Warehouse Your Data
- Reporting off of denormalized data is more performant than reporting off of raw transactional data.
- Data warehousing involves accepting some data staleness in exchange for improved performance.
- Triggers can be used to mitigate data staleness in warehousing environments.
Enterprise data typically begins its life cycle as transactional data. Transactional data is generated through any kind of exchange, whether between people, organizations, or machines. Purchases, form submissions, and timestamps all qualify as transactions. When recorded in its raw form, transactional data is often paired with master data and reference data, which tend to be non-transactional in nature. For example, if a Alex donates money to a research foundation, the data might be broken down into three types:
- Transactional Data: Alex’s name, Alex’s address, the method of payment, the donation amount, the donation date, whether Alex checked the newsletter signup box, etc.
- Reference Data: currency exchange rates, regional demographics, etc.
- Master Data: marketing campaign information, banking information, etc.
Transactional data usually consists of a large volume of tables containing minimally redundant, unsanitized data; and it can be a real challenge to report off of in its raw form.
As a result, it is generally recommended that transactional data go through some degree of denormalization before it is used for reporting purposes. Not only does this involve reducing the number of tables (and therefore the number of joins that must be executed when the data is queried), it also repeats the same data in multiple tables (increasing data redundancy) and standardizes input in terms of spelling and format. Reporting off of the second table instead of the two first tables means users will have an easier time understanding the data, and the database will have to do less processing to deliver it. Putting transactional data through this transformation process is sometimes referred to as ETL (Extract, Transform, and Load). The end result, a data set or entirely separate database used solely for reporting, is known as a data warehouse.
Great as they are for performance, data warehouses also introduce the problem of data staleness. Live transactional data is always up to date whereas warehoused data must be updated periodically as new records come in. Data has the potential to grow stale in between updates.
The way to mitigate warehouse staleness is first to prioritize your data and then set up triggers that will mark data for incremental extraction to the warehouse as specific values change. Your warehouse, for example, may be updated nightly at 3 AM when traffic is at its lowest, but tables with trigger markings may be set to update once every hour so that sensitive data is more current than it would be otherwise.
Which data qualifies as “high priority” will depend on your business needs. Reference data typically changes infrequently and ranks low in priority. Certain types of master data, like employee rosters or product catalogs, might also fit that description. Transactional data, by comparison, often has broader business implications and therefore demands greater accuracy. The programs (e.g., .NET assemblies, stored procedures, etc.) updating your data warehouse can be set to update specific columns between bulk warehouse updates as new, high-priority data is added to the source tables. Because only select segments of data are being updated throughout the workday, triggers preserve performance while granting users access to more recent information.
Minimize Processing Points
- Minimize the amount of processing performed during each query.
- Use materialized views to reduce the database workload.
Any processing that needs to happen to a data set for it to become reportable should be handled prior to report execution rather than as part of the query. Querying should be reserved for report-specific processing, as all data manipulation taxes performance. ETL and other universal adjustments should run behind the scenes with Exago querying a warehouse or other static dataset.
The diagram below illustrates a low-performance system with high data freshness.
Each and every time Exago BI makes a data call, the web service has to apply ETL processing to the raw transactional data to make it reportable, leaving the application server to perform report-specific processing. Though the data returned in these reports will be relatively fresh, performance will slow to accommodate the high volume of processing.
Below is an example of a high performance system that performs ETL (by way of stored procedures) in the background. Exago BI queries the data warehouse, which need only handle report-specific processing before returning the results, dramatically improving performance times.
In the event that data requires additional post-ETL processing in the data warehouse, materialized views can be used to further improve performance. They function as a sort of cached file for programmable data objects and offer speed in exchange for staleness. Because materialized views are executed ahead of time, querying them (instead of a standard view) means cutting back on data retrieval and processing time.
Minimize the Number of Database Queries
- Use execution caching except when live data is required.
- Use schema caching if schema rarely changes.
Queries to the database are expensive in terms of processing, no matter how many rows are returned. Not only does the application have to formulate a call to the database, the database then has to execute the query and return the data set over the network back to the server. Sometimes the application then has to do additional processing before finally generating the report output. Administrators can report off of the application log file to determine how much processing happens in Exago BI.
Exago BI offers admins and users a number of opportunities to reduce the number of calls to the database in a given session. Execution caching, enabled at the admin level, can be implemented by users on a report-by-report basis to generate a local cache of data returned for a given report. This is especially useful for large reports, as users can specify that a cache be refreshed after hours when it will be less likely to affect worker productivity. Users also have the ability to force a re-execution in the event up-to-date information is required. Exago BI’s monitoring feature can help administrators determine which reports to cache by allowing them to report on execution frequency and time.
Exago BI defaults to querying the database for schema information (e.g., column name, data type, etc.), but admins can instead elect to retrieve that information from a static file in order to reduce the number of queries to the database. Setting the Global Schema Access Type toggle to “Metadata” in the Admin Console stores the schema in the application config file for any objects invoking the setting. From there, the schemas can be updated manually, or, for specified data objects with frequently-changing schemas, overridden.
Minimize the Number of Rows Returned Per Query
- Offload joining and filtering to the database when using programmable objects.
- Push aggregate calculations to the database when possible.
- Apply incremental loading when it is unnecessary to have the full data set returned.
The goal in transferring as few rows as possible from the database to Exago BI is to have the database handle the bulk of the heavy lifting. The less data processing left to the application server, the more dramatic the performance gains. If a report is going to return 100 rows of data, the goal would be for the database to return exactly that many rows to Exago BI.
Unfortunately, this isn’t always easy to achieve. Stored procedures, .NET assemblies, and web services will all join and filter on the application server by default, requiring that any and all data objects joined to them be passed over the network in their entirety before being whittled down to just the report output. This can have heavy performance implications. Use Exago’s Programmable Data Object settings, found in the Admin Console, to offload this processing onto the database or data source server. This will help ensure that only the rows needed for the report output are passed to the application, minimizing lag and maximizing efficiency.
As of v2016.3, Exago also supports pushing aggregate calculations to the database, rather than performing them at the application level.This allows Exago to retrieve only one row from the database, which contains the aggregate, instead of all the individual data rows. This can be a significant performance boon, especially for large reports.
Lastly, as of v2017.2, slated for release at the end of October, users will have the option to load report output incrementally rather than all at once. Not only is this feature helpful in the designing of very large reports, it is also a means of preventing users from overloading the database as they explore data in ExpressView. Incremental loading limits the query to a subset of the total rows so that the user can assess the report’s design without waiting for the full execution. Aggregate calculations will necessarily reflect the sample data rather than all data qualifying for the report, but this is sufficient for checking the validity of a formula or function.
It’s important to remember that no matter how much you work to optimize performance from a data perspective, execution times will vary. Longer times, while not as long as they may otherwise have been, will nevertheless stand out as “long” to end users. For this reason, we’ll cover managing user expectations in a future installment of the Performance & Scaling series. In the meantime, if you have questions regarding any of the principles or methods above, let us know in the comments.
Thumbnail Photo Credit: This modification of “Bike wheel and six bicycle floor pumps” by Your Best Digs is licensed under CC BY 2.0.