If there’s one thing I’ve learned as a BI consultant, it’s that Data Management problems, like speeding tickets and jury duty, are terribly common but somehow still feel unlikely to happen to you.
I can’t tell you how many times I’ve seen BI implementations drag on for months and months because issues around data extraction, modeling, aliasing, and stewardship weren’t resolved or even considered at the onset of the project. It’s never fun to put in the time upfront, but it’s a lot less painful than having to backtrack.
That said, I’ve also seen companies give serious consideration to their Data Management strategy from the start and get their BI implementation in front of customers well within their deadline. Data Management may sound like optional busy work for paper pushers, but let me assure you, it’s critical to your success, especially if you offer BI.
So, here’s a free consultation, a detailed explanation of the Data Management pitfalls I encounter most often in the field. This list probably won’t surprise you, but I hope it will prompt you to take a hard, objective look at your own implementation and assess it for vulnerabilities. With a little foresight and some elbow grease, you can realize that smooth, unimpeded BI deployment you’re counting on.
Poor (or Nonexistent) Data Stewardship
The most common Data Managementmistakes I see are related to a lack of internal ownership over the database itself. It’s rare to find a true owner of the database in such a way that a single person can tell the story from data collection to storage to extraction.
There are a number of ways this can happen to a company. When folks come to us looking for a reporting solution to meet their customers’ needs (such as a BI solution designed to be embedded into SaaS applications), they’re not setting up the database in the same step. They’ve already been collecting data for a long time — long before reporting was even a consideration, in most cases. Sometimes we discover that the person who initially set up the database doesn’t even work at the organization anymore and didn’t leave much in the way of documentation or tribal knowledge to help onboard a successor.
Other times, responsibility for (and knowledge of) the data is distributed throughout the company. One group might have a deep understanding of the data’s semantics while another, such as IT, might have some insight into its maintenance and traffic capacity. A third group responsible for data analysis might be most familiar with its utility to stakeholders.
Unfortunately, none of these groups have a grasp of the database’s structure or complete knowledge of the data itself. You can see how this could begin to create some confusion.
If no one owns the data, then it becomes incredibly difficult to manage. Oftentimes, multiple individuals will attempt to understand how the data is set up, balk at the sheer scope of this undertaking, and then modify their project goals to avoid having to dig deeper. This results in multiple people understanding only parts and pieces of the data lifecycle. I’ve had clients ask me how their tables are joined together.
Consolidating tribal knowledge into a single document — such as a data dictionary, business glossary, and/or entity-relationship diagram (ERD) — is the first step toward facilitating good stewardship protocol. Ideally, a single person or team would be devoted to the task, but creating reliable records can at least help a distributed network of stewards stay in sync.
Most companies’ databases are first optimized for data storage rather than extraction. This makes sense, as we need somewhere to put the data before we can do anything with it. As long as we have it somewhere, we can decide how to analyze it later, right?
When SaaS providers come to us for an embedded BI solution, they’re often under pressure to get the solution to users quickly. Their data isn’t ready, but they launch against the raw data in order to make the deadline, invariably resulting in inefficient queries and poor report execution times. Such setbacks are easily avoided with a little structural preparation.
Raw databases are typically highly normalized and must be somewhat denormalized for more efficient querying. Normalization, the process of increasing the number of tables in a database while decreasing their redundancy, improves write performance to the database and is a common initial setup.
If no one owns the data, then it becomes incredibly difficult to manage.
Denormalization, however, speeds up data retrieval by decreasing the number of tables in the database and increasing their redundancy. This is the step I often see either skipped or thrown together last minute in the form of a view or stored procedure that serves only one specific reporting use case.
Now, this is not to say that a single table is an answer to all your reporting inefficiencies. On the contrary, reporting off of a single humongous data table presents its own set of problems. A middle ground between these two extremes is almost always necessary to ensure acceptable report performance. Denormalizing your raw data and eliminating any unnecessary bulk before launching your BI solution will keep you from having to double back and redesign all your reports after the fact.
While you’re improving your database performance, why not give some thought to accessibility? Databases are often designed by administrators who have, at best, a foggy understanding of how the data will be consumed.
It can be difficult to put yourself in the users’ shoes, to try and imagine how they will interpret schema names, to anticipate the types of decisions they’ll be trying to make based on their reports. But getting into those shoes, whether through personas or actual user research, is critical to making the data (and therefore your BI implementation) useful to consumers.
The first issue is straightforward but often overlooked: column names. An admin setting up a reporting system likely knows which fields are which by looking at the schema names and might deem aliasing “nice to have” rather than a necessity. Those who do choose to alias their columns sometimes fail to consider what will make the most sense to users. If my sales team is looking for a field containing all of their leads, naming this field “Contact Name” might be technically correct but make it difficult for users to find.
The second issue is data overwhelm. Knowing what sort of reports your users want is vital to not inundating them with tables and columns.
I can’t tell you how many times I’ve seen admins expose users to 12 tables, each containing 25 to 100 columns, just because they’re not sure which columns users will need. Then, because there are so many columns, the administrator doesn’t have time to alias the fields, never mind descriptions or additional metadata. The users in question might only need 15 columns from a total of 3 of those tables (which could be conveniently combined into a single view) but instead, get bombarded with field options and exit the tool in terror. It doesn’t have to be that way.
Lastly, providing data definitions can be extremely helpful to end-users. Simple explanations of what each field contains will do, but you could also give examples of use cases or suggest how fields might be combined. Identifying fields using familiar terminology can make reporting a painless and valuable experience for end-users.
Poor Handling of Dimension Information
Most companies start out with what we call fact information — recorded transactions and concomitant data. Imagine a table of sales data; a record would show an order placed by this customer on this day for these products. This table might not have a record on some dates because no sales were made on those days. In fact tables, we only create a new record when that event occurs.
Dimension information, on the other hand, is a complete set of some type of data intended to provide additional context. The most common dimension information that we recommend adding is a calendar table, which would have a list of all dates, whether an event occurred on that date or not. Dimension tables allow you to look holistically at your collection of data points rather than just at the data points that have events related to them. Both fact and dimension information are critical to getting value out of BI.
The question now becomes how to introduce this dimension information, and there are two basic options:
- Train BI users on when to include these tables in their report definitions, or
- Build dimension tables directly into the exposed data models.
It’s all too common for companies to do neither, leading to user frustration and distrust. It’s important for companies to anticipate this issue and act on one of the above resolutions so as not to alienate their customer base.
It is absolutely possible to address these Data Management challenges before they derail your BI implementation, but that takes foresight, research, time, and a willingness to make that upfront investment. Good Data Management practices will pay off in the long run — in time saved and in boosts to your company’s reputation. Thoughtful BI implementations are incredibly “sticky,” meaning users want to keep using them rather than migrate to a different system. Embedded BI implementations pass on that stickiness to the host application, but no BI solution can make up for poor Data Management.
Originally published with Dataversity.