

There are different takes on what veracity refers to, but the overall consensus is that data veracity reflects the truthfulness of a data set and your level of confidence or trust in it. I’ll take this a step further and say that data veracity is your level of confidence/trust in the data based on its provenance as well as the data processing method.
Think about this: when you get a box of chocolate which you haven’t tried before, how do you estimate how good it is? The first step is to look where it was made, by what shop or brand. You can mainly assess its quality by its provenance. As a second step, you probably also want to ensure that after you open the box, you won’t taint the chocolates somehow before you taste them.
Data veracity helps us better understand the risks associated with analysis and business decisions based on a particular big data set.Looking at a data example, imagine you want to enrich your sales prospect information with employment data — where those customers work and their job titles. Not only this can provide you with additional contact data, but it can also help you create different market segments and do a better job of serving them. LinkedIn collects lots of employment data, but unfortunately you can’t purchase it from them. So what can you do? You might go to another third-party provider of who claims to scrape LinkedIn data from search engine results (a legally grey area in my opinion; I’m not a legal expert so let’s just treat this as a theoretical example). Therefore, you might consider purchasing this LinkedIn employment data, but how do you gauge its veracity? Well, you need to ask these questions to the service provider:
- Who created and contributed to the data source?
- When was the data collected?
- Was the original data source enriched in any way?
- What methodology do they follow in collecting the data?
- What algorithm do they use to match records and what are the matching confidence levels?
- Were only certain industries or locations included in the data source?
- Has the information been edited or modified in any way?
- Did the creators summarize the information?