Exago Logo
Search
Generic filters
Exact matches only
Exago Logo

Lakes, Swamps, Ponds, and Other Bodies of Data

by | Deploying BI

data lake vs data warehouse

In our continued effort to demystify analytics-related jargon, this post will define all data collections named after types of buildings and bodies of water. Figurative language is nice and a helpful mnemonic device, but some of these terms are getting into extended-metaphor territory, a murky place indeed.

First, it’s worth asking why the buildings-and-water terms are so ubiquitous. Why is this a thing?

According to Dataversity, the first of these words was “datamart,” coined in the 1970s and followed by “data warehouse” a decade later. Then, in 2010, James Dixon extended the metaphor by conceiving of a datamart as a “store of bottled water:”

If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

Data and water have so much in common (liquidity, commercial value, fuel potential) that it’s no surprise the metaphor took on a life of its own from there. Let’s see where the current leads.

Data Lake

DATA LAKE A repository of raw data in its native format. The term describes any large data pool in which the schema and data requirements are undefined before querying.

Example: An unstructured repository of Google Maps search information, including string, numeric, array, geospatial, and image data types.

Source: TechTarget

 

Data Pond

DATA PONDS A series of isolated repositories of raw data in its native format, also referred to as “data puddles,” used as a temporary intermediary location for raw, just-imported information. The data is then typically added to a data lake.

Example: Porting in ticket sales data for a particular theme park before adding it to a data lake containing all information for all parks in the system.  

Source: Cask Data, Bill Inmon

 

Data Swamp

DATA SWAMP A repository of ungoverned data, typically a data pond or data lake of questionable quality. The data swamp is typically used as an argument for cultivating data reservoirs rather than data lakes.

Example: An unmanaged database of philanthropic gifts that do not accord with an organization’s other financial data and can therefore not be trusted.

Source: Gartner

 

Data Reservoir

DATA RESERVOIR A repository of data that has undergone information management and governance, which typically includes access controls, transformations enforcing semantic consistency, and cataloging methods. The term is possibly analogous to “data warehouse.”

Example: A centralized medical database that is regularly updated by medical staff and governed by a dedicated team of data stewards. Data is primed for reporting and tenanted according to staff clearance levels.

Sources: IBM, Dell EMC

 

Data Warehouse

DATA WAREHOUSE A repository of data that has undergone ETL (Extract, Transform, Load) processing, which may include information management and governance, for the purpose of integrating data from diverse sources and making it easier to analyze.

Example: Clothing manufacturing, shipping, and sales data that has been consolidated into a single database, shaped, and released to business users.

Sources: TechTarget, Spotless Data

 

Data Mart

DATA MART A repository of data that has undergone ETL and is tailored to the needs of a specific end user group. Datamarts (also “data marts”) can either be crafted from a data warehouse or combined to form a data warehouse.

Example: A subset of the data warehouse example above where the data has been groomed for a particular user set, such as a sales and marketing team.

Source: TechTarget

 

Data Silo

DATA SILO A repository of data that exists in isolation from other repositories of data. Data silos can be intentional or the accidental result of mismanaged data channels.

Example: A series of spreadsheets maintained by different people, or a database that has been disconnected from other systems to control a data breach.

Source: PC Mag

Data Lake vs. Data Warehouse

Since data lakes, marts, and warehouses seem to be the most commonly confused terms, we’ve given them extra treatment below. For an even deeper dive, check out this cheat sheet from TechTarget!

If we were to arrange these three terms in process order, data lakes would come first because they contain raw, unstructured data. That data is either queried directly or ETL’d into a data warehouse, which may be further partitioned into data marts.

data lake vs data warehouse

Data Mart vs. Data Warehouse

 

data mart vs data warehouse

As these definitions suggest, it’s important to know where your enterprise data originated, how it integrates with data from other sources, and how it will be made available to end users. Whether you call them ponds or a silos, unintentionally isolated repositories of data should be avoided and instead incorporated into well-managed warehouses and reservoirs.

What other terms can we help demystify? Let us know in the comments!

BI Newsletter
Sign Up For The Exago Newsletter

Stay up-to-date on all things SaaS and analytics with fresh content each month.

Ready to see Exago BI in action?

Request a Demo

Please fill out the form and we’ll be in touch to arrange a personalized demo.

Just want a quick overview? Check out our webinar.

Share This