This article was first published by Springboard.
Good public data is a lot like spare change: plentiful, and yet somehow impossible to find when you need it.
That’s why we’ve compiled this list of free, reputable, publicly available data sets. Software-as-a-service providers can use public data for all kinds of purposes, including but not limited to:
- Crafting more engaging product demos. Randomly generated data might serve in a pinch, but it’s by no means as engaging as fact-based data. If prospective end users are interested in the data they’re looking at—be it crime statistics, census data, or something else—they’ll be more likely to want to explore and interact with the application they’re using to analyze it.
- Testing new data types. Public data can be a quick means of diversifying your testing data. Maybe you just developed a mapping tool and need some geographic data with which to test it. Or perhaps you simply need a larger volume of data for performance testing. A quick import into your database could have you off and running in no time.
- Enriching your data. In the event that there is publically available data related to your industry vertical, you could use that data to supplement or enrich your clients’ proprietary data. Google Analytics does something akin to this with their benchmarking tool, which allows GA users to compare their traffic metrics to those of other sites similar in size and content area.
- Protecting personally identifiable information. Need to strip client data of its PII? Public data to the rescue! Simply swap out sensitive information for publicly available data of a similar data format, and you’re in the clear.
Ready to up your data game? Check out these 19 resources to get started.
- United States Census Data: The U.S. Census Bureau publishes reams of demographic data at the state, city, and even zip code level. The data set is fantastic for creating geographic data visualizations and can be accessed on the Census Bureau website. Alternatively, the data can be accessed via an API. One convenient way to use that API is through the choroplethr. In general, this data is very clean and very comprehensive.
- FBI Crime Data: The FBI crime data set is fascinating. If you’re interested in analyzing time series data, you can use it to chart changes in crime rates at the national level over a 20-year period. Alternatively, you can look at the data geographically.
- CDC Cause of Death: The Centers for Disease Control and Prevention maintains a database on cause of death. The data can be segmented in almost every way imaginable: age, race, year, and so on.
- Medicare Hospital Quality: The Centers for Medicare & Medicaid Services maintains a database on quality of care at more than 4,000 Medicare-certified hospitals across the U.S., providing for interesting comparisons.
- SEER Cancer Incidence: The U.S. government also has data about cancer incidence, again segmented by age, race, gender, year, and other factors. It comes from the National Cancer Institute’s Surveillance, Epidemiology, and End Results Program.
- Bureau of Labor Statistics: Many important economic indicators for the United States (like unemployment and inflation) can be found on the Bureau of Labor Statistics website. Most of the data can be segmented both by time and by geography.
- Bureau of Economic Analysis: The Bureau of Economic Analysis also has national and regional economic data, including gross domestic product and exchange rates.
- IMF Economic Data: For access to global financial statistics and other data, check out the International Monetary Fund’s website.
- Dow Jones Weekly Returns: Predicting stock prices is a major application of data analysis and machine learning. One relevant data set to explore is the weekly returns of the Dow Jones Index from the Center for Machine Learning and Intelligent Systems at the University of California, Irvine.
- Data.gov.uk: The British government’s official data portal offers access to tens of thousands of data sets on topics such as crime, education, transportation, and health.
- Enron Emails: After the collapse of Enron, a data set of roughly 500,000 emails with message text and metadata were released. The data set is now famous and provides an excellent testing ground for text-related analysis. You also can explore other research uses of this data set through the page.
- Google Books Ngrams: If you’re interested in truly massive data, the Ngram viewer data set counts the frequency of words and phrases by year across a huge number of text sources. The resulting file is 2.2 TB.
- UNICEF: If data about the lives of children around the world is of interest, UNICEF is the most credible source. The organization’s public data sets touch upon nutrition, immunization, and education, among others.
- Reddit Comments: Reddit released a data set of every comment that has ever been made on the site. That’s over a terabyte of data uncompressed, so if you want a smaller data set to work with Kaggle has hosted the comments from May 2015 on their site.
- Wikipedia: Wikipedia provides instructions for downloading the text of English-language articles, in addition to other projects from the Wikimedia Foundation.
- Lending Club: Lending Club provides data about loan applications it has rejected as well as the performance of loans that it issued. The data set lends itself both to categorization techniques (will a given loan default) as well as regressions (how much will be paid back on a given loan).
- Walmart: Walmart has released historical sales data for 45 stores located in different regions across the United States.
- Airbnb: Inside Airbnb offers different data sets related to Airbnb listings in dozens of cities around the world.
- Yelp: Yelp maintains a dataset for use in personal, educational, and academic purposes. It includes 6 million reviews spanning 189,000 businesses in 10 metropolitan areas. Students are welcome to participate in Yelp’s dataset challenge.
Do you have any favorite go-to sources of free public data? Tell us about them in the comments!