Geodemographics Blog

Jonathan Ortiz: Linking U.S. Census data

19-04-2017

Through a grant from the National Science Foundation’s DataStart program, I was brought to data.world, a start-up based in Austin, TX, where we partnered with the U.S. Census Bureau and Amazon Web Services (AWS) to provide U.S. Census data in a new linked data format that makes vital demographic information more consumable for data users and more usable by machines.

The American Community Survey

The U.S. Census Bureau conducts more than 130 surveys and programs each year. We decided to tackle the American Community Survey (ACS), because it is the Census Bureau’s biggest and most up-to-date survey as well as a super-connector dataset that’s leveraged throughout the U.S. by countless individuals, non-profit organizations, and businesses.

ACS data affect compliance with federal civil rights laws and the allocation of close to $400 billion annually for planning and implementation of federal programs and services such as school construction, housing and community development, road and transportation planning, and job training. Additionally, businesses of all sizes and types rely on ACS data for marketing, hiring, and selecting site locations, as well as forecasting future demand for goods and services.

The U.S. Census Bureau’s primary channels for distributing ACS data are the American FactFinder website and the Census API. The API is an amazing tool that powers derivative works such as Census Reporter and Data USA. Census Reporter aims to make it “easier for journalists to write stories using ACS data,” but the website is useful to more than just journalists. Data USA is the “most comprehensive visualization of U.S. public data.” Both outlets provide social and economic benefits by putting easily-digestible and visual info into the public’s hands. But aside from reading the summaries and visualizations these sites provide, there isn’t much one can do with the underlying data. With this in mind, we set out to provide a new iteration of ACS data that allows users to do more than simply read it.

We decided to transform the ACS microdata release into a Linked Open Data (LOD) graph database using semantic web technologies.

Conversion to Linked Data

Most U.S. Censuschannels provide pre-tabulated population estimates by geography. But to get an instance-level view of the ACS, data analysts must download the ACS Public Use Microdata Sample (PUMS).

PUMS metadata is stored in a human-readable data dictionary document, separate from the data. This gives the PUMS a steep learning curve. Simply looking at the data and trying to make sense of it requires constantly referencing the data dictionary, which takes time to master. This also limits what a computer can do with the data, because the concepts in the data dictionary are not represented in a format the computer can understand.

We addressed these issues by…

1. …creating a linked data schema using the concepts, relationships, and semantic meaning contained in the PUMS data dictionary. This schema spans years, accounting for differences in the PUMS data model from one year to the next, so all metadata for all years is in one place.

2. …normalizing and transforming the data from raw csv to linked data using a series of SPARQL updates and a Java importer developed by data.world

3. …linking real-world concepts contained in the PUMS to external data sources like dbpedia and wikidata

Why semantic technologies?

The future of distributing data on the web is Linked Data - modeling data with strong identifiers for real-world concepts that can be cross-referenced between multiple datasets and can apply context and meaning via schemas that live alongside and within the data. We have provided this LOD release to create a center of gravity for ongoing work to link datasets to the ACS. The linked ACS release also provides users with additional new features and benefits.

New Features:

Original PUMS csv ---> New Linked ACS graph data

· Flat, tabular data ---> URIs denote linkable concepts

· Complex data dictionary ---> Metadata built into the data itself

· Unwieldy raw files ---> Secure, queryable database

Benefits:

· Usability and availability: By using semantic technologies to both model and distribute the data, the data is more available and more useable by machines.

· Sharing knowledge with data: With semantic modeling, we put the information about the data into the data itself, which reduces friction caused by the separation of the data from the data dictionary.

· Better starting point: This exploratory data format gives users a leg up when learning the data schema.

Readers are strongly encouraged to review the MRS CGG’s past article covering the 2009 European Geodemographics Conference and linking spatial data on the web. You can read Christian Becker’s presentation from that conferencehere. Even though the presentation is from 2009, it is a terrific introduction to LOD and its content is still relevant today! Readers may also have interest in learning why semantics and Linked Data are vital to ushering in the next wave of artificial intelligence.

Learn more

data.world and the U.S. Census Bureau have worked with AWS to provide the linked ACSas an AWS Public Dataset. This means anyone can analyze the linked ACS in the Amazon cloud without needing to download or store their own copy. We encourage all data scientists, data engineers, researchers, developers, and semantic web enthusiasts who have an interest in demographic topics to use this vital data in its new format, expand on it, and develop applications with it.

This resource will be featured in future webinars and a session at the ACS User Conference in May 2017. Readers can learn more about how to access and use this new resource atdocs.data.world/uscensus.


Through a grant from the National Science Foundation’s DataStart program, data scientist Jonathan Ortiz was brought to data.world, a start-up based in Austin, TX, where the company partnered with the U.S. Census Bureau and Amazon Web Services (AWS) to provide U.S. Census data in a new linked data format that makes vital demographic information more consumable for data users and more usable by machines.


You can contact Jonathan via jonathan.ortiz@data.world.


Any views or opinions presented are solely those of the author and do not necessarily represent those of the MRS Census and Geodemographic Group unless otherwise specifically stated.

Comments (0)

No Comments Added

Leave a reply





Please enter the 3 black symbols only

  • Name is empty
  • Email is empty
  • Security code is empty
Previous Posts: