Contents

Overview

The Census Bureau collects data from people in the United States through multiple survey programs. Federal, state, and local governments use the data to assess how constituents are represented and to allocate spending. The data are also made freely available to the public and have a wide range of uses.¹ In this post, I parse, load, and verify data from the Census Bureau's American Community Survey (ACS) 2013 5-year Public Use Microdata Sample (PUMS) for Washington DC.

Brief process:

Start with "Running an IPython Notebook on Google Compute Engine from Chrome".
- For additional storage, create and mount a disk to the instance.²
Download (and decompress):³ ⁴
- 2013 5-year PUMS data dictionary: PUMS_Data_Dictionary_2009-2013.txt (<1 MB)
- 2013 5-year PUMS person and housing records for Washington DC:
  - Person records: csv_pdc.zip (5 MB compressed, 30 MB decompressed)
  - Housing records: csv_hdc.zip (2 MB compressed, 13 MB decompressed)
- 2013 5-year PUMS estimates for user verification: pums_estimates_9_13.csv (<1 MB)
Load the files:
- Data dictionary TXT: dsdemos.census.parse_pumsdatadict (see dsdemos package below)
  This is a customized parser I wrote for PUMS_Data_Dictionary_2009-2013.txt. The data dictionary is inconsistently formatted, which complicates parsing.⁵ ⁶
- Person/housing records and user verification CSVs: pandas.read_csv ⁷ ⁸
Confirm the user verification estimates (see example below):⁹
- To calculate an estimate $X$ for a specific "characteristic" (e.g. "Age 25-34"), sum the column '[P]WGTP' of the filtered data ('PWGTP' for person records, 'WGTP' for housing records).¹⁰ '[P]WGTP' are the sample weights.
- To calculate the estimate's "direct standard error", use the ACS's modified root-mean-square deviation:
  
  $$\mathrm{SE}(X) = \sqrt{\frac{4}{80}\sum_{r=1}^{80}(X_r-X)^2}$$
  
  where each $X_r$ is the sum of the column '[P]WGTPr' of the filtered data. '[P]WGTP[1-80]' are the "replicate weights".
- To calculate the estimate's margin of error (defined by ACS at the 90% confidence level):
  
  $$\mathrm{MOE}(X) = 1.645\,\mathrm{SE}(X)$$

Source code:

For step-by-step, see the Jupyter Notebook (click the HTML export to render in-browser):
20160110-etl-census-with-python.ipynb
20160110-etl-census-with-python-full.html
This post uses dsdemos v0.0.3.¹¹

Motivations

Why am I using the American Community Survey (ACS)?

The ACS is a relevant data set. A future step is to predict an individual's household income, which is among the subjects that the ACS survey addresses.
The ACS is a reliable data set.¹² The ACS has quality controls to ensure that it is representative. The survey samples about 3 million addresses per year with a response rate of about 97%.
The ACS is a time-series data set. The survey sends questionnaires throughout the year and releases data once per year. A future step is to use the time series to forecast an individual's household income.
I recognize that using ACS data can be problematic. Data from the Census Bureau has been used for harm,¹³ and current ACS terminology asks respondents to identify by terms such as "race".¹⁴ For this project, I take data from the Census Bureau at face value, and I infer from it at face value. It's important to respect that these aren't simply data points; these are people.

Why am I using the ACS 5-year estimate?

As of Dec 2015, the ACS offers two windowing options for their data releases: 1-year estimates and 5-year estimates.¹⁵ Because the ACS 5-year estimates include data over a 5-year window, they have the largest sample size and thus the highest precision for modeling small populations. However, the sample size comes at the expense of currency. Forecasting the predictions from a 5-year window to be more relevant to a specific year is a future step.

Why am I using Python?

This project can be done using Python, R, SQL, and/or other languages.¹⁶ I'm using Python since a future step is to make a machine learning pipeline, which is a popular application of scikit-learn.

What about "big data"?

I'm starting with a data set small enough to be processed in memory (i.e. operated on in RAM), since the focus of many Python packages is in-memory operations on single machines.¹⁷ These packages often parallelize operations across the machine's processor cores. For operations that exceed the machine's available RAM (i.e. out-of-core computations), there's Dask for Python, and for operations that require a cluster of machines, there's Spark for Java, Scala, Python, and R. Scaling a pipeline to a large enough data set that requires a cluster is a future step.

Example

This is an abbreviated example of my ETL procedure in the Jupyter Notebook for this post (see links to source code above).

In [1]:

cd ~

/home/samuel_harrold

In [2]:

# Import standard packages.
import os
import subprocess
import sys
# Import installed packages.
import numpy as np
import pandas as pd
# Import local packages.
# Insert current directory into module search path.
# dsdemos version: https://github.com/stharrold/dsdemos/releases/tag/v0.0.3
sys.path.insert(0, os.path.join(os.path.curdir, r'dsdemos'))
import dsdemos as dsd

In [3]:

# File paths
# Base path to ACS:
#     http://www2.census.gov/programs-surveys/acs/
# 2013 5-year PUMS data dictionary:
#     tech_docs/pums/data_dict/PUMS_Data_Dictionary_2009-2013.txt
# 2013 5-year PUMS housing records for Washington DC (extracted from csv_pdc.zip):
#     data/pums/2013/5-Year/csv_pdc.zip
# 2013 5-year PUMS user verification estimates:
#     tech_docs/pums/estimates/pums_estimates_9_13.csv
path_acs = r'/mnt/disk-20151227t211000z/www2-census-gov/programs-surveys/acs/'
path_dtxt = os.path.join(path_acs, r'tech_docs/pums/data_dict/PUMS_Data_Dictionary_2009-2013.txt')
path_hcsv = os.path.join(path_acs, r'data/pums/2013/5-Year/ss13hdc.csv')
path_ecsv = os.path.join(path_acs, r'tech_docs/pums/estimates/pums_estimates_9_13.csv')

In [4]:

# Load and display the data dictionary.
ddict = dsd.census.parse_pumsdatadict(path=path_dtxt)
print("ddict, dfd: Convert the nested dict into a hierarchical data frame.")
tmp = dict() # tmp is a throwaway variable
for record_type in ddict['record_types']:
    tmp[record_type] = pd.DataFrame.from_dict(ddict['record_types'][record_type], orient='index')
dfd = pd.concat(tmp, names=['record_type', 'var_name'])
dfd.head()

`ddict`, `dfd`: Convert the nested `dict` into a hierarchical data frame.

Out[4]:

		length	description	var_codes	notes
record_type	var_name
HOUSING RECORD	ACR	1	Lot size	{'b': 'N/A (GQ/not a one-family house or mobil...	NaN
	ADJHSG	7	Adjustment factor for housing dollar amounts (...	{'1086032': '2009 factor', '1068395': '2010 fa...	[Note: The values of ADJHSG inflation-adjusts ...
	ADJINC	7	Adjustment factor for income and earnings doll...	{'1085467': '2009 factor (0.999480 * 1.0860317...	[Note: The values of ADJINC inflation-adjusts ...
	AGS	1	Sales of Agriculture Products (Yearly sales)	{'b': 'N/A (GQ/vacant/not a one-family house o...	[Note: No adjustment factor is applied to AGS.]
	BATH	1	Bathtub or shower	{'b': 'N/A (GQ)', '1': 'Yes', '2': 'No'}	NaN

In [5]:

# Load and display the housing records.
print("dfh: First 5 housing records and first 10 columns.")
dfh = pd.read_csv(path_hcsv)
dfh.iloc[:, :10].head()

`dfh`: First 5 housing records and first 10 columns.

Out[5]:

	insp	RT	SERIALNO	DIVISION	PUMA00	PUMA10	REGION	ST	ADJHSG	ADJINC
0	600	H	2009000000403	5	102	-9	3	11	1086032	1085467
1	NaN	H	2009000001113	5	103	-9	3	11	1086032	1085467
2	480	H	2009000001978	5	103	-9	3	11	1086032	1085467
3	NaN	H	2009000002250	5	105	-9	3	11	1086032	1085467
4	2500	H	2009000002985	5	101	-9	3	11	1086032	1085467

In [6]:

# Load and display the verification estimates.
# Select the estimates for Washington DC then for the
# characteristic 'Owner occupied units (TEN in 1,2)'.
dfe = pd.read_csv(path_ecsv)
tfmask_dc = dfe['state'] == 'District of Columbia'
dfe_dc = dfe.loc[tfmask_dc]
print("dfe_dc: This example verifies these quantities.")
dfe_dc.loc[[310]]

`dfe_dc`: This example verifies these quantities.

Out[6]:

	st	state	characteristic	pums_est_09_to_13	pums_se_09_to_13	pums_moe_09_to_13
310	11	District of Columbia	Owner occupied units (TEN in 1,2)	110,362	1363	2242

In [7]:

# Verify the estimates following
# https://www.census.gov/programs-surveys/acs/
#     technical-documentation/pums/documentation.2013.html
#     tech_docs/pums/accuracy/2009_2013AccuracyPUMS.pdf
# Define the column names for the housing weights.
# Select the reference verification data for the characteristic,
# and select the records for the characteristic.
hwt = 'WGTP'
hwts = [hwt+str(inum) for inum in range(1, 81)] # ['WGTP1', ..., 'WGTP80']
char = 'Owner occupied units (TEN in 1,2)'
print("'{char}'".format(char=char))
tfmask_ref = dfe_dc['characteristic'] == char
tfmask_test = np.logical_or(dfh['TEN'] == 1, dfh['TEN'] == 2)
# Calculate and verify the estimate ('est') for the characteristic.
# The estimate is the sum of the sample weights 'WGTP'.
col = 'pums_est_09_to_13'
print("    '{col}':".format(col=col), end=' ')
ref_est = int(dfe_dc.loc[tfmask_ref, col].values[0].replace(',', ''))
test_est = dfh.loc[tfmask_test, hwt].sum()
assert np.isclose(ref_est, test_est, rtol=0, atol=1)
print("(ref, test) = {tup}".format(tup=(ref_est, test_est)))
# Calculate and verify the "direct standard error" ('se') of the estimate.
# The direct standard error is a modified root-mean-square deviation
# using the "replicate weights" 'WGTP[1-80]'.
col = 'pums_se_09_to_13'
print("    '{col}' :".format(col=col), end=' ')
ref_se = dfe_dc.loc[tfmask_ref, col].values[0]
test_se = ((4/80)*((dfh.loc[tfmask_test, hwts].sum() - test_est)2).sum())0.5
assert np.isclose(ref_se, test_se, rtol=0, atol=1)
print("(ref, test) = {tup}".format(tup=(ref_se, test_se)))
# Calculate and verify the margin of error ('moe') at the
# 90% confidence level (+/- 1.645 standard errors).
col = 'pums_moe_09_to_13'
print("    '{col}':".format(col=col), end=' ')
ref_moe = dfe_dc.loc[tfmask_ref, col].values[0]
test_moe = 1.645*test_se
assert np.isclose(ref_moe, test_moe, rtol=0, atol=1)
print("(ref, test) = {tup}".format(tup=(ref_moe, test_moe)))

'Owner occupied units (TEN in 1,2)'
    'pums_est_09_to_13': (ref, test) = (110362, 110362)
    'pums_se_09_to_13' : (ref, test) = (1363, 1363.1910174293257)
    'pums_moe_09_to_13': (ref, test) = (2242, 2242.449223671241)

In [8]:

# Export ipynb to html
path_static = os.path.join(os.path.expanduser(r'~'), r'stharrold.github.io/content/static')
basename = r'20160110-etl-census-with-python'
filename = r'example'
path_ipynb = os.path.join(path_static, basename, filename+'.ipynb')
for template in ['basic', 'full']:
    path_html = os.path.splitext(path_ipynb)[0]+'-'+template+'.html'
    cmd = ['jupyter', 'nbconvert', '--to', 'html', '--template', template, path_ipynb, '--output', path_html]
    print(' '.join(cmd))
    subprocess.run(args=cmd, check=True)
    print()

jupyter nbconvert --to html --template basic /home/samuel_harrold/stharrold.github.io/content/static/20160110-etl-census-with-python/example.ipynb --output /home/samuel_harrold/stharrold.github.io/content/static/20160110-etl-census-with-python/example-basic.html

jupyter nbconvert --to html --template full /home/samuel_harrold/stharrold.github.io/content/static/20160110-etl-census-with-python/example.ipynb --output /home/samuel_harrold/stharrold.github.io/content/static/20160110-etl-census-with-python/example-full.html

Helpful links

Some links I found helpful for this blog post:

American Community Survey (ACS):
- ACS About.
- ACS Guidance for Data Users describes how to get started with ACS data.
- ACS Technical Documentation links to resources for learning how to work with ACS data.
- ACS Methodology includes design details, sample sizes, coverage estimates, and past questionnaires.
- ACS Library has a collection of reports and infographics using ACS data.
Python:
- Learning Python, 5th ed. (2013, O'Reilly) was my formal introduction to Python.
- Python for Data Analysis (2012, O'Reilly) introduced me to pandas.
- Python Cookbook, 3rd ed. (2013, O'Reilly) has a collection of optimized recipes.
- From the well-documented Python 3.5 standard library, I used collections, functools, os, pdb, subprocess, sys, and time for this post.
- Likewise, the documentation for numpy and pandas is thorough and invaluable.
- Of IPython's convenient "magic" commands, within this post's Jupyter Notebooks, I used %pdb, %reload_ext, and the extension %autoreload.
- StackOverflow "How to get line count cheaply in Python" (not as performant as wc -l but still neat).
Git:
- GitHub Guides are where I started with git.
- Git documentation answers a lot of questions.
- Git-flow streamlines my repository management with this branching model.
- StackOverflow "Download a specific tag with git".
dsdemos v0.0.3 (browse code):
- To design the package file structure, I used Learning Python, 5th ed. (2013, O'Reilly), Part V, "Modules and Packages"; and Python Cookbook, 3rd ed. (2013, O'Reilly), Ch. 10, "Modules and Packages".
- I use Google-style docstrings adapted from the example by the Napoleon extension to Sphinx (a Python documentation generator, not yet used by dsdemos).
- Pytest for testing.
- Semantic Versioning for version numbers.

Footnotes

Some use cases for the Census Bureau's American Community Survey with data access recommendations: ACS Which Data Tool ↩
For this post, I mounted a single disk for storage to a single instance and loaded the data in RAM. A scaled version of this pipeline on Google Cloud Platform could include integrated services such as Cloud Storage and Big Query. ↩
To download ACS data via FTP:
$ sudo curl --remote-name <url>
Decompress the ZIP files with unzip. ↩
I'm downloading the data files rather than using the Census Bureau's API because this project requires one-time access to all data rather than dynamic access to a subset of the data. ↩
StackOverflow "regex to parse well-formatted multi-line data dictionary". ↩
With dsdemos v0.0.3 above, you can export the data dictionary to JSON format with the json Python library. See example from dsdemos tests. ↩
Docs for pandas.read_csv ↩
Pandas 0.17.1 has a compatibility issue with Python 3.5. See GitHub pandas issue 11915 for a temporary fix. The issue should be resolved in pandas 0.18. ↩
For the formulas to calculate the estimates, the direct standard error, and the margin of error, as well as for example calculations, see 2013 5-year PUMS Accuracy, section 7, "Measuring Sampling Error". The sample weights '[P]WGTP' mitigate over/under-representation and control agreement with published ACS estimates. The accuracy PDF describes two methods of calculating the error (i.e. uncertainty) associated with an estimate of a characteristic:
1. Calculate the "generalized standard error" of the estimate using "design factors" from the survey. This method does not use columns '[P]WGTP[1-80]' and requires looking up a design factor for the specific characteristic (e.g "population by tenure"). See the accuracy PDF for the design factors.
2. Calculate the "direct standard error" of the estimate using the "replicate weights", which are the columns '[P]WGTP[1-80]'. This method is extensible to many kinds of characteristics (e.g. population by tenure by age).
Note: Controlled estimates of characteristics like "total population" have 0 direct standard error from replicate weights. Use the generalized standard error for these estimates. ↩
There are several ways to select rows by filtering on conditions using pandas. I prefer creating a pandas.Series with boolean values as true-false mask then using the true-false mask as an index to filter the rows. See the docs for pandas.DataFrame.loc.
Example query: Select columns 'AGEP' and 'WGTP' where values for 'AGEP' are between 25 and 34.
tfmask = np.logical_and(25 <= df['AGEP'], df['AGEP'] <= 34)
df_subset = df.loc[tfmask, ['AGEP', 'WGTP']] ↩
To download dsdemos and checkout v0.0.3 for following the example above:
$ cd ~
$ git clone https://github.com/stharrold/dsdemos.git
$ cd dsdemos
$ git checkout tags/v0.0.3
If you already have a clone of dsdemos, update your local repository then checkout the version's tag:
$ cd dsdemos
$ git pull
$ git checkout tags/v0.0.3 ↩
ACS Methodology includes design details, sample sizes, coverage estimates, and past questionnaires. ↩
Data from the Census Bureau was used to identify Japanese communities as part of the internment of US citizens and residents with Japanese ancestry during World War II. See the ACLU's FAQ section about census data and the Wikipedia article "Internment of Japanese Americans". ↩
"Race" is a problematic term with historical connotations and conflicts between self-identification and labeling by others. The 2015 ACS questionnaire refers to "race" and "ethnicity" separately. The American Anthropological Association recommended in 1997 that questions about "race" and "ethnicity" are ambiguous given the historical context and would be better phrased as about "race/ethnicity". For this project, I refer to "race" and "ethnicity" as "race/ethnicity". The following links are also helpful:
↩
The ACS 3-year estimates are discontinued; 2013 is the last year included in the 3-year estimates. For guidance in choosing, accessing, and using a data set, see ACS Guidance for Data Users. ↩
StackExchange Programmers "R vs Python for data analysis". ↩
See the PyData stack for a collection of performant Python packages. ↩

Extract, transform, and load census data with Python

Overview

Motivations

Example

Helpful links

Footnotes

Comments

Overview

Motivations

Example

Helpful links

Footnotes

Related Posts:

Comments