Overview
The Census Bureau collects data from people in the United States through multiple survey programs. Federal, state, and local governments use the data to assess how constituents are represented and to allocate spending. The data are also made freely available to the public and have a wide range of uses.1 In this post, I parse, load, and verify data from the Census Bureau's American Community Survey (ACS) 2013 5-year Public Use Microdata Sample (PUMS) for Washington DC.
Brief process:
- Start with "Running an IPython Notebook on Google Compute Engine from Chrome".
- For additional storage, create and mount a disk to the instance.2
- Download (and decompress):3 4
- 2013 5-year PUMS data dictionary: PUMS_Data_Dictionary_2009-2013.txt (<1 MB)
- 2013 5-year PUMS person and housing records for Washington DC:
- Person records: csv_pdc.zip (5 MB compressed, 30 MB decompressed)
- Housing records: csv_hdc.zip (2 MB compressed, 13 MB decompressed)
- 2013 5-year PUMS estimates for user verification: pums_estimates_9_13.csv (<1 MB)
- Load the files:
- Data dictionary TXT:
dsdemos.census.parse_pumsdatadict
(seedsdemos
package below)
This is a customized parser I wrote forPUMS_Data_Dictionary_2009-2013.txt
. The data dictionary is inconsistently formatted, which complicates parsing.5 6 - Person/housing records and user verification CSVs:
pandas.read_csv
7 8
- Data dictionary TXT:
- Confirm the user verification estimates (see example below):9
- To calculate an estimate \(X\) for a specific "characteristic" (e.g. "Age 25-34"), sum the column
'[P]WGTP'
of the filtered data ('PWGTP'
for person records,'WGTP'
for housing records).10'[P]WGTP'
are the sample weights. - To calculate the estimate's "direct standard error", use the ACS's modified root-mean-square deviation:
$$\mathrm{SE}(X) = \sqrt{\frac{4}{80}\sum_{r=1}^{80}(X_r-X)^2}$$
where each \(X_r\) is the sum of the column'[P]WGTPr'
of the filtered data.'[P]WGTP[1-80]'
are the "replicate weights". - To calculate the estimate's margin of error (defined by ACS at the 90% confidence level):
$$\mathrm{MOE}(X) = 1.645\,\mathrm{SE}(X)$$
- To calculate an estimate \(X\) for a specific "characteristic" (e.g. "Age 25-34"), sum the column
Source code:
- For step-by-step, see the Jupyter Notebook (click the HTML export to render in-browser):
20160110-etl-census-with-python.ipynb
20160110-etl-census-with-python-full.html - This post uses
dsdemos
v0.0.3.11
Motivations
Why am I using the American Community Survey (ACS)?
- The ACS is a relevant data set. A future step is to predict an individual's household income, which is among the subjects that the ACS survey addresses.
- The ACS is a reliable data set.12 The ACS has quality controls to ensure that it is representative. The survey samples about 3 million addresses per year with a response rate of about 97%.
- The ACS is a time-series data set. The survey sends questionnaires throughout the year and releases data once per year. A future step is to use the time series to forecast an individual's household income.
- I recognize that using ACS data can be problematic. Data from the Census Bureau has been used for harm,13 and current ACS terminology asks respondents to identify by terms such as "race".14 For this project, I take data from the Census Bureau at face value, and I infer from it at face value. It's important to respect that these aren't simply data points; these are people.
Why am I using the ACS 5-year estimate?
As of Dec 2015, the ACS offers two windowing options for their data releases: 1-year estimates and 5-year estimates.15 Because the ACS 5-year estimates include data over a 5-year window, they have the largest sample size and thus the highest precision for modeling small populations. However, the sample size comes at the expense of currency. Forecasting the predictions from a 5-year window to be more relevant to a specific year is a future step.
Why am I using Python?
This project can be done using Python, R, SQL, and/or other languages.16 I'm using Python since a future step is to make a machine learning pipeline, which is a popular application of scikit-learn.
What about "big data"?
I'm starting with a data set small enough to be processed in memory (i.e. operated on in RAM), since the focus of many Python packages is in-memory operations on single machines.17 These packages often parallelize operations across the machine's processor cores. For operations that exceed the machine's available RAM (i.e. out-of-core computations), there's Dask for Python, and for operations that require a cluster of machines, there's Spark for Java, Scala, Python, and R. Scaling a pipeline to a large enough data set that requires a cluster is a future step.
Example
This is an abbreviated example of my ETL procedure in the Jupyter Notebook for this post (see links to source code above).
cd ~
# Import standard packages.
import os
import subprocess
import sys
# Import installed packages.
import numpy as np
import pandas as pd
# Import local packages.
# Insert current directory into module search path.
# dsdemos
version: https://github.com/stharrold/dsdemos/releases/tag/v0.0.3
sys.path.insert(0, os.path.join(os.path.curdir, r'dsdemos'))
import dsdemos as dsd
# File paths
# Base path to ACS:
# http://www2.census.gov/programs-surveys/acs/
# 2013 5-year PUMS data dictionary:
# tech_docs/pums/data_dict/PUMS_Data_Dictionary_2009-2013.txt
# 2013 5-year PUMS housing records for Washington DC (extracted from csv_pdc.zip):
# data/pums/2013/5-Year/csv_pdc.zip
# 2013 5-year PUMS user verification estimates:
# tech_docs/pums/estimates/pums_estimates_9_13.csv
path_acs = r'/mnt/disk-20151227t211000z/www2-census-gov/programs-surveys/acs/'
path_dtxt = os.path.join(path_acs, r'tech_docs/pums/data_dict/PUMS_Data_Dictionary_2009-2013.txt')
path_hcsv = os.path.join(path_acs, r'data/pums/2013/5-Year/ss13hdc.csv')
path_ecsv = os.path.join(path_acs, r'tech_docs/pums/estimates/pums_estimates_9_13.csv')
# Load and display the data dictionary.
ddict = dsd.census.parse_pumsdatadict(path=path_dtxt)
print("ddict
, dfd
: Convert the nested dict
into a hierarchical data frame.")
tmp = dict() # tmp
is a throwaway variable
for record_type in ddict['record_types']:
tmp[record_type] = pd.DataFrame.from_dict(ddict['record_types'][record_type], orient='index')
dfd = pd.concat(tmp, names=['record_type', 'var_name'])
dfd.head()
# Load and display the housing records.
print("dfh
: First 5 housing records and first 10 columns.")
dfh = pd.read_csv(path_hcsv)
dfh.iloc[:, :10].head()
# Load and display the verification estimates.
# Select the estimates for Washington DC then for the
# characteristic 'Owner occupied units (TEN in 1,2)'.
dfe = pd.read_csv(path_ecsv)
tfmask_dc = dfe['state'] == 'District of Columbia'
dfe_dc = dfe.loc[tfmask_dc]
print("dfe_dc
: This example verifies these quantities.")
dfe_dc.loc[[310]]
# Verify the estimates following
# https://www.census.gov/programs-surveys/acs/
# technical-documentation/pums/documentation.2013.html
# tech_docs/pums/accuracy/2009_2013AccuracyPUMS.pdf
# Define the column names for the housing weights.
# Select the reference verification data for the characteristic,
# and select the records for the characteristic.
hwt = 'WGTP'
hwts = [hwt+str(inum) for inum in range(1, 81)] # ['WGTP1', ..., 'WGTP80']
char = 'Owner occupied units (TEN in 1,2)'
print("'{char}'".format(char=char))
tfmask_ref = dfe_dc['characteristic'] == char
tfmask_test = np.logical_or(dfh['TEN'] == 1, dfh['TEN'] == 2)
# Calculate and verify the estimate ('est') for the characteristic.
# The estimate is the sum of the sample weights 'WGTP'.
col = 'pums_est_09_to_13'
print(" '{col}':".format(col=col), end=' ')
ref_est = int(dfe_dc.loc[tfmask_ref, col].values[0].replace(',', ''))
test_est = dfh.loc[tfmask_test, hwt].sum()
assert np.isclose(ref_est, test_est, rtol=0, atol=1)
print("(ref, test) = {tup}".format(tup=(ref_est, test_est)))
# Calculate and verify the "direct standard error" ('se') of the estimate.
# The direct standard error is a modified root-mean-square deviation
# using the "replicate weights" 'WGTP[1-80]'.
col = 'pums_se_09_to_13'
print(" '{col}' :".format(col=col), end=' ')
ref_se = dfe_dc.loc[tfmask_ref, col].values[0]
test_se = ((4/80)*((dfh.loc[tfmask_test, hwts].sum() - test_est)2).sum())0.5
assert np.isclose(ref_se, test_se, rtol=0, atol=1)
print("(ref, test) = {tup}".format(tup=(ref_se, test_se)))
# Calculate and verify the margin of error ('moe') at the
# 90% confidence level (+/- 1.645 standard errors).
col = 'pums_moe_09_to_13'
print(" '{col}':".format(col=col), end=' ')
ref_moe = dfe_dc.loc[tfmask_ref, col].values[0]
test_moe = 1.645*test_se
assert np.isclose(ref_moe, test_moe, rtol=0, atol=1)
print("(ref, test) = {tup}".format(tup=(ref_moe, test_moe)))
# Export ipynb to html
path_static = os.path.join(os.path.expanduser(r'~'), r'stharrold.github.io/content/static')
basename = r'20160110-etl-census-with-python'
filename = r'example'
path_ipynb = os.path.join(path_static, basename, filename+'.ipynb')
for template in ['basic', 'full']:
path_html = os.path.splitext(path_ipynb)[0]+'-'+template+'.html'
cmd = ['jupyter', 'nbconvert', '--to', 'html', '--template', template, path_ipynb, '--output', path_html]
print(' '.join(cmd))
subprocess.run(args=cmd, check=True)
print()
Helpful links
Some links I found helpful for this blog post:
- American Community Survey (ACS):
- ACS About.
- ACS Guidance for Data Users describes how to get started with ACS data.
- ACS Technical Documentation links to resources for learning how to work with ACS data.
- ACS Methodology includes design details, sample sizes, coverage estimates, and past questionnaires.
- ACS Library has a collection of reports and infographics using ACS data.
- Python:
- Learning Python, 5th ed. (2013, O'Reilly) was my formal introduction to Python.
- Python for Data Analysis (2012, O'Reilly) introduced me to
pandas
. - Python Cookbook, 3rd ed. (2013, O'Reilly) has a collection of optimized recipes.
- From the well-documented Python 3.5 standard library, I used collections, functools, os, pdb, subprocess, sys, and time for this post.
- Likewise, the documentation for numpy and pandas is thorough and invaluable.
- Of IPython's convenient "magic" commands, within this post's Jupyter Notebooks, I used %pdb, %reload_ext, and the extension %autoreload.
- StackOverflow "How to get line count cheaply in Python" (not as performant as
wc -l
but still neat).
- Git:
- GitHub Guides are where I started with
git
. - Git documentation answers a lot of questions.
- Git-flow streamlines my repository management with this branching model.
- StackOverflow "Download a specific tag with git".
- GitHub Guides are where I started with
dsdemos
v0.0.3 (browse code):- To design the package file structure, I used Learning Python, 5th ed. (2013, O'Reilly), Part V, "Modules and Packages"; and Python Cookbook, 3rd ed. (2013, O'Reilly), Ch. 10, "Modules and Packages".
- I use Google-style docstrings adapted from the example by the Napoleon extension to Sphinx (a Python documentation generator, not yet used by
dsdemos
). - Pytest for testing.
- Semantic Versioning for version numbers.
Footnotes
-
Some use cases for the Census Bureau's American Community Survey with data access recommendations: ACS Which Data Tool ↩
-
For this post, I mounted a single disk for storage to a single instance and loaded the data in RAM. A scaled version of this pipeline on Google Cloud Platform could include integrated services such as Cloud Storage and Big Query. ↩
-
To download ACS data via FTP:
$ sudo curl --remote-name <url>
Decompress the ZIP files withunzip
. ↩ -
I'm downloading the data files rather than using the Census Bureau's API because this project requires one-time access to all data rather than dynamic access to a subset of the data. ↩
-
StackOverflow "regex to parse well-formatted multi-line data dictionary". ↩
-
With
dsdemos
v0.0.3 above, you can export the data dictionary to JSON format with thejson
Python library. See example fromdsdemos
tests. ↩ -
Pandas 0.17.1 has a compatibility issue with Python 3.5. See GitHub pandas issue 11915 for a temporary fix. The issue should be resolved in pandas 0.18. ↩
-
For the formulas to calculate the estimates, the direct standard error, and the margin of error, as well as for example calculations, see 2013 5-year PUMS Accuracy, section 7, "Measuring Sampling Error". The sample weights
'[P]WGTP'
mitigate over/under-representation and control agreement with published ACS estimates. The accuracy PDF describes two methods of calculating the error (i.e. uncertainty) associated with an estimate of a characteristic:
- Calculate the "generalized standard error" of the estimate using "design factors" from the survey. This method does not use columns
'[P]WGTP[1-80]'
and requires looking up a design factor for the specific characteristic (e.g "population by tenure"). See the accuracy PDF for the design factors. - Calculate the "direct standard error" of the estimate using the "replicate weights", which are the columns
'[P]WGTP[1-80]'
. This method is extensible to many kinds of characteristics (e.g. population by tenure by age).
Note: Controlled estimates of characteristics like "total population" have 0 direct standard error from replicate weights. Use the generalized standard error for these estimates. ↩ - Calculate the "generalized standard error" of the estimate using "design factors" from the survey. This method does not use columns
-
There are several ways to select rows by filtering on conditions using
pandas
. I prefer creating apandas.Series
with boolean values as true-false mask then using the true-false mask as an index to filter the rows. See the docs forpandas.DataFrame.loc
.
Example query: Select columns'AGEP'
and'WGTP'
where values for'AGEP'
are between 25 and 34.
tfmask = np.logical_and(25 <= df['AGEP'], df['AGEP'] <= 34)
df_subset = df.loc[tfmask, ['AGEP', 'WGTP']]
↩ -
To download
dsdemos
and checkout v0.0.3 for following the example above:
$ cd ~
$ git clone https://github.com/stharrold/dsdemos.git
$ cd dsdemos
$ git checkout tags/v0.0.3
If you already have a clone ofdsdemos
, update your local repository then checkout the version's tag:
$ cd dsdemos
$ git pull
$ git checkout tags/v0.0.3
↩ -
ACS Methodology includes design details, sample sizes, coverage estimates, and past questionnaires. ↩
-
Data from the Census Bureau was used to identify Japanese communities as part of the internment of US citizens and residents with Japanese ancestry during World War II. See the ACLU's FAQ section about census data and the Wikipedia article "Internment of Japanese Americans". ↩
-
"Race" is a problematic term with historical connotations and conflicts between self-identification and labeling by others. The 2015 ACS questionnaire refers to "race" and "ethnicity" separately. The American Anthropological Association recommended in 1997 that questions about "race" and "ethnicity" are ambiguous given the historical context and would be better phrased as about "race/ethnicity". For this project, I refer to "race" and "ethnicity" as "race/ethnicity". The following links are also helpful:
- Census Bureau's statement about "race" (2013)
- Office of Management and Budget, "Standards for the Classification of Federal Data on Race and Ethnicity" (1994), Appendix Directive No. 15 (1977)
- Office of Management and Budget, "Review of the Racial and Ethnic Standards to the OMB Concerning Changes" (Jul 1997)
- Office of Management and Budget, "Revisions to the Standards for the Classification of Federal Data on Race and Ethnicity" (Oct 1997)
- American Anthropological Association, "Statement on Race" (1998)
- Wikipedia article "Race and ethnicity in the United States Census"
-
The ACS 3-year estimates are discontinued; 2013 is the last year included in the 3-year estimates. For guidance in choosing, accessing, and using a data set, see ACS Guidance for Data Users. ↩
-
StackExchange Programmers "R vs Python for data analysis". ↩
-
See the PyData stack for a collection of performant Python packages. ↩
Comments
comments powered by Disqus