tigris
& getcensus
tidycensus
Data scientists using R often run into two problems:
Understanding what data the Census Bureau publishes.
Understanding what R packages on CRAN are available to help with their project.
The U.S. Census Bureau is the premier source of data about US people, places and economy.
This makes the Bureau a natural source of information for data analysts.
The Census Bureau conducts over 100 Censuses, Surveys and Programs.
You can view the full list of programs here.
Top 5 most popular(by API requests):
The Census Bureau’s geographic hierarchy!
Designed to be relatively homogeneous, e.g. population characteristics, economic status, living conditions
Average about 4,000 inhabitants
Every house, every tree, every city has its own unique latitude and longitude coordinates.
There are two underlying important pieces of information for spatial data:
Though we refer to a shape file in the singular, it’s actually a collection of at least three basic files:
All files must be present in the directory and named the same (except for the file extension) to import correctly.
Use vector data and shapefiles to create choropleth maps
Household Income by Census Tract
tigris
for maps & getcensus
for census datatidycensus
for maps and census data (WINNER!)Download a shape file of state boundaries from the Census.
Point R (or other spatial software) to correct filepath find File Paths, folders, etc.
Using the tigris package, which lets us download Census shapefiles directly into R without having to unzip and point to directories, etc.
Simply call any of these functions (with the proper location variables):
tracts()
counties()
school_districts()
roads()
Instead of downloading data from the horrible-to-navigate Census FactFinder or pleasant-to-navigate CensusReporter.org we can pull the code with the censusapi package from Hannah Recht, of Bloomberg.
First, sign up for a census key.
Second, replace YOURKEYHERE
with your Census API key.
# Add key to .Renviron
Sys.setenv(CENSUS_KEY="YOURKEYHERE")
# Check to see that the expected key is output in your R console
Sys.getenv("CENSUS_KEY")
Check out the dozens of data sets you can access.
apis <- listCensusApis()
View(apis)
We’ll focus on using the getCensus()
function from the package. It makes an API call and returns a data frame of results.
These are the arguments you’ll need to pass it:
name
- the name of the Census data set, like “acs5” or “timeseries/bds/firms”vintage
- the year of the data setvars
- one or more variables to accessregion
- the geography level of data, like county or tracts or stateYou can use listCensusMetadata()
to see what tables might be available from the ACS Census survey.
acs_vars <- listCensusMetadata(name="acs/acs5", type="variables", vintage=2016)
View(acs_vars)
Slow Process! Please don’t run this right now.
In the search finder window, type variable of interest, i.e. median household income
az_income <- getCensus(name = "acs/acs5", vintage = 2016,
vars = c("NAME", "B19013_001E", "B19013_001M"),
region = "county:*", regionin = "state:04")
head(az_income)
## state county NAME B19013_001E B19013_001M
## 1 04 001 Apache County, Arizona 32460 1381
## 2 04 003 Cochise County, Arizona 45383 1470
## 3 04 005 Coconino County, Arizona 51106 1578
## 4 04 007 Gila County, Arizona 40593 1420
## 5 04 009 Graham County, Arizona 47422 3914
## 6 04 011 Greenlee County, Arizona 51813 5787
## Data Wrangling
az_income %>%
rename(MHI_est = B19013_001E ,
MHI_moe = B19013_001M)%>%
mutate(se = MHI_moe/1.645,
cv = (se/MHI_est)*100) %>% #CV is the coefficient of variation
##Plot
ggplot( aes(x = MHI_est,
y = reorder(NAME, MHI_est))) +
geom_point(color = "black", size = 2) +
geom_errorbarh(aes(xmin = MHI_est - MHI_moe,
xmax = MHI_est + MHI_moe )) +
labs(title = "Median Household Income by County:",
subtitle =
paste0("Arizona US Census/ACS5 2016"),
x = "Median Household Income", y="") +
scale_x_continuous(labels = scales::dollar) +
theme_minimal() +
theme(panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank())
First, let’s make sure the shape files download as sf files (because it can also handle sp versions, as well)
If cb is set to TRUE, it downloads a generalized (1:500k) counties file. Default is FALSE (the most detailed TIGER file).
View(az)
theme_void() +
theme(panel.grid.major = element_line(colour = 'transparent')) +
Time to join the map data and median income data
state county NAME B19013_001E B19013_001M
1 04 001 Apache County, Arizona 32460 1381
[1] "015" "005" "009" "013" "027" "001" "025" "021" "011" "023" "003"
[12] "017" "019" "012" "007"
## [1] "STATEFP" "COUNTYFP" "COUNTYNS" "AFFGEOID" "GEOID"
## [6] "NAME.x" "LSAD" "ALAND" "AWATER" "state"
## [11] "NAME.y" "B19013_001E" "B19013_001M" "geometry"
## Simple feature collection with 1 feature and 13 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -114.7548 ymin: 34.20963 xmax: -112.5293 ymax: 37.00075
## epsg (SRID): 4269
## proj4string: +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs
## STATEFP COUNTYFP COUNTYNS AFFGEOID GEOID NAME.x LSAD ALAND
## 1 04 015 00025445 0500000US04015 04015 Mohave 06 34475503964
## AWATER state NAME.y B19013_001E B19013_001M
## 1 387344307 04 Mohave County, Arizona 39856 1002
## geometry
## 1 MULTIPOLYGON (((-114.7532 3...
ggplot(az4ever) +
geom_sf(aes(fill=B19013_001E), color="white") +
theme_void() +
theme(panel.grid.major = element_line(colour = 'transparent')) +
scale_fill_distiller(palette="Oranges", direction=1, name="Median income") +
labs(title="2016 Median income in Arizona counties", caption="Source: US Census/ACS5 2016")
Newest package for Census data: tidycensus
With tidycensus, you can download the shape files with the data you want already attached. No joins necessary.
Let’s get right into mapping. We’ll calculate unemployment percents by Census tract in Maricopa County. It’ll involve wrangling some data. But querying the data with get_acs()
will be easy and so will getting the shape file by simply passing it geometry=T
.
Pass it your Census key.
VarSearch <- load_variables( 2017, "acs5", cache=TRUE )
# convert all letters to upper case
VarSearch$label <- toupper(VarSearch$label)
unemployment <- VarSearch %>%
mutate( contains.unemployment = grepl( "UNEMPLOYED", label ) ) %>%
#Create new variable with Mutate that has UNEMPLOYED in title using grepl
filter( contains.unemployment )
head(unemployment, 1)
## # A tibble: 1 x 4
## name label concept contains.unemplo…
## <chr> <chr> <chr> <lgl>
## 1 B12006… ESTIMATE!!TOTAL!!NEVER M… MARITAL STATUS BY SE… TRUE
View
Filter
(top left). In label
box, type Hispanic
B03001_002
)jobs <- c(labor_force = "B23025_005E",
unemployed = "B23025_002E")
arizona <- get_acs(geography="tract", year=2017, survey="acs5",
variables= jobs, county = "Maricopa",
state="AZ", geometry=T)
## Simple feature collection with 6 features and 5 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -112.0654 ymin: 33.46573 xmax: -111.04 ymax: 34.03733
## epsg (SRID): 4269
## proj4string: +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs
## GEOID NAME variable
## 1 04013010101 Census Tract 101.01, Maricopa County, Arizona B23025_002
## 2 04013010101 Census Tract 101.01, Maricopa County, Arizona B23025_005
## 3 04013010102 Census Tract 101.02, Maricopa County, Arizona B23025_002
## 4 04013010102 Census Tract 101.02, Maricopa County, Arizona B23025_005
## 5 04013030401 Census Tract 304.01, Maricopa County, Arizona B23025_002
## 6 04013030401 Census Tract 304.01, Maricopa County, Arizona B23025_005
## estimate moe geometry
## 1 1681 333 MULTIPOLYGON (((-111.7869 3...
## 2 87 59 MULTIPOLYGON (((-111.7869 3...
## 3 1767 431 MULTIPOLYGON (((-112.0654 3...
## 4 54 47 MULTIPOLYGON (((-112.0654 3...
## 5 1354 296 MULTIPOLYGON (((-111.9648 3...
## 6 30 49 MULTIPOLYGON (((-111.9648 3...
library(tidyr)
arizona<-arizona %>%
mutate(variable=case_when(
variable=="B23025_005" ~ "Unemployed",
variable=="B23025_002" ~ "Workforce")) %>%
select(-moe) %>%
spread(variable, estimate) %>% #Spread moves rows into columns
mutate(percent_unemployed=round(Unemployed/Workforce*100,2))
head(arizona)
library(tidyr)
ggplot(arizona, aes(fill=percent_unemployed)) +
geom_sf(color="white") +
theme_void() + theme(panel.grid.major = element_line(colour = 'transparent')) +
scale_fill_distiller(palette="Reds", direction=1, name="Estimate") +
labs(title="Percent unemployed in Maricopa County", caption="Source: US Census/ACS5 2016") +
NULL
head(maricopa)
## Simple feature collection with 6 features and 7 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -112.0654 ymin: 33.46573 xmax: -111.04 ymax: 34.03733
## epsg (SRID): 4269
## proj4string: +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs
## GEOID NAME variable
## 1 04013010101 Census Tract 101.01, Maricopa County, Arizona White
## 2 04013010101 Census Tract 101.01, Maricopa County, Arizona Black
## 3 04013010101 Census Tract 101.01, Maricopa County, Arizona Asian
## 4 04013010101 Census Tract 101.01, Maricopa County, Arizona Hispanic
## 5 04013010102 Census Tract 101.02, Maricopa County, Arizona White
## 6 04013010102 Census Tract 101.02, Maricopa County, Arizona Black
## estimate moe summary_est summary_moe geometry
## 1 4814 513 4915 517 MULTIPOLYGON (((-111.7869 3...
## 2 0 12 4915 517 MULTIPOLYGON (((-111.7869 3...
## 3 19 33 4915 517 MULTIPOLYGON (((-111.7869 3...
## 4 78 104 4915 517 MULTIPOLYGON (((-111.7869 3...
## 5 4408 592 4602 602 MULTIPOLYGON (((-112.0654 3...
## 6 25 38 4602 602 MULTIPOLYGON (((-112.0654 3...
Combine data wrangling and mapping into 1
library(viridis)
maricopa %>%
mutate(pct = 100 * (estimate / summary_est)) %>%
ggplot(aes(fill = pct, color = pct)) +
facet_wrap(~variable) +
geom_sf() +
coord_sf(crs = 26915) +
scale_fill_viridis(direction=-1) +
scale_color_viridis(direction=-1) +
theme_void() +
theme(panel.grid.major = element_line(colour = 'transparent')) +
labs(title="Racial geography of Maricopa County, Arizona", caption="Source: US Census 2010")
Poverty = B17001_002
county_pov <- get_acs(geography = "county",
variables = "B17001_002",
summary_var = "B17001_001",
year = 2017,
survey = "acs5",
geometry = TRUE,
shift_geo = TRUE) %>%
mutate(pctpov = 100 * (estimate/summary_est))
## Plot
ggplot(county_pov) +
geom_sf(aes(fill = pctpov), color=NA) +
coord_sf(datum=NA) +
labs(title = "Percent of population in poverty by county",
subtitle = "Alaska and Hawaii are shifted and not to scale",
caption = "Source: ACS 5-year, 2016",
fill = "% in poverty") +
scale_fill_viridis(direction=-1)