Part 1: Exploratory data analysis
Data wrangling
As stated in the beginning, the LODES data structure is both high-resolution and quite longitudinal; the Census geography is the block, the smallest possible, but ACS data is only up to the block group level, so fortunately we can just aggregate by the first 12 digits of the FIPS code for each block. Temporally, data stretches back to 2002, though as 2009 was the start of higher-quality data collection according to the Census Bureau, we begin with that year instead.
The LODES database consists of three separate datasets. These consist of:
- Workplace Area Characteristics (WAC): Per block, count of the workplaces of workers. Crosstabs include age group, income group, NAICS industry sector (construction, finance, etc.), race, education, age and size of firm.
- Residence Area Characteristics (RAC): Per block, count of the residences of workers. Crosstabs are the same as WAC
- Origin-Destination (OD): Per worker, denote their residence block, workplace block, and age group/income group/basic industry sector (goods producing, trade/transportation, and others)
This section is concerned with WAC and RAC datasets, and basic inferences on their content. We use built-in Python functions to directly download from the Census Bureau FTP server, and as data is split by year, aggregate years 2009-2019 together in one Dask DataFrame.
years = ['2009','2010','2011','2012','2013','2014','2015','2016','2017','2018','2019']
url = "https://lehd.ces.census.gov/data/lodes/LODES7/pa/wac/pa_wac_S000_JT00_"
wac = dd.concat([pd.read_csv(gzip.open(BytesIO(urlopen(url+year+".csv.gz").read()))).assign(year=int(year)) for year in years])
wac['w_geocode'] = wac['w_geocode'].astype('str').str[:12]
wac.head()
url = "https://lehd.ces.census.gov/data/lodes/LODES7/pa/rac/pa_rac_S000_JT00_"
rac = dd.concat([pd.read_csv(gzip.open(BytesIO(urlopen(url+year+".csv.gz").read()))).assign(year=int(year)) for year in years])
rac['h_geocode'] = rac['h_geocode'].astype('str').str[:12]
rac.head()
To limit our selected area of analysis, we load Census TIGER block group shapefiles for Philadelphia and its five surrounding PA counties. Including New Jersey and Delaware block groups as well as LODES data is possible, but the size of the data already was at the limit for the available computing power.
phl_bg = gpd.read_file("bg/tl_2019_42_bg/tl_2019_42_bg.shp").query("COUNTYFP in ['017','029','045','091','101']").to_crs('EPSG:3857')
Viz 1: Hexbin of workplace and residence locations
Simply plotting the overall density of workplaces and residences will give us a good idea of the kind of inherent spatial structure we are dealing with in the Philadelphia region. After summing and merging each block group, we take the centroid of each block group, as hvPlot
only recognizes lat/long for hexbins.
wac_bg_sum = wac.groupby(['year','w_geocode'])['C000'].sum().compute().reset_index()
wac_bg_sum = gpd.GeoDataFrame(phl_bg[['GEOID','geometry']].merge(wac_bg_sum, left_on='GEOID', right_on='w_geocode', how='inner'),
geometry='geometry',crs='EPSG:3857')
wac_bg_sum['Longitude'] = wac_bg_sum.to_crs(4326).geometry.centroid.x
wac_bg_sum['Latitude'] = wac_bg_sum.to_crs(4326).geometry.centroid.y
wac_bg_sum['C000'].isnull().values.any()
wac_bg_sum
rac_bg_sum = rac.groupby(['year','h_geocode'])['C000'].sum().compute().reset_index()
rac_bg_sum = gpd.GeoDataFrame(phl_bg[['GEOID','geometry']].merge(rac_bg_sum, left_on='GEOID', right_on='h_geocode', how='inner'),
geometry='geometry',crs='EPSG:3857')
rac_bg_sum['Longitude'] = rac_bg_sum.to_crs(4326).geometry.centroid.x
rac_bg_sum['Latitude'] = rac_bg_sum.to_crs(4326).geometry.centroid.y
rac_bg_sum['C000'].isnull().values.any()
rac_bg_sum
We see that while the workplace hexbin on the left does show a few hexes with high density, the residential hexbin on right is more or less uniform across the entire region, and this is not a mistake as the colors do change slightly over years. Later we will see if the OD dataset and using raw point data is perhaps more reliable to visualize residential density.
hv_panel = pn.panel(hv.Layout(
wac_bg_sum.hvplot.hexbin(title='Workplace hexbin',x='Longitude',y='Latitude',geo=True,c='C000',tiles='StamenTonerBackground',hover_cols='C000',groupby='year').map(gv.HexTiles,hv.HexTiles) \
.options(hv.opts.HexTiles(alpha=0.7,aggregator=np.mean,width=800,height=800,xaxis=None,yaxis=None)) +
rac_bg_sum.hvplot.hexbin(title='Residence hexbin',x='Longitude',y='Latitude',geo=True,c='C000',tiles='StamenTonerBackground',hover_cols='C000',groupby='year').map(gv.HexTiles,hv.HexTiles) \
.options(hv.opts.HexTiles(alpha=0.7,aggregator=np.mean,width=800,height=800,xaxis=None,yaxis=None))
).cols(2))
widgets=hv_panel[1]
comb = pn.Column(
pn.Row(*widgets),
hv_panel[0])
comb.save('./charts/1_hexbin.html',embed=True)
The following was an attempt to use xarray
and hvPlot’s contourf
2d kernel density function, but this was not successful.
wac_bg_xr = pd.DataFrame(wac_bg_sum.drop(columns=['GEOID','geometry','w_geocode'])).set_index(['Longitude','Latitude','year']).to_xarray()
wac_bg_xr
wac_bg_xr.hvplot.contourf(
x='Longitude',y='Latitude',z='C000',groupby='year',levels=8)
Viz 2: Comparing worker job locations to residence locations
Since every block group has both workplaces and residences, it is worthwhile to see how they compare head-to-head. The left plot simply reflects if the block group has a greater raw count of workplaces or residences, while the second is a ratio; the second actually is relatively close to the algorithm Manduca used to determine central business districts, as he found was more reliable and versatile in different urban contexts. We can see that the first plot casts a much wider net on the eligibility for a workplace-characterized tract than the second. Later on we will use clustering to create a more sophisticated workplace-detection method.
comp = pd.merge(wac_bg_sum[['year','w_geocode','geometry','C000']],rac_bg_sum[['year','h_geocode','C000']],
left_on=['year','w_geocode'],right_on=['year','h_geocode'],how='inner')
comp['net'] = np.where(comp['C000_x'] - comp['C000_y'] > 0,'More workplaces','More residences')
comp['ratio'] =comp['C000_x']/comp['C000_y']
comp = gpd.GeoDataFrame(comp[comp['ratio'] < 700])
# hv_panel = pn.panel(hv.Layout(
# comp.hvplot(title="More workplaces or residents by block group",geo=True,crs=3857,groupby='year',c='net')
# +
# comp.hvplot(title="Ratio of workplaces to residents by block group",geo=True,crs=3857,groupby='year',c='ratio')
# ).cols(2))
# widgets=hv_panel[1]
# comb = pn.Column(
# pn.Row(*widgets),
# hv_panel[0])
hv.save(comp.hvplot(title="More workplaces or residents by block group",geo=True,crs=3857,groupby='year',c='net',widget_position='top'),'./charts/1_w_or_h.html',embed=True)
hv.save(comp.hvplot(title="Ratio of workplaces to residents by block group",geo=True,crs=3857,groupby='year',c='ratio',widget_position='top'),'./charts/1_wh_ratio.html',embed=True)
Viz 3: Looking at jobs and broken by industry/income over time
Here we take the crosstabs for industry and income, and evaluate how they change over the 11-year period 2009-2019. We can see that while job sectors have barely changed over time except for an increase in healthcare jobs (not a surprise for Philadelphia), the highest income bracket has increased over time relative to the others.
allcol = [col for col in wac if col.startswith('C')]
cnscol = [col for col in wac if col.startswith('CNS')]
inccol = [col for col in wac if col.startswith('CE')]
wac_year_sum = wac.groupby('year')[allcol].sum().compute()
hv.save(wac_year_sum.hvplot.area(title='Jobs by industry, 2009-2019',
widget_location='top',x='year',y=cnscol,value_label='Job count'),'./charts/1_jobs_industry.html',embed=True)
hv.save(wac_year_sum.hvplot.area(title='Jobs by income bracket, 2009-2019',
widget_location='top',x='year',y=inccol,value_label='Job count'),'./charts/1_jobs_income.html',embed=True)
Viz 4: Density-distance plot: Workplace and population as a function of distance from CBD
Next, we aggregate a level higher onto the census tract and merge with ACS population for 2013-2019. Then we calculate the distance in feet from each census tract centroid to Center City, and plot the results for workplace and population density by this distance. We can see that while population density is relatively uniform by distance. job density peaks significantly at the CBD, as expected
pop_phl = pd.DataFrame()
for yr in [int(i) for i in years[4:]]:
acs = cenpy.products.ACS(year=yr)
pop_phl = pop_phl.append(acs.from_msa(msa='Philadelphia-Camden-Wilmington, PA-NJ-DE-MD', variables=["B01001_001E"], level='tract', return_geometry=True) \
.assign(year=yr))
from shapely.geometry import Point
wac_ct_sum = wac_bg_sum.drop(columns=['geometry','GEOID'])
wac_ct_sum['GEOID'] = wac_ct_sum['w_geocode'].astype('str').str[:11]
wac_ct_sum = wac_ct_sum.groupby(['year','GEOID'])['C000'].sum().reset_index()
wac_ct_sum = gpd.GeoDataFrame(pd.merge(wac_ct_sum,pop_phl[['year','GEOID','geometry','B01001_001E']],on=['year','GEOID'],how='inner'),geometry='geometry',crs='EPSG:3857')
wac_ct_sum['distCenterCity'] = wac_ct_sum['geometry'].centroid.distance(Point(-8368112.55, 4859373.64))
wac_ct_sum = wac_ct_sum.sort_values(by=['year','distCenterCity'])
wac_ct_sum['densJob'] = (wac_ct_sum['C000'] / wac_ct_sum['geometry'].area) * 2.788e+7
wac_ct_sum['densPop'] = (wac_ct_sum['B01001_001E'] / wac_ct_sum['geometry'].area) * 2.788e+7
hv.save(pd.DataFrame(wac_ct_sum).hvplot.scatter(title='Distance-density plot, Philadelphia region census tracts',
widget_location='top',x='distCenterCity',y=['densJob','densPop'],groupby='year'),"./charts/1_dens_distance.html",embed=True)
The same phenomenon can be found when plotting cumulative count of population and jobs by distance, where job count overtakes population count nearest Center City for up to 5000 feet; then the population rise overtakes and rises slightly faster than jobs, suggesting that population density still dominates over job density in the suburbs.
wac_ct_sum['cumulPop'] = wac_ct_sum.groupby(['year'])['B01001_001E'].cumsum()
wac_ct_sum['cumulJob'] = wac_ct_sum.groupby(['year'])['C000'].cumsum()
cum_distance = pd.DataFrame(wac_ct_sum).hvplot.scatter(title='Distance-cumulative total plot, Philadelphia region census tracts',
widget_location='top',x='distCenterCity',y=['cumulPop','cumulJob'],groupby='year')
cum_distance.save("./charts/1_cum_distance.html",embed=True)