Part 3: Origin-Destination Analysis
Part 3: Origin-Destination Analysis
The LODES origin-destination dataset can be described as the heart of the entire database structure. Each row has a pair of Census blocks defining the home and work location of one worker, along with other variables denoting age, income, and general category of the worker. We are able to build a few insightful visualizations on top of this.
As usual, we take in OD data directly from the Census FTP server. With all 11 years, the dataset is over 25 million rows long, almost unmanageable. For the dotmap and network map, we thus will only consider 2019.
years = ['2009','2010','2011','2012','2013','2014','2015','2016','2017','2018','2019']
url = "https://lehd.ces.census.gov/data/lodes/LODES7/pa/od/pa_od_main_JT00_"
od_all = dd.concat([pd.read_csv(gzip.open(BytesIO(urlopen(url+year+".csv.gz").read()))).assign(year=int(year)) for year in years])
od = od_all[od_all['year'] == 2019].compute()
od
w_geocode | h_geocode | S000 | SA01 | SA02 | SA03 | SE01 | SE02 | SE03 | SI01 | SI02 | SI03 | createdate | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 420010301011003 | 420010303002134 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 20211018 | 2019 |
1 | 420010301011003 | 420210137001010 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 20211018 | 2019 |
2 | 420010301011012 | 420010303002079 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 20211018 | 2019 |
3 | 420010301011012 | 420550120003013 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 20211018 | 2019 |
4 | 420010301011012 | 420550120006025 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 20211018 | 2019 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5128502 | 421330240022035 | 421330240013021 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 20211018 | 2019 |
5128503 | 421330240022035 | 421330240021000 | 2 | 2 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 2 | 20211018 | 2019 |
5128504 | 421330240022035 | 421330240021004 | 2 | 0 | 2 | 0 | 0 | 0 | 2 | 0 | 0 | 2 | 20211018 | 2019 |
5128505 | 421330240022035 | 421330240021072 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 20211018 | 2019 |
5128506 | 421330240022035 | 421330240022034 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 20211018 | 2019 |
5128507 rows × 14 columns
Aggregation to block group and merge with shapefiles.
od['w_geocode'] = od['w_geocode'].astype('str').str[:12]
od['h_geocode'] = od['h_geocode'].astype('str').str[:12]
od["id"] = od.index + 1
phl_bg = gpd.read_file("bg/tl_2019_42_bg/tl_2019_42_bg.shp").query("COUNTYFP in ['017','029','045','091','101']").to_crs(2272)
phl_bg = phl_bg[['GEOID','geometry']].assign(geometry=phl_bg.geometry.centroid)
phl_bg = phl_bg.to_crs(4326)
od_geo = od.merge(phl_bg.rename(columns={"geometry":"geo_w"}), left_on='w_geocode', right_on='GEOID', how='inner')
od_geo = od_geo.merge(phl_bg.rename(columns={"geometry":"geo_h"}), left_on='h_geocode', right_on='GEOID', how='inner')
Viz 1: Dot map of each worker residence and job
Firstly we can produce a datashaded dot map of each worker residence and job, similar to the hexbin in part 1. To do so, it is best to convert the data to a SpatialPandas GeoDataFrame
, which is developed by Holoviews to work well with hvPlot and Datashader. Here, the data seems to be more complete and discernable than the hexmap.
import spatialpandas as spd
od_ds_work = spd.GeoDataFrame(od_geo.drop(columns='geo_h'))
od_ds_home = spd.GeoDataFrame(od_geo.drop(columns='geo_w'))
combine = (
od_ds_work.hvplot(c='S000', tiles='StamenTonerRetina', datashade=True, project=True, frame_height=600, frame_width=600, xaxis=None,yaxis=None,cmap='hot')
+
od_ds_home.hvplot(c='S000', tiles='StamenTonerRetina', datashade=True, project=True, frame_height=600, frame_width=600, xaxis=None,yaxis=None)
)
hv.save(combine,"./charts/3_datashade.html", backend='bokeh')
Viz 2: Employment efficiency and growing spatial mismatch over time
Adding the temporal dimension to the O-D analysis can really aid in understand how people’s general commute distances have changed over time. Because analysis with this many rows is unfeasible, we will sample the O-D data to a tenth of the total.
Next we categorize each O-D pair on four categories: looking at if the home lies in Philadelphia or the surrounding counties, and if the workplace lies in Philadelphia or the surrounding counties. This produces an in-city commute, in-suburb commute, suburb-to-city commute, and city-to-suburb commute. In Census parlance, this rate of cross-border commuting is called “employment efficiency”. We produce two plots below: the first is calculating the percent change of the count of each type of commute, while the second is a stacked bar showing the absolute percent.
The last commute (city-to-suburb) is also commonly known as the reverse commute, and is a product of the ever-growing job sprawl and suburbanization trends of the last fifty years. In Philadelphia’s case, the region has one of the highest rates of reverse commuting, nearing almost forty percent. Nonetheless, the barplot shows that suburb-suburb jobs still have a clear majority over any commute involving the city. And the strong percentage growth of the reverse commute seen in the first chart is a significant trend that also reflects continued suburban domiance of job workplaces.
The consequences of this trend are severe. The regional transit network, Regional Rail in particular, has not been optimized well for the reverse commute as opposed to the traditional peak-direction; offices have also not been keen to locate close to rail stations, opting for farther business parks along highways. Numerous policies could be modified or put into place to incentivize offices to relocate closer to stations, or better yet, move back into the city.
od_all_samp = od_all[['w_geocode','h_geocode','year','S000']].sample(frac=0.1).compute()
cond = [np.logical_and(od_all_samp['h_geocode'].astype('str').str[2:5] == '101',od_all_samp['w_geocode'].astype('str').str[2:5] == '101'),
np.logical_and(od_all_samp['h_geocode'].astype('str').str[2:5] == '101',od_all_samp['w_geocode'].astype('str').str[2:5] != '101'),
np.logical_and(od_all_samp['h_geocode'].astype('str').str[2:5] != '101',od_all_samp['w_geocode'].astype('str').str[2:5] == '101'),
np.logical_and(od_all_samp['h_geocode'].astype('str').str[2:5] != '101',od_all_samp['w_geocode'].astype('str').str[2:5] != '101')]
choice = ['in Phila','city-to-suburb commute','suburb-to-city commute','in suburb']
od_all_samp['emp_eff'] = np.select(cond,choice)
od_eff = od_all_samp.groupby(['year','emp_eff'])['S000'].sum().reset_index()
od_eff['pctchg'] = od_eff.sort_values(['emp_eff','year']).groupby(['emp_eff'])['S000'].apply(lambda x: x.div(x.iloc[0]).subtract(1).mul(100))
hv.save(hv.HLine(0).opts(hv.opts.HLine(line_dash='dotted')) * od_eff.hvplot.line(x='year',y='pctchg',by='emp_eff').opts(legend_position='top'),
"./charts/3_emp_eff.html",backend='bokeh')
hv.save(od_all_samp.groupby(['year','emp_eff'])['S000'].sum().hvplot.bar(stacked=True, legend='top_left'),
"./charts/3_emp_eff_pct.html",backend='bokeh')
Viz 3: Origin-destination network flow graph
Now, we get to the meat and bones of O-D analysis: the origin-destination network flow graph. This graph will aggregate the lines of origin (home) to destination (work) into nodes and edges (or flows). The library used here is MovingPandas, a new library that deals with all sorts of temporal-spatial trajectories such as flightpaths and bird migrations. However, as there is no time information associated with the O-D data unfortunately, we will stick with the temporal dimension.
First we must melt the data so that each origin and destination block group is on a separate row. Then we set a mock time for each O and D just so the data can be processed.
valuevar = ['geo_w','geo_h']
od_long = od_geo.melt(value_vars=valuevar,id_vars=od.columns.difference(valuevar),var_name='location',value_name="geometry")
od_long['t'] = np.where(od_long['location'] == 'geo_h',pd.to_datetime('2019-01-01 06:00:00'),pd.to_datetime('2019-01-01 08:00:00'))
od_long = od_long.set_index('t')
od_long.sort_values('id')
S000 | SA01 | SA02 | SA03 | SE01 | SE02 | SE03 | SI01 | SI02 | SI03 | createdate | h_geocode | id | w_geocode | year | location | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
t | |||||||||||||||||
2019-01-01 08:00:00 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 20211018 | 420171001022 | 989830 | 420171001021 | 2019 | geo_w | POINT (-74.92836437904931 40.07687511288584) |
2019-01-01 06:00:00 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 20211018 | 420171001022 | 989830 | 420171001021 | 2019 | geo_h | POINT (-74.96474358791602 40.0606801695214) |
2019-01-01 08:00:00 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 20211018 | 420171001022 | 989831 | 420171001021 | 2019 | geo_w | POINT (-74.92836437904931 40.07687511288584) |
2019-01-01 06:00:00 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 20211018 | 420171001022 | 989831 | 420171001021 | 2019 | geo_h | POINT (-74.96474358791602 40.0606801695214) |
2019-01-01 08:00:00 | 2 | 2 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 2 | 20211018 | 420171001022 | 989832 | 420171001021 | 2019 | geo_w | POINT (-74.92836437904931 40.07687511288584) |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2019-01-01 08:00:00 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 20211018 | 421010390008 | 4579656 | 421019891001 | 2019 | geo_w | POINT (-75.0057899620279 40.03434987047037) |
2019-01-01 06:00:00 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 20211018 | 421019802001 | 4579657 | 421019891001 | 2019 | geo_h | POINT (-75.04900990952665 40.06974013053726) |
2019-01-01 08:00:00 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 20211018 | 421019802001 | 4579657 | 421019891001 | 2019 | geo_w | POINT (-75.0057899620279 40.03434987047037) |
2019-01-01 08:00:00 | 2 | 0 | 1 | 1 | 0 | 0 | 2 | 0 | 0 | 2 | 20211018 | 421019802001 | 4579658 | 421019891001 | 2019 | geo_w | POINT (-75.0057899620279 40.03434987047037) |
2019-01-01 06:00:00 | 2 | 0 | 1 | 1 | 0 | 0 | 2 | 0 | 0 | 2 | 20211018 | 421019802001 | 4579658 | 421019891001 | 2019 | geo_h | POINT (-75.04900990952665 40.06974013053726) |
3012858 rows × 17 columns
We sample the data for 2000 O-D pairs to save time; this is fine as we are only concerned with relative flows, and take in point geometry.
# The trajectory aggregation function is very slow, so the OD data will be sampled for 2000 pairs
sample_od = np.random.choice(od_long['id'].unique(),2000)
od_geo_sample = gpd.GeoDataFrame(od_long[od_long['id'].isin(sample_od)],geometry='geometry',crs='EPSG:4326')
od_geo_sample.sort_values(['id','location'])
S000 | SA01 | SA02 | SA03 | SE01 | SE02 | SE03 | SI01 | SI02 | SI03 | createdate | h_geocode | id | w_geocode | year | location | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
t | |||||||||||||||||
2019-01-01 06:00:00 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 20211018 | 420454041012 | 990434 | 420171001021 | 2019 | geo_h | POINT (-75.34330 39.88576) |
2019-01-01 08:00:00 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 20211018 | 420454041012 | 990434 | 420171001021 | 2019 | geo_w | POINT (-74.92836 40.07688) |
2019-01-01 06:00:00 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 20211018 | 420912001052 | 990507 | 420171001021 | 2019 | geo_h | POINT (-75.06218 40.11149) |
2019-01-01 08:00:00 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 20211018 | 420912001052 | 990507 | 420171001021 | 2019 | geo_w | POINT (-74.92836 40.07688) |
2019-01-01 06:00:00 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 20211018 | 421010348012 | 992342 | 420171001021 | 2019 | geo_h | POINT (-75.02046 40.05865) |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2019-01-01 08:00:00 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 20211018 | 420171014011 | 4576670 | 421019891001 | 2019 | geo_w | POINT (-75.00579 40.03435) |
2019-01-01 06:00:00 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 20211018 | 421010333003 | 4579116 | 421019891001 | 2019 | geo_h | POINT (-75.04795 40.05165) |
2019-01-01 08:00:00 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 20211018 | 421010333003 | 4579116 | 421019891001 | 2019 | geo_w | POINT (-75.00579 40.03435) |
2019-01-01 06:00:00 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 20211018 | 421010356021 | 4579415 | 421019891001 | 2019 | geo_h | POINT (-75.04733 40.10327) |
2019-01-01 08:00:00 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 20211018 | 421010356021 | 4579415 | 421019891001 | 2019 | geo_w | POINT (-75.00579 40.03435) |
4000 rows × 17 columns
Next we call MovingPandas to build the initial TrajectoryCollection
, an object holding all trajectories and their connected line segments (each O-D pair in this case only has one).
traj_collection = mpd.TrajectoryCollection(od_geo_sample, 'id', min_length=1000)
print("Finished creating {} trajectories".format(len(traj_collection)))
Finished creating 1946 trajectories
Next we use TrajectoryCollectionAggregator
, which implements an algorithm developed by Andrienko and Andrienko (2011) to aggregate the trajectories into a weighted network. The crucial parameter here is setting max_distance
to a reasonable maximum commute length for the region, as setting it too high will simply aggregate everything into a single giant O-D pair.
aggregator = mpd.TrajectoryCollectionAggregator(traj_collection, max_distance=15000, min_distance=1000, min_stop_duration=dt.timedelta(minutes=30))
flows = aggregator.get_flows_gdf()
clusters = aggregator.get_clusters_gdf()
print("Finished creating {} flows".format(len(flows)) + " and {} clusters".format(len(clusters)))
Finished creating 342 flows and 29 clusters
Here you can see what a single O-D pair looks like.
hv.save(traj_collection.trajectories[64].hvplot(title="A single O-D trajectory",line_width=2, tiles='StamenTonerBackground', width=600, height=500),
"./charts/3_single_traj.html",backend='bokeh')
The results of the graph are striking. Firstly we can see the sheer concentrated density of the commute flows to Center City, strongest from other neighborhoods in Philadelphia. But accompanying this is also several significant suburban notes, the majority of which correspond to major job centers like Chesterbrook and Horsham. While each suburb-suburb commute flow ranges from small to moderate, their combined sum as part of an everywhere-to-everywhere network, compared to Philadelphia’s largely-radial network, is the reason why there are so many more suburb-suburb commutes in the region than any other type.
flows = flows[flows['weight'] > 1].to_crs('EPSG:3857')
clusters = clusters.to_crs('EPSG:3857')
odnet = (flows.hvplot(title='Origin-destination network graph, Philadelphia and 5 PA counties',
geo=True,
crs='EPSG:3857',
hover_cols=['weight'],
line_width='weight',
alpha=0.5, color='#1f77b3', tiles='StamenTonerRetina',
frame_height=600, frame_width=800) *
clusters.hvplot(geo=True, crs='EPSG:3857', color='red', size='n')
)
hv.save(odnet,"./charts/3_orig_dest.html",backend="bokeh")
Lastly here are two histograms showing the distribution of flows by weight and nodes by size; this gives a general picture of how much certain job centers still dominate the region despite the trend towards more distributed job sprawl.
hv.Layout(flows.hvplot.hist(title='Distribution of flows by weight',y='weight') +
clusters.hvplot.hist(title='Distribution of nodes by size',y='n')).cols(1)