Metadata and Queries¶

This notebok covers accessing and using metadata (data about the data) and basic queries useful when reading geoJSON files from open data sources.

The sodapy module is used to explore metatdata from Socrata based open data sources. You need to install Sodapy using pip in a terminal window.

SODAPY has a method to get_metadata which returns dictionary containing information summarizing the dataset. If you are working with Socrata datasets, it is best to request an application token.

You must create a user name and password login before requesting an APP TOKEN.

With respect to queries, two appraoches are demonstrated, namely using the SODAPY get method and using the geopandas read_file method. Both appraoches use a string containing the url of the SODA endpoint. The query string structure is a bit different and the SODA get method returns a dictionary that has to be converted to a GeoDataFrame.

# import libraries

import sodapy
from sodapy import Socrata

import pandas as pd
import geopandas as gpd
import numpy as np
import os

import matplotlib.pyplot as plt
# draw maps in the notebook
%matplotlib inline

Get polygons that represent neighborhoods in Buffalo, NY.¶

Find the dataset identifier from the Buffalo Open data website.
Copy from view data window - EXPORT/SODA API - or home page API button.
- Neighborhoods in Buffalo - https://data.buffalony.gov/resource/g7bi-nz8b.json.
- This is a polygon dataset with the boundaries of Buffalo's neighborhoods as defined by the city planning department.

First look at the metadata.
- In this example the field names for each column in the dataset are displayed

I have stored my SODAPY App Token in my environment variables. In Windows, edit your environment variables. I do not know how to do this on a mac!

If you do not want to modify the environment, you can include your app token as a string but everyone who sees your notebook will see your app token.

socrata_domain = "data.buffalony.gov/" 
socrata_dataset_identifier = "g7bi-nz8b"
app_token = os.environ.get("SODAPY_APPTOKEN") # this is in my environment path

client = Socrata(socrata_domain, app_token)
metadata = client.get_metadata(socrata_dataset_identifier)

[x['fieldName'] for x in metadata['columns']]# show field names in dataset

['the_geom',
 'objectid',
 'shape_starea',
 'shape_stlength',
 'nbhdname',
 'placename',
 'objectid_1',
 'nbhdnum',
 'calcacres',
 'sqmiles',
 'shape_leng',
 'sde_vector_gis_osp_neighborhoods_2018_area']

View metadata for specific fields¶

You can view the metadata for each field to find out

the data type,
how many null values there are, and
get a frequency distribution of the top values if relevant.

meta_geom = [x for x in metadata['columns'] if x['fieldName'] == 'the_geom'][0]
meta_geom

{'id': 411985490,
 'name': 'the_geom',
 'dataTypeName': 'multipolygon',
 'fieldName': 'the_geom',
 'position': 1,
 'renderTypeName': 'multipolygon',
 'tableColumnId': 53228908,
 'format': {}}

meta_name = [x for x in metadata['columns'] if x['fieldName'] == 'nbhdname'][0]
meta_name

{'id': 411985669,
 'name': 'NbhdName',
 'dataTypeName': 'text',
 'fieldName': 'nbhdname',
 'position': 12,
 'renderTypeName': 'text',
 'tableColumnId': 82055654,
 'cachedContents': {'largest': 'West Side',
  'non_null': '35',
  'null': '0',
  'top': [{'item': 'Allentown', 'count': '1'},
   {'item': 'Seneca Babcock', 'count': '1'},
   {'item': 'Lovejoy', 'count': '1'},
   {'item': 'Fruit Belt', 'count': '1'},
   {'item': 'Delavan Grider', 'count': '1'},
   {'item': 'Kensington-Bailey', 'count': '1'},
   {'item': 'Genesee-Moselle', 'count': '1'},
   {'item': 'Grant-Amherst', 'count': '1'},
   {'item': 'Ellicott', 'count': '1'},
   {'item': 'Central', 'count': '1'},
   {'item': 'West Side', 'count': '1'},
   {'item': 'Black Rock', 'count': '1'},
   {'item': 'Lower West Side', 'count': '1'},
   {'item': 'Fillmore-Leroy', 'count': '1'},
   {'item': 'Kaisertown', 'count': '1'},
   {'item': 'Broadway Fillmore', 'count': '1'},
   {'item': 'Riverside', 'count': '1'},
   {'item': 'Central Park', 'count': '1'},
   {'item': 'University Heights', 'count': '1'},
   {'item': 'MLK Park', 'count': '1'}],
  'not_null': '35',
  'smallest': 'Allentown',
  'cardinality': '35'},
 'format': {}}

Use SODAPY GET¶

To read the entire data set
To read one neighborhood

You must include content_type ='geojson'. The result is a dictionary.

allNeigh = client.get(socrata_dataset_identifier,content_type='geojson')
#allNeigh

ElmBid = client.get(socrata_dataset_identifier,content_type='geojson',nbhdname = 'Elmwood Bidwell')
#ElmBid

Convert dictionary to geoDataFrame from features¶

nbhood=gpd.GeoDataFrame.from_features(allNeigh)
nbhood.plot();

elmBid=gpd.GeoDataFrame.from_features(ElmBid)
elmBid.plot();

Alternatively¶

Use geopandas.read_file with the url to get the geojson version of the file.

To only read a single neightborhood, add ?$where=nbhdname=\%27Elmwood\%20Bidwell\%27 to the url

url1="https://data.buffalony.gov/resource/g7bi-nz8b.geojson"

#If no SODA API 
# url = "https://data.buffalony.gov/api/geospatial/q9bk-zu3p?method=export&format=GeoJSON"

gdf = gpd.read_file(url1)

Read only 1 neighborhood¶

In a where clause you have to specify the url encode values. %27= ' and %20 = space

url2="https://data.buffalony.gov/resource/g7bi-nz8b.GeoJSON?$where=nbhdname=%27Elmwood%20Bidwell%27"
ngdf = gpd.read_file(url2)

gdf.plot();

ngdf.plot();

If the coordinate reference system is not projected, change to EPSG:3857 and plot the file using matplotlib. Colors are based on the neighborhood name field.

gdf.shape

(35, 12)

gdf.geometry.name #renamed the_geom to geometry

'geometry'

gdf.crs

{'init': 'epsg:4326'}

EPSG 4326 is WGS 84, units = degrees (unprojected)

Reproject to EPSG 3857 is Web Mercator Auxiliary Sphere, WGS 84, units = meters¶

gdf=gdf.to_crs('epsg:3857')
gdf.plot(column='nbhdname', figsize=(9,10));

C:\Users\mixwa\AppData\Local\Continuum\anaconda3\envs\GEG584\lib\site-packages\pyproj\crs.py:77: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method.
  return _prepare_from_string(" ".join(pjargs))

Create a new dataframe and with a new field containing sq. kilometers.¶

The data owner provided a field with sq. miles and acres calculated. We could use either (or both) of these for a choropleth map. But, since the new units of the projected data is meters, we can calculate sq. kilometers as a third measure of size for a choropleth map.

neighborhoods = gdf[['nbhdname', 'geometry', 'sqmiles']].copy()
neighborhoods['sqkilo']=neighborhoods.geometry.area/1000**2 # in sq kilometers
neighborhoods.nbhdname+ ' Area = '+neighborhoods.sqkilo.astype(str)

0        Genesee-Moselle Area = 7.147232792295891
1             Allentown Area = 1.6107571697020384
2            West Hertel Area = 4.106325934356094
3               Central Area = 11.110428479065419
4          Elmwood Bryant Area = 4.10109639144569
5         Fillmore-Leroy Area = 4.350064316196875
6         Delavan Grider Area = 4.942666025512347
7                Lovejoy Area = 8.721404053616778
8            Black Rock Area = 2.9336975627739625
9         Hopkins-Tifft Area = 19.980052929825007
10      Lower West Side Area = 2.7154485587566897
11         Schiller Park Area = 4.575175753750093
12     Kensington-Bailey Area = 5.677638786388734
13             Ellicott Area = 4.2190936254504985
14             Kenfield Area = 3.3950988968459432
15           Masten Park Area = 5.162537133033123
16         Grant-Amherst Area = 2.787756353782923
17             Riverside Area = 4.616184288692757
18      Seneca-Cazenovia Area = 5.414997115929739
19           First Ward Area = 3.6378655941913296
20        Seneca Babcock Area = 6.702458672078505
21         Pratt-Willert Area = 3.232096716586115
22     Broadway Fillmore Area = 10.86469445208743
23         Central Park Area = 3.9076379124912095
24             North Park Area = 8.74376267797996
25       Upper West Side Area = 4.087098972018056
26         Elmwood Bidwell Area = 8.4497205532233
27               Parkside Area = 8.15821059108441
28             West Side Area = 4.291040097237276
29    University Heights Area = 6.461279625343363
30          Hamlin Park Area = 2.2862697700775425
31            Fruit Belt Area = 3.284535733621377
32             South Park Area = 8.02689631490958
33            Kaisertown Area = 4.246150798109207
34              MLK Park Area = 3.318255473442588
dtype: object

Get open data on pot hole service requests¶

The City of Buffalo has an open data set on 311 request for service calls. One type of call they receive from the public is for pot hole repair. On Feb 16, WIVB reported on the city's pot hole blitz (https://www.wivb.com/news/local-news/city-of-buffalo-goes-on-saturday-pothole-blitz/1788423791). The story indicted where the crews were dispatched to fix the pot holes.

The blitz included:

2 crews on Main Street between Humboldt and Hertel
1 crew on South Park Avenue between Bailey and Dorrance
1 crew the Hamlin Park neighborhood
1 crew the Broadway Fillmore area
4 other crews across the city on various pothole requests that were logged into 311.

The story went on to indicate:

"The City of Buffalo said that 311 calls are not as high as they were in 2018. In February 2018, the city got 1,100 calls for pot hole repair through 311. In the two weeks of February 2019, the city said there have been 400 calls."

We will pull data for the first two weeks in Feb 2019 and map that data to the Buffalo Neighborhoods. The SODA endpoint is:

311 Service Requests https://data.buffalony.gov/resource/whkc-e5vr.geojson

Two methods are shown to query the data:

using SODAPY
using geopandas read_file

There are three commonly used parameters in a query:

\$limit - to increase the number of records beyond the default of 1,000
\$where - read specific rows (records)
\$select - read specific columns (fields)

See the API Docs (Developers) for a full list of parameters (SQLQueries) https://dev.socrata.com/docs/queries/

Read data into a geodataframe since there is geometry in the dataset. The query will include

type contains 'Pot Hole" and open_date between 2/1/2019 and 2/15/2019.

socrata_dataset_identifier = "whkc-e5vr"
metadata = client.get_metadata(socrata_dataset_identifier)
[x['fieldName'] for x in metadata['columns']]# show field names in dataset

['case_reference',
 'open_date',
 'closed_date',
 'status',
 'subject',
 'reason',
 'type',
 'object_type',
 'address_number',
 'address_line_1',
 'address_line_2',
 'city',
 'state',
 'zip_code',
 'property_id',
 'location',
 'latitude',
 'longitude',
 'council_district',
 'police_district',
 'census_tract',
 'census_block_group',
 'census_block',
 'neighborhood',
 'x_coordinate',
 'y_coordinate',
 ':@computed_region_tmcg_v66k',
 ':@computed_region_xbxg_7ifr',
 ':@computed_region_eziv_p4ck',
 ':@computed_region_fk4y_hpmh',
 ':@computed_region_kwzn_pe6v',
 ':@computed_region_uh5x_q5mi',
 ':@computed_region_dwzh_dtk5',
 ':@computed_region_k9un_h6vm']

ph_2019 = client.get(socrata_dataset_identifier,
                     content_type='geojson',
                     where ="(open_date between '2019-02-01T00:00:00' AND '2019-02-15T00:00:00') AND (type like 'Pot Hole%')")

phgdf = gpd.GeoDataFrame.from_features(ph_2019)
phgdf.shape

(336, 30)

Geopandas URL¶

is a bit more complicated than the sodapy one since there is no client to convert the url to the proper codes. The only new code in this url is %25 which is the code for % which is the wild card character for searching substrings, Basically, this will read any row between the appropriate dates and whre the type field contains "Pot Hole%". The string must start with the Pot and Hole can be followed by any number of characters.

url3='https://data.buffalony.gov/resource/p7e8-krif.geojson?\
$where=open_date%20between%20%272019-02-01T00:00:00%27%20and%20%272019-02-15T00:00:00%27\
%20and%20type%20like%20%27Pot%20Hole%25%27'

phgdf2 =  gpd.read_file(url3)
phgdf2.shape

(336, 30)

Just for fun, read the 2020 calls for Pot Hole repairs during the first two weeks of February.

url4='https://data.buffalony.gov/resource/p7e8-krif.geojson?$limit=1000000&\
$where=open_date%20between%20%272020-02-01T00:00:00%27%20and%20%272020-02-15T00:00:00%27\
%20and%20type%20like%20%27Pot%20Hole%25%27'

phgdf3 =  gpd.read_file(url4)
phgdf3.shape

(320, 30)

phgdf2.tail()

phgdf2['council_district'] = phgdf2.council_district.fillna("UNKNOWN")
phgdf2=phgdf2.loc[phgdf2.geometry.notnull()]
phgdf2=phgdf2.to_crs('epsg:3857')

C:\Users\mixwa\AppData\Local\Continuum\anaconda3\envs\GEG584\lib\site-packages\pyproj\crs.py:77: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method.
  return _prepare_from_string(" ".join(pjargs))

phgdf2.plot(column='council_district',figsize=(9,10),legend=True);

It appears that the UNKNOWN Council District should actually = NORTH. Change UNKNOWN to NORTH.

phgdf2['council_district'].replace({"UNKNOWN":"NORTH"}, inplace=True)
phgdf2.plot(column='council_district',figsize=(9,10),legend=True);

Another query example.¶

Find all pot hole complaints within a mile of my house.

Need the coordinates of my house (lat/lon) which is (y,x)
Converision of a mile to meters
- within_circle(location,42.915431,-78.877030,1609)
Instead of reading all columns, use \$select to keep a few of them
Extract the year and month from open_date
- date_extract_y(open_date)%20as%20year,date_extract_m(open_date)%20as%20month

url="https://data.buffalony.gov/resource/p7e8-krif.geojson?$limit=10000&\
$select=date_extract_y(open_date)%20as%20year,date_extract_m(open_date)%20as%20month,subject,reason,type,location&\
$where=within_circle(location,42.915431,-78.877030,1609)\
%20and%20type%20like%20%27Pot%20Hole%25%27"
phnear = gpd.read_file(url)
                         
phnear.tail()

phnear=phnear.to_crs('epsg:3857')
phnear.plot(column='year',figsize=(9,9), legend=True);

C:\Users\mixwa\AppData\Local\Continuum\anaconda3\envs\GEG584\lib\site-packages\pyproj\crs.py:77: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method.
  return _prepare_from_string(" ".join(pjargs))

phsummary=phnear.groupby(['year','month']).agg({'month':'count'})
phsummary.columns = ['Count_Mth']
phsummary=phsummary.reset_index()
phsummary

phsummary.loc[phsummary['Count_Mth'].idxmax()]

year         2018
month           2
Count_Mth     143
Name: 116, dtype: object

Assignment: New York City 311 data analysis¶

Borough Boundaries (water areas excluded) https://data.cityofnewyork.us/resource/7t3b-ywvw.geojson

311 Service Requests from 2010 to present https://data.cityofnewyork.us/resource/erm2-nwe9.geojson

where:

created_date between 1/1/2018 and 3/1/2020
descriptor contains Loud Music select:
keep year and month of created_date
keep - city, borough, location, open_data_channel_type, complaint_type, descriptor

Answer the following questions:

Which Borough had the most noise complaints during this time period?
During what month are noise complaints most frequent?
How many compaints by borough are from residential propoerties versus commercial?
What channel type is most frequently used, by borrough.

Map all commercial complaints (use the borough boundaries as the back drop behind the points.)

Let's make prettier maps. Below is an example that leads to adding a basemap, polygons, and points to the same map plot!

ax = gdf.plot(figsize=(10, 10), alpha=0.5, edgecolor='k') # plots polygons

# conda install -c conda-forge contextily
import contextily as ctx

ax = gdf.plot(figsize=(10, 10), alpha=0.5, edgecolor='k')
ctx.add_basemap(ax) # add default basemap to plot

ax = gdf.plot(figsize=(10, 10), alpha=0.5, edgecolor='k')
ctx.add_basemap(ax, url=ctx.providers.Stamen.TonerLite) #change basemap from ctx.providers

nei = gdf.plot(figsize=(10, 10), alpha=0.5, edgecolor='k')
ctx.add_basemap(nei, url=ctx.providers.Stamen.TonerLite)
phgdf2.plot(ax=nei, marker='o', color='red', markersize=5);#add points to plot

	location_state	zip_code	city	location_zip	police_district	x_coordinate	subject	latitude	state	neighborhood	...	census_block_group	y_coordinate	census_tract	open_date	address_number	type	address_line_1	closed_date	council_district	geometry
331	None	None	Buffalo	None	None	-78.84987	Dept of Public Works	None	NY	UNKNOWN	...	None	42.94737	None	2019-02-14T17:01:00	Unknown	Pot Hole (Req_Serv)	Unknown	2019-02-15T08:13:00	None	None
332	None	14216	Buffalo	None	District D	1069359.8	Dept of Public Works	42.947746958340794	NY	UNKNOWN	...	3	1074169.6	56	2019-02-14T17:33:00	INTERSECTION	Pot Hole (Req_Serv)	Hertel Ave	2019-02-15T15:04:00	NORTH	POINT (-78.87822 42.94775)
333	None	14214	Buffalo	None	District D	-8776997.371	Dept of Public Works	42.94394955613908	NY	UNKNOWN	...	2	5303495.9275	48	2019-02-14T23:55:00	124	Pot Hole (Req_Serv)	DEPEW AVE	2019-02-23T08:06:00	DELAWARE	POINT (-78.84489 42.94395)
334	None	14214	Buffalo	None	District D	-8777371.516	Dept of Public Works	42.94397977607642	NY	UNKNOWN	...	2	5303499.9118	48	2019-02-14T23:58:00	34	Pot Hole (Req_Serv)	DEPEW AVE	2019-02-15T15:06:00	DELAWARE	POINT (-78.84849 42.94398)
335	None	14214	Buffalo	None	District D	1076936.6	Dept of Public Works	42.944067963326404	NY	UNKNOWN	...	1	1072803.9	52.01	2019-02-14T23:59:00	INTERSECTION	Pot Hole (Req_Serv)	Linden Ave	2019-02-19T14:44:00	DELAWARE	POINT (-78.84983 42.94407)

	subject	year	reason	type	month	geometry
3327	Dept of Public Works	2018	Engineering - Street Repairs	Pot Hole (Req_Serv)	5	POINT (-78.88992 42.90466)
3328	Dept of Public Works	2019	Engineering - Street Repairs	Pot Hole (Req_Serv)	4	POINT (-78.88992 42.90466)
3329	Dept of Public Works	2019	Engineering - Street Repairs	Pot Hole (Req_Serv)	3	POINT (-78.88584 42.90545)
3330	Dept of Public Works	2018	Engineering - Street Repairs	Pot Hole (Req_Serv)	2	POINT (-78.89455 42.90936)
3331	Dept of Public Works	2018	Engineering - Street Repairs	Pot Hole (Req_Serv)	4	POINT (-78.89455 42.90936)

	year	month	Count_Mth
0	2008	10	1
1	2008	11	4
2	2008	12	4
3	2008	7	2
4	2008	8	5
...	...	...	...
134	2019	8	35
135	2019	9	18
136	2020	1	58
137	2020	2	43
138	2020	3	6