Metadata and Queries

This notebok covers accessing and using metadata (data about the data) and basic queries useful when reading geoJSON files from open data sources.

The sodapy module is used to explore metatdata from Socrata based open data sources. You need to install Sodapy using pip in a terminal window.

Sodapy

SODAPY has a method to get_metadata which returns dictionary containing information summarizing the dataset. If you are working with Socrata datasets, it is best to request an application token.

SODA APP Token

You must create a user name and password login before requesting an APP TOKEN.

With respect to queries, two appraoches are demonstrated, namely using the SODAPY get method and using the geopandas read_file method. Both appraoches use a string containing the url of the SODA endpoint. The query string structure is a bit different and the SODA get method returns a dictionary that has to be converted to a GeoDataFrame.

In [1]:
# import libraries

import sodapy
from sodapy import Socrata

import pandas as pd
import geopandas as gpd
import numpy as np
import os

import matplotlib.pyplot as plt
# draw maps in the notebook
%matplotlib inline 

Get polygons that represent neighborhoods in Buffalo, NY.

  • Find the dataset identifier from the Buffalo Open data website.

  • Copy from view data window - EXPORT/SODA API - or home page API button.

  • First look at the metadata.

    • In this example the field names for each column in the dataset are displayed

I have stored my SODAPY App Token in my environment variables. In Windows, edit your environment variables. I do not know how to do this on a mac!

Edit Environment

If you do not want to modify the environment, you can include your app token as a string but everyone who sees your notebook will see your app token.

In [37]:
socrata_domain = "data.buffalony.gov/" 
socrata_dataset_identifier = "g7bi-nz8b"
app_token = os.environ.get("SODAPY_APPTOKEN") # this is in my environment path

client = Socrata(socrata_domain, app_token)
metadata = client.get_metadata(socrata_dataset_identifier)

[x['fieldName'] for x in metadata['columns']]# show field names in dataset
Out[37]:
['the_geom',
 'objectid',
 'shape_starea',
 'shape_stlength',
 'nbhdname',
 'placename',
 'objectid_1',
 'nbhdnum',
 'calcacres',
 'sqmiles',
 'shape_leng',
 'sde_vector_gis_osp_neighborhoods_2018_area']

View metadata for specific fields

You can view the metadata for each field to find out

  • the data type,
  • how many null values there are, and
  • get a frequency distribution of the top values if relevant.
In [3]:
meta_geom = [x for x in metadata['columns'] if x['fieldName'] == 'the_geom'][0]
meta_geom
Out[3]:
{'id': 411985490,
 'name': 'the_geom',
 'dataTypeName': 'multipolygon',
 'fieldName': 'the_geom',
 'position': 1,
 'renderTypeName': 'multipolygon',
 'tableColumnId': 53228908,
 'format': {}}
In [4]:
meta_name = [x for x in metadata['columns'] if x['fieldName'] == 'nbhdname'][0]
meta_name
Out[4]:
{'id': 411985669,
 'name': 'NbhdName',
 'dataTypeName': 'text',
 'fieldName': 'nbhdname',
 'position': 12,
 'renderTypeName': 'text',
 'tableColumnId': 82055654,
 'cachedContents': {'largest': 'West Side',
  'non_null': '35',
  'null': '0',
  'top': [{'item': 'Allentown', 'count': '1'},
   {'item': 'Seneca Babcock', 'count': '1'},
   {'item': 'Lovejoy', 'count': '1'},
   {'item': 'Fruit Belt', 'count': '1'},
   {'item': 'Delavan Grider', 'count': '1'},
   {'item': 'Kensington-Bailey', 'count': '1'},
   {'item': 'Genesee-Moselle', 'count': '1'},
   {'item': 'Grant-Amherst', 'count': '1'},
   {'item': 'Ellicott', 'count': '1'},
   {'item': 'Central', 'count': '1'},
   {'item': 'West Side', 'count': '1'},
   {'item': 'Black Rock', 'count': '1'},
   {'item': 'Lower West Side', 'count': '1'},
   {'item': 'Fillmore-Leroy', 'count': '1'},
   {'item': 'Kaisertown', 'count': '1'},
   {'item': 'Broadway Fillmore', 'count': '1'},
   {'item': 'Riverside', 'count': '1'},
   {'item': 'Central Park', 'count': '1'},
   {'item': 'University Heights', 'count': '1'},
   {'item': 'MLK Park', 'count': '1'}],
  'not_null': '35',
  'smallest': 'Allentown',
  'cardinality': '35'},
 'format': {}}

Use SODAPY GET

  • To read the entire data set
  • To read one neighborhood

You must include content_type ='geojson'. The result is a dictionary.

In [38]:
allNeigh = client.get(socrata_dataset_identifier,content_type='geojson')
#allNeigh
In [39]:
ElmBid = client.get(socrata_dataset_identifier,content_type='geojson',nbhdname = 'Elmwood Bidwell')
#ElmBid

Convert dictionary to geoDataFrame from features

In [7]:
nbhood=gpd.GeoDataFrame.from_features(allNeigh)
nbhood.plot();
In [8]:
elmBid=gpd.GeoDataFrame.from_features(ElmBid)
elmBid.plot();

Alternatively

Use geopandas.read_file with the url to get the geojson version of the file.

To only read a single neightborhood, add ?$where=nbhdname=\%27Elmwood\%20Bidwell\%27 to the url

In [9]:
url1="https://data.buffalony.gov/resource/g7bi-nz8b.geojson"

#If no SODA API 
# url = "https://data.buffalony.gov/api/geospatial/q9bk-zu3p?method=export&format=GeoJSON"

gdf = gpd.read_file(url1)

Read only 1 neighborhood

In a where clause you have to specify the url encode values. %27= ' and %20 = space

In [10]:
url2="https://data.buffalony.gov/resource/g7bi-nz8b.GeoJSON?$where=nbhdname=%27Elmwood%20Bidwell%27"
ngdf = gpd.read_file(url2)
In [11]:
gdf.plot();
In [12]:
ngdf.plot();

If the coordinate reference system is not projected, change to EPSG:3857 and plot the file using matplotlib. Colors are based on the neighborhood name field.

In [13]:
gdf.shape
Out[13]:
(35, 12)
In [14]:
gdf.geometry.name #renamed the_geom to geometry
Out[14]:
'geometry'
In [15]:
gdf.crs
Out[15]:
{'init': 'epsg:4326'}

EPSG 4326 is WGS 84, units = degrees (unprojected)

Reproject to EPSG 3857 is Web Mercator Auxiliary Sphere, WGS 84, units = meters

In [16]:
gdf=gdf.to_crs('epsg:3857')
gdf.plot(column='nbhdname', figsize=(9,10));
C:\Users\mixwa\AppData\Local\Continuum\anaconda3\envs\GEG584\lib\site-packages\pyproj\crs.py:77: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method.
  return _prepare_from_string(" ".join(pjargs))

Create a new dataframe and with a new field containing sq. kilometers.

The data owner provided a field with sq. miles and acres calculated. We could use either (or both) of these for a choropleth map. But, since the new units of the projected data is meters, we can calculate sq. kilometers as a third measure of size for a choropleth map.

In [17]:
neighborhoods = gdf[['nbhdname', 'geometry', 'sqmiles']].copy()
neighborhoods['sqkilo']=neighborhoods.geometry.area/1000**2 # in sq kilometers
neighborhoods.nbhdname+ ' Area = '+neighborhoods.sqkilo.astype(str)
Out[17]:
0        Genesee-Moselle Area = 7.147232792295891
1             Allentown Area = 1.6107571697020384
2            West Hertel Area = 4.106325934356094
3               Central Area = 11.110428479065419
4          Elmwood Bryant Area = 4.10109639144569
5         Fillmore-Leroy Area = 4.350064316196875
6         Delavan Grider Area = 4.942666025512347
7                Lovejoy Area = 8.721404053616778
8            Black Rock Area = 2.9336975627739625
9         Hopkins-Tifft Area = 19.980052929825007
10      Lower West Side Area = 2.7154485587566897
11         Schiller Park Area = 4.575175753750093
12     Kensington-Bailey Area = 5.677638786388734
13             Ellicott Area = 4.2190936254504985
14             Kenfield Area = 3.3950988968459432
15           Masten Park Area = 5.162537133033123
16         Grant-Amherst Area = 2.787756353782923
17             Riverside Area = 4.616184288692757
18      Seneca-Cazenovia Area = 5.414997115929739
19           First Ward Area = 3.6378655941913296
20        Seneca Babcock Area = 6.702458672078505
21         Pratt-Willert Area = 3.232096716586115
22     Broadway Fillmore Area = 10.86469445208743
23         Central Park Area = 3.9076379124912095
24             North Park Area = 8.74376267797996
25       Upper West Side Area = 4.087098972018056
26         Elmwood Bidwell Area = 8.4497205532233
27               Parkside Area = 8.15821059108441
28             West Side Area = 4.291040097237276
29    University Heights Area = 6.461279625343363
30          Hamlin Park Area = 2.2862697700775425
31            Fruit Belt Area = 3.284535733621377
32             South Park Area = 8.02689631490958
33            Kaisertown Area = 4.246150798109207
34              MLK Park Area = 3.318255473442588
dtype: object

Get open data on pot hole service requests

The City of Buffalo has an open data set on 311 request for service calls. One type of call they receive from the public is for pot hole repair. On Feb 16, WIVB reported on the city's pot hole blitz (https://www.wivb.com/news/local-news/city-of-buffalo-goes-on-saturday-pothole-blitz/1788423791). The story indicted where the crews were dispatched to fix the pot holes.

The blitz included:

  • 2 crews on Main Street between Humboldt and Hertel

  • 1 crew on South Park Avenue between Bailey and Dorrance

  • 1 crew the Hamlin Park neighborhood

  • 1 crew the Broadway Fillmore area

  • 4 other crews across the city on various pothole requests that were logged into 311.

The story went on to indicate:

"The City of Buffalo said that 311 calls are not as high as they were in 2018. In February 2018, the city got 1,100 calls for pot hole repair through 311. In the two weeks of February 2019, the city said there have been 400 calls."

We will pull data for the first two weeks in Feb 2019 and map that data to the Buffalo Neighborhoods. The SODA endpoint is:

311 Service Requests https://data.buffalony.gov/resource/whkc-e5vr.geojson

Two methods are shown to query the data:

  • using SODAPY
  • using geopandas read_file

There are three commonly used parameters in a query:

  • \$limit - to increase the number of records beyond the default of 1,000
  • \$where - read specific rows (records)
  • \$select - read specific columns (fields)

    See the API Docs (Developers) for a full list of parameters (SQLQueries) https://dev.socrata.com/docs/queries/

Read data into a geodataframe since there is geometry in the dataset. The query will include

  • type contains 'Pot Hole" and open_date between 2/1/2019 and 2/15/2019.
In [18]:
socrata_dataset_identifier = "whkc-e5vr"
metadata = client.get_metadata(socrata_dataset_identifier)
[x['fieldName'] for x in metadata['columns']]# show field names in dataset
Out[18]:
['case_reference',
 'open_date',
 'closed_date',
 'status',
 'subject',
 'reason',
 'type',
 'object_type',
 'address_number',
 'address_line_1',
 'address_line_2',
 'city',
 'state',
 'zip_code',
 'property_id',
 'location',
 'latitude',
 'longitude',
 'council_district',
 'police_district',
 'census_tract',
 'census_block_group',
 'census_block',
 'neighborhood',
 'x_coordinate',
 'y_coordinate',
 ':@computed_region_tmcg_v66k',
 ':@computed_region_xbxg_7ifr',
 ':@computed_region_eziv_p4ck',
 ':@computed_region_fk4y_hpmh',
 ':@computed_region_kwzn_pe6v',
 ':@computed_region_uh5x_q5mi',
 ':@computed_region_dwzh_dtk5',
 ':@computed_region_k9un_h6vm']
In [19]:
ph_2019 = client.get(socrata_dataset_identifier,
                     content_type='geojson',
                     where ="(open_date between '2019-02-01T00:00:00' AND '2019-02-15T00:00:00') AND (type like 'Pot Hole%')")

phgdf = gpd.GeoDataFrame.from_features(ph_2019)
phgdf.shape
Out[19]:
(336, 30)

Geopandas URL

is a bit more complicated than the sodapy one since there is no client to convert the url to the proper codes. The only new code in this url is %25 which is the code for % which is the wild card character for searching substrings, Basically, this will read any row between the appropriate dates and whre the type field contains "Pot Hole%". The string must start with the Pot and Hole can be followed by any number of characters.

In [20]:
url3='https://data.buffalony.gov/resource/p7e8-krif.geojson?\
$where=open_date%20between%20%272019-02-01T00:00:00%27%20and%20%272019-02-15T00:00:00%27\
%20and%20type%20like%20%27Pot%20Hole%25%27'

phgdf2 =  gpd.read_file(url3)
phgdf2.shape
Out[20]:
(336, 30)

Just for fun, read the 2020 calls for Pot Hole repairs during the first two weeks of February.

In [21]:
url4='https://data.buffalony.gov/resource/p7e8-krif.geojson?$limit=1000000&\
$where=open_date%20between%20%272020-02-01T00:00:00%27%20and%20%272020-02-15T00:00:00%27\
%20and%20type%20like%20%27Pot%20Hole%25%27'

phgdf3 =  gpd.read_file(url4)
phgdf3.shape
Out[21]:
(320, 30)
In [22]:
phgdf2.tail()
Out[22]:
location_state zip_code city location_zip police_district x_coordinate subject latitude state neighborhood ... census_block_group y_coordinate census_tract open_date address_number type address_line_1 closed_date council_district geometry
331 None None Buffalo None None -78.84987 Dept of Public Works None NY UNKNOWN ... None 42.94737 None 2019-02-14T17:01:00 Unknown Pot Hole (Req_Serv) Unknown 2019-02-15T08:13:00 None None
332 None 14216 Buffalo None District D 1069359.8 Dept of Public Works 42.947746958340794 NY UNKNOWN ... 3 1074169.6 56 2019-02-14T17:33:00 INTERSECTION Pot Hole (Req_Serv) Hertel Ave 2019-02-15T15:04:00 NORTH POINT (-78.87822 42.94775)
333 None 14214 Buffalo None District D -8776997.371 Dept of Public Works 42.94394955613908 NY UNKNOWN ... 2 5303495.9275 48 2019-02-14T23:55:00 124 Pot Hole (Req_Serv) DEPEW AVE 2019-02-23T08:06:00 DELAWARE POINT (-78.84489 42.94395)
334 None 14214 Buffalo None District D -8777371.516 Dept of Public Works 42.94397977607642 NY UNKNOWN ... 2 5303499.9118 48 2019-02-14T23:58:00 34 Pot Hole (Req_Serv) DEPEW AVE 2019-02-15T15:06:00 DELAWARE POINT (-78.84849 42.94398)
335 None 14214 Buffalo None District D 1076936.6 Dept of Public Works 42.944067963326404 NY UNKNOWN ... 1 1072803.9 52.01 2019-02-14T23:59:00 INTERSECTION Pot Hole (Req_Serv) Linden Ave 2019-02-19T14:44:00 DELAWARE POINT (-78.84983 42.94407)

5 rows × 30 columns

In [23]:
phgdf2['council_district'] = phgdf2.council_district.fillna("UNKNOWN")
phgdf2=phgdf2.loc[phgdf2.geometry.notnull()]
phgdf2=phgdf2.to_crs('epsg:3857')
C:\Users\mixwa\AppData\Local\Continuum\anaconda3\envs\GEG584\lib\site-packages\pyproj\crs.py:77: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method.
  return _prepare_from_string(" ".join(pjargs))
In [24]:
phgdf2.plot(column='council_district',figsize=(9,10),legend=True);

It appears that the UNKNOWN Council District should actually = NORTH. Change UNKNOWN to NORTH.

In [25]:
phgdf2['council_district'].replace({"UNKNOWN":"NORTH"}, inplace=True)
phgdf2.plot(column='council_district',figsize=(9,10),legend=True);

Another query example.

Find all pot hole complaints within a mile of my house.

  • Need the coordinates of my house (lat/lon) which is (y,x)
  • Converision of a mile to meters
    • within_circle(location,42.915431,-78.877030,1609)
  • Instead of reading all columns, use \$select to keep a few of them
  • Extract the year and month from open_date
    • date_extract_y(open_date)%20as%20year,date_extract_m(open_date)%20as%20month
In [26]:
url="https://data.buffalony.gov/resource/p7e8-krif.geojson?$limit=10000&\
$select=date_extract_y(open_date)%20as%20year,date_extract_m(open_date)%20as%20month,subject,reason,type,location&\
$where=within_circle(location,42.915431,-78.877030,1609)\
%20and%20type%20like%20%27Pot%20Hole%25%27"
phnear = gpd.read_file(url)
                         
phnear.tail()
Out[26]:
subject year reason type month geometry
3327 Dept of Public Works 2018 Engineering - Street Repairs Pot Hole (Req_Serv) 5 POINT (-78.88992 42.90466)
3328 Dept of Public Works 2019 Engineering - Street Repairs Pot Hole (Req_Serv) 4 POINT (-78.88992 42.90466)
3329 Dept of Public Works 2019 Engineering - Street Repairs Pot Hole (Req_Serv) 3 POINT (-78.88584 42.90545)
3330 Dept of Public Works 2018 Engineering - Street Repairs Pot Hole (Req_Serv) 2 POINT (-78.89455 42.90936)
3331 Dept of Public Works 2018 Engineering - Street Repairs Pot Hole (Req_Serv) 4 POINT (-78.89455 42.90936)
In [27]:
phnear=phnear.to_crs('epsg:3857')
phnear.plot(column='year',figsize=(9,9), legend=True);
C:\Users\mixwa\AppData\Local\Continuum\anaconda3\envs\GEG584\lib\site-packages\pyproj\crs.py:77: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method.
  return _prepare_from_string(" ".join(pjargs))
In [28]:
phsummary=phnear.groupby(['year','month']).agg({'month':'count'})
phsummary.columns = ['Count_Mth']
phsummary=phsummary.reset_index()
phsummary
Out[28]:
year month Count_Mth
0 2008 10 1
1 2008 11 4
2 2008 12 4
3 2008 7 2
4 2008 8 5
... ... ... ...
134 2019 8 35
135 2019 9 18
136 2020 1 58
137 2020 2 43
138 2020 3 6

139 rows × 3 columns

In [29]:
phsummary.loc[phsummary['Count_Mth'].idxmax()]
Out[29]:
year         2018
month           2
Count_Mth     143
Name: 116, dtype: object

Assignment: New York City 311 data analysis

Borough Boundaries (water areas excluded) https://data.cityofnewyork.us/resource/7t3b-ywvw.geojson

311 Service Requests from 2010 to present https://data.cityofnewyork.us/resource/erm2-nwe9.geojson

where:

  • created_date between 1/1/2018 and 3/1/2020
  • descriptor contains Loud Music select:
  • keep year and month of created_date
  • keep - city, borough, location, open_data_channel_type, complaint_type, descriptor

Answer the following questions:

  1. Which Borough had the most noise complaints during this time period?
  2. During what month are noise complaints most frequent?
  3. How many compaints by borough are from residential propoerties versus commercial?
  4. What channel type is most frequently used, by borrough.

Map all commercial complaints (use the borough boundaries as the back drop behind the points.)

Let's make prettier maps. Below is an example that leads to adding a basemap, polygons, and points to the same map plot!

In [30]:
ax = gdf.plot(figsize=(10, 10), alpha=0.5, edgecolor='k') # plots polygons
In [31]:
# conda install -c conda-forge contextily
import contextily as ctx
In [32]:
ax = gdf.plot(figsize=(10, 10), alpha=0.5, edgecolor='k')
ctx.add_basemap(ax) # add default basemap to plot