Skip to main content
All CollectionsDatamySidewalk Data
How the mySidewalk Data Library is Built
How the mySidewalk Data Library is Built

Overview of the processes undertaken to build and maintain the data in the mySidewalk data library.

Sarah Byrd avatar
Written by Sarah Byrd
Updated over 2 months ago

The mySidewalk data library is a curated set of community level data. It covers a wide variety of topics and themes, so that it can be used for a diverse set of needs in communities across the country.

A dedicated team at mySidewalk works to build, maintain, expand, and curate the data library, helping it grow and change. For all of the data contained in the library, it undergoes the same high level steps:

These three steps then allow for the Application of data across mySidewalk products.

In the rest of this article, we will provide some details on the Acquisition, Processing, and Metadata steps undertaken for data in the mySidewalk Data Library.

Acquisition

Data begins as an idea. We solicit data ideas from customers, prospects, subject matter experts, partners, and from our own extensive research and reading. We also work with partners to purchase or create an exchange that allows us to add the data to the library. Examples of data partners include property data (ATTOM Data), GoDaddy Venture Forward, National Housing Preservation Database (NHPD), and noise data (Tether Data).

With a data idea in hand, we then commit to research to determine:

  • Granularity (smallest geography available)

  • Coverage (is it available for the whole United States?)

  • Use restrictions (cost of data, legal use restrictions)

  • Use cases (if the data is added, how might it be used to address a community need or problem?)

  • Duplication (do we already have similar data?)

  • Currency (can we find a more up-to-date source, with similar coverage and granularity?)

  • Update Frequency (will new versions of the data be published? How often?)

After passing the research stage, a data idea gets prioritized on our backlog to get added to the system data library.

Once picked up and assigned to be worked on, our data team develops a data pipeline for the idea. This pipeline generally acquires the raw data files and associated documentation, preserves the original raw riles, does initial cleaning (i.e. removal of headers and footers, unmerging of cells), and loads the data into a working database.

Processing

Once we have the data setup in a working database, we revisit the research step checklist. We review the provided documentation, methodology, data dictionaries, and publications to check our understanding of the data and how it might be used by our customers. Preliminary research is conducted to check the coverage and granularity, as sometimes the published availability does not match the availability in practice, after taking into account suppressed or placeholder values. Throughout all these steps, we want to catch data that does not meet our standards and stop before it gets added to the mySidewalk data library.

Next, we do some cleaning of the data which includes but is not limited to:

  • Removal of placeholder values (become NULL or No Data)

  • Correction of known location id (geoid) or name changes

  • Geocoding (adding latitude and longitude coordinates to address data)

  • Georeferencing (cleaning location and state names so that it joins to mySidewalk geographies)

After cleaning, we start the pre-work that then allows us to build the data. Depending on the set of raw tables, this may involve:

  • Harmonization (transformation of data from 2010 Census Tracts or Block groups into the 2020 Census Tracts or Block groups in the mySidewalk data library)

  • Imputation (mathematical filling in of missing data values using known data values)

Next, we do the work to build the data. This may involve:

  • Aggregation (for point data, calculating into the mySidewalk geographies)

  • Regrouping (creating custom data indicators by combining 2 or more raw data values or categories.)

  • Transformation (doing math so that the output is a usable value, such as average, median, percent, percentile, etc)

  • Apportionment (a spatial math process that allows the calculation of data into mySidewalk geographies for which it was not originally published. See Getting data into Modern and/or Custom Geographic Boundaries for details)

  • Projection (calculation of data for future times, using the published raw data values)

Once built, the data then undergoes an extensive QA process, checking things such as:

  • Nulls (reviewing missing data, if it should be missing or should have a value)

  • Zero (checking that the zero value is real and not a placeholder for missing or unknown data)

  • Parts equal the whole (if there are 6 categories, when combined do they equal the same value as the total?)

  • Nesting (Combine the values of geographies whose borders match, checking that the smaller geography adds up to the container geography. Reviewing so that block group, tract, county, state, nation add up as expected)

  • Change over time (are the year-to-year differences consistent? If not, is there a cause like Covid-19 shutdown?)

  • Normal range (reviewing against documentation, that the values are within expected range. For example, if the index is 0-1 that all values are 0 to 1).

  • Outliers (Create histogram or box-and-whisker plot of each indicator. Review for outlier values, checking if a real pattern or data error)

  • Numeracy (Logic check, that the values make sense given the indicator. Example: if counting doctors in Washington D.C., 1 Million is not correct since the whole population of Washington D.C. is less than that.)

Throughout the above processing steps, much of the data changed shape from the original raw data values. The main two reasons are

  1. Geographic

    1. mySidewalk calculates all of the data values for Metropolitan Planning Organization (MPO), Neighborhood, and City Council District geographies.

    2. mySidewalk calculates data values beyond the original raw published values, using harmonization, apportionment, geocoding, and spatial aggregation.

  2. Calculations

    1. The team at mySidewalk tries to provide the data in ready-to-use formats. This often means creating custom time groups (i.e. pooling multiple years of data to increase availability) or creating custom data groups (i.e. 30% or more of income spent on housing, age generations, R/ECAP, etc).

    2. These are values not directly available from a raw data download.

Once the data is built and has passed the working database QA, it is then sent down the pipeline for metadata documentation.

Metadata

Metadata is the key to making it possible to search or browse the data library, select data, normalize data, visualize it, and leverage AI to help tell data stories with it. Some of the metadata we enrich includes:

  • Citation. (The pre-populated footer for data visualizations along with the detailed catalog page provide full source information.)

  • Methodology (Available on every source page, it explains how the data values were calculated by the original source and then by mySidewalk)

  • Labels (These range from the names for the data and data groups, time names, source names, and even the units for each data.)

  • Normalization (We store the data such that we can provide both normalizers, if available, within the data tree along with standard normalizers such as population, households, housing units, property, and area.)

  • Groupings (Data that adds up or logically is used together are pre-grouped so that you can select the whole group, rather than each part individually.)

  • Synonyms (In data search, we provide synonyms and re-ranking rules to make it easier to find the most commonly used data.)

After all of these steps, we then sync the data into a copy of the final production environment. This allows an additional QA pass, checking all of the things above in addition to how the data looks, reads, and behaves across the different mySidewalk products. In this environment, our team builds test components in Reports and observes data in Seek data tables. This provides a final check prior to the release.

Application

The mySidewalk data library is the same across all mySidewalk products. This allows you to use it however you need, including hoping between products for different workflows and use cases.

Did this answer your question?