The mySidewalk Data Library is a carefully curated repository of community-level data designed to address diverse needs across communities in the U.S. This article explains how our team builds, maintains, and updates the library through three main steps: Acquisition, Processing, and Metadata.

1. Acquisition: Collecting the Right Data

Our process starts with ideas from customers, experts, and our own internal research. We also work with partners to purchase or create an exchange that allows us to add the data to the library. Examples of data partners include: property data (ATTOM Data), GoDaddy Venture Forward, National Housing Preservation Database (NHPD), and noise data (Tether Data). We evaluate potential datasets based on:

Granularity: What is the smallest geographic unit available?
Coverage: Is it available nationwide?
Use Restrictions: Are there legal or cost-related limitations?
Use Cases: How can the data address community needs?
Duplication: Do we already have similar data?
Currency and Updates: Is the data current, and how often is it updated?

Once approved, our team creates a data pipeline to acquire and clean the raw files. This involves:

Initial Cleaning: Removing headers, footers, and unmerged cells.
Database Setup: Loading cleaned data into a working database.

2. Processing: Preparing the Data

Our data team revisits the research checklist and conducts detailed cleaning. This includes but is not limited to:

Placeholder Removal: Replacing placeholders like “N/A” with “NULL”.
Geocoding: Adding latitude and longitude to address data.
Georeferencing: Aligning location data with mySidewalk geographies.

We also perform advanced transformations like:

Harmonization: Standardizing Census tract data across different years.
Aggregation: Summarizing point data for larger geographies.
Apportionment: Using spatial math to align data with specific geographic boundaries.
Projection: Calculating data for future time periods using the published raw values.

Quality Assurance (QA) Process

Once built, the data undergoes an extensive QA process to ensure accuracy and reliability. Key checks include:

Valid Nulls and Zeros: Reviewing missing data and zero values to confirm whether they represent valid data points or placeholders for unknown data.
Basic Calculations: Ensuring that aggregated values align. For instance, if a dataset has six categories, their combined total should match the overall dataset total.
Basic Logic: Verifying data ranges and numeracy against documentation. For example, an index ranging from 0 to 1 should have all values within this range. Similarly, a count of 1 million doctors in Washington D.C. would be flagged since the city’s population is lower than this number.
Nesting: Validating hierarchical geographic data. For example, ensuring that all census block group values aggregate correctly to match the value for a census tract.
Changes Over Time: Checking year-to-year differences for consistency. For example, anomalies caused by events like COVID-19 are reviewed and documented.
Outliers: Identifying and examining outlier values using visual tools like histograms or box-and-whisker plots to distinguish real patterns from potential errors.

Why Does Data Change Shape?

Throughout the QA and processing steps, data may change shape from its original raw form due to the following factors:

Geography:

Data values are calculated for Metropolitan Planning Organizations (MPOs), Neighborhoods, and City Council Districts.
Additional geographic calculations are performed using harmonization, apportionment, geocoding, and spatial aggregation.

Calculations:

Data is transformed into ready-to-use formats. This often involves creating custom groupings, such as pooling data across multiple years for increased availability or calculating new indicators like “30% or more of income spent on housing” or “age generations.”
These values are not directly available from raw datasets but are crucial for actionable insights.

One final QA pass is conducted once the data has been uploaded to the platform. This includes checking all of the things above in addition to how the data looks, reads, and behaves across the different mySidewalk products. In this environment, our team builds test components in Reports and observes data in Seek data tables. This step ensures the data is ready to power features like data search and visualization.

Once the data passes the working database QA process, it moves into the pipeline for metadata documentation.

3. Metadata: Adding Context and Accessibility

Metadata enhances the usability of our data by providing:

Citations: Full source information for transparency.
Methodology Descriptions: Detailed explanations of how values are calculated.
Labels and Synonyms: Ensuring data is easy to search and understand.
Groupings: Logical combinations of related data for convenience.

4. Application: Putting Data to Use

The mySidewalk Data Library integrates seamlessly into all our products, allowing users to transition between tools for various workflows. Whether you’re visualizing data, addressing community needs, or preparing reports, our library ensures reliable and actionable insights.

Getting data into Modern and/or Custom Geographic Boundaries

Troubleshooting Visualizations Built with your Data

Census Projections in mySidewalk

Understanding Historical Census Data in mySidewalk

Data Availability

How the mySidewalk Data Library is Built

1. Acquisition: Collecting the Right Data

2. Processing: Preparing the Data

Quality Assurance (QA) Process

3. Metadata: Adding Context and Accessibility

4. Application: Putting Data to Use