Skip to main content
Step by Step: GeoReference Tutorial for a Correlation

Step by Step walkthrough where you can create an example that correlates your Data with mySidewalk Data.

Jennifer Funk avatar
Written by Jennifer Funk
Updated over a week ago

If you want to assign a geography in mySidewalk, the georeference tool is a great way to do that. This will help you upload data without directing you to a third party tool or gathering latitude/longitude coordinates that may not relate to the visualization you are trying to make.

Additionally, once you have mapped your data to the geographic regions that we support, you can leverage mySidewalk data to make a correlation.

These articles can provide more information as needed:

Goal of this Example

In this article we are going to walk through an example of how to use the georeference tool to create a correlation scatter plot between mySidewalk data and your data. The goal will be to show a correlation between the COVID-19 Case Rate & Population Density in Kansas City by following these steps:

1. Get and Clean the Data

We can use the data publicly available here at the data.kcmo.org. I used the COVID-19 Data by ZIP Code dataset.

  • First, we can remove all of the rows that we do not want to upload. For example, anything listed as ‘not calculated’ will not be helpful to us

  • Then, we need to ensure that there is only one row that contains header information (you can access the cleaned file here to check your work). The column headers will become the names of the data values when you load them in. For example, you will have data called “Total Ballots Counted (Estimated)” for each geography (United States, Arkansas, Kansas and Missouri).

  • You can also remove extra data columns if you want (for this example, we will not be using “Two-Week Total Cases” but it’s fine to leave it in the spreadsheet)

  • Save the cleaned file as a .csv

2. Upload and Assign the geographies

We will now take that cleaned file and upload it into mySidewalk.

  • Log In

  • Click Upload -> Upload and Georeference

  • Upload the file we created

At this point, you will see a sample of the file.

  • Select the “Multiple Geographies” button

  • Using the dropdowns, assign the ZIP Codes to the corresponding rows. You must select a corresponding geography for each line.

  • You can also change the name of your file by updating the “Layer File name” at the top of the page

  • Click “Submit” to upload the layer with the geographies assigned

A successful layer upload will look like the above image.

Things to note:

  • The column headers that contained numbers say “number” in the Type column and not “text”. This means that the format can be changed within mySidewalk (to percent, for example, and used in calculations such as normalization)

  • All the column headers load into the “Name” list. We want to ensure that the data we need is the data we are getting. You can confirm the data values by choosing the different geographies and watching the data values change to match the spreadsheet

  • The geographies are outlined on the map image. This confirms that you agree with our representation of the geography

Things to Do:

  • Choose a label for your layer with the radio buttons in the “Use as Label?” column (typically the name of the geographic region). In this example, choose the row name “Label”. The label is basically the name of the geographic region that will appear when you aren’t using a map (see callout example)

  • Change the aliases if desired

3. Use the layer in a Correlation

To start using this layer, you will need to navigate to a report. You can choose “New Report” at the top of the page or select “Reports” from the blue quick start menu button on the left.

Create the Correlation

  • Hover to view then select the add content blue “+” button on the page and select “Correlation”

  • Select “Add My Layer” since you are adding a layer you uploaded and not a layer created by mySidewalk

  • Select the layer you just uploaded

  • You can see the Correlation populate with two generic datasets

Adjust both Data Selections

Because the goal of this is to correlate COVID case rate by population density, we need to make some adjustments to the selected data.

First, we will choose the Rate per 100,000 data for the x-axis:

  • Under “X-Axis”, click Change Data and choose your layer (“Data from KC COVID-19_Data_by_ZIP_Code”)

  • Choose the plus button next to “Crude Rate Per 100,000”

Next, we will choose Total Population data for the y-axis:

  • Under “Y-Axis”, click Change Data and choose MySidewalk Data

  • Search for and select “Total Population”

Finally, we will normalize the population data because the Rate per 100,000 data is already normalized and we want an appropriate comparison.

  • Under “Y-Axis” in the “Normalize by” dropdown, choose “Total Area (acres)”

What does this correlation tell me?

As described in the footer, there is not a strong relationship between population density and crude COVID rate per 100K in Kansas City.

Did this answer your question?