Load Packages We Need

We begin by loading packages we need. Here you might explain briefly what each one does and why it’s needed. I can add echo=FALSE if I want to suppress the R code being printed to my final document.

library(dplyr)
library(ggplot2)

Data Preparation

It is often said that in data analytics, 80% of the work is in data preparation and cleaning and only 20% is actual statistical work and hypothesis testing. This would be a great place for you to write a brief intro to where you found your data, and what you plan to do with it in this script.

Get Data

Here, I make sure that the data is loaded. I have to make sure that the file path for the Philadelphia Census data is correct.

lead_data <- read.csv("https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+child_blood_lead_levels_by_ct&filename=child_blood_lead_levels_by_ct&format=csv&skipfields=cartodb_id,the_geom,the_geom_webmercator")
philadelphia_census_data <- read.csv("philly_census.csv")

Examine Data

Here, I’m going to use some combination of head(), tail(), names(), dim(), str() and summary() to give my readers a bit of insight into the two datasets.

head(lead_data) # head, not tail, we want to see row 3
##   census_tract data_redacted num_bll_5plus num_screen perc_5plus
## 1   4.2101e+10         false             0        100          0
## 2   4.2101e+10          true            NA        109         NA
## 3   4.2101e+10          true            NA        110         NA
## 4   4.2101e+10          true            NA         61         NA
## 5   4.2101e+10         false             0         41          0
## 6   4.2101e+10          true            NA         49         NA
summary(philadelphia_census_data)
##   census_tract      pct_unemployed  pct_commute_public_transit
##  Min.   :4.21e+10   Min.   : 0.00   Min.   : 0.00             
##  1st Qu.:4.21e+10   1st Qu.: 7.30   1st Qu.:16.75             
##  Median :4.21e+10   Median :11.90   Median :26.15             
##  Mean   :4.21e+10   Mean   :13.07   Mean   :27.16             
##  3rd Qu.:4.21e+10   3rd Qu.:17.10   3rd Qu.:37.12             
##  Max.   :4.21e+10   Max.   :75.00   Max.   :79.90             
##                     NA's   :8       NA's   :8                 
##  median_household_income mean_household_income pct_child_uninsured
##  Min.   : 11473          Min.   : 22414        Min.   : 0.000     
##  1st Qu.: 27195          1st Qu.: 37700        1st Qu.: 0.000     
##  Median : 39225          Median : 50357        Median : 2.400     
##  Mean   : 43380          Mean   : 58543        Mean   : 4.115     
##  3rd Qu.: 53725          3rd Qu.: 71770        3rd Qu.: 5.850     
##  Max.   :145104          Max.   :228178        Max.   :33.900     
##  NA's   :9               NA's   :9             NA's   :9          
##  pct_families_below_poverty_line
##  Min.   : 0.00                  
##  1st Qu.: 7.55                  
##  Median :18.50                  
##  Mean   :19.86                  
##  3rd Qu.:28.95                  
##  Max.   :60.10                  
##  NA's   :9

Here, I’m going to add some insights from what I’ve read about the lead database and what I’ve been told about the census dataset.

For example, I could type a bulleted list like this:

  • Where there were 1-5 children with high lead, the exact number is suppressed.
  • If there were 0 children or >6 children with high lead, the exact number is given.
  • The census data is a selection of the American Community Survey’s estimates of Selected Economic Characteristics (2016 estimates).

I can add emphasis to my text in various ways.

Imputation

Here I’m going to describe my criteria for imputation, e.g. what I do when the number of kiddos with high lead is suppressed. For example:

In lead_data, we want to impute some values. Where the data was redacted, it’s because the number of children with blood lead level above 5 micrograms per deciliter was between 1-5, inclusive.

Make a Copy!

We can make changes directly in our lead_data, but what if we mess up and have to re-download it? Better to make a copy to work with.

lead_data_corrected <- lead_data

Impute

Take the section of rows where the data was redacted and the number of kiddos with high lead is “NA”, and turn those NA into a number.

lead_data_corrected$num_bll_5plus[which(is.na(lead_data_corrected$num_bll_5plus))]  <- 3

Clean Up

But now that we’ve imputed a value for those NA values, the perc_5plus variable is not really accurate any more.

That means it is a good time to do some more radical cleaning of our data.

transmute() allows us to create a new dataset that keeps some columns and adds new ones.

lead_data_corrected <- lead_data_corrected %>%
  transmute(census_tract, percent_high_lead = round((num_bll_5plus / num_screen) * 100, 1))

head(lead_data_corrected)
##   census_tract percent_high_lead
## 1   4.2101e+10               0.0
## 2   4.2101e+10               2.8
## 3   4.2101e+10               2.7
## 4   4.2101e+10               4.9
## 5   4.2101e+10               0.0
## 6   4.2101e+10               6.1

Combination

What column is the “hinge” between our two data frames?

Does it have the same variable name in both data frames, or is it different?

Which tracts do we want? Ones that appear in both? In at least lead_data? You want to explain here your rationale for joining the data and including / excluding data that exists in only one of the datasets.

We’re going to use merge to combine these.

Use ?merge() to get the information you need to complete the following:

lead_ses <- merge(x=lead_data_corrected, 
                  y=philadelphia_census_data, 
                  by="census_tract",
                  all = TRUE)

head(lead_ses)
##   census_tract percent_high_lead pct_unemployed pct_commute_public_transit
## 1   4.2101e+10               0.0            2.6                       33.4
## 2   4.2101e+10               2.8            7.6                       22.0
## 3   4.2101e+10               2.7            4.6                       11.5
## 4   4.2101e+10               4.9            5.9                       21.8
## 5   4.2101e+10               0.0            2.0                       14.6
## 6   4.2101e+10               6.1           12.3                       20.6
##   median_household_income mean_household_income pct_child_uninsured
## 1                  103772                124525                33.9
## 2                   50455                 99337                 6.5
## 3                   93036                121867                 3.3
## 4                   57604                 83354                 0.0
## 5                   70038                 92625                 0.0
## 6                   40568                 60813                 9.1
##   pct_families_below_poverty_line
## 1                             1.7
## 2                            13.7
## 3                             3.4
## 4                            23.2
## 5                             0.0
## 6                             0.0

Visualize Data

We want to visualize our data to understand it more.

Histogram

Let’s make a histogram (frequency graph) for median income.

ggplot2 allows you to gradually add layers of complexity to a plot using the “grammar of graphics”.

You start with the data source (in our case, lead_ses) and some “aesthetic” values – what you’re going to plot from that data. Then you add a geometric representation (histogram, point, bar, box)

Try changing the code above to use geom_boxplot() instead of geom_histogram. Put the code below. Here’s a hint: you have to have an x, even though you don’t really need one. So try x=0.

Boxplot

What about putting two variables into play? If you have a geometric representation that requires an x and a y value, ggplot will assume that you’re going to give it two variables in aes. Unless you say otherwise, the first will be x, the second will be y.

Scatterplot

Below, finish the code that will make a scatterplot (points) where x is pct_unemployed and y is percent_high_lead:

Do you think there’s a relationship?

Try looking at scatterplots of other relationships, like high lead as a function of mean household income, or uninsured children plotted against public transit use.

Find anything interesting?

Adding Color

Let’s add in a third variable, using color!

Improving Labels

Let’s spruce this up and make our graph more attractive. Fill in the missing values below.

It’s better, but try it again below, but adding a line that says

scale_color_continuous("Percent of Families\nBelow the Poverty Line")

As you can see the syntax is complex. You can add colors to the lines, axes, backgrounds, etc. You can tell ggplot where you want your tick marks and where to put the plot legend. You can customize so much of your plot, but it isn’t super easy, syntax-wise.

BUT! If you have installed the package “ggThemeAssist”, there is some good stuff you can use by highlighting some ggplot code and then going to the “Addins” menu up top and selecting “ggplot Theme Assistant”.

ggthemeassist (START HERE!)

Let’s try that now! Take the code below, highlight it, and select ggThemeAssist. Play with different ways to enhance / improve your plot!

You can also use some pre-set themes if you like, using ggthemes.

ggthemes

Here I use ggthemes to use ready-made themes that emulate a certain style or publication, like so:

Statistics

So, you think you have something statistically interesting! A real pattern. Poorer areas have children with higher lead than more wealthy areas. But is the trend you see statistically significant?

Aside: I won’t actually teach statistics here. But a great book (free, online, with a nice printable version) is the Handbook of Biological Statistics at http://www.biostathandbook.com/. There’s a fan of this book that created an R companion to it. I can’t testify to how good it is, but it might be helpful. Check it out at https://rcompanion.org/rcompanion/a_02.html.

Statistics for a linear model

You think you have a pretty good predictor of lead levels for a given census tract: the percent of families in that area who live below the poverty line.

ggplot(lead_ses, aes(x=pct_families_below_poverty_line, y = percent_high_lead)) +
  geom_point()

my_model <- lm(percent_high_lead ~ pct_families_below_poverty_line, lead_ses)
summary(my_model)
## 
## Call:
## lm(formula = percent_high_lead ~ pct_families_below_poverty_line, 
##     data = lead_ses)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2234 -2.4512 -0.7699  1.7294 11.5795 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      2.50216    0.28438   8.799   <2e-16 ***
## pct_families_below_poverty_line  0.12410    0.01154  10.755   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.215 on 370 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.2382, Adjusted R-squared:  0.2361 
## F-statistic: 115.7 on 1 and 370 DF,  p-value: < 2.2e-16

Check out the R squared and the p values (Pr(>|t|)). You may also want to try plotting these points and the suggested linear model to see if it looks good to you.

Statistics for a t-test

Don’t want to do a predictive model, but rather, demonstrate 2-group difference in mean? We can make groups like this:

high_poverty <- lead_ses %>% filter(pct_families_below_poverty_line >= 15)
low_poverty <- lead_ses %>% filter(pct_families_below_poverty_line < 15)

t.test(high_poverty$percent_high_lead, low_poverty$percent_high_lead)
## 
##  Welch Two Sample t-test
## 
## data:  high_poverty$percent_high_lead and low_poverty$percent_high_lead
## t = 10.182, df = 368.59, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.730961 4.038283
## sample estimates:
## mean of x mean of y 
##  6.426761  3.042138

Check out the t, p-value, and confidence interval. How could you plot this? Here’s one way:

So we find a statistically significant difference. What’s the effect size? Why does this matter? Good place to write here.

library(effsize)
cohen.d(high_poverty$percent_high_lead, low_poverty$percent_high_lead, na.rm = TRUE)
## 
## Cohen's d
## 
## d estimate: 1.032427 (large)
## 95 percent confidence interval:
##       inf       sup 
## 0.8133096 1.2515436