We begin by loading packages we need. Here you might explain briefly what each one does and why it’s needed. I can add echo=FALSE if I want to suppress the R code being printed to my final document.
library(dplyr)
library(ggplot2)It is often said that in data analytics, 80% of the work is in data preparation and cleaning and only 20% is actual statistical work and hypothesis testing. This would be a great place for you to write a brief intro to where you found your data, and what you plan to do with it in this script.
Here, I make sure that the data is loaded. I have to make sure that the file path for the Philadelphia Census data is correct.
lead_data <- read.csv("https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+child_blood_lead_levels_by_ct&filename=child_blood_lead_levels_by_ct&format=csv&skipfields=cartodb_id,the_geom,the_geom_webmercator")
philadelphia_census_data <- read.csv("philly_census.csv")Here, I’m going to use some combination of head(), tail(), names(), dim(), str() and summary() to give my readers a bit of insight into the two datasets.
head(lead_data) # head, not tail, we want to see row 3## census_tract data_redacted num_bll_5plus num_screen perc_5plus
## 1 4.2101e+10 false 0 100 0
## 2 4.2101e+10 true NA 109 NA
## 3 4.2101e+10 true NA 110 NA
## 4 4.2101e+10 true NA 61 NA
## 5 4.2101e+10 false 0 41 0
## 6 4.2101e+10 true NA 49 NA
summary(philadelphia_census_data)## census_tract pct_unemployed pct_commute_public_transit
## Min. :4.21e+10 Min. : 0.00 Min. : 0.00
## 1st Qu.:4.21e+10 1st Qu.: 7.30 1st Qu.:16.75
## Median :4.21e+10 Median :11.90 Median :26.15
## Mean :4.21e+10 Mean :13.07 Mean :27.16
## 3rd Qu.:4.21e+10 3rd Qu.:17.10 3rd Qu.:37.12
## Max. :4.21e+10 Max. :75.00 Max. :79.90
## NA's :8 NA's :8
## median_household_income mean_household_income pct_child_uninsured
## Min. : 11473 Min. : 22414 Min. : 0.000
## 1st Qu.: 27195 1st Qu.: 37700 1st Qu.: 0.000
## Median : 39225 Median : 50357 Median : 2.400
## Mean : 43380 Mean : 58543 Mean : 4.115
## 3rd Qu.: 53725 3rd Qu.: 71770 3rd Qu.: 5.850
## Max. :145104 Max. :228178 Max. :33.900
## NA's :9 NA's :9 NA's :9
## pct_families_below_poverty_line
## Min. : 0.00
## 1st Qu.: 7.55
## Median :18.50
## Mean :19.86
## 3rd Qu.:28.95
## Max. :60.10
## NA's :9
Here, I’m going to add some insights from what I’ve read about the lead database and what I’ve been told about the census dataset.
For example, I could type a bulleted list like this:
I can add emphasis to my text in various ways.
Here I’m going to describe my criteria for imputation, e.g. what I do when the number of kiddos with high lead is suppressed. For example:
In lead_data, we want to impute some values. Where the data was redacted, it’s because the number of children with blood lead level above 5 micrograms per deciliter was between 1-5, inclusive.
We can make changes directly in our lead_data, but what if we mess up and have to re-download it? Better to make a copy to work with.
lead_data_corrected <- lead_dataTake the section of rows where the data was redacted and the number of kiddos with high lead is “NA”, and turn those NA into a number.
lead_data_corrected$num_bll_5plus[which(is.na(lead_data_corrected$num_bll_5plus))] <- 3But now that we’ve imputed a value for those NA values, the perc_5plus variable is not really accurate any more.
That means it is a good time to do some more radical cleaning of our data.
transmute() allows us to create a new dataset that keeps some columns and adds new ones.
lead_data_corrected <- lead_data_corrected %>%
transmute(census_tract, percent_high_lead = round((num_bll_5plus / num_screen) * 100, 1))
head(lead_data_corrected)## census_tract percent_high_lead
## 1 4.2101e+10 0.0
## 2 4.2101e+10 2.8
## 3 4.2101e+10 2.7
## 4 4.2101e+10 4.9
## 5 4.2101e+10 0.0
## 6 4.2101e+10 6.1
What column is the “hinge” between our two data frames?
Does it have the same variable name in both data frames, or is it different?
Which tracts do we want? Ones that appear in both? In at least lead_data? You want to explain here your rationale for joining the data and including / excluding data that exists in only one of the datasets.
We’re going to use merge to combine these.
Use ?merge() to get the information you need to complete the following:
lead_ses <- merge(x=lead_data_corrected,
y=philadelphia_census_data,
by="census_tract",
all = TRUE)
head(lead_ses)## census_tract percent_high_lead pct_unemployed pct_commute_public_transit
## 1 4.2101e+10 0.0 2.6 33.4
## 2 4.2101e+10 2.8 7.6 22.0
## 3 4.2101e+10 2.7 4.6 11.5
## 4 4.2101e+10 4.9 5.9 21.8
## 5 4.2101e+10 0.0 2.0 14.6
## 6 4.2101e+10 6.1 12.3 20.6
## median_household_income mean_household_income pct_child_uninsured
## 1 103772 124525 33.9
## 2 50455 99337 6.5
## 3 93036 121867 3.3
## 4 57604 83354 0.0
## 5 70038 92625 0.0
## 6 40568 60813 9.1
## pct_families_below_poverty_line
## 1 1.7
## 2 13.7
## 3 3.4
## 4 23.2
## 5 0.0
## 6 0.0
We want to visualize our data to understand it more.
Let’s make a histogram (frequency graph) for median income.
ggplot2 allows you to gradually add layers of complexity to a plot using the “grammar of graphics”.
You start with the data source (in our case, lead_ses) and some “aesthetic” values – what you’re going to plot from that data. Then you add a geometric representation (histogram, point, bar, box)
Try changing the code above to use geom_boxplot() instead of geom_histogram. Put the code below. Here’s a hint: you have to have an x, even though you don’t really need one. So try x=0.
What about putting two variables into play? If you have a geometric representation that requires an x and a y value, ggplot will assume that you’re going to give it two variables in aes. Unless you say otherwise, the first will be x, the second will be y.
Below, finish the code that will make a scatterplot (points) where x is pct_unemployed and y is percent_high_lead:
Do you think there’s a relationship?
Try looking at scatterplots of other relationships, like high lead as a function of mean household income, or uninsured children plotted against public transit use.
Find anything interesting?
Let’s add in a third variable, using color!
Let’s spruce this up and make our graph more attractive. Fill in the missing values below.
It’s better, but try it again below, but adding a line that says
scale_color_continuous("Percent of Families\nBelow the Poverty Line")
As you can see the syntax is complex. You can add colors to the lines, axes, backgrounds, etc. You can tell ggplot where you want your tick marks and where to put the plot legend. You can customize so much of your plot, but it isn’t super easy, syntax-wise.
BUT! If you have installed the package “ggThemeAssist”, there is some good stuff you can use by highlighting some ggplot code and then going to the “Addins” menu up top and selecting “ggplot Theme Assistant”.
Let’s try that now! Take the code below, highlight it, and select ggThemeAssist. Play with different ways to enhance / improve your plot!
You can also use some pre-set themes if you like, using ggthemes.
Here I use ggthemes to use ready-made themes that emulate a certain style or publication, like so:
So, you think you have something statistically interesting! A real pattern. Poorer areas have children with higher lead than more wealthy areas. But is the trend you see statistically significant?
Aside: I won’t actually teach statistics here. But a great book (free, online, with a nice printable version) is the Handbook of Biological Statistics at http://www.biostathandbook.com/. There’s a fan of this book that created an R companion to it. I can’t testify to how good it is, but it might be helpful. Check it out at https://rcompanion.org/rcompanion/a_02.html.
You think you have a pretty good predictor of lead levels for a given census tract: the percent of families in that area who live below the poverty line.
ggplot(lead_ses, aes(x=pct_families_below_poverty_line, y = percent_high_lead)) +
geom_point()my_model <- lm(percent_high_lead ~ pct_families_below_poverty_line, lead_ses)
summary(my_model)##
## Call:
## lm(formula = percent_high_lead ~ pct_families_below_poverty_line,
## data = lead_ses)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2234 -2.4512 -0.7699 1.7294 11.5795
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.50216 0.28438 8.799 <2e-16 ***
## pct_families_below_poverty_line 0.12410 0.01154 10.755 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.215 on 370 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.2382, Adjusted R-squared: 0.2361
## F-statistic: 115.7 on 1 and 370 DF, p-value: < 2.2e-16
Check out the R squared and the p values (Pr(>|t|)). You may also want to try plotting these points and the suggested linear model to see if it looks good to you.
Don’t want to do a predictive model, but rather, demonstrate 2-group difference in mean? We can make groups like this:
high_poverty <- lead_ses %>% filter(pct_families_below_poverty_line >= 15)
low_poverty <- lead_ses %>% filter(pct_families_below_poverty_line < 15)
t.test(high_poverty$percent_high_lead, low_poverty$percent_high_lead)##
## Welch Two Sample t-test
##
## data: high_poverty$percent_high_lead and low_poverty$percent_high_lead
## t = 10.182, df = 368.59, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.730961 4.038283
## sample estimates:
## mean of x mean of y
## 6.426761 3.042138
Check out the t, p-value, and confidence interval. How could you plot this? Here’s one way:
So we find a statistically significant difference. What’s the effect size? Why does this matter? Good place to write here.
library(effsize)
cohen.d(high_poverty$percent_high_lead, low_poverty$percent_high_lead, na.rm = TRUE)##
## Cohen's d
##
## d estimate: 1.032427 (large)
## 95 percent confidence interval:
## inf sup
## 0.8133096 1.2515436