[Video]
# Scatterplot of weight vs. weeks
ggplot(data = ncbirths, aes(x = weeks, y = weight)) +
geom_point()
## Warning: Removed 2 rows containing missing values (geom_point).
# Boxplot of weight vs. weeks
ggplot(data = ncbirths,
aes(x = cut(weeks, breaks = 5), y = weight)) +
geom_boxplot()
[Video]
# Mammals scatterplot
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
geom_point()
# Baseball player scatterplot
ggplot(data = mlbBat10, aes(x = OBP, y = SLG)) +
geom_point()
# Body dimensions scatterplot
ggplot(data = bdims, aes(x = hgt, y = wgt, color = factor(sex))) +
geom_point()
# Smoking scatterplot
ggplot(data = smoking, aes(x = age, y = amtWeekdays)) +
geom_point()
## Warning: Removed 1270 rows containing missing values (geom_point).
This scatterplot shows the relationship between the poverty rates and high school graduation rates of counties in the United States.
Describe the form, direction, and strength of this relationship.
# Scatterplot with coord_trans()
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
geom_point() +
coord_trans(x = "log10", y = "log10")
# Scatterplot with scale_x_log10() and scale_y_log10()
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
geom_point() +
scale_x_log10() +
scale_y_log10()
[Video]
# Filter for AB greater than or equal to 200
ab_gt_200 <- mlbBat10 %>%
filter(AB >= 200)
# Scatterplot of SLG vs. OBP
ggplot(ab_gt_200, aes(x = OBP, y = SLG)) +
geom_point()
# Identify the outlying player
ab_gt_200 %>%
filter(OBP < 0.2)
## name team position G AB R H 2B 3B HR RBI TB BB SO SB CS OBP SLG
## 1 B Wood LAA 3B 81 226 20 33 2 0 4 14 47 6 71 1 0 0.174 0.208
## AVG
## 1 0.146
[Video]
In a scientific paper, three correlations are reported with the following values:
Choose the correct interpretation of these findings.
In a scientific paper, three correlations are reported with the following values:
Which of these values represents the strongest correlation?
# Compute correlation
ncbirths %>%
summarize(N = n(), r = cor(weight, mage))
## N r
## 1 1000 0.05506589
# Compute correlation for all non-missing pairs
ncbirths %>%
summarize(N = n(), r = cor(weight, weeks, use = "pairwise.complete.obs"))
## N r
## 1 1000 0.6701013
[Video]
# Compute properties of Anscombe
Anscombe %>%
group_by(set) %>%
summarize(
N = n(),
mean_of_x = mean(x),
std_dev_of_x = sd(x),
mean_of_y = mean(y),
std_dev_of_y = sd(y),
correlation_between_x_and_y = cor(x, y)
)
## # A tibble: 4 x 7
## set N mean_of_x std_dev_of_x mean_of_y std_dev_of_y correlation_between…
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 11 9 3.32 7.50 2.03 0.816
## 2 2 11 9 3.32 7.50 2.03 0.816
## 3 3 11 9 3.32 7.5 2.03 0.816
## 4 4 11 9 3.32 7.50 2.03 0.817
Recall the scatterplot between the poverty rate of counties in the United States and the high school graduation rate in those counties from the previous chapter. Which of the following values is the correct correlation between poverty rate and high school graduation rate?
# Run this and look at the plot
ggplot(data = mlbBat10, aes(x = OBP, y = SLG)) +
geom_point()
# Correlation for all baseball players
mlbBat10 %>%
summarize(N = n(), r = cor(OBP, SLG))
## N r
## 1 1199 0.8145628
# Run this and look at the plot
mlbBat10 %>%
filter(AB > 200) %>%
ggplot(aes(x = OBP, y = SLG)) +
geom_point()
# Correlation for all players with at least 200 ABs
mlbBat10 %>%
filter(AB >= 200) %>%
summarize(N = n(), r = cor(OBP, SLG))
## N r
## 1 329 0.6855364
# Run this and look at the plot
ggplot(data = bdims, aes(x = hgt, y = wgt, color = factor(sex))) +
geom_point()
# Correlation of body dimensions
bdims %>%
group_by(sex) %>%
summarize(N = n(), r = cor(hgt, wgt))
## # A tibble: 2 x 3
## sex N r
## <int> <int> <dbl>
## 1 0 260 0.431
## 2 1 247 0.535
# Run this and look at the plot
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
geom_point() + scale_x_log10() + scale_y_log10()
# Correlation among mammals, with and without log
mammals %>%
summarize(N = n(),
r = cor(BodyWt, BrainWt),
r_log = cor(log(BodyWt), log(BrainWt)))
## N r r_log
## 1 62 0.9341638 0.9595748
[Video]
Recall that you previously determined the value of the correlation coefficient between the poverty rate of counties in the United States and the high school graduation rate in those counties was -0.681. Choose the correct interpretation of this value.
In the San Francisco Bay Area from 1960-1967, the correlation between the birthweight of 1,236 babies and the length of their gestational period was 0.408. Which of the following conclusions is not a valid statistical interpretation of these results.
[Video]
# Create faceted scatterplot
ggplot(data = noise, aes(x = x, y = y)) +
geom_point() +
facet_wrap(~ z)
# Compute correlations for each dataset
noise_summary <- noise %>%
group_by(z) %>%
summarize(N = n(), spurious_cor = cor(x, y))
# Isolate sets with correlations above 0.2 in absolute strength
noise_summary %>%
filter(abs(spurious_cor) > 0.2)
## # A tibble: 3 x 3
## z N spurious_cor
## <int> <int> <dbl>
## 1 7 50 0.240
## 2 15 50 0.309
## 3 16 50 0.218
[Video]
# Scatterplot with regression line
ggplot(data = bdims, aes(x = hgt, y = wgt)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
# Estimate optimal value of my_slope
add_line(my_slope = 1)
[Video]
Consider a linear regression model of the form:
Y=β0+β1⋅X+ϵ, where ϵ∼N(0,σϵ).
The slope coefficient is:
The fitted model for the poverty rate of U.S. counties as a function of high school graduation rate is:
povertyˆ=64.594−0.591⋅hs_grad
In Hampshire County in western Massachusetts, the high school graduation rate is 92.4%. These two facts imply that the poverty rate in Hampshire County is ___.
# Print bdims_summary
bdims_summary
## N r mean_hgt sd_hgt mean_wgt sd_wgt
## 1 507 0.7173011 171.1438 9.407205 69.14753 13.34576
# Add slope and intercept
bdims_summary %>%
mutate(slope = r * sd_wgt / sd_hgt,
intercept = mean_wgt - slope * mean_hgt)
## N r mean_hgt sd_hgt mean_wgt sd_wgt slope intercept
## 1 507 0.7173011 171.1438 9.407205 69.14753 13.34576 1.017617 -105.0113
[Video]
# Height of children vs. height of father
ggplot(data = Galton_men, aes(x = father, y = height)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
# Height of children vs. height of mother
ggplot(data = Galton_women, aes(x = mother, y = height)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
In an opinion piece about nepotism published in The New York Times in 2015, economist Seth Stephens-Davidowitz wrote that:
"Regression to the mean is so powerful that once-in-a-generation talent basically never sires once-in-a-generation talent. It explains why Michael Jordan’s sons were middling college basketball players and Jakob Dylan wrote two good songs. It is why there are no American parent-child pairs among Hall of Fame players in any major professional sports league."
The author is arguing that…
[Video]
Recall that the fitted model for the poverty rate of U.S. counties as a function of high school graduation rate is:
povertyˆ=64.594−0.591⋅hs_grad
Which of the following is the correct interpretation of the slope coefficient?
A politician interpreting the relationship between poverty rates and high school graduation rates implores his constituents:
If we can lower the poverty rate by 59%, we’ll double the high school graduate rate in our county (i.e. raise it by 100%).
Which of the following mistakes in interpretation has the politician made?
# Linear model for weight as a function of height
lm(wgt ~ hgt, data = bdims)
##
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
##
## Coefficients:
## (Intercept) hgt
## -105.011 1.018
# Linear model for SLG as a function of OBP
lm(SLG ~ OBP, data = mlbBat10)
##
## Call:
## lm(formula = SLG ~ OBP, data = mlbBat10)
##
## Coefficients:
## (Intercept) OBP
## 0.009407 1.110323
# Log-linear model for body weight as a function of brain weight
lm(log(BodyWt) ~ log(BrainWt), data = mammals)
##
## Call:
## lm(formula = log(BodyWt) ~ log(BrainWt), data = mammals)
##
## Coefficients:
## (Intercept) log(BrainWt)
## -2.509 1.225
In the previous examples, we fit two regression models:
wgtˆ=−105.011+1.018⋅hgt
and
SLGˆ=0.009+1.110⋅OBP.
Which of the following statements is incorrect?
[Video]
# Show the coefficients
coef(mod)
## (Intercept) hgt
## -105.011254 1.017617
# Show the full output
summary(mod)
##
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.743 -6.402 -1.231 5.059 41.103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -105.01125 7.53941 -13.93 <2e-16 ***
## hgt 1.01762 0.04399 23.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared: 0.5145, Adjusted R-squared: 0.5136
## F-statistic: 535.2 on 1 and 505 DF, p-value: < 2.2e-16
# Mean of weights equal to mean of fitted values?
mean(bdims$wgt) == mean(fitted.values(mod))
## [1] TRUE
# Mean of the residuals
mean(residuals(mod))
## [1] -1.266971e-15
# Load broom
library(broom)
# Create bdims_tidy
bdims_tidy <- augment(mod)
# Glimpse the resulting data frame
glimpse(bdims_tidy)
## Rows: 507
## Columns: 9
## $ wgt <dbl> 65.6, 71.8, 80.7, 72.6, 78.8, 74.8, 86.4, 78.4, 62.0, 81.6…
## $ hgt <dbl> 174.0, 175.3, 193.5, 186.5, 187.2, 181.5, 184.0, 184.5, 17…
## $ .fitted <dbl> 72.05406, 73.37697, 91.89759, 84.77427, 85.48661, 79.68619…
## $ .se.fit <dbl> 0.4320546, 0.4520060, 1.0667332, 0.7919264, 0.8183471, 0.6…
## $ .resid <dbl> -6.4540648, -1.5769666, -11.1975919, -12.1742745, -6.68660…
## $ .hat <dbl> 0.002154570, 0.002358152, 0.013133942, 0.007238576, 0.0077…
## $ .sigma <dbl> 9.312824, 9.317005, 9.303732, 9.301360, 9.312471, 9.314716…
## $ .cooksd <dbl> 5.201807e-04, 3.400330e-05, 9.758463e-03, 6.282074e-03, 2.…
## $ .std.resid <dbl> -0.69413418, -0.16961994, -1.21098084, -1.31269063, -0.721…
[Video]
# Print ben
ben
## wgt hgt
## 1 74.8 182.8
# Predict the weight of ben
predict(mod, newdata = ben)
## 1
## 81.00909
# Add the line to the scatterplot
ggplot(data = bdims, aes(x = hgt, y = wgt)) +
geom_point() +
geom_abline(data = coefs,
aes(intercept = Intercept, slope = hgt),
color = "dodgerblue")
[Video]
The residual standard error reported for the regression model for poverty rate of U.S. counties in terms of high school graduation rate is 4.67. What does this mean?
# View summary of model
summary(mod)
##
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.743 -6.402 -1.231 5.059 41.103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -105.01125 7.53941 -13.93 <2e-16 ***
## hgt 1.01762 0.04399 23.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared: 0.5145, Adjusted R-squared: 0.5136
## F-statistic: 535.2 on 1 and 505 DF, p-value: < 2.2e-16
# Compute the mean of the residuals
mean(residuals(mod))
## [1] -1.266971e-15
# Compute RMSE
sqrt(sum(residuals(mod)^2) / df.residual(mod))
## [1] 9.30804
[Video]
# View model summary
summary(mod)
##
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.743 -6.402 -1.231 5.059 41.103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -105.01125 7.53941 -13.93 <2e-16 ***
## hgt 1.01762 0.04399 23.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared: 0.5145, Adjusted R-squared: 0.5136
## F-statistic: 535.2 on 1 and 505 DF, p-value: < 2.2e-16
# Compute R-squared
bdims_tidy %>%
summarize(var_y = var(wgt), var_e = var(.resid)) %>%
mutate(R_squared = 1 - var_e / var_y)
## # A tibble: 1 x 3
## var_y var_e R_squared
## <dbl> <dbl> <dbl>
## 1 178. 86.5 0.515
The R2 reported for the regression model for poverty rate of U.S. counties in terms of high school graduation rate is 0.464.
lm(formula = poverty ~ hs_grad, data = countyComplete) %>% summary()
How should this result be interpreted?
# Compute SSE for null model
mod_null %>%
summarize(SSE = sum(.resid^2))
## SSE
## 1 90123.34
# Compute SSE for regression model
mod_hgt %>%
summarize(SSE = sum(.resid^2))
## SSE
## 1 43753.01
[Video]
# Rank points of high leverage
mod %>%
augment() %>%
arrange(desc(.hat)) %>%
head()
## # A tibble: 6 x 9
## wgt hgt .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 85.5 198. 96.6 1.26 -11.1 0.0182 9.30 0.0134 -1.20
## 2 90.9 197. 95.6 1.21 -4.66 0.0170 9.31 0.00221 -0.505
## 3 49.8 147. 44.8 1.13 5.02 0.0148 9.31 0.00221 0.543
## 4 80.7 194. 91.9 1.07 -11.2 0.0131 9.30 0.00976 -1.21
## 5 95.9 193 91.4 1.05 4.51 0.0126 9.32 0.00152 0.488
## 6 44.8 150. 47.1 1.04 -2.32 0.0124 9.32 0.000397 -0.251
# Rank influential points
mod %>%
augment() %>%
arrange(desc(.cooksd)) %>%
head()
## # A tibble: 6 x 9
## wgt hgt .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 73.2 151. 48.8 0.974 24.4 0.0109 9.25 0.0386 2.64
## 2 116. 178. 75.9 0.507 40.5 0.00296 9.14 0.0282 4.36
## 3 104. 165. 63.0 0.491 41.1 0.00279 9.14 0.0273 4.42
## 4 109. 190. 88.8 0.946 19.8 0.0103 9.28 0.0238 2.13
## 5 67.3 152. 50.1 0.922 17.2 0.00982 9.29 0.0171 1.86
## 6 76.8 158. 55.3 0.729 21.5 0.00613 9.27 0.0166 2.32
[Video]
# Create nontrivial_players
nontrivial_players <- mlbBat10 %>%
filter(AB >= 10, OBP < 0.5)
# Fit model to new data
mod_cleaner <- lm(SLG ~ OBP, data = nontrivial_players)
# View model summary
summary(mod_cleaner)
##
## Call:
## lm(formula = SLG ~ OBP, data = nontrivial_players)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.31383 -0.04165 -0.00261 0.03992 0.35819
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.043326 0.009823 -4.411 1.18e-05 ***
## OBP 1.345816 0.033012 40.768 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07011 on 734 degrees of freedom
## Multiple R-squared: 0.6937, Adjusted R-squared: 0.6932
## F-statistic: 1662 on 1 and 734 DF, p-value: < 2.2e-16
# Visualize new model
ggplot(data = nontrivial_players, aes(x = OBP, y = SLG)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
# Rank high leverage points
mod %>%
augment() %>%
arrange(desc(.hat), .cooksd) %>%
head()
## # A tibble: 6 x 9
## wgt hgt .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 85.5 198. 96.6 1.26 -11.1 0.0182 9.30 0.0134 -1.20
## 2 90.9 197. 95.6 1.21 -4.66 0.0170 9.31 0.00221 -0.505
## 3 49.8 147. 44.8 1.13 5.02 0.0148 9.31 0.00221 0.543
## 4 80.7 194. 91.9 1.07 -11.2 0.0131 9.30 0.00976 -1.21
## 5 95.9 193 91.4 1.05 4.51 0.0126 9.32 0.00152 0.488
## 6 44.8 150. 47.1 1.04 -2.32 0.0124 9.32 0.000397 -0.251
[Video]
Michael is a hybrid thinker and doer—a byproduct of being a StrengthsFinder “Learner” over time. With 20+ years of engineering, design, and product experience, he helps organizations identify market needs, mobilize internal and external resources, and deliver delightful digital customer experiences that align with business goals. He has been entrusted with problem-solving for brands—ranging from Fortune 500 companies to early-stage startups to not-for-profit organizations.
Michael earned his BS in Computer Science from New York Institute of Technology and his MBA from the University of Maryland, College Park. He is also a candidate to receive his MS in Applied Analytics from Columbia University.
LinkedIn | Twitter | www.michaelmallari.com/data | www.columbia.edu/~mm5470