Scatterplots are the most common and effective tools for visualizing the relationship between two numeric variables.
# Import data
ncbirths <- read.csv("ncbirths.csv", stringsAsFactors = FALSE)
# Load ggplot2
library(ggplot2)
# Scatterplot of weight vs. weeks
ggplot (data = ncbirths, aes(y = weight, x = weeks)) + geom_point()
You can think of boxplots as scatterplots for which the variable on the x-axis has been discretized. The cut() function takes two arguments: the continuous variable you want to discretize and the number of breaks that you want to make in that continuous variable in order to discretize it.
# Boxplot of weight vs. weeks
ggplot(data = ncbirths,
aes(x = cut(weeks, breaks = 5), y = weight)) +
geom_boxplot()
Scatterplots can reveal the nature of the relationship between two variables.
# Load package
library( openintro)
# Mammals scatterplot
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) + geom_point()
# Baseball player scatterplot
ggplot(data = mlbBat10, aes(x = OBP, y = SLG)) + geom_point()
# Body dimensions scatterplot
ggplot(data = bdims, aes(x = hgt, y = wgt, color = factor(sex))) + geom_point()
# Smoking scatterplot
ggplot(data = smoking, aes(x = age, y = amtWeekdays)) + geom_point()
This scatterplot shows the relationship between the poverty rates and high school graduation rates of counties in the United States.
Describe the form, direction, and strength of this relationship.
In cases where the relationship between two variables is not linear, a careful transformation of one or both of the variables can reveal a clear relationship. ggplot2 provides several different mechanisms for viewing transformed relationships. The coord_trans() function transforms the coordinates of the plot. Alternatively, the scale_x_log10() and scale_y_log10() functions perform a base-10 log transformation of each axis.
# Scatterplot with coord_trans()
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
geom_point() +
coord_trans(x = "log10", y = "log10")
# Scatterplot with scale_x_log10() and scale_y_log10()
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
geom_point() +
scale_x_log10() + scale_y_log10()
Outliers can affect the results of a linear regression model and how we can deal with them.
# Load dplyr
library(dplyr)
# Scatterplot of SLG vs. OBP
mlbBat10 %>%
filter(AB >= 200) %>%
ggplot(aes(x = OBP, y = SLG)) +
geom_point()
# Identify the outlying player
mlbBat10 %>%
filter(AB >= 200, OBP < 0.200)
## name team position G AB R H 2B 3B HR RBI TB BB SO SB CS OBP
## 1 B Wood LAA 3B 81 226 20 33 2 0 4 14 47 6 71 1 0 0.174
## SLG AVG
## 1 0.208 0.146
In a scientific paper, three correlations are reported with the following values:
-0.395 1.827 0.738 Choose the correct interpretation of these findings.
Correlation cannot be greater than 1.
In a scientific paper, three correlations are reported with the following values:
0.582 0.134 -0.795 Which of these values represents the strongest correlation?
-0.795
The cor(x, y) function will compute the Pearson product-moment correlation between variables, x and y. Since this quantity is symmetric with respect to x and y, it doesn’t matter in which order you put the variables. Setting the use argument to “pairwise.complete.obs” allows cor() to compute the correlation coefficient for those observations where the values of x and y are both not missing.
# Compute correlation
ncbirths %>%
summarize(N = n(), r = cor(weight, mage))
## N r
## 1 1000 0.05506589
# Compute correlation for all non-missing pairs
ncbirths %>%
summarize(N = n(), r = cor(weight, weeks, use = "pairwise.complete.obs"))
## N r
## 1 1000 0.6701013
In 1973, Francis Anscombe famously created four datasets with remarkably similar numerical properties, but obviously different graphic relationships. The Anscombe dataset contains the x and y coordinates for these four datasets, along with a grouping variable, set, that distinguishes the quartet.
# Load data
Anscombe <- read.csv("Anscombe.csv", stringsAsFactors = FALSE)
# Compute properties of Anscombe
Anscombe %>%
group_by(set) %>%
summarize(N = n(), mean(x), sd(x), mean(y), sd(y), cor(x, y))
## # A tibble: 4 x 7
## set N `mean(x)` `sd(x)` `mean(y)` `sd(y)` `cor(x, y)`
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 11 9 3.316625 7.500909 2.031568 0.8164205
## 2 2 11 9 3.316625 7.500909 2.031657 0.8162365
## 3 3 11 9 3.316625 7.500000 2.030424 0.8162867
## 4 4 11 9 3.316625 7.500909 2.030579 0.8165214
Recall the scatterplot between the poverty rate of counties in the United States and the high school graduation rate in those counties from the previous chapter. Which of the following values is the correct correlation between poverty rate and high school graduation rate?
Possible Answers
Estimating the value of the correlation coefficient between two quantities from their scatterplot can be difficult because your perception of the strength of these relationships can be influenced by design choices like the x and y scales.
# Correlation for all baseball players
mlbBat10 %>%
summarize(N = n(), r = cor(OBP, SLG))
## N r
## 1 1199 0.8145628
# Correlation for all players with at least 200 ABs
mlbBat10 %>%
filter(AB >= 200) %>%
summarize(N = n(), r = cor(OBP, SLG))
## N r
## 1 329 0.6855364
# Correlation of body dimensions
bdims %>%
group_by(sex) %>%
summarize(N = n(), r = cor(hgt, wgt))
## # A tibble: 2 x 3
## sex N r
## <int> <int> <dbl>
## 1 0 260 0.4310593
## 2 1 247 0.5347418
# Correlation among mammals, with and without log
mammals %>%
summarize(N = n(),
r = cor(BodyWt, BrainWt),
r_log = cor(log(BodyWt), log(BrainWt)))
## N r r_log
## 1 62 0.9341638 0.9595748
Recall that you previously determined the value of the correlation coefficient between the poverty rate of counties in the United States and the high school graduation rate in those counties was -0.681. Choose the correct interpretation of this value.
In the San Francisco Bay Area from 1960-1967, the correlation between the birthweight of 1,236 babies and the length of their gestational period was 0.408. Which of the following conclusions is not a valid statistical interpretation of these results.
Staying in the womb longer causes babies to be heavier when they are born.
Statisticians must always be skeptical of potentially spurious correlations.
# Load data
noise <- read.csv("noise.csv", stringsAsFactors = FALSE)
# Create faceted scatterplot
ggplot(data = noise, aes(x = x, y = y)) +
geom_point() +
facet_wrap(~ z)
# Compute correlations for each dataset
noise_summary <- noise %>%
group_by(z) %>%
summarize(N = n(), spurious_cor = cor(x, y))
# Isolate sets with correlations above 0.2 in absolute strength
noise_summary %>%
filter(abs(spurious_cor) > 0.2)
## # A tibble: 4 x 3
## z N spurious_cor
## <int> <int> <dbl>
## 1 3 50 -0.2063690
## 2 5 50 0.3060001
## 3 14 50 -0.2883678
## 4 20 50 -0.2323851
The simple linear regression model for a numeric response as a function of a numeric explanatory variable can be visualized on the corresponding scatterplot by a straight line. This is a “best fit” line that cuts through the data in a way that minimizes the distance between the line and the data points.
# Scatterplot with regression line
ggplot(data = bdims, aes(x = hgt, y = wgt)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
The least squares criterion implies that the slope of the regression line is unique. In practice, the slope is computed by R.
Estimate optimal value of my_slope add_line(my_slope = 1)
Note: add_line is a custom function.
Consider a linear regression model of the form:
Y=β0+β1⋅X+ϵ, where ϵ∼N(0,σϵ). The slope coefficient is:
β1
The fitted model for the poverty rate of U.S. counties as a function of high school graduation rate is:
povertyˆ=64.594−0.591⋅hs_grad In Hampshire County in western Massachusetts, the high school graduation rate is 92.4%. These two facts imply that the poverty rate in Hampshire County is ___.
Two facts enable you to compute the slope b1 and intercept b0 of a simple linear regression model from some basic summary statistics.
First, the slope can be defined as:
b1=rX,Y⋅sYsX
where rX,Y represents the correlation (cor()) of X and Y and sX and sY represent the standard deviation (sd()) of X and Y, respectively.
Second, the point (x¯,y¯)is always on the least squares regression line, where x¯and y¯denote the average of x and y, respectively.
N <- c(1507)
r <- c(0.7173011)
mean_hgt <- c(171.1438)
sd_hgt <- c(9.407205)
mean_wgt <- c(69.14753)
sd_wgt <- c(13.34576)
bdims_summary <- data.frame(N, r, mean_hgt, sd_hgt, mean_wgt, sd_wgt)
# Print bdims_summary
bdims_summary
## N r mean_hgt sd_hgt mean_wgt sd_wgt
## 1 1507 0.7173011 171.1438 9.407205 69.14753 13.34576
# Add slope and intercept
bdims_summary %>%
mutate(slope = r * sd_wgt / sd_hgt,
intercept = mean_wgt - slope * mean_hgt)
## N r mean_hgt sd_hgt mean_wgt sd_wgt slope intercept
## 1 1507 0.7173011 171.1438 9.407205 69.14753 13.34576 1.017617 -105.0112
Regression to the mean is a concept attributed to Sir Francis Galton. The basic idea is that extreme random observations will tend to be less extreme upon a second trial.
# Load data
Galton_men <- read.csv("Galton_men.csv")
Galton_women <- read.csv("Galton_women.csv")
# Height of children vs. height of father
ggplot(data = Galton_men, aes(x = father, y = height)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
geom_smooth(method = "lm", se = FALSE)
# Height of children vs. height of mother
ggplot(data = Galton_women, aes(x = mother, y = height)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
geom_smooth(method = "lm", se = FALSE)
In an opinion piece about nepotism published in The New York Times in 2015, economist Seth Stephens-Davidowitz wrote that:
“Regression to the mean is so powerful that once-in-a-generation talent basically never sires once-in-a-generation talent. It explains why Michael Jordan’s sons were middling college basketball players and Jakob Dylan wrote two good songs. It is why there are no American parent-child pairs among Hall of Fame players in any major professional sports league.” The author is arguing that…
Recall that the fitted model for the poverty rate of U.S. counties as a function of high school graduation rate is:
povertyˆ=64.594−0.591⋅hs_grad Which of the following is the correct interpretation of the slope coefficient?
Among U.S. counties, each additional percentage point increase in the high school graduation rate is associated with about a 0.591 percentage point decrease in the poverty rate.
A politician interpreting the relationship between poverty rates and high school graduation rates implores his constituents:
If we can lower the poverty rate by 59%, we’ll double the high school graduate rate in our county (i.e. raise it by 100%). Which of the following mistakes in interpretation has the politician made?
All of the above.
The function that creates linear models is lm(). This function generally takes two arguments: A formula that specifies the model; and A data argument for the data frame that contains the data you want to use to fit the model.
# Linear model for weight as a function of height
lm(wgt ~ hgt, data = bdims)
##
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
##
## Coefficients:
## (Intercept) hgt
## -105.011 1.018
# Linear model for SLG as a function of OBP
lm(SLG ~ OBP, data = mlbBat10)
##
## Call:
## lm(formula = SLG ~ OBP, data = mlbBat10)
##
## Coefficients:
## (Intercept) OBP
## 0.009407 1.110323
# Log-linear model for body weight as a function of brain weight
lm(log(BodyWt) ~ log(BrainWt), data = mammals)
##
## Call:
## lm(formula = log(BodyWt) ~ log(BrainWt), data = mammals)
##
## Coefficients:
## (Intercept) log(BrainWt)
## -2.509 1.225
In the previous examples, we fit two regression models: wgtˆ=−105.011+1.018⋅hgt and SLGˆ=0.009+1.110⋅OBP.
Which of the following statements is incorrect?
The coef() function displays only the values of the coefficients. Conversely, the summary() function displays not only that information, but a bunch of other information, including the associated standard error and p-value for each coefficient, the R2, adjusted R2, and the residual standard error.
mod <- lm(wgt ~ hgt, data = bdims)
# Show the coefficients
coef(mod)
## (Intercept) hgt
## -105.011254 1.017617
# Show the full output
summary(mod)
##
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.743 -6.402 -1.231 5.059 41.103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -105.01125 7.53941 -13.93 <2e-16 ***
## hgt 1.01762 0.04399 23.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared: 0.5145, Adjusted R-squared: 0.5136
## F-statistic: 535.2 on 1 and 505 DF, p-value: < 2.2e-16
Once you have fit a regression model, you are often interested in the fitted values (ŷ i) and the residuals (ei), where i indexes the observations.
# Mean of weights equal to mean of fitted values?
mean(bdims$wgt) == mean(fitted.values(mod))
## [1] TRUE
# Mean of the residuals
mean(residuals(mod))
## [1] -1.266971e-15
The augment() function from the broom package takes a model object as an argument and returns a data frame that contains the data on which the model was fit, along with several quantities specific to the regression model, including the fitted values, residuals, leverage scores, and standardized residuals.
# Load broom
library(broom)
# Create bdims_tidy
bdims_tidy <- augment(mod)
# Glimpse the resulting data frame
glimpse(bdims_tidy)
## Observations: 507
## Variables: 9
## $ wgt <dbl> 65.6, 71.8, 80.7, 72.6, 78.8, 74.8, 86.4, 78.4, 62....
## $ hgt <dbl> 174.0, 175.3, 193.5, 186.5, 187.2, 181.5, 184.0, 18...
## $ .fitted <dbl> 72.05406, 73.37697, 91.89759, 84.77427, 85.48661, 7...
## $ .se.fit <dbl> 0.4320546, 0.4520060, 1.0667332, 0.7919264, 0.81834...
## $ .resid <dbl> -6.4540648, -1.5769666, -11.1975919, -12.1742745, -...
## $ .hat <dbl> 0.002154570, 0.002358152, 0.013133942, 0.007238576,...
## $ .sigma <dbl> 9.312824, 9.317005, 9.303732, 9.301360, 9.312471, 9...
## $ .cooksd <dbl> 5.201807e-04, 3.400330e-05, 9.758463e-03, 6.282074e...
## $ .std.resid <dbl> -0.69413418, -0.16961994, -1.21098084, -1.31269063,...
You can compute expected values for observations that were not present in the data on which the model was fit. These types of predictions are called out-of-sample. You can use the predict() function to generate expected values for the weight of new individuals. We must pass the data frame of new observations through the newdata argument.
wgt <- c(74.8)
hgt <-c(182.8)
ben <- data.frame(wgt, hgt)
# Print ben
ben
## wgt hgt
## 1 74.8 182.8
# Predict the weight of ben
predict(mod,newdata = ben)
## 1
## 81.00909
You can add regression lines to a scatterplot manually by using the geom_abline() function, which takes slope and intercept arguments.
# Define coefs
#mod <- lm(wgt ~ hgt, data = bdims)
#coefs <- data.frame(t(coef(mod)))
coefs <- data.frame(-105, 1.02)
names(coefs) <- c("(Intercept)", "hgt")
str(coefs)
## 'data.frame': 1 obs. of 2 variables:
## $ (Intercept): num -105
## $ hgt : num 1.02
# Add the line to the scatterplot
ggplot(data = bdims, aes(x = hgt, y = wgt)) +
geom_point() +
geom_abline(data = coefs,
aes(intercept = `(Intercept)`, slope = hgt),
color = "dodgerblue")
The residual standard error reported for the regression model for poverty rate of U.S. counties in terms of high school graduation rate is 4.67. What does this mean?
You can recover the residuals from mod with residuals(), and the degrees of freedom with df.residual().
# View summary of model
summary(mod)
##
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.743 -6.402 -1.231 5.059 41.103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -105.01125 7.53941 -13.93 <2e-16 ***
## hgt 1.01762 0.04399 23.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared: 0.5145, Adjusted R-squared: 0.5136
## F-statistic: 535.2 on 1 and 505 DF, p-value: < 2.2e-16
# Compute the mean of the residuals
mean(residuals(mod))
## [1] -1.266971e-15
# Compute RMSE
sqrt(sum(residuals(mod)^2) / df.residual(mod))
## [1] 9.30804
Recall that the coefficient of determination (R2), can be computed as R2=1−SSESST=1−Var(e)Var(y),where e is the vector of residuals and y is the response variable. This gives us the interpretation of R2 as the percentage of the variability in the response that is explained by the model, since the residuals are the part of that variability that remains unexplained by the model.
# View model summary
summary(mod)
##
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.743 -6.402 -1.231 5.059 41.103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -105.01125 7.53941 -13.93 <2e-16 ***
## hgt 1.01762 0.04399 23.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared: 0.5145, Adjusted R-squared: 0.5136
## F-statistic: 535.2 on 1 and 505 DF, p-value: < 2.2e-16
# Compute R-squared
bdims_tidy %>%
summarize(var_y = var(wgt), var_e = var(.resid)) %>%
mutate(R_squared = 1-var_e / var_y)
## var_y var_e R_squared
## 1 178.1094 86.46839 0.5145208
The R2 reported for the regression model for poverty rate of U.S. counties in terms of high school graduation rate is 0.464.
lm(formula = poverty ~ hs_grad, data = countyComplete) %>% summary() How should this result be interpreted?
The R2 gives us a numerical measurement of the strength of fit relative to a null model based on the average of the response variable: ŷ null=y¯
This model has an R2 of zero because SSE=SST.
# Define mod_null and mod_hgt
mod <- lm(wgt ~ 1, data = bdims) # The null model only includes the constant 1.
mod_null <-
mod %>%
augment(bdims)
mod <- lm(wgt ~ hgt, data = bdims)
mod_hgt <-
mod %>%
augment(bdims)
# Compute SSE for null model
mod_null %>%
summarize(SSE = var(.resid))
## SSE
## 1 178.1094
# Compute SSE for regression model
mod_hgt %>%
summarize(SSE = var(.resid))
## SSE
## 1 86.46839
The leverage of an observation in a regression model is defined entirely in terms of the distance of that observation from the mean of the explanatory variable. That is, observations close to the mean of the explanatory variable have low leverage, while observations far from the mean of the explanatory variable have high leverage.
# Rank points of high leverage
mod <- lm(HR ~ SB, data = mlbBat10)
mod %>%
augment() %>%
arrange(desc(.hat)) %>%
head()
## HR SB .fitted .se.fit .resid .hat .sigma .cooksd
## 1 1 68 30.23089 2.025288 -29.230889 0.08619986 6.844170 0.926810510
## 2 2 52 23.78922 1.536351 -21.789218 0.04960366 6.870720 0.273961953
## 3 5 50 22.98401 1.475381 -17.984010 0.04574474 6.880486 0.170721375
## 4 19 47 21.77620 1.384016 -2.776196 0.04025455 6.900562 0.003539207
## 5 5 47 21.77620 1.384016 -16.776196 0.04025455 6.883261 0.129238703
## 6 6 42 19.76317 1.232039 -13.763174 0.03189934 6.889185 0.067745543
## .std.resid
## 1 -4.4328476
## 2 -3.2400795
## 3 -2.6688283
## 4 -0.4108077
## 5 -2.4824577
## 6 -2.0277982
Influential points are likely to have high leverage and deviate from the general relationship between the two variables. We measure influence using Cook’s distance, which incorporates both the leverage and residual of each observation.
# Rank influential points
mod %>%
augment() %>%
arrange(desc(.cooksd)) %>%
head()
## HR SB .fitted .se.fit .resid .hat .sigma .cooksd
## 1 1 68 30.23089 2.025288 -29.23089 0.08619986 6.844170 0.92681051
## 2 2 52 23.78922 1.536351 -21.78922 0.04960366 6.870720 0.27396195
## 3 5 50 22.98401 1.475381 -17.98401 0.04574474 6.880486 0.17072138
## 4 5 47 21.77620 1.384016 -16.77620 0.04025455 6.883261 0.12923870
## 5 1 42 19.76317 1.232039 -18.76317 0.03189934 6.878983 0.12590881
## 6 6 42 19.76317 1.232039 -13.76317 0.03189934 6.889185 0.06774554
## .std.resid
## 1 -4.432848
## 2 -3.240079
## 3 -2.668828
## 4 -2.482458
## 5 -2.764474
## 6 -2.027798
Although a better model fit can sometimes be achieved by simply removing outliers and re-fitting the model, it can harm the integrity of the data so you need to have a good justification for removing them.
# Create nontrivial_players
nontrivial_players <- mlbBat10 %>%
filter(AB >= 10, OBP < 0.5)
# Fit model to new data
mod_cleaner <- lm(SLG ~ OBP, data = nontrivial_players)
# View model summary
summary(mod_cleaner)
##
## Call:
## lm(formula = SLG ~ OBP, data = nontrivial_players)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.31383 -0.04165 -0.00261 0.03992 0.35819
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.043326 0.009823 -4.411 1.18e-05 ***
## OBP 1.345816 0.033012 40.768 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07011 on 734 degrees of freedom
## Multiple R-squared: 0.6937, Adjusted R-squared: 0.6932
## F-statistic: 1662 on 1 and 734 DF, p-value: < 2.2e-16
# Visualize new model
ggplot(data = nontrivial_players, aes(x = OBP, y = SLG)) +
geom_point() +
geom_smooth(method = "lm")
High leverage points that lie right near the regression are not influential.
# Rank high leverage points
mod %>% augment() %>% arrange(desc(.hat), .cooksd) %>% head()
## HR SB .fitted .se.fit .resid .hat .sigma .cooksd
## 1 1 68 30.23089 2.025288 -29.230889 0.08619986 6.844170 0.926810510
## 2 2 52 23.78922 1.536351 -21.789218 0.04960366 6.870720 0.273961953
## 3 5 50 22.98401 1.475381 -17.984010 0.04574474 6.880486 0.170721375
## 4 19 47 21.77620 1.384016 -2.776196 0.04025455 6.900562 0.003539207
## 5 5 47 21.77620 1.384016 -16.776196 0.04025455 6.883261 0.129238703
## 6 6 42 19.76317 1.232039 -13.763174 0.03189934 6.889185 0.067745543
## .std.resid
## 1 -4.4328476
## 2 -3.2400795
## 3 -2.6688283
## 4 -0.4108077
## 5 -2.4824577
## 6 -2.0277982
How can we revitalize a region’s economy? (Continued from Quiz 4) Use the code below to create a new variable, rural takes the value of rural if it has less than 1,000 people per square mile and the value of urban otherwise.
# Load data
data(countyComplete) # It comes from the openintro package
# Create a new variable, rural
countyComplete$rural <- ifelse(countyComplete$density < 1000, "rural", "urban")
countyComplete$rural <- factor(countyComplete$rural)
# Scatterplot of per_capita_income vs. bachelors
ggplot (data = countyComplete, aes(x = per_capita_income, y = bachelors)) + geom_point()
# Boxplot of per_capita_income vs. bachelors
ggplot(data = countyComplete,
aes(x = cut(per_capita_income, breaks = 5), y = bachelors)) +
geom_boxplot()
There is a positive linear relationship between having a bachelors degree and per capita income. Having a bachelors degree will generally result in having a higher income. The community should support higher education if it wants to increase the standard of living.
ggplot(data = countyComplete, aes(x = per_capita_income, y = bachelors, color = factor(rural))) + geom_point()
countyComplete %>%
summarize(N = n(), r = cor(per_capita_income, bachelors))
## N r
## 1 3143 0.7924464
countyComplete %>%
summarize(N = n(), r = cor(per_capita_income, white))
## N r
## 1 3143 0.1542088
The correlation between per capita income and college graduation is strong at 0.792. The correlation between per capita income and being white is much weaker at 0.154. Having a bachelors degree causes per capita income to be higher, being white does not have a strong causal effect on per capita income.
# Scatterplot with regression line
ggplot(data = countyComplete, aes(x = per_capita_income, y = bachelors)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
mod <- lm(per_capita_income ~ bachelors, data = countyComplete)
# Show the coefficients
coef(mod)
## (Intercept) bachelors
## 13087.6795 494.7534
# Show the full output
summary(mod)
##
## Call:
## lm(formula = per_capita_income ~ bachelors, data = countyComplete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18032.7 -1708.2 73.8 1748.0 21756.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13087.680 142.091 92.11 <2e-16 ***
## bachelors 494.753 6.795 72.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3299 on 3141 degrees of freedom
## Multiple R-squared: 0.628, Adjusted R-squared: 0.6279
## F-statistic: 5302 on 1 and 3141 DF, p-value: < 2.2e-16
There is a positive association between education and standard of living. People who have bachelors degrees generally have a higher standard of living. The slope of the linear model = 494.7534. This means that having a bachelors degree will result in having a per capita income of approximaely 5 times that of someone who does not have a bachelors degree.
# Create bdims_tidy
countyComplete_tidy <- augment(mod)
# Compute R-squared
countyComplete_tidy %>%
summarize(var_y = var(per_capita_income), var_e = var(.resid)) %>%
mutate(R_squared = 1-var_e / var_y)
## var_y var_e R_squared
## 1 29253692 10883214 0.6279713
The R^2 reported for the regression model for per capita income of U.S. counties in terms of college graduation rate is 0.6279713. This means that 62.8% of the variability in per capita income among U.S. counties can be explained by having a bachelors degree .
ggplot(data = countyComplete, aes(x = per_capita_income, y = white)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
mod <- lm(per_capita_income ~ white, data = countyComplete)
# Show the coefficients
coef(mod)
## (Intercept) white
## 18403.11249 49.48507
# Show the full output
summary(mod)
##
## Call:
## lm(formula = per_capita_income ~ white, data = countyComplete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12588 -3394 -942 2078 41351
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18403.112 478.496 38.460 <2e-16 ***
## white 49.485 5.657 8.747 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5345 on 3141 degrees of freedom
## Multiple R-squared: 0.02378, Adjusted R-squared: 0.02347
## F-statistic: 76.51 on 1 and 3141 DF, p-value: < 2.2e-16
The R^2 in this regression model is only 0.02347 which means that being white only accounts for 2.35% difference in per capita income.