Assignment 5 Correlation and Regression

Chapter 1 Visualizing two variables

1.1 Scatterplots

Scatterplots are the most common and effective tools for visualizing the relationship between two numeric variables.

# Import data
ncbirths <- read.csv("ncbirths.csv", stringsAsFactors = FALSE) 

# Load ggplot2
library(ggplot2)

# Scatterplot of weight vs. weeks
ggplot (data = ncbirths, aes(y = weight, x = weeks)) + geom_point()

1.2 Boxplots as deiscretized / conditioned scatterplots

You can think of boxplots as scatterplots for which the variable on the x-axis has been discretized. The cut() function takes two arguments: the continuous variable you want to discretize and the number of breaks that you want to make in that continuous variable in order to discretize it.

# Boxplot of weight vs. weeks
ggplot(data = ncbirths, 
       aes(x = cut(weeks, breaks = 5), y = weight)) + 
  geom_boxplot()

1.3 Creating scatterplots

Scatterplots can reveal the nature of the relationship between two variables.

# Load package
library( openintro)

# Mammals scatterplot
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) + geom_point()



# Baseball player scatterplot
ggplot(data = mlbBat10, aes(x = OBP, y = SLG)) + geom_point()



# Body dimensions scatterplot
ggplot(data = bdims, aes(x = hgt, y = wgt, color = factor(sex))) + geom_point()



# Smoking scatterplot
ggplot(data = smoking, aes(x = age, y = amtWeekdays)) + geom_point()

1.4 Characterizing scatterplots

This scatterplot shows the relationship between the poverty rates and high school graduation rates of counties in the United States.

Describe the form, direction, and strength of this relationship.

  1. Linear, negative, moderately strong

1.5 Transformations

In cases where the relationship between two variables is not linear, a careful transformation of one or both of the variables can reveal a clear relationship. ggplot2 provides several different mechanisms for viewing transformed relationships. The coord_trans() function transforms the coordinates of the plot. Alternatively, the scale_x_log10() and scale_y_log10() functions perform a base-10 log transformation of each axis.

# Scatterplot with coord_trans()
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
  geom_point() + 
  coord_trans(x = "log10", y = "log10")


# Scatterplot with scale_x_log10() and scale_y_log10()
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
  geom_point() +
  scale_x_log10() + scale_y_log10()

1.6 Identifying outliers

Outliers can affect the results of a linear regression model and how we can deal with them.


# Load dplyr
library(dplyr)

# Scatterplot of SLG vs. OBP
mlbBat10 %>%
  filter(AB >= 200) %>%
  ggplot(aes(x = OBP, y = SLG)) +
  geom_point()


# Identify the outlying player
mlbBat10 %>%
  filter(AB >= 200, OBP < 0.200)
##     name team position  G  AB  R  H 2B 3B HR RBI TB BB SO SB CS   OBP
## 1 B Wood  LAA       3B 81 226 20 33  2  0  4  14 47  6 71  1  0 0.174
##     SLG   AVG
## 1 0.208 0.146

Chapter 2 Correlation

2.1 Understanding correlation scale

In a scientific paper, three correlations are reported with the following values:

-0.395 1.827 0.738 Choose the correct interpretation of these findings.

    1. is invalid.

Correlation cannot be greater than 1.

2.2 Understanding correlation sign

In a scientific paper, three correlations are reported with the following values:

0.582 0.134 -0.795 Which of these values represents the strongest correlation?

-0.795

2.3 Computing correlation

The cor(x, y) function will compute the Pearson product-moment correlation between variables, x and y. Since this quantity is symmetric with respect to x and y, it doesn’t matter in which order you put the variables. Setting the use argument to “pairwise.complete.obs” allows cor() to compute the correlation coefficient for those observations where the values of x and y are both not missing.

# Compute correlation
ncbirths %>%
  summarize(N = n(), r = cor(weight, mage))
##      N          r
## 1 1000 0.05506589

# Compute correlation for all non-missing pairs
ncbirths %>%
  summarize(N = n(), r = cor(weight, weeks, use = "pairwise.complete.obs"))
##      N         r
## 1 1000 0.6701013

2.4 Exploring Anscombe

In 1973, Francis Anscombe famously created four datasets with remarkably similar numerical properties, but obviously different graphic relationships. The Anscombe dataset contains the x and y coordinates for these four datasets, along with a grouping variable, set, that distinguishes the quartet.

# Load data
Anscombe <- read.csv("Anscombe.csv", stringsAsFactors = FALSE) 

# Compute properties of Anscombe
Anscombe %>%
  group_by(set) %>%
  summarize(N = n(), mean(x), sd(x), mean(y), sd(y), cor(x, y))
## # A tibble: 4 x 7
##     set     N `mean(x)`  `sd(x)` `mean(y)`  `sd(y)` `cor(x, y)`
##   <int> <int>     <dbl>    <dbl>     <dbl>    <dbl>       <dbl>
## 1     1    11         9 3.316625  7.500909 2.031568   0.8164205
## 2     2    11         9 3.316625  7.500909 2.031657   0.8162365
## 3     3    11         9 3.316625  7.500000 2.030424   0.8162867
## 4     4    11         9 3.316625  7.500909 2.030579   0.8165214

2.5 Perception of correlation

Recall the scatterplot between the poverty rate of counties in the United States and the high school graduation rate in those counties from the previous chapter. Which of the following values is the correct correlation between poverty rate and high school graduation rate?

Possible Answers

  1. -0.681

2.6 Perception of correlation (2)

Estimating the value of the correlation coefficient between two quantities from their scatterplot can be difficult because your perception of the strength of these relationships can be influenced by design choices like the x and y scales.

# Correlation for all baseball players
mlbBat10 %>%
  summarize(N = n(), r = cor(OBP, SLG))
##      N         r
## 1 1199 0.8145628

# Correlation for all players with at least 200 ABs
mlbBat10 %>%
  filter(AB >= 200) %>%
  summarize(N = n(), r = cor(OBP, SLG))
##     N         r
## 1 329 0.6855364

# Correlation of body dimensions
bdims %>%
  group_by(sex) %>%
  summarize(N = n(), r = cor(hgt, wgt))
## # A tibble: 2 x 3
##     sex     N         r
##   <int> <int>     <dbl>
## 1     0   260 0.4310593
## 2     1   247 0.5347418

# Correlation among mammals, with and without log
mammals %>%
  summarize(N = n(), 
            r = cor(BodyWt, BrainWt), 
            
            r_log = cor(log(BodyWt), log(BrainWt)))
##    N         r     r_log
## 1 62 0.9341638 0.9595748

2.7 Interpreting correlation in context

Recall that you previously determined the value of the correlation coefficient between the poverty rate of counties in the United States and the high school graduation rate in those counties was -0.681. Choose the correct interpretation of this value.

  1. Counties with lower high school graduation rates are likely to have higher poverty rates.

2.8 Correlation and causation

In the San Francisco Bay Area from 1960-1967, the correlation between the birthweight of 1,236 babies and the length of their gestational period was 0.408. Which of the following conclusions is not a valid statistical interpretation of these results.

Staying in the womb longer causes babies to be heavier when they are born.

2.9 Spurious correlation in random data

Statisticians must always be skeptical of potentially spurious correlations.

# Load data
noise <- read.csv("noise.csv", stringsAsFactors = FALSE) 

# Create faceted scatterplot
ggplot(data = noise, aes(x = x, y = y)) +
  geom_point() + 
  facet_wrap(~ z)


# Compute correlations for each dataset
noise_summary <- noise %>%
  group_by(z) %>%
  summarize(N = n(), spurious_cor = cor(x, y))

# Isolate sets with correlations above 0.2 in absolute strength
noise_summary %>%
  filter(abs(spurious_cor) > 0.2)
## # A tibble: 4 x 3
##       z     N spurious_cor
##   <int> <int>        <dbl>
## 1     3    50   -0.2063690
## 2     5    50    0.3060001
## 3    14    50   -0.2883678
## 4    20    50   -0.2323851

Chapter 3 Simple Linear Regression

3.1 The “best fit” line

The simple linear regression model for a numeric response as a function of a numeric explanatory variable can be visualized on the corresponding scatterplot by a straight line. This is a “best fit” line that cuts through the data in a way that minimizes the distance between the line and the data points.

# Scatterplot with regression line
ggplot(data = bdims, aes(x = hgt, y = wgt)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

3.2 Uniqueness of least squares regression line

The least squares criterion implies that the slope of the regression line is unique. In practice, the slope is computed by R.

Estimate optimal value of my_slope add_line(my_slope = 1)

Note: add_line is a custom function.

3.3 Regression model terminology

Consider a linear regression model of the form:

Y=β0+β1⋅X+ϵ, where ϵ∼N(0,σϵ). The slope coefficient is:

β1

3.4 Regression model output terminology

The fitted model for the poverty rate of U.S. counties as a function of high school graduation rate is:

povertyˆ=64.594−0.591⋅hs_grad In Hampshire County in western Massachusetts, the high school graduation rate is 92.4%. These two facts imply that the poverty rate in Hampshire County is ___.

  1. expected to be about 10.0%

3.5 Fitting a linear model “by hand”

Two facts enable you to compute the slope b1 and intercept b0 of a simple linear regression model from some basic summary statistics.

First, the slope can be defined as:

b1=rX,Y⋅sYsX

where rX,Y represents the correlation (cor()) of X and Y and sX and sY represent the standard deviation (sd()) of X and Y, respectively.

Second, the point (x¯,y¯)is always on the least squares regression line, where x¯and y¯denote the average of x and y, respectively.

N <- c(1507)
r <- c(0.7173011)
mean_hgt <- c(171.1438)
sd_hgt <- c(9.407205)
mean_wgt <- c(69.14753)
sd_wgt <- c(13.34576)
bdims_summary <- data.frame(N, r, mean_hgt, sd_hgt, mean_wgt, sd_wgt)

# Print bdims_summary
bdims_summary
##      N         r mean_hgt   sd_hgt mean_wgt   sd_wgt
## 1 1507 0.7173011 171.1438 9.407205 69.14753 13.34576

# Add slope and intercept
bdims_summary %>%
  mutate(slope = r * sd_wgt / sd_hgt, 
         intercept = mean_wgt - slope * mean_hgt)
##      N         r mean_hgt   sd_hgt mean_wgt   sd_wgt    slope intercept
## 1 1507 0.7173011 171.1438 9.407205 69.14753 13.34576 1.017617 -105.0112

3.6 Regression to the mean

Regression to the mean is a concept attributed to Sir Francis Galton. The basic idea is that extreme random observations will tend to be less extreme upon a second trial.

# Load data
Galton_men <- read.csv("Galton_men.csv")
Galton_women <- read.csv("Galton_women.csv")


# Height of children vs. height of father
ggplot(data = Galton_men, aes(x = father, y = height)) +
  geom_point() + 
  geom_abline(slope = 1, intercept = 0) + 
  geom_smooth(method = "lm", se = FALSE)


# Height of children vs. height of mother
ggplot(data = Galton_women, aes(x = mother, y = height)) +
  geom_point() + 
  geom_abline(slope = 1, intercept = 0) + 
  geom_smooth(method = "lm", se = FALSE)

3.7 “Regression” in the parlance of our time

In an opinion piece about nepotism published in The New York Times in 2015, economist Seth Stephens-Davidowitz wrote that:

“Regression to the mean is so powerful that once-in-a-generation talent basically never sires once-in-a-generation talent. It explains why Michael Jordan’s sons were middling college basketball players and Jakob Dylan wrote two good songs. It is why there are no American parent-child pairs among Hall of Fame players in any major professional sports league.” The author is arguing that…

  1. Because of regression to the mean, an outstanding basketball player is likely to have sons that are good at basketball, but not as good as him.

Chapter 4 Interpreting Regression Models

4.1 Interpretation of coeficients

Recall that the fitted model for the poverty rate of U.S. counties as a function of high school graduation rate is:

povertyˆ=64.594−0.591⋅hs_grad Which of the following is the correct interpretation of the slope coefficient?

Among U.S. counties, each additional percentage point increase in the high school graduation rate is associated with about a 0.591 percentage point decrease in the poverty rate.

4.2 Interpretation in context

A politician interpreting the relationship between poverty rates and high school graduation rates implores his constituents:

If we can lower the poverty rate by 59%, we’ll double the high school graduate rate in our county (i.e. raise it by 100%). Which of the following mistakes in interpretation has the politician made?

All of the above.

4.3 Fitting simple linear models

The function that creates linear models is lm(). This function generally takes two arguments: A formula that specifies the model; and A data argument for the data frame that contains the data you want to use to fit the model.

# Linear model for weight as a function of height
lm(wgt ~ hgt, data = bdims)
## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Coefficients:
## (Intercept)          hgt  
##    -105.011        1.018

# Linear model for SLG as a function of OBP
lm(SLG ~ OBP, data = mlbBat10)
## 
## Call:
## lm(formula = SLG ~ OBP, data = mlbBat10)
## 
## Coefficients:
## (Intercept)          OBP  
##    0.009407     1.110323

# Log-linear model for body weight as a function of brain weight
lm(log(BodyWt) ~ log(BrainWt), data = mammals)
## 
## Call:
## lm(formula = log(BodyWt) ~ log(BrainWt), data = mammals)
## 
## Coefficients:
##  (Intercept)  log(BrainWt)  
##       -2.509         1.225

4.4 Units and scale

In the previous examples, we fit two regression models: wgtˆ=−105.011+1.018⋅hgt and SLGˆ=0.009+1.110⋅OBP.

Which of the following statements is incorrect?

  1. Because the slope coefficient for OBP is larger (1.110) than the slope coefficient for hgt (1.018), we can conclude that the association between OBP and SLG is stronger than the association between height and weight.

4.5 The lm summary output

The coef() function displays only the values of the coefficients. Conversely, the summary() function displays not only that information, but a bunch of other information, including the associated standard error and p-value for each coefficient, the R2, adjusted R2, and the residual standard error.


mod <- lm(wgt ~ hgt, data = bdims)

# Show the coefficients
coef(mod)
## (Intercept)         hgt 
## -105.011254    1.017617

# Show the full output
summary(mod)
## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.743  -6.402  -1.231   5.059  41.103 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -105.01125    7.53941  -13.93   <2e-16 ***
## hgt            1.01762    0.04399   23.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared:  0.5145, Adjusted R-squared:  0.5136 
## F-statistic: 535.2 on 1 and 505 DF,  p-value: < 2.2e-16

4.6 Fitted values and residuals

Once you have fit a regression model, you are often interested in the fitted values (ŷ i) and the residuals (ei), where i indexes the observations.

# Mean of weights equal to mean of fitted values?
mean(bdims$wgt) == mean(fitted.values(mod))
## [1] TRUE

# Mean of the residuals
mean(residuals(mod))
## [1] -1.266971e-15

4.7 Tidying your linear model

The augment() function from the broom package takes a model object as an argument and returns a data frame that contains the data on which the model was fit, along with several quantities specific to the regression model, including the fitted values, residuals, leverage scores, and standardized residuals.

# Load broom
library(broom)
# Create bdims_tidy
bdims_tidy <- augment(mod)

# Glimpse the resulting data frame
glimpse(bdims_tidy)
## Observations: 507
## Variables: 9
## $ wgt        <dbl> 65.6, 71.8, 80.7, 72.6, 78.8, 74.8, 86.4, 78.4, 62....
## $ hgt        <dbl> 174.0, 175.3, 193.5, 186.5, 187.2, 181.5, 184.0, 18...
## $ .fitted    <dbl> 72.05406, 73.37697, 91.89759, 84.77427, 85.48661, 7...
## $ .se.fit    <dbl> 0.4320546, 0.4520060, 1.0667332, 0.7919264, 0.81834...
## $ .resid     <dbl> -6.4540648, -1.5769666, -11.1975919, -12.1742745, -...
## $ .hat       <dbl> 0.002154570, 0.002358152, 0.013133942, 0.007238576,...
## $ .sigma     <dbl> 9.312824, 9.317005, 9.303732, 9.301360, 9.312471, 9...
## $ .cooksd    <dbl> 5.201807e-04, 3.400330e-05, 9.758463e-03, 6.282074e...
## $ .std.resid <dbl> -0.69413418, -0.16961994, -1.21098084, -1.31269063,...

4.8 Making predictions

You can compute expected values for observations that were not present in the data on which the model was fit. These types of predictions are called out-of-sample. You can use the predict() function to generate expected values for the weight of new individuals. We must pass the data frame of new observations through the newdata argument.

wgt <- c(74.8)
hgt <-c(182.8)
ben <- data.frame(wgt, hgt)

# Print ben
ben
##    wgt   hgt
## 1 74.8 182.8

# Predict the weight of ben
predict(mod,newdata = ben)
##        1 
## 81.00909

4.9 Adding a regression line to a plot manually

You can add regression lines to a scatterplot manually by using the geom_abline() function, which takes slope and intercept arguments.

# Define coefs
#mod <- lm(wgt ~ hgt, data = bdims)
#coefs <- data.frame(t(coef(mod)))
coefs <- data.frame(-105, 1.02)
names(coefs) <- c("(Intercept)", "hgt")
str(coefs)
## 'data.frame':    1 obs. of  2 variables:
##  $ (Intercept): num -105
##  $ hgt        : num 1.02

# Add the line to the scatterplot
ggplot(data = bdims, aes(x = hgt, y = wgt)) + 
  geom_point() + 
  geom_abline(data = coefs, 
              aes(intercept = `(Intercept)`, slope = hgt),  
              color = "dodgerblue")

Chapter 5 Model fit

5.1 RMSE

The residual standard error reported for the regression model for poverty rate of U.S. counties in terms of high school graduation rate is 4.67. What does this mean?

  1. The typical difference between the observed poverty rate and the poverty rate predicted by the model is about 4.67 percentage points.

5.2 Standard error of residuals

You can recover the residuals from mod with residuals(), and the degrees of freedom with df.residual().

# View summary of model
summary(mod)
## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.743  -6.402  -1.231   5.059  41.103 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -105.01125    7.53941  -13.93   <2e-16 ***
## hgt            1.01762    0.04399   23.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared:  0.5145, Adjusted R-squared:  0.5136 
## F-statistic: 535.2 on 1 and 505 DF,  p-value: < 2.2e-16

# Compute the mean of the residuals
mean(residuals(mod))
## [1] -1.266971e-15

# Compute RMSE
sqrt(sum(residuals(mod)^2) / df.residual(mod))
## [1] 9.30804

5.3 Assessing simple linear model fit

Recall that the coefficient of determination (R2), can be computed as R2=1−SSESST=1−Var(e)Var(y),where e is the vector of residuals and y is the response variable. This gives us the interpretation of R2 as the percentage of the variability in the response that is explained by the model, since the residuals are the part of that variability that remains unexplained by the model.

# View model summary
summary(mod)
## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.743  -6.402  -1.231   5.059  41.103 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -105.01125    7.53941  -13.93   <2e-16 ***
## hgt            1.01762    0.04399   23.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared:  0.5145, Adjusted R-squared:  0.5136 
## F-statistic: 535.2 on 1 and 505 DF,  p-value: < 2.2e-16

# Compute R-squared
bdims_tidy %>%
  summarize(var_y = var(wgt), var_e = var(.resid)) %>%
  mutate(R_squared = 1-var_e / var_y)
##      var_y    var_e R_squared
## 1 178.1094 86.46839 0.5145208

5.4 Interpretation of R^2

The R2 reported for the regression model for poverty rate of U.S. counties in terms of high school graduation rate is 0.464.

lm(formula = poverty ~ hs_grad, data = countyComplete) %>% summary() How should this result be interpreted?

  1. 46.4% of the variability in poverty rate among U.S. counties can be explained by high school graduation rate.

5.5 Linear vs. average

The R2 gives us a numerical measurement of the strength of fit relative to a null model based on the average of the response variable: ŷ null=y¯

This model has an R2 of zero because SSE=SST.


# Define mod_null and mod_hgt
mod <- lm(wgt ~ 1, data = bdims) # The null model only includes the constant 1.
mod_null <-
  mod %>%
  augment(bdims)

mod <- lm(wgt ~ hgt, data = bdims)
mod_hgt <-
  mod %>%
  augment(bdims)


# Compute SSE for null model
mod_null %>%
  summarize(SSE = var(.resid))
##        SSE
## 1 178.1094

# Compute SSE for regression model
mod_hgt %>%
  summarize(SSE = var(.resid))
##        SSE
## 1 86.46839

5.6 Leverage

The leverage of an observation in a regression model is defined entirely in terms of the distance of that observation from the mean of the explanatory variable. That is, observations close to the mean of the explanatory variable have low leverage, while observations far from the mean of the explanatory variable have high leverage.

# Rank points of high leverage

mod <- lm(HR ~ SB, data = mlbBat10)

mod %>%
  augment() %>%
  arrange(desc(.hat)) %>%
  head()
##   HR SB  .fitted  .se.fit     .resid       .hat   .sigma     .cooksd
## 1  1 68 30.23089 2.025288 -29.230889 0.08619986 6.844170 0.926810510
## 2  2 52 23.78922 1.536351 -21.789218 0.04960366 6.870720 0.273961953
## 3  5 50 22.98401 1.475381 -17.984010 0.04574474 6.880486 0.170721375
## 4 19 47 21.77620 1.384016  -2.776196 0.04025455 6.900562 0.003539207
## 5  5 47 21.77620 1.384016 -16.776196 0.04025455 6.883261 0.129238703
## 6  6 42 19.76317 1.232039 -13.763174 0.03189934 6.889185 0.067745543
##   .std.resid
## 1 -4.4328476
## 2 -3.2400795
## 3 -2.6688283
## 4 -0.4108077
## 5 -2.4824577
## 6 -2.0277982

5.7 Influence

Influential points are likely to have high leverage and deviate from the general relationship between the two variables. We measure influence using Cook’s distance, which incorporates both the leverage and residual of each observation.

# Rank influential points
mod %>%
     augment() %>%
     arrange(desc(.cooksd)) %>%
     head()
##   HR SB  .fitted  .se.fit    .resid       .hat   .sigma    .cooksd
## 1  1 68 30.23089 2.025288 -29.23089 0.08619986 6.844170 0.92681051
## 2  2 52 23.78922 1.536351 -21.78922 0.04960366 6.870720 0.27396195
## 3  5 50 22.98401 1.475381 -17.98401 0.04574474 6.880486 0.17072138
## 4  5 47 21.77620 1.384016 -16.77620 0.04025455 6.883261 0.12923870
## 5  1 42 19.76317 1.232039 -18.76317 0.03189934 6.878983 0.12590881
## 6  6 42 19.76317 1.232039 -13.76317 0.03189934 6.889185 0.06774554
##   .std.resid
## 1  -4.432848
## 2  -3.240079
## 3  -2.668828
## 4  -2.482458
## 5  -2.764474
## 6  -2.027798

5.8 Removing outliers

Although a better model fit can sometimes be achieved by simply removing outliers and re-fitting the model, it can harm the integrity of the data so you need to have a good justification for removing them.

# Create nontrivial_players
nontrivial_players <- mlbBat10 %>%
  filter(AB >= 10, OBP < 0.5)

# Fit model to new data
mod_cleaner <- lm(SLG ~ OBP, data = nontrivial_players)

# View model summary
summary(mod_cleaner)
## 
## Call:
## lm(formula = SLG ~ OBP, data = nontrivial_players)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31383 -0.04165 -0.00261  0.03992  0.35819 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.043326   0.009823  -4.411 1.18e-05 ***
## OBP          1.345816   0.033012  40.768  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07011 on 734 degrees of freedom
## Multiple R-squared:  0.6937, Adjusted R-squared:  0.6932 
## F-statistic:  1662 on 1 and 734 DF,  p-value: < 2.2e-16

# Visualize new model
ggplot(data = nontrivial_players, aes(x = OBP, y = SLG)) +
  geom_point() + 
  geom_smooth(method = "lm")

5.9 High leverage points

High leverage points that lie right near the regression are not influential.

# Rank high leverage points

mod %>% augment() %>% arrange(desc(.hat), .cooksd) %>% head()
##   HR SB  .fitted  .se.fit     .resid       .hat   .sigma     .cooksd
## 1  1 68 30.23089 2.025288 -29.230889 0.08619986 6.844170 0.926810510
## 2  2 52 23.78922 1.536351 -21.789218 0.04960366 6.870720 0.273961953
## 3  5 50 22.98401 1.475381 -17.984010 0.04574474 6.880486 0.170721375
## 4 19 47 21.77620 1.384016  -2.776196 0.04025455 6.900562 0.003539207
## 5  5 47 21.77620 1.384016 -16.776196 0.04025455 6.883261 0.129238703
## 6  6 42 19.76317 1.232039 -13.763174 0.03189934 6.889185 0.067745543
##   .std.resid
## 1 -4.4328476
## 2 -3.2400795
## 3 -2.6688283
## 4 -0.4108077
## 5 -2.4824577
## 6 -2.0277982

Quiz 5

How can we revitalize a region’s economy? (Continued from Quiz 4) Use the code below to create a new variable, rural takes the value of rural if it has less than 1,000 people per square mile and the value of urban otherwise.

# Load data
data(countyComplete) # It comes from the openintro package

# Create a new variable, rural
countyComplete$rural <- ifelse(countyComplete$density < 1000, "rural", "urban")
countyComplete$rural <- factor(countyComplete$rural)

1. Now that you have done expoloratory data analysis with the countyComplete data set in the previous quiz, you are ready to identify what variables are strongly associated with the standard of living. To that end, do the following:

1.1 Create a scatterplot of per_capita_income and bachelors in the data set (1.1 Scatterplots).

# Scatterplot of per_capita_income vs. bachelors
ggplot (data = countyComplete, aes(x = per_capita_income, y = bachelors)) + geom_point()

1.2 Create a boxplot of per_capita_income and bachelors (1.2 Boxplots as discretized/conditioned scatterplots).

# Boxplot of per_capita_income vs. bachelors
ggplot(data = countyComplete, 
       aes(x = cut(per_capita_income, breaks = 5), y = bachelors)) + 
  geom_boxplot()

1.3 What did you learn from the scatterplot and the boxplot you created above? Keep in mind that you wish to learn to revitalize the rural community you live in.

There is a positive linear relationship between having a bachelors degree and per capita income. Having a bachelors degree will generally result in having a higher income. The community should support higher education if it wants to increase the standard of living.

1.4 Add the rural variable to the scatterplot of per_capita_income and bachelors to see whether there is any difference in their relationship between rural and urban (1.3 Creating scatterplots). You may choose to add the third variable as color or as facet_grid.


ggplot(data = countyComplete, aes(x = per_capita_income, y = bachelors, color = factor(rural))) + geom_point()

1.5 Using the same techniques as above, identify another variable that seems to be correlated to per_capita_income.

ggplot(data = countyComplete, aes(x = per_capita_income, y = bachelors, color = white)) + geom_point()

2. Compute correlations (2.1 Computing correlation).

2.1 between income_per_capita and bachelors

countyComplete %>%
  summarize(N = n(), r = cor(per_capita_income, bachelors))
##      N         r
## 1 3143 0.7924464

2.2 between income_per_capita and another influencial variable you found above

countyComplete %>%
  summarize(N = n(), r = cor(per_capita_income, white))
##      N         r
## 1 3143 0.1542088

2.3 Interpret them. How are they different from causation?

The correlation between per capita income and college graduation is strong at 0.792. The correlation between per capita income and being white is much weaker at 0.154. Having a bachelors degree causes per capita income to be higher, being white does not have a strong causal effect on per capita income.

3. Run a simple linear regression

3.1 Create a scatterplot of per_capita_income and bachelors with regression line (3.1 The “best fit” line).


# Scatterplot with regression line
ggplot(data = countyComplete, aes(x = per_capita_income, y = bachelors)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

3.2 Run a linear model for per_capita_income as a function of bachelor and generate a summary out (4.2 The lm summary output).

mod <- lm(per_capita_income ~ bachelors, data = countyComplete)

# Show the coefficients
coef(mod)
## (Intercept)   bachelors 
##  13087.6795    494.7534

# Show the full output
summary(mod)
## 
## Call:
## lm(formula = per_capita_income ~ bachelors, data = countyComplete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18032.7  -1708.2     73.8   1748.0  21756.5 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13087.680    142.091   92.11   <2e-16 ***
## bachelors     494.753      6.795   72.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3299 on 3141 degrees of freedom
## Multiple R-squared:  0.628,  Adjusted R-squared:  0.6279 
## F-statistic:  5302 on 1 and 3141 DF,  p-value: < 2.2e-16

3.3 How is education associated with the standard of living? Positively or negatively? Interpret the coefficient of bachelors.

There is a positive association between education and standard of living. People who have bachelors degrees generally have a higher standard of living. The slope of the linear model = 494.7534. This means that having a bachelors degree will result in having a per capita income of approximaely 5 times that of someone who does not have a bachelors degree.

3.4 How much of the variations in per_capita_income was explained by the model (5.4 Interpretation of R^2)?

# Create bdims_tidy
countyComplete_tidy <- augment(mod)


# Compute R-squared
countyComplete_tidy %>%
  summarize(var_y = var(per_capita_income), var_e = var(.resid)) %>%
  mutate(R_squared = 1-var_e / var_y)
##      var_y    var_e R_squared
## 1 29253692 10883214 0.6279713

The R^2 reported for the regression model for per capita income of U.S. counties in terms of college graduation rate is 0.6279713. This means that 62.8% of the variability in per capita income among U.S. counties can be explained by having a bachelors degree .

3.5 Create a scatterplot of per_capita_income and another influencial variable you found above.

ggplot(data = countyComplete, aes(x = per_capita_income, y = white)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

3.6 Run a linear model for per_capita_income as a function of another influencial variable you found above.

mod <- lm(per_capita_income ~ white, data = countyComplete)

# Show the coefficients
coef(mod)
## (Intercept)       white 
## 18403.11249    49.48507

# Show the full output
summary(mod)
## 
## Call:
## lm(formula = per_capita_income ~ white, data = countyComplete)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12588  -3394   -942   2078  41351 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18403.112    478.496  38.460   <2e-16 ***
## white          49.485      5.657   8.747   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5345 on 3141 degrees of freedom
## Multiple R-squared:  0.02378,    Adjusted R-squared:  0.02347 
## F-statistic: 76.51 on 1 and 3141 DF,  p-value: < 2.2e-16

The R^2 in this regression model is only 0.02347 which means that being white only accounts for 2.35% difference in per capita income.