Disclaimer: The content of this RMarkdown note came from a course called Correlation and Regression in datacamp.

Correlation and Regression

Ultimately, data analysis is about understanding relationships among variables. In this course, you will learn how to describe relationships between two numerical quantities by:

Chapter 1: Visualizing two variables

# Impot data

#ncbirths <- read.csv(paste(getwd(),"/Bus Statistics/data/Correlation and Regression/ncbirths.csv", sep = ""))

ncbirths <- read.csv("/resources/rstudio/Bus Statistics/data/Correlation and Regression/ncbirths.csv")
head(ncbirths)
##   fage mage      mature weeks    premie visits marital gained weight
## 1   NA   13 younger mom    39 full term     10 married     38   7.63
## 2   NA   14 younger mom    42 full term     15 married     20   7.88
## 3   19   15 younger mom    37 full term     11 married     38   6.63
## 4   21   15 younger mom    41 full term      6 married     34   8.00
## 5   NA   15 younger mom    39 full term      9 married     27   6.38
## 6   NA   15 younger mom    38 full term     19 married     22   5.38
##   lowbirthweight gender     habit  whitemom
## 1        not low   male nonsmoker not white
## 2        not low   male nonsmoker not white
## 3        not low female nonsmoker     white
## 4        not low   male nonsmoker     white
## 5        not low female nonsmoker not white
## 6            low   male nonsmoker not white
str(ncbirths)
## 'data.frame':    1000 obs. of  13 variables:
##  $ fage          : int  NA NA 19 21 NA NA 18 17 NA 20 ...
##  $ mage          : int  13 14 15 15 15 15 15 15 16 16 ...
##  $ mature        : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
##  $ weeks         : int  39 42 37 41 39 38 37 35 38 37 ...
##  $ premie        : Factor w/ 3 levels "<NA>","full term",..: 2 2 2 2 2 2 2 3 2 2 ...
##  $ visits        : int  10 15 11 6 9 19 12 5 9 13 ...
##  $ marital       : Factor w/ 3 levels "<NA>","married",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ gained        : int  38 20 38 34 27 22 76 15 NA 52 ...
##  $ weight        : num  7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
##  $ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
##  $ gender        : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
##  $ habit         : Factor w/ 3 levels "<NA>","nonsmoker",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ whitemom      : Factor w/ 3 levels "<NA>","not white",..: 2 2 3 3 2 2 2 2 3 3 ...

1.1 Scatterplots

The ncbirths dataset is a random sample of 1,000 cases taken from a larger dataset collected in 2004. Each case describes the birth of a single child born in North Carolina, along with:

  • various characteristics of the child (e.g. birth weight, length of gestation, etc.),
  • the child’s mother (e.g. age, weight gained during pregnancy, smoking habits, etc.) and
  • the child’s father (e.g. age).
# Load packages
library(ggplot2)

# Scatterplot of weight vs. weeks
ggplot(data = ncbirths, aes(x = weeks, y = weight)) + geom_point()

1.2 Boxplots as discretized/conditioned scatterplots

# Boxplot of weight vs. weeks
ggplot(data = ncbirths, 
       aes(x = cut(weeks, breaks = 5), y = weight)) + 
  geom_boxplot()

Interpretation

  • The relationship no longer seems linear.

1.3 Creating scatterplots

We will use several datasets listed below, which are available through the openintro package.

  • The mammals dataset contains information about 39 different species of mammals, including their body weight, brain weight, gestation time, and a few other variables.
  • The mlbBat10 dataset contains batting statistics for 1,199 Major League Baseball players during the 2010 season.
  • The bdims dataset contains body girth and skeletal diameter measurements for 507 physically active individuals.
  • The smoking dataset contains information on the smoking habits of 1,691 citizens of the United Kingdom.
# Load the package
library(openintro)

# Mammals scatterplot
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
  geom_point()


# Baseball player scatterplot
ggplot(data = mlbBat10, aes(x = OBP, y = SLG)) +
  geom_point()


# Body dimensions scatterplot
ggplot(data = bdims, aes(x = hgt, y = wgt, color = factor(sex))) +
  geom_point()


# Smoking scatterplot
ggplot(data = smoking, aes(x = age, y = amtWeekdays)) +
  geom_point()

1.4 Transformations

When there is no linear relationship between two variables, there are two possibilities:

  1. There really is no meaningful relationship between the two variables.
  2. Other times, a careful transformation of one or both of the variables can reveal a clear relationship.

Clarify the bizarre pattern between brain weight and body weight among mammals, using transformations to clarify this relationship?

# Scatterplot with coord_trans()
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
  geom_point() + 
  coord_trans(x = "log10", y = "log10")


# Scatterplot with scale_x_log10() and scale_y_log10()
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
  geom_point() +
  scale_x_log10() + scale_y_log10()

1.5 Identifying outliers

Recall that in the baseball example earlier in the chapter, most of the points were clustered in the lower left corner of the plot, making it difficult to see the general pattern of the majority of the data. This difficulty was caused by a few outlying players whose on-base percentages (OBPs) were exceptionally high.

Use filter() to create a scatterplot for SLG as a function of OBP among players who had at least 200 at-bats. And find the outliers corresponding to the one player with at least 200 at-bats whose OBP was below 0.200.

# Load package
library(dplyr) # for the pipe operators (%>%) below

# Scatterplot of SLG vs. OBP without filtering outliers
mlbBat10 %>%
  ggplot(aes(x = OBP, y = SLG)) +
  geom_point()


# Scatterplot of SLG vs. OBP
mlbBat10 %>%
  filter(AB >= 200) %>%
  ggplot(aes(x = OBP, y = SLG)) +
  geom_point()


# Identify the outlying player
mlbBat10 %>%
  filter(AB >= 200, OBP < 0.2)
##     name team position  G  AB  R  H 2B 3B HR RBI TB BB SO SB CS   OBP
## 1 B Wood  LAA       3B 81 226 20 33  2  0  4  14 47  6 71  1  0 0.174
##     SLG   AVG
## 1 0.208 0.146

Chapter 2: Correlation

Correlation:

  • is a means of quantifying bivariate relationships;
  • ranges between -1 and 1;
  • indicates that the variables move in the oppositie direction when its sign is negative;
  • indicates that the variables move in the same direction when its sign is positive;
  • indiates a strong association when its absolute value is closer to 1.

2.1 Computing correlation

The cor(x, y) function will compute the Pearson product-moment correlation between variables, x and y.

# Compute correlation
ncbirths %>%
  summarize(N = n(), r = cor(weight, mage))
##      N          r
## 1 1000 0.05506589

# Compute correlation for all non-missing pairs
ncbirths %>%
  summarize(N = n(), r = cor(weight, weeks, use = "pairwise.complete.obs"))
##      N         r
## 1 1000 0.6701013

2.1 Exploring Anscombe

In 1973, Francis Anscombe famously created four datasets with remarkably similar numerical properties, but obviously different graphic relationships. The Anscombe dataset contains the x and y coordinates for these four datasets, along with a grouping variable, set, that distinguishes the quartet.

# Load Anscombe data
Anscombe <- read.csv("data/Correlation and Regression/Anscombe.csv")

# Plot 
ggplot(data = Anscombe, aes(x = x, y = y)) +
  geom_point() +
  facet_wrap(~ set)


# Compute properties of Anscombe
Anscombe %>%
  group_by(set) %>%
  summarize(N = n(), mean(x), sd(x), mean(y), sd(y), cor(x,y))
## # A tibble: 4 x 7
##     set     N `mean(x)`  `sd(x)` `mean(y)`  `sd(y)` `cor(x, y)`
##   <int> <int>     <dbl>    <dbl>     <dbl>    <dbl>       <dbl>
## 1     1    11         9 3.316625  7.500909 2.031568   0.8164205
## 2     2    11         9 3.316625  7.500909 2.031657   0.8162365
## 3     3    11         9 3.316625  7.500000 2.030424   0.8162867
## 4     4    11         9 3.316625  7.500909 2.030579   0.8165214

2.2 Perception of correlation (2)

Examine the four scatterplots and jot down your best estimate of the value of the correlation coefficient between each pair of variables. Then, compare these values to the actual values you compute in this exercise.

# Baseball player scatterplot
ggplot(data = mlbBat10, aes(x = OBP, y = SLG)) +
  geom_point()


# Scatterplot of SLG vs. OBP
mlbBat10 %>%
  filter(AB >= 200) %>%
  ggplot(aes(x = OBP, y = SLG)) +
  geom_point()


# Body dimensions scatterplot
ggplot(data = bdims, aes(x = hgt, y = wgt, color = factor(sex))) +
  geom_point()


# Scatterplot with scale_x_log10() and scale_y_log10()
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
  geom_point() +
  scale_x_log10() + scale_y_log10()


# Correlation for all baseball players
mlbBat10 %>%
  summarize(N = n(), r = cor(OBP, SLG))
##      N         r
## 1 1199 0.8145628

# Correlation for all players with at least 200 ABs
mlbBat10 %>%
  filter(AB >= 200) %>%
  summarize(N = n(), r = cor(OBP, SLG))
##     N         r
## 1 329 0.6855364

# Correlation of body dimensions
bdims %>%
  group_by(sex) %>%
  summarize(N = n(), r = cor(hgt, wgt))
## # A tibble: 2 x 3
##     sex     N         r
##   <int> <int>     <dbl>
## 1     0   260 0.4310593
## 2     1   247 0.5347418

# Correlation among mammals, with and without log
mammals %>%
  summarize(N = n(), 
            r = cor(BodyWt, BrainWt), 
            r_log = cor(log(BodyWt), log(BrainWt)))
##    N         r     r_log
## 1 62 0.9341638 0.9595748

2.3 Correlation and causation

Correlation doesn’t mean causation. Interpret the following statement as an example.

In the San Francisco Bay Area from 1960-1967, the correlation between the birthweight of 1,236 babies and the length of their gestational period was 0.408.

  • We observed that babies with longer gestational periods tended to be heavier at birth.
  • These data suggest that babies with longer gestational periods tend to be heavier at birth, but there are many potential confounding factors that were not taken into account.
  • It may be that a longer gestational period contributes to a heavier birthweight among babies, but a randomized, controlled experiment is needed to confirm this observation.
  • It would be incorrect to conclude, however that staying in the womb longer causes babies to be heavier when they are born.

2.4 Spurious correlation in random data

Remarkable but nonsensecal relations are called spurious. Statisticians must always be skeptical of potentially spurious correlations.

Create a faceted scatterplot that shows the relationship between each of the 20 sets of pairs of random variables x and y. Do you see any pairs of variables that might be meaningfully correlated? Identify the datasets that show non-trivial correlation of greater than 0.2 in absolute value.

# Import data
noise <- read.csv("data/Correlation and Regression/noise.csv")

# Create faceted scatterplot
ggplot(noise, aes(x = x, y = y)) +
geom_point() + 
facet_wrap(~ z)




# Compute correlations for each dataset
noise_summary <- noise %>%
  group_by(z) %>%
  summarize(N = n(), spurious_cor = cor(x, y))

# Isolate sets with correlations above 0.2 in absolute strength
noise_summary %>%
  filter(abs(spurious_cor) > 0.2)
## # A tibble: 4 x 3
##       z     N spurious_cor
##   <int> <int>        <dbl>
## 1     3    50   -0.2063690
## 2     5    50    0.3060001
## 3    14    50   -0.2883678
## 4    20    50   -0.2323851

Chapter 3: Simple linear regression

3.1 The “best fit” line

The simple linear regression model can be visualized by a straight line, a “best fit” line that cuts through the data in a way that minimizes the distance between the line and the data points. This can be done by using the geom_smooth() function.

# Scatterplot with regression line
ggplot(data = bdims, aes(x = hgt, y = wgt)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) # lm stands for linear model; se for standard errors

3.2 Uniqueness of least squares regression line

Experiment with trying to find the optimal value for the regression slope for weight as a function of height in the bdims dataset via trial-and-error.

# Define add_line function
add_line <- function(my_slope) {
  ggplot(data = bdims, aes(hgt, wgt)) +
    geom_point() +
    geom_abline(intercept = -102, slope = my_slope, col = "blue")
}

# Estimate optimal value of my_slope
add_line(my_slope = 1)

3.3 Fitting a linear model “by hand”

# Summarize bdims
bdims_summary <-
  bdims %>%
  summarize(N = n(), r = cor(wgt, hgt), mean_hgt = mean(hgt), sd_hgt = sd(hgt),
            mean_wgt = mean(wgt), sd_wgt = sd(wgt))

# Print bdims_summary
bdims_summary
##     N         r mean_hgt   sd_hgt mean_wgt   sd_wgt
## 1 507 0.7173011 171.1438 9.407205 69.14753 13.34576

# Add slope and intercept
bdims_summary %>%
  mutate(slope = r * sd_wgt / sd_hgt, 
         intercept = mean_wgt - slope * mean_hgt)
##     N         r mean_hgt   sd_hgt mean_wgt   sd_wgt    slope intercept
## 1 507 0.7173011 171.1438 9.407205 69.14753 13.34576 1.017617 -105.0113

3.4 Regression to the mean

Regression to the mean is a concept attributed to Sir Francis Galton. The basic idea is that extreme random observations will tend to be less extreme upon a second trial.

While it is true that tall mothers and fathers tend to have tall children, those children tend to be less tall than their parents, relative to average. That is, fathers who are 3 inches taller than the average father tend to have children who may be taller than average, but by less than 3 inches.

# Load data
Galton_men <- read.csv("~/resources/rstudio/Bus Statistics/data/Correlation and Regression/Galton_men.csv")
Galton_women <- read.csv("~/resources/rstudio/Bus Statistics/data/Correlation and Regression/Galton_women.csv")


# Height of children vs. height of father
ggplot(data = Galton_men, aes(x = father, y = height)) +
  geom_point() + 
  geom_abline(slope = 1, intercept = 0) + 
  geom_smooth(method = "lm", se = FALSE)


# Height of children vs. height of mother
ggplot(data = Galton_women, aes(x = mother, y = height)) +
  geom_point() + 
  geom_abline(slope = 1, intercept = 0) + 
  geom_smooth(method = "lm", se = FALSE)

Interpretation

  • The slope of the regression line, which is smaller than 1 (the slope of the diagonal line) for both males and females, verifies Sir Francis Galton’s regression to the mean concept!

Chapter 4: Interpreting regression models

4.1 Fitting simple linear models

Create a linear model using lm(). This function return a model object having class “lm”. This object contains lots of information about your regression model, including:

  • the data used to fit the model,
  • the specification of the model,
  • the fitted values and residuals,
  • the residuals.
# Linear model for weight as a function of height
lm(wgt ~ hgt, data = bdims)
## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Coefficients:
## (Intercept)          hgt  
##    -105.011        1.018

# Linear model for SLG as a function of OBP
lm(SLG ~ OBP, data = mlbBat10)
## 
## Call:
## lm(formula = SLG ~ OBP, data = mlbBat10)
## 
## Coefficients:
## (Intercept)          OBP  
##    0.009407     1.110323

# Log-linear model for body weight as a function of brain weight
library(ggplot2)
ggplot(data = mammals, aes(BrainWt, BodyWt)) + geom_point()

ggplot(data = mammals, aes(log(BrainWt), log(BodyWt))) + geom_point()

lm(log(BodyWt) ~ log(BrainWt), data = mammals)
## 
## Call:
## lm(formula = log(BodyWt) ~ log(BrainWt), data = mammals)
## 
## Coefficients:
##  (Intercept)  log(BrainWt)  
##       -2.509         1.225

4.2 The lm summary output

# Create a linear model
mod <- lm(wgt ~ hgt, data = bdims)

# Show the coefficients
coef(mod)
## (Intercept)         hgt 
## -105.011254    1.017617

# Show the full output
summary(mod)
## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.743  -6.402  -1.231   5.059  41.103 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -105.01125    7.53941  -13.93   <2e-16 ***
## hgt            1.01762    0.04399   23.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared:  0.5145, Adjusted R-squared:  0.5136 
## F-statistic: 535.2 on 1 and 505 DF,  p-value: < 2.2e-16

What are degrees of freedom?

  • It’s basically the number of observations that are free to vary. For a detailed discussion, click here.

4.3 Fitted values and residuals

Confirm the two mathematical facts that the least squares fitting procedure guarantees:

  • the mean of the residuals is zero;
  • the mean of the fitted values must equal the mean of the response variable.
# Mean of weights equal to mean of fitted values?
mean(bdims$wgt) == mean(fitted.values(mod))
## [1] TRUE

# Mean of the residuals
mean(resid(mod))
## [1] -1.266971e-15

4.4 Tidying your linear model

As you fit a regression model, there are some quantities (e.g. R^2) that apply to the model as a whole, while others apply to each observation (e.g. fitted values).

Attach per-observation quantities to the original data as new variables, using the augment() function from the broom package. It takes a model object as an argument and returns a data frame.

# Install package
# install.packages("broom")

# Load package
library(broom)

# Load broom
library(broom)

# Create bdims_tidy
bdims_tidy <- augment(mod)

# Glimpse the resulting data frame
glimpse(bdims_tidy)
## Observations: 507
## Variables: 9
## $ wgt        <dbl> 65.6, 71.8, 80.7, 72.6, 78.8, 74.8, 86.4, 78.4, 62....
## $ hgt        <dbl> 174.0, 175.3, 193.5, 186.5, 187.2, 181.5, 184.0, 18...
## $ .fitted    <dbl> 72.05406, 73.37697, 91.89759, 84.77427, 85.48661, 7...
## $ .se.fit    <dbl> 0.4320546, 0.4520060, 1.0667332, 0.7919264, 0.81834...
## $ .resid     <dbl> -6.4540648, -1.5769666, -11.1975919, -12.1742745, -...
## $ .hat       <dbl> 0.002154570, 0.002358152, 0.013133942, 0.007238576,...
## $ .sigma     <dbl> 9.312824, 9.317005, 9.303732, 9.301360, 9.312471, 9...
## $ .cooksd    <dbl> 5.201807e-04, 3.400330e-05, 9.758463e-03, 6.282074e...
## $ .std.resid <dbl> -0.69413418, -0.16961994, -1.21098084, -1.31269063,...

4.5 Making predictions

Using the predict() function, generate expected values for the weight of a new individual.

  • 1st argument: the mod object that contains the fitted model for weight as a function of height for the observations in the bdims dataset.
  • 2nd argument: the ben data frame that contains a height and weight observation for one person.
# Define ben
ben <- data.frame(74.8, 182.8)
names(ben) <- c("wgt", "hgt")

# Print ben
ben # Note that the data frame ben has variables with the exact same names as those in the fitted model.
##    wgt   hgt
## 1 74.8 182.8

# Predict the weight of ben
predict(mod, ben)
##        1 
## 81.00909

4.6 Adding a regression line to a plot manually

Add a regression line to our scatterplot manually, using the geom_abline() function. It takes slope and intercept arguments. Alternatively, the geom_smooth() function does the same automatically add a simple linear regression line to a scatterplot of the corresponding variables.

# Define coefs
#mod <- lm(wgt ~ hgt, data = bdims)
#coefs <- data.frame(t(coef(mod)))
coefs <- data.frame(-105, 1.02)
names(coefs) <- c("(Intercept)", "hgt")
str(coefs)
## 'data.frame':    1 obs. of  2 variables:
##  $ (Intercept): num -105
##  $ hgt        : num 1.02

# Add the line to the scatterplot
ggplot(data = bdims, aes(x = hgt, y = wgt)) + 
  geom_point() + 
  geom_abline(data = coefs, 
              aes(intercept = `(Intercept)`, slope = hgt),  
              color = "dodgerblue")

Chapter 5: Model Fit

Learn to assess how well your linear model works. One could get a sense of model fit by looking at how far off actual observations are from the regression line (or predicted values) as a group. Can we quantify this?

  1. SSE Since the sum of the distance of actual and predicted values is guranteed to be zero, we may square them before summing them. This is called the sum of squared errors (SSE). SSE penalizes large residuals disproportionately. This may be a desirable feature since one would prefer model 1 to model 2.
  • Model 1 that misses by a little bit but never by a lot
  • Model 2 that works most of the time but occasionally way off
  1. RMSE Unfortunately, SSE is hard to interpret. Hence, Root Mean Squared Error (RMSE)), which is square-rooted average SSE and thus has the same units of the response variable. R calls this measure as Residual Standard Error.

5.1 RMSE

The residual standard error reported for the regression model for poverty rate of U.S. counties in terms of high school graduation rate is 4.67. What does this mean?

  • The typical difference between the observed poverty rate and the poverty rate predicted by the model is about 4.67 percentage points.
  • A percentage point is the unit of the arithmetic difference of two percentages. Click here for discussion on this topic.

5.2 Standard error of residuals (Residual Standard Error)

Show that the mean of residuals is zero (not exactly zero due to rounding error). Calculate residual standard error.

# View summary of model
summary(mod)
## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.743  -6.402  -1.231   5.059  41.103 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -105.01125    7.53941  -13.93   <2e-16 ***
## hgt            1.01762    0.04399   23.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared:  0.5145, Adjusted R-squared:  0.5136 
## F-statistic: 535.2 on 1 and 505 DF,  p-value: < 2.2e-16

# Compute the mean of the residuals
mean(residuals(mod))
## [1] -1.266971e-15

# Compute RMSE
sqrt(sum(residuals(mod)^2) / df.residual(mod))
## [1] 9.30804

5.3 Assessing simple linear model fit

The coefficient of determination (R^2) shows the percentage of the variability in the response that is explained by the model, since the residuals are the part of that variability that remains unexplained by the model.

# View model summary
summary(mod)
## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.743  -6.402  -1.231   5.059  41.103 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -105.01125    7.53941  -13.93   <2e-16 ***
## hgt            1.01762    0.04399   23.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared:  0.5145, Adjusted R-squared:  0.5136 
## F-statistic: 535.2 on 1 and 505 DF,  p-value: < 2.2e-16

# Compute R-squared
bdims_tidy %>%
  summarize(var_y = var(wgt), var_e = var(.resid)) %>%
  mutate(R_squared = 1 - var_e / var_y)
##      var_y    var_e R_squared
## 1 178.1094 86.46839 0.5145208

Interpreation

  • 51.4% of the variability in weight is explained by height.

5.4 Interpretation of R^2

The R^2 reported for the regression model for poverty rate of U.S. counties in terms of high school graduation rate is 0.464. This means that 46.4% of the variability in poverty rate among U.S. counties can be explained by high school graduation rate.

5.5 Linear vs. average

The R^2 shows how fit the model is relative to a null model that uses the average of the response variable as predicted values.

# Define mod_null and mod_hgt
mod <- lm(wgt ~ 1, data = bdims) # The null model only includes the constant 1.
mod_null <-
  mod %>%
  augment(bdims)

mod <- lm(wgt ~ hgt, data = bdims)
mod_hgt <-
  mod %>%
  augment(bdims)

# Compute SSE for null model
mod_null %>%
  summarize(SSE = var(.resid))
##        SSE
## 1 178.1094

# Compute SSE for regression model
mod_hgt %>%
  summarize(SSE = var(.resid))
##        SSE
## 1 86.46839

5.6 Leverage

To refine the understanding of outliers, leverage and influence. In fitting a linear model, we want to be mindful that one or two individual observations shouldn’t have disportionate influence over the determination of the slope. The concept of leverage and influence help us quantify this situation.

The leverage of an observation measures the difference between the value of the explanatory variable and the mean of the explanatory variable. The augment() function produces the extent of leverage under .hat.

# Define mod
mod <- lm(SLG ~ OBP, data = filter(mlbBat10, AB >= 10))

mod %>%
  augment() %>%
  arrange(desc(.hat)) %>%
  head()
##     SLG   OBP     .fitted     .se.fit      .resid       .hat     .sigma
## 1 0.000 0.000 -0.03744579 0.009956861  0.03744579 0.01939493 0.07153050
## 2 0.000 0.000 -0.03744579 0.009956861  0.03744579 0.01939493 0.07153050
## 3 0.000 0.000 -0.03744579 0.009956861  0.03744579 0.01939493 0.07153050
## 4 0.308 0.550  0.69049108 0.009158810 -0.38249108 0.01641049 0.07011360
## 5 0.000 0.037  0.01152451 0.008770891 -0.01152451 0.01504981 0.07154283
## 6 0.038 0.038  0.01284803 0.008739031  0.02515197 0.01494067 0.07153800
##        .cooksd .std.resid
## 1 0.0027664282  0.5289049
## 2 0.0027664282  0.5289049
## 3 0.0027664282  0.5289049
## 4 0.2427446800 -5.3943121
## 5 0.0002015398 -0.1624191
## 6 0.0009528017  0.3544561

5.7 Influence

The influence of an observation depends on:

  1. its leverage, and
  2. the magnitude of its residual (difference between fitted values and actual values).

The augment() function produces the extent of leverage under .cooksd.

# Rank influential points
mod %>%
  augment() %>%
  arrange(desc(.cooksd)) %>%
  head()
##     SLG   OBP    .fitted     .se.fit     .resid        .hat     .sigma
## 1 0.308 0.550 0.69049108 0.009158810 -0.3824911 0.016410487 0.07011360
## 2 0.833 0.385 0.47211002 0.004190644  0.3608900 0.003435619 0.07028875
## 3 0.800 0.455 0.56475653 0.006186785  0.2352435 0.007488132 0.07101125
## 4 0.379 0.133 0.13858258 0.005792344  0.2404174 0.006563752 0.07098798
## 5 0.786 0.438 0.54225666 0.005678026  0.2437433 0.006307223 0.07097257
## 6 0.231 0.077 0.06446537 0.007506974  0.1665346 0.011024863 0.07127661
##      .cooksd .std.resid
## 1 0.24274468  -5.394312
## 2 0.04407145   5.056428
## 3 0.04114818   3.302718
## 4 0.03760256   3.373787
## 5 0.03712042   3.420018
## 6 0.03057912   2.342252

5.8 Removing outliers

Observations can be outliers for a number of different reasons. Statisticians must always be careful—and more importantly, transparent—when dealing with outliers.

For example, in the mlbBat10 data, the outlier with an OBP of 0.550 is Bobby Scales, an infielder who had four hits in 13 at-bats for the Chicago Cubs. Scales also walked seven times, resulting in his unusually high OBP. The justification for removing Scales here is weak. While his performance was unusual, there is nothing to suggest that it is not a valid data point, nor is there a good reason to think that somehow we will learn more about Major League Baseball players by excluding him.

Nevertheless, we can demonstrate how removing him will affect our model.

# Create nontrivial_players
nontrivial_players <-
  mlbBat10 %>%
  filter(AB >= 10, OBP < 0.5)

# Fit model to new data
mod_cleaner <- lm(SLG ~ OBP, data = nontrivial_players)

# View model summary
summary(mod_cleaner)
## 
## Call:
## lm(formula = SLG ~ OBP, data = nontrivial_players)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31383 -0.04165 -0.00261  0.03992  0.35819 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.043326   0.009823  -4.411 1.18e-05 ***
## OBP          1.345816   0.033012  40.768  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07011 on 734 degrees of freedom
## Multiple R-squared:  0.6937, Adjusted R-squared:  0.6932 
## F-statistic:  1662 on 1 and 734 DF,  p-value: < 2.2e-16

# Visualize new model
ggplot(aes(x = OBP, y = SLG), data = nontrivial_players) +
  geom_point() +
  geom_smooth(method = "lm")

5.9 High leverage points

Not all points of high leverage are influential. While the high leverage observation corresponding to Bobby Scales in the previous exercise is influential, the three observations for players with OBP and SLG values of 0 are not influential.

This is because they happen to lie right near the regression anyway.

# Rank high leverage points
mod %>%
  augment() %>%
  arrange(desc(.hat), .cooksd) %>%
  head()
##     SLG   OBP     .fitted     .se.fit      .resid       .hat     .sigma
## 1 0.000 0.000 -0.03744579 0.009956861  0.03744579 0.01939493 0.07153050
## 2 0.000 0.000 -0.03744579 0.009956861  0.03744579 0.01939493 0.07153050
## 3 0.000 0.000 -0.03744579 0.009956861  0.03744579 0.01939493 0.07153050
## 4 0.308 0.550  0.69049108 0.009158810 -0.38249108 0.01641049 0.07011360
## 5 0.000 0.037  0.01152451 0.008770891 -0.01152451 0.01504981 0.07154283
## 6 0.038 0.038  0.01284803 0.008739031  0.02515197 0.01494067 0.07153800
##        .cooksd .std.resid
## 1 0.0027664282  0.5289049
## 2 0.0027664282  0.5289049
## 3 0.0027664282  0.5289049
## 4 0.2427446800 -5.3943121
## 5 0.0002015398 -0.1624191
## 6 0.0009528017  0.3544561