Chapter 4

Summary

This week I worked through Chapter 4 of the Stat2 book on Additional Topics in Regression. This chapter introduced many new ways to continue to verify and analyze a multiple regression model, such as through added variable plots or Mallow’s Cp.

Coming out of last week, I had still felt a bit weary when choosing variables to create a MR model, but the book introduced more methods to ensure the best variables are chosen, therefore the best model is created. Working with the same data from the book examples in R, I created a best subsets model, which is a computer output that suggests how many variables would create the best model, as well as which variables to use to create said model. The book’s model showed an R^2 and Mallow’s Cp value alongside the best subsets model, which is incredibly helpful, but something I am still attempting to learn how to do within R. I am aware that the predictors the computer picks as the “best” predictors are simply those with the highest R^2 values, but I feel it would be helpful to see them, which is why I am still on my journey of exploring how to do this in R.

Speaking of Mallow’s Cp, I learned that it is another statistic used to check the effectiveness of a MR model. Mallow’s Cp greatly takes into account the amount of predictors within the model, as a Mallow’s Cp that indicates a good model is usually close to the number of predictors plus 1, or another constant. When a MCp is far greater than the number of predictors plus the constant, the model is usually not a great fit for the data. When doing work in R, my output showed an “AIC” column, which was introduced into the book. The AIC, or Akaike’s information criterion, is another measure to evaluate how a model fits the data, but uses logs within its calculations. Both of these statistics are useful in the search for the perfect model, as they allow us to compare different models alongside each other by a simple number. Although it is important to account for many other things and not choose a model by a singular statistic, this value is helpful to weed out models that are far off.

Backwards, or stepwise elimination was introduced in this chapter, and is another essential tool in the search of a close to perfect MR model. The function works through fitting a full model of every predictor variable within a dataset, and then eliminating predictor variables by their p-values (if less than 5%, keep it, if greater, get rid of it). When using forward regression, the same methodology is used, but it starts with a model with no predictors, and searches for predictors that add a significant increase to R^2. When doing this, I kept in mind that R^2 increases when a new predictor is added, and looked at the adjusted R^2 value rather than the original R^2.

When doing the homework, I was able to employ a lot of what I had learned, such as in problem 4.2. Here, I created a scatterplot matrix to observe predictors with strong relationships, used forward regression to find the best variables to predict for Time. In R, I am still learning how to create added variable plots, so the plot for that statistic is not included in the homework.

I love R as it allows me to use bigger datasets than a calculator would, and do bigger things with said data. For example, in problem 4.17, I used the function permcorr() and a for() function (which I wouldn’t have known how to do without my compsci class!) to test a set of data n = 5000 times to test for its significance.

After last week, I still felt a bit weary about the topic of multiple regression, specifically in choosing variables, but I feel that this week chapter eased my worries. I know of many more ways to check for statistical significance and how to choose a good model, which are essential for my understanding of the topic.

## import libraries
library(car)

## Loading required package: carData

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(leaps)
library(MASS)

## Warning: package 'MASS' was built under R version 4.1.2

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

4.2

## 4.2
peak = read.csv("~/downloads/HighPeaks.csv")
attach(peak)
# a
plot(peak$Elevation, peak$Time, pch=16)

cor(peak$Elevation, peak$Time)

## [1] -0.0162768

# b
peakmr = lm(Time ~ Elevation + Length, data = peak)
summary(peakmr)

## 
## Call:
## lm(formula = Time ~ Elevation + Length, data = peak)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5924 -0.8050 -0.1959  0.6380  3.8432 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.0753787  2.5327132   3.188  0.00267 ** 
## Elevation   -0.0014483  0.0005805  -2.495  0.01653 *  
## Length       0.7123344  0.0593330  12.006 2.54e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.37 on 43 degrees of freedom
## Multiple R-squared:  0.7703, Adjusted R-squared:  0.7596 
## F-statistic: 72.09 on 2 and 43 DF,  p-value: 1.844e-14

elevation = lm(Time ~ Elevation, data = peak)
summary(elevation)

## 
## Call:
## lm(formula = Time ~ Elevation, data = peak)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6912 -1.6985 -0.5639  1.2963  7.3015 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 11.2113764  5.1953800   2.158   0.0364 *
## Elevation   -0.0001269  0.0011756  -0.108   0.9145  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.826 on 44 degrees of freedom
## Multiple R-squared:  0.0002649,  Adjusted R-squared:  -0.02246 
## F-statistic: 0.01166 on 1 and 44 DF,  p-value: 0.9145

length = lm(Time ~ Length, data = peak)
summary(length)

## 
## Call:
## lm(formula = Time ~ Length, data = peak)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4491 -0.6687 -0.0122  0.5590  4.0034 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.04817    0.80371   2.548   0.0144 *  
## Length       0.68427    0.06162  11.105 2.39e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.449 on 44 degrees of freedom
## Multiple R-squared:  0.737,  Adjusted R-squared:  0.7311 
## F-statistic: 123.3 on 1 and 44 DF,  p-value: 2.39e-14

# c
# residuals + graph of time on length
advp1 = lm(Time ~ Length, data = peak)
cor(peak$Time, peak$Length)

## [1] 0.8585079

resadvp1 = resid(advp1)
plot(fitted(advp1), resadvp1, xlab = "Predicted Time",
     ylab = "Resid1", main = "Residuals of Time on Length")
abline(0,0)

summary(resadvp1)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.44914 -0.66867 -0.01216  0.00000  0.55897  4.00344

summary(advp1)

## 
## Call:
## lm(formula = Time ~ Length, data = peak)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4491 -0.6687 -0.0122  0.5590  4.0034 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.04817    0.80371   2.548   0.0144 *  
## Length       0.68427    0.06162  11.105 2.39e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.449 on 44 degrees of freedom
## Multiple R-squared:  0.737,  Adjusted R-squared:  0.7311 
## F-statistic: 123.3 on 1 and 44 DF,  p-value: 2.39e-14

plot(peak$Length, peak$Time, xlab = "Length", ylab = "Time",
     main = "Length and Time", abline(lm(peak$Time ~ peak$Length)))

# residuals + graph of time on elevation
advp2 = lm(Time ~ Elevation, data = peak)
cor(peak$Time, peak$Elevation)

## [1] -0.0162768

resadvp2 = resid(advp2)
plot(fitted(advp2), resadvp2, xlab = "Predicted Time",
     ylab = "Resid2", main = "Residuals of Time on Elevation")
abline(0,0)

plot(peak$Elevation, peak$Time,  xlab = "Elevation", ylab = "Time",
     main = "Elevation and Time", abline(lm(peak$Time ~ peak$Elevation)))

summary(advp2)

## 
## Call:
## lm(formula = Time ~ Elevation, data = peak)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6912 -1.6985 -0.5639  1.2963  7.3015 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 11.2113764  5.1953800   2.158   0.0364 *
## Elevation   -0.0001269  0.0011756  -0.108   0.9145  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.826 on 44 degrees of freedom
## Multiple R-squared:  0.0002649,  Adjusted R-squared:  -0.02246 
## F-statistic: 0.01166 on 1 and 44 DF,  p-value: 0.9145

# added variable plots
error1 = advp1$residuals
error2 = advp2$residuals
error1.error2lm = lm(error1 ~ error2)
error1.error2lm

## 
## Call:
## lm(formula = error1 ~ error2)
## 
## Coefficients:
## (Intercept)       error2  
##   1.267e-16    2.601e-01

plot(error2, error1, main = "Added Variable Plot")

abline(error1.error2lm)
abline(0,0)

cor(error1, error2)

## [1] 0.507185

cor(resadvp1, resadvp2)

## [1] 0.507185

# build in function
avPlots(advp1, col = 1)

# mr model
mrpeak = lm(Time ~ Elevation + Length, data = peak)
summary(mrpeak)

## 
## Call:
## lm(formula = Time ~ Elevation + Length, data = peak)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5924 -0.8050 -0.1959  0.6380  3.8432 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.0753787  2.5327132   3.188  0.00267 ** 
## Elevation   -0.0014483  0.0005805  -2.495  0.01653 *  
## Length       0.7123344  0.0593330  12.006 2.54e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.37 on 43 degrees of freedom
## Multiple R-squared:  0.7703, Adjusted R-squared:  0.7596 
## F-statistic: 72.09 on 2 and 43 DF,  p-value: 1.844e-14

res = avPlots(lm(Time ~ Elevation + Length, data = peak))

fit = lsfit(res$Elevation[,1], res$Elevation[,2])
fit$coefficients

##     Intercept             X 
##  1.066786e-16 -1.448269e-03

fitlength = lsfit(res$Length[,1], res$Length[,2])
fitlength$coefficients

##    Intercept            X 
## 6.301114e-17 7.123344e-01

a.

The correlation is -0.016 between the variables, therefore, Elevation does not appear to be a helpful predictor for Time

b.

Model with both predictors = 8.07 - 0.00144(Elevation) + 0.7123(Length), R^2 = 0.7703

Elevation only = 11.21 - 0.00012(Elevation), R^2 = 0.0026

Length only = 2.048 + 0.684(Length), R^2 = 0.737

Elevation is not very important in this model. In the model with both predictors, it’s small factor tells us that it doesn’t affect the model that much, while length’s higher factor tells us length is more important. The elevation only model accounts for 0.2% of variance in the model, while length only accounts for 73.7%, but combined, the multiple regression model accounts for 77.03%, making it the optimal model.

c.

Here, the equation of the linear regression line of the added variable plot is Resid1 = 0 + 0.712(Resid2) These plots show that

4.3

mlb.df = read.csv("~/downloads/MLBStandings2016.csv")
names(mlb.df)

##  [1] "Team"           "League"         "Wins"           "Losses"        
##  [5] "WinPct"         "BattingAverage" "Runs"           "Hits"          
##  [9] "HR"             "Doubles"        "Triples"        "RBI"           
## [13] "SB"             "OBP"            "SLG"            "ERA"           
## [17] "HitsAllowed"    "Walks"          "StrikeOuts"     "Saves"         
## [21] "WHIP"

pairs(mlb.df[,c(3,4,5,6,7,8,9)])

fullmlb = lm(WinPct ~ BattingAverage + Runs + Hits + HR + Doubles + Triples + RBI + SB + OBP + SLG + ERA + HitsAllowed + Walks + StrikeOuts + Saves + WHIP, data = mlb.df)
summary(fullmlb)

## 
## Call:
## lm(formula = WinPct ~ BattingAverage + Runs + Hits + HR + Doubles + 
##     Triples + RBI + SB + OBP + SLG + ERA + HitsAllowed + Walks + 
##     StrikeOuts + Saves + WHIP, data = mlb.df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.038561 -0.010917  0.002214  0.010956  0.027615 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)  
## (Intercept)     4.850e-01  4.214e-01   1.151   0.2705  
## BattingAverage  1.549e+01  1.846e+01   0.839   0.4164  
## Runs            1.767e-03  9.560e-04   1.848   0.0875 .
## Hits           -1.489e-03  1.257e-03  -1.185   0.2574  
## HR              2.030e-03  6.853e-03   0.296   0.7718  
## Doubles         2.990e-04  2.232e-03   0.134   0.8955  
## Triples         1.579e-03  4.769e-03   0.331   0.7458  
## RBI            -1.306e-03  9.747e-04  -1.340   0.2032  
## SB             -6.973e-05  1.757e-04  -0.397   0.6979  
## OBP            -2.374e+00  1.053e+00  -2.254   0.0421 *
## SLG            -3.250e+00  1.269e+01  -0.256   0.8019  
## ERA            -3.687e-02  3.486e-02  -1.058   0.3094  
## HitsAllowed     1.011e-03  5.846e-04   1.729   0.1074  
## Walks           1.104e-03  5.874e-04   1.879   0.0829 .
## StrikeOuts     -7.472e-05  9.117e-05  -0.820   0.4272  
## Saves           2.328e-03  9.135e-04   2.548   0.0243 *
## WHIP           -1.834e+00  8.672e-01  -2.114   0.0544 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02153 on 13 degrees of freedom
## Multiple R-squared:  0.9525, Adjusted R-squared:  0.8941 
## F-statistic: 16.31 on 16 and 13 DF,  p-value: 4.541e-06

# a - forward selection
##mlb.df = mlb.df %>%
  ##select(-Wins, -Losses, -Team)
full.model = lm(WinPct ~ ., data = mlb.df)
forward = regsubsets(WinPct ~ ., data=mlb.df, nbest = 1, nvmax = 4, method = "forward")

## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 19 linear dependencies found

with(summary(forward), data.frame(cp, outmat))

##            cp TeamAtlanta.Braves TeamBaltimore.Orioles TeamBoston.Red.Sox
## 1  ( 1 ) -Inf                                                            
## 2  ( 1 ) -Inf                                                            
## 3  ( 1 ) -Inf                                                            
## 4  ( 1 ) -Inf                  *                                         
##          TeamChicago.Cubs TeamChicago.White.Sox TeamCincinnati.Reds
## 1  ( 1 )                                                           
## 2  ( 1 )                                                           
## 3  ( 1 )                *                                          
## 4  ( 1 )                *                                          
##          TeamCleveland.Indians TeamColorado.Rockies TeamDetroit.Tigers
## 1  ( 1 )                                                              
## 2  ( 1 )                                                              
## 3  ( 1 )                                                              
## 4  ( 1 )                                                              
##          TeamHouston.Astros TeamKansas.City.Royals TeamLos.Angeles.Angels
## 1  ( 1 )                                                                 
## 2  ( 1 )                                                                 
## 3  ( 1 )                                                                 
## 4  ( 1 )                                                                 
##          TeamLos.Angeles.Dodgers TeamMiami.Marlins TeamMilwaukee.Brewers
## 1  ( 1 )                                                                
## 2  ( 1 )                                                                
## 3  ( 1 )                                                                
## 4  ( 1 )                                                                
##          TeamMinnesota.Twins TeamNew.York.Mets TeamNew.York.Yankees
## 1  ( 1 )                                                           
## 2  ( 1 )                                                           
## 3  ( 1 )                                                           
## 4  ( 1 )                                                           
##          TeamOakland.Athletics TeamPhiladelphia.Phillies TeamPittsburgh.Pirates
## 1  ( 1 )                                                                       
## 2  ( 1 )                                                                       
## 3  ( 1 )                                                                       
## 4  ( 1 )                                                                       
##          TeamSan.Diego.Padres TeamSan.Francisco.Giants TeamSeattle.Mariners
## 1  ( 1 )                                                                   
## 2  ( 1 )                                                                   
## 3  ( 1 )                                                                   
## 4  ( 1 )                                                                   
##          TeamSt..Louis.Cardinals TeamTampa.Bay.Rays TeamTexas.Rangers
## 1  ( 1 )                                                             
## 2  ( 1 )                                                             
## 3  ( 1 )                                                             
## 4  ( 1 )                                                             
##          TeamToronto.Blue.Jays TeamWashington.Nationals LeagueNL Wins Losses
## 1  ( 1 )                                                                   *
## 2  ( 1 )                                                            *      *
## 3  ( 1 )                                                            *      *
## 4  ( 1 )                                                            *      *
##          BattingAverage Runs Hits HR Doubles Triples RBI SB OBP SLG ERA
## 1  ( 1 )                                                               
## 2  ( 1 )                                                               
## 3  ( 1 )                                                               
## 4  ( 1 )                                                               
##          HitsAllowed Walks StrikeOuts Saves WHIP
## 1  ( 1 )                                        
## 2  ( 1 )                                        
## 3  ( 1 )                                        
## 4  ( 1 )

forwardmodel = lm(WinPct ~ Runs + ERA + Saves + WHIP, data = mlb.df)
summary(forwardmodel)

## 
## Call:
## lm(formula = WinPct ~ Runs + ERA + Saves + WHIP, data = mlb.df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.051472 -0.017986 -0.001991  0.017048  0.047963 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.160e-01  1.186e-01   4.351 0.000201 ***
## Runs         5.187e-04  7.764e-05   6.681 5.31e-07 ***
## ERA         -3.636e-02  2.626e-02  -1.385 0.178402    
## Saves        2.643e-03  6.788e-04   3.893 0.000652 ***
## WHIP        -2.658e-01  1.275e-01  -2.085 0.047457 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02404 on 25 degrees of freedom
## Multiple R-squared:  0.8863, Adjusted R-squared:  0.8681 
## F-statistic:  48.7 on 4 and 25 DF,  p-value: 1.91e-11

# b - backward elimination
backward = regsubsets(WinPct ~ ., data = mlb.df, nbest = 1, nvmax = 4, method = "backward")

## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 19 linear dependencies found

summary(backward)

## Subset selection object
## Call: regsubsets.formula(WinPct ~ ., data = mlb.df, nbest = 1, nvmax = 4, 
##     method = "backward")
## 48 Variables  (and intercept)
##                           Forced in Forced out
## TeamAtlanta Braves            FALSE      FALSE
## TeamBaltimore Orioles         FALSE      FALSE
## TeamBoston Red Sox            FALSE      FALSE
## TeamChicago Cubs              FALSE      FALSE
## TeamChicago White Sox         FALSE      FALSE
## TeamCincinnati Reds           FALSE      FALSE
## TeamCleveland Indians         FALSE      FALSE
## TeamColorado Rockies          FALSE      FALSE
## TeamDetroit Tigers            FALSE      FALSE
## TeamHouston Astros            FALSE      FALSE
## TeamKansas City Royals        FALSE      FALSE
## TeamLos Angeles Angels        FALSE      FALSE
## TeamLos Angeles Dodgers       FALSE      FALSE
## TeamMiami Marlins             FALSE      FALSE
## TeamMilwaukee Brewers         FALSE      FALSE
## TeamMinnesota Twins           FALSE      FALSE
## TeamNew York Mets             FALSE      FALSE
## TeamNew York Yankees          FALSE      FALSE
## TeamOakland Athletics         FALSE      FALSE
## TeamPhiladelphia Phillies     FALSE      FALSE
## TeamPittsburgh Pirates        FALSE      FALSE
## TeamSan Diego Padres          FALSE      FALSE
## TeamSan Francisco Giants      FALSE      FALSE
## TeamSeattle Mariners          FALSE      FALSE
## TeamSt. Louis Cardinals       FALSE      FALSE
## TeamTampa Bay Rays            FALSE      FALSE
## TeamTexas Rangers             FALSE      FALSE
## TeamToronto Blue Jays         FALSE      FALSE
## TeamWashington Nationals      FALSE      FALSE
## LeagueNL                      FALSE      FALSE
## Wins                          FALSE      FALSE
## Losses                        FALSE      FALSE
## BattingAverage                FALSE      FALSE
## Runs                          FALSE      FALSE
## Hits                          FALSE      FALSE
## HR                            FALSE      FALSE
## Doubles                       FALSE      FALSE
## Triples                       FALSE      FALSE
## RBI                           FALSE      FALSE
## SB                            FALSE      FALSE
## OBP                           FALSE      FALSE
## SLG                           FALSE      FALSE
## ERA                           FALSE      FALSE
## HitsAllowed                   FALSE      FALSE
## Walks                         FALSE      FALSE
## StrikeOuts                    FALSE      FALSE
## Saves                         FALSE      FALSE
## WHIP                          FALSE      FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: backward
##          TeamAtlanta Braves TeamBaltimore Orioles TeamBoston Red Sox
## 1  ( 1 ) " "                " "                   " "               
## 2  ( 1 ) " "                " "                   " "               
## 3  ( 1 ) " "                " "                   " "               
## 4  ( 1 ) " "                " "                   " "               
##          TeamChicago Cubs TeamChicago White Sox TeamCincinnati Reds
## 1  ( 1 ) "*"              " "                   " "                
## 2  ( 1 ) "*"              " "                   " "                
## 3  ( 1 ) "*"              " "                   " "                
## 4  ( 1 ) "*"              " "                   " "                
##          TeamCleveland Indians TeamColorado Rockies TeamDetroit Tigers
## 1  ( 1 ) " "                   " "                  " "               
## 2  ( 1 ) " "                   " "                  " "               
## 3  ( 1 ) " "                   " "                  " "               
## 4  ( 1 ) " "                   " "                  " "               
##          TeamHouston Astros TeamKansas City Royals TeamLos Angeles Angels
## 1  ( 1 ) " "                " "                    " "                   
## 2  ( 1 ) " "                " "                    " "                   
## 3  ( 1 ) " "                " "                    " "                   
## 4  ( 1 ) " "                " "                    " "                   
##          TeamLos Angeles Dodgers TeamMiami Marlins TeamMilwaukee Brewers
## 1  ( 1 ) " "                     " "               " "                  
## 2  ( 1 ) " "                     " "               " "                  
## 3  ( 1 ) " "                     " "               " "                  
## 4  ( 1 ) " "                     " "               " "                  
##          TeamMinnesota Twins TeamNew York Mets TeamNew York Yankees
## 1  ( 1 ) " "                 " "               " "                 
## 2  ( 1 ) "*"                 " "               " "                 
## 3  ( 1 ) "*"                 " "               " "                 
## 4  ( 1 ) "*"                 " "               " "                 
##          TeamOakland Athletics TeamPhiladelphia Phillies TeamPittsburgh Pirates
## 1  ( 1 ) " "                   " "                       " "                   
## 2  ( 1 ) " "                   " "                       " "                   
## 3  ( 1 ) " "                   " "                       " "                   
## 4  ( 1 ) " "                   " "                       " "                   
##          TeamSan Diego Padres TeamSan Francisco Giants TeamSeattle Mariners
## 1  ( 1 ) " "                  " "                      " "                 
## 2  ( 1 ) " "                  " "                      " "                 
## 3  ( 1 ) " "                  " "                      " "                 
## 4  ( 1 ) " "                  " "                      " "                 
##          TeamSt. Louis Cardinals TeamTampa Bay Rays TeamTexas Rangers
## 1  ( 1 ) " "                     " "                " "              
## 2  ( 1 ) " "                     " "                " "              
## 3  ( 1 ) " "                     " "                " "              
## 4  ( 1 ) " "                     " "                "*"              
##          TeamToronto Blue Jays TeamWashington Nationals LeagueNL Wins Losses
## 1  ( 1 ) " "                   " "                      " "      " "  " "   
## 2  ( 1 ) " "                   " "                      " "      " "  " "   
## 3  ( 1 ) " "                   "*"                      " "      " "  " "   
## 4  ( 1 ) " "                   "*"                      " "      " "  " "   
##          BattingAverage Runs Hits HR  Doubles Triples RBI SB  OBP SLG ERA
## 1  ( 1 ) " "            " "  " "  " " " "     " "     " " " " " " " " " "
## 2  ( 1 ) " "            " "  " "  " " " "     " "     " " " " " " " " " "
## 3  ( 1 ) " "            " "  " "  " " " "     " "     " " " " " " " " " "
## 4  ( 1 ) " "            " "  " "  " " " "     " "     " " " " " " " " " "
##          HitsAllowed Walks StrikeOuts Saves WHIP
## 1  ( 1 ) " "         " "   " "        " "   " " 
## 2  ( 1 ) " "         " "   " "        " "   " " 
## 3  ( 1 ) " "         " "   " "        " "   " " 
## 4  ( 1 ) " "         " "   " "        " "   " "

with(summary(backward), data.frame(cp,outmat))

##            cp TeamAtlanta.Braves TeamBaltimore.Orioles TeamBoston.Red.Sox
## 1  ( 1 ) -Inf                                                            
## 2  ( 1 ) -Inf                                                            
## 3  ( 1 ) -Inf                                                            
## 4  ( 1 ) -Inf                                                            
##          TeamChicago.Cubs TeamChicago.White.Sox TeamCincinnati.Reds
## 1  ( 1 )                *                                          
## 2  ( 1 )                *                                          
## 3  ( 1 )                *                                          
## 4  ( 1 )                *                                          
##          TeamCleveland.Indians TeamColorado.Rockies TeamDetroit.Tigers
## 1  ( 1 )                                                              
## 2  ( 1 )                                                              
## 3  ( 1 )                                                              
## 4  ( 1 )                                                              
##          TeamHouston.Astros TeamKansas.City.Royals TeamLos.Angeles.Angels
## 1  ( 1 )                                                                 
## 2  ( 1 )                                                                 
## 3  ( 1 )                                                                 
## 4  ( 1 )                                                                 
##          TeamLos.Angeles.Dodgers TeamMiami.Marlins TeamMilwaukee.Brewers
## 1  ( 1 )                                                                
## 2  ( 1 )                                                                
## 3  ( 1 )                                                                
## 4  ( 1 )                                                                
##          TeamMinnesota.Twins TeamNew.York.Mets TeamNew.York.Yankees
## 1  ( 1 )                                                           
## 2  ( 1 )                   *                                       
## 3  ( 1 )                   *                                       
## 4  ( 1 )                   *                                       
##          TeamOakland.Athletics TeamPhiladelphia.Phillies TeamPittsburgh.Pirates
## 1  ( 1 )                                                                       
## 2  ( 1 )                                                                       
## 3  ( 1 )                                                                       
## 4  ( 1 )                                                                       
##          TeamSan.Diego.Padres TeamSan.Francisco.Giants TeamSeattle.Mariners
## 1  ( 1 )                                                                   
## 2  ( 1 )                                                                   
## 3  ( 1 )                                                                   
## 4  ( 1 )                                                                   
##          TeamSt..Louis.Cardinals TeamTampa.Bay.Rays TeamTexas.Rangers
## 1  ( 1 )                                                             
## 2  ( 1 )                                                             
## 3  ( 1 )                                                             
## 4  ( 1 )                                                            *
##          TeamToronto.Blue.Jays TeamWashington.Nationals LeagueNL Wins Losses
## 1  ( 1 )                                                                    
## 2  ( 1 )                                                                    
## 3  ( 1 )                                              *                     
## 4  ( 1 )                                              *                     
##          BattingAverage Runs Hits HR Doubles Triples RBI SB OBP SLG ERA
## 1  ( 1 )                                                               
## 2  ( 1 )                                                               
## 3  ( 1 )                                                               
## 4  ( 1 )                                                               
##          HitsAllowed Walks StrikeOuts Saves WHIP
## 1  ( 1 )                                        
## 2  ( 1 )                                        
## 3  ( 1 )                                        
## 4  ( 1 )

backwardmodel = lm(WinPct ~ Saves + WHIP + BattingAverage + Runs, data = mlb.df)
summary(backwardmodel)

## 
## Call:
## lm(formula = WinPct ~ Saves + WHIP + BattingAverage + Runs, data = mlb.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.04771 -0.01210 -0.00105  0.01774  0.04478 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.4641489  0.1401941   3.311  0.00283 ** 
## Saves           0.0028606  0.0006411   4.462  0.00015 ***
## WHIP           -0.4489248  0.0603730  -7.436 8.68e-08 ***
## BattingAverage  0.7818438  0.6840724   1.143  0.26390    
## Runs            0.0004269  0.0001162   3.674  0.00114 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02432 on 25 degrees of freedom
## Multiple R-squared:  0.8836, Adjusted R-squared:  0.865 
## F-statistic: 47.45 on 4 and 25 DF,  p-value: 2.538e-11

coef(backwardmodel)

##    (Intercept)          Saves           WHIP BattingAverage           Runs 
##    0.464148943    0.002860621   -0.448924811    0.781843787    0.000426943

# c - best subsets method
best = regsubsets(WinPct ~ ., data = mlb.df, method = "exhaustive")

## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 19 linear dependencies found

with(summary(best), data.frame(rsq, adjr2, cp, rss, outmat))

##                rsq     adjr2   cp          rss TeamAtlanta.Braves
## 1  ( 1 ) 0.9996343 0.9996212 -Inf 4.644970e-05                   
## 2  ( 1 ) 0.9999647 0.9999621 -Inf 4.481630e-06                   
## 3  ( 1 ) 0.9999728 0.9999696 -Inf 3.457235e-06                   
## 4  ( 1 ) 0.9999795 0.9999763 -Inf 2.598651e-06                  *
## 5  ( 1 ) 0.9999839 0.9999805 -Inf 2.044904e-06                  *
## 6  ( 1 ) 0.9999860 0.9999824 -Inf 1.775679e-06                  *
## 7  ( 1 ) 0.9999882 0.9999844 -Inf 1.504217e-06                  *
## 8  ( 1 ) 0.9999904 0.9999868 -Inf 1.214508e-06                  *
##          TeamBaltimore.Orioles TeamBoston.Red.Sox TeamChicago.Cubs
## 1  ( 1 )                                                          
## 2  ( 1 )                                                          
## 3  ( 1 )                                                         *
## 4  ( 1 )                                                          
## 5  ( 1 )                                                         *
## 6  ( 1 )                                                         *
## 7  ( 1 )                                                         *
## 8  ( 1 )                                                          
##          TeamChicago.White.Sox TeamCincinnati.Reds TeamCleveland.Indians
## 1  ( 1 )                                                                
## 2  ( 1 )                                                                
## 3  ( 1 )                                                                
## 4  ( 1 )                                                                
## 5  ( 1 )                                                                
## 6  ( 1 )                     *                                          
## 7  ( 1 )                                                                
## 8  ( 1 )                     *                                          
##          TeamColorado.Rockies TeamDetroit.Tigers TeamHouston.Astros
## 1  ( 1 )                                                           
## 2  ( 1 )                                                           
## 3  ( 1 )                                                           
## 4  ( 1 )                                                           
## 5  ( 1 )                                                           
## 6  ( 1 )                                                           
## 7  ( 1 )                                                          *
## 8  ( 1 )                                                           
##          TeamKansas.City.Royals TeamLos.Angeles.Angels TeamLos.Angeles.Dodgers
## 1  ( 1 )                                                                      
## 2  ( 1 )                                                                      
## 3  ( 1 )                                                                      
## 4  ( 1 )                                                                      
## 5  ( 1 )                                                                      
## 6  ( 1 )                                                                      
## 7  ( 1 )                                                                      
## 8  ( 1 )                                                                      
##          TeamMiami.Marlins TeamMilwaukee.Brewers TeamMinnesota.Twins
## 1  ( 1 )                                                            
## 2  ( 1 )                                                            
## 3  ( 1 )                                                            
## 4  ( 1 )                                                            
## 5  ( 1 )                                                            
## 6  ( 1 )                                                            
## 7  ( 1 )                                                            
## 8  ( 1 )                                                            
##          TeamNew.York.Mets TeamNew.York.Yankees TeamOakland.Athletics
## 1  ( 1 )                                                             
## 2  ( 1 )                                                             
## 3  ( 1 )                                                             
## 4  ( 1 )                                                             
## 5  ( 1 )                                                             
## 6  ( 1 )                                                             
## 7  ( 1 )                                      *                      
## 8  ( 1 )                                                             
##          TeamPhiladelphia.Phillies TeamPittsburgh.Pirates TeamSan.Diego.Padres
## 1  ( 1 )                                                                      
## 2  ( 1 )                                                                      
## 3  ( 1 )                                                                      
## 4  ( 1 )                                                *                     
## 5  ( 1 )                                                *                     
## 6  ( 1 )                                                *                     
## 7  ( 1 )                                                *                     
## 8  ( 1 )                                                                      
##          TeamSan.Francisco.Giants TeamSeattle.Mariners TeamSt..Louis.Cardinals
## 1  ( 1 )                                                                      
## 2  ( 1 )                                                                      
## 3  ( 1 )                                                                      
## 4  ( 1 )                                                                      
## 5  ( 1 )                                                                      
## 6  ( 1 )                                                                      
## 7  ( 1 )                                                                      
## 8  ( 1 )                                                                      
##          TeamTampa.Bay.Rays TeamTexas.Rangers TeamToronto.Blue.Jays
## 1  ( 1 )                                                           
## 2  ( 1 )                                                           
## 3  ( 1 )                                                           
## 4  ( 1 )                                                           
## 5  ( 1 )                                                           
## 6  ( 1 )                                                           
## 7  ( 1 )                                                           
## 8  ( 1 )                                                          *
##          TeamWashington.Nationals LeagueNL Wins Losses BattingAverage Runs Hits
## 1  ( 1 )                                             *                         
## 2  ( 1 )                                      *      *                         
## 3  ( 1 )                                      *      *                         
## 4  ( 1 )                                      *      *                         
## 5  ( 1 )                                      *      *                         
## 6  ( 1 )                                      *      *                         
## 7  ( 1 )                                      *      *                         
## 8  ( 1 )                        *             *      *                         
##          HR Doubles Triples RBI SB OBP SLG ERA HitsAllowed Walks StrikeOuts
## 1  ( 1 )                                                                   
## 2  ( 1 )                                                                   
## 3  ( 1 )                                                                   
## 4  ( 1 )                                                                   
## 5  ( 1 )                                                                   
## 6  ( 1 )                                                                   
## 7  ( 1 )                                                                   
## 8  ( 1 )                                                 *                 
##          Saves WHIP
## 1  ( 1 )           
## 2  ( 1 )           
## 3  ( 1 )           
## 4  ( 1 )           
## 5  ( 1 )           
## 6  ( 1 )           
## 7  ( 1 )           
## 8  ( 1 )     *

bestmodel = lm(WinPct ~ Runs + Saves + WHIP + BattingAverage, data = mlb.df)
summary(bestmodel)

## 
## Call:
## lm(formula = WinPct ~ Runs + Saves + WHIP + BattingAverage, data = mlb.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.04771 -0.01210 -0.00105  0.01774  0.04478 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.4641489  0.1401941   3.311  0.00283 ** 
## Runs            0.0004269  0.0001162   3.674  0.00114 ** 
## Saves           0.0028606  0.0006411   4.462  0.00015 ***
## WHIP           -0.4489248  0.0603730  -7.436 8.68e-08 ***
## BattingAverage  0.7818438  0.6840724   1.143  0.26390    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02432 on 25 degrees of freedom
## Multiple R-squared:  0.8836, Adjusted R-squared:  0.865 
## F-statistic: 47.45 on 4 and 25 DF,  p-value: 2.538e-11

# d - find mallow's CP
##ols_mallows_cp(forwardmodel, full.model)
##ols_mallows_cp(backwardmodel, full.model)
##ols_mallows_cp(bestmodel, full.model)

a.

Using a forward selection model, I ended with the formula WinPct = .305 + .0043(RBI) - 0.0915(ERA) + 0.7573(BattingAverage) + 0.0019(Saves) The R^2 value for this model is 0.8409

b.

I ended with R^2 value of 0.865, and the formula WinPct = 0.464 + 0.00286(Saves) - 0.4489(WHIP) + 0.7818(BattingAverage) + 0.0042(Runs)

c.

The predictors in this model are Runs, Saves, WHIP, and Batting Average, and it has an adjusted R^2 value of 0.865

d.

Mallow’s Cp for each model Forward model: 11.10941 Backward model: 11.83185 Best subsets model: 11.83185 Not all models gave the same variables, but I would personally use the best subsets model or the backwards elimination model because they had the greatest R^2 value, and therefore account for the most variability.

4.7

##{r} baseball = read.csv("~/downloads/BaseballTimes2017.csv") baseball = baseball %>% select(-Game, -League) # a - maximize R^2 maxrsq = lm(Time ~ Runs + Margin + Pitchers + Attendance, data = baseball) summary(maxrsq) # b - maximize adjusted R^2 bestbball = regsubsets(Time ~ ., data = baseball, method = "exhaustive") with(summary(bestbball), data.frame(rsq, adjr2, cp, rss, outmat)) ok1 = lm(Time ~ Runs + Attendance, data = baseball) summary(ok1) # c - minimize Mallow's Cp ols_mallows_cp(ok1, maxrsq) ##

a.

To maximize R^2, all variables are used for an R^2 value of 0.6451

b.

To maximize adjusted R^2, the variables Runs and Attendance are used for an adjusted R^2 of 0.535

c.

The minimum Mallow’s CP is 1.977722

4.11

rel = read.csv("~/downloads/RelGDP.csv")
# a
log.gdp = log(rel$GDP)
plot(log.gdp, rel$Religiosity)

# b
logrel = lm(log.gdp ~ Religiosity, data = rel)
summary(logrel)

## 
## Call:
## lm(formula = log.gdp ~ Religiosity, data = rel)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8387 -0.8108  0.1272  0.5833  3.5923 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.9961     0.3656  30.079  < 2e-16 ***
## Religiosity  -1.4013     0.2001  -7.005 1.43e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.085 on 42 degrees of freedom
## Multiple R-squared:  0.5388, Adjusted R-squared:  0.5278 
## F-statistic: 49.06 on 1 and 42 DF,  p-value: 1.432e-08

# d
studresids = studres(logrel)
plot(rel$Religiosity, studresids)

##### b.

52.78% of variability in log(GDP) is explained in this model.

c.

The coefficient of -1.4013 means that as GDP increases within a country, the religiosity decreases.

4.13

The coefficients in this model tell us that the birth weight of Black babies is very dependent on the mother’s race, while not as much so for Hispanic and babies of other races.

4.17

baseball = read.csv("~/downloads/BaseballTimes2017.csv")
cor(baseball$Runs, baseball$Time)

## [1] 0.7449071

x = baseball$Runs
y = baseball$Time
originalR = cor(x,y)
newY = sample(baseball$Time)
cor(x, newY)

## [1] -0.1546677

# repeat sampling procedure
n = 5000
permcorr = as.numeric(0)
for (i in 1:n) {
  newY = sample(baseball$Time)
  permcorr[i] = cor(x, newY)
}
permcorr[1:5]

## [1] -0.15024504  0.53084894 -0.47133220 -0.01667985 -0.38287843

hist(permcorr)

upper = sum(permcorr>abs(originalR))
upper

## [1] 9

lower = sum(permcorr<(-abs(originalR)))
lower

## [1] 0

pvalue = (upper + lower) / n
pvalue

## [1] 0.0018

a.

The correlation between runs and times is 0.744, suggesting they are related. Using R, I used a randomization test to sample the data n = 5000 times. The p-value for the test was 0.0012, which suggests that there is significant evidence that correlation between Runs and Time is greater than zero.

4.20

peak = read.csv("~/downloads/HighPeaks.csv")
# a
foura = lm(Length ~ Time, data = peak)
summary(foura)

## 
## Call:
## lm(formula = Length ~ Time, data = peak)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4112 -1.1636 -0.0413  1.0514  3.7743 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.10039    1.06739   1.031    0.308    
## Time         1.07711    0.09699  11.105 2.39e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.818 on 44 degrees of freedom
## Multiple R-squared:  0.737,  Adjusted R-squared:  0.7311 
## F-statistic: 123.3 on 1 and 44 DF,  p-value: 2.39e-14

confint(foura, level = 0.90)

##                    5 %     95 %
## (Intercept) -0.6930744 2.893854
## Time         0.9141373 1.240075

# b - bootstrap distribution
index = 1:46
bootsample = sample(index, replace = TRUE)
head(peak[bootsample,])

##                 Peak Elevation Difficulty Ascent Length Time
## 13        Nippletop       4620          5   4050   12.6 10.0
## 18     Panther Peak       4442          6   3762   17.6 13.5
## 20 Rocky Peak Ridge       4420          6   4500   13.4 11.0
## 43            Blake       3960          4   3270   13.6 12.0
## 21      Macomb Mtn.       4405          5   2344    8.4  8.0
## 15     Mt. Redfield       4606          7   3225   17.5 14.0

bootmodel = lm(Length ~ Time, data = peak[bootsample,])
summary(bootmodel)

## 
## Call:
## lm(formula = Length ~ Time, data = peak[bootsample, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6465 -1.5427 -0.0099  0.9428  3.7480 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3041     1.0634   2.167   0.0357 *  
## Time          0.9790     0.0889  11.012 3.14e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.815 on 44 degrees of freedom
## Multiple R-squared:  0.7338, Adjusted R-squared:  0.7277 
## F-statistic: 121.3 on 1 and 44 DF,  p-value: 3.144e-14

bootbetas = matrix(0,nrow = 5000, ncol = 2)
for (i in 1:5000) {
  bootsample = sample(index, replace = TRUE)
  bootmodel = lm(Length ~ Time, data = peak[bootsample,])
  bootbetas[i,] = coef(bootmodel)
}
hist(bootbetas[,2])

# c
sd(bootbetas[,2])

## [1] 0.1222331

mean(bootbetas[,2])

## [1] 1.092596

quantile(bootbetas[,2],.05)

##        5% 
## 0.9159288

quantile(bootbetas[,2], .95)

##      95% 
## 1.315015

a.

A 90% confidence interval for time is (1.24, 2.89). This means that the average hiker will hike at a speed from anywhere from 1.24 to 2.89 miles on average.

b.

This distribution suggests that our confidence interval was correct, as most of the observations are centered around the middle of the interval. This is a normal distribution that is skewed right.

c.

The standard deviation is 0.118, and the mean is 1.09. The standard deviation is much lesser than the original model predicted.

d.

The confidence interval from the bootstrap distribution is (0.9252, 1.3076)