This week I worked through Chapter 4 of the Stat2 book on Additional Topics in Regression. This chapter introduced many new ways to continue to verify and analyze a multiple regression model, such as through added variable plots or Mallow’s Cp.
Coming out of last week, I had still felt a bit weary when choosing variables to create a MR model, but the book introduced more methods to ensure the best variables are chosen, therefore the best model is created. Working with the same data from the book examples in R, I created a best subsets model, which is a computer output that suggests how many variables would create the best model, as well as which variables to use to create said model. The book’s model showed an R^2 and Mallow’s Cp value alongside the best subsets model, which is incredibly helpful, but something I am still attempting to learn how to do within R. I am aware that the predictors the computer picks as the “best” predictors are simply those with the highest R^2 values, but I feel it would be helpful to see them, which is why I am still on my journey of exploring how to do this in R.
Speaking of Mallow’s Cp, I learned that it is another statistic used to check the effectiveness of a MR model. Mallow’s Cp greatly takes into account the amount of predictors within the model, as a Mallow’s Cp that indicates a good model is usually close to the number of predictors plus 1, or another constant. When a MCp is far greater than the number of predictors plus the constant, the model is usually not a great fit for the data. When doing work in R, my output showed an “AIC” column, which was introduced into the book. The AIC, or Akaike’s information criterion, is another measure to evaluate how a model fits the data, but uses logs within its calculations. Both of these statistics are useful in the search for the perfect model, as they allow us to compare different models alongside each other by a simple number. Although it is important to account for many other things and not choose a model by a singular statistic, this value is helpful to weed out models that are far off.
Backwards, or stepwise elimination was introduced in this chapter, and is another essential tool in the search of a close to perfect MR model. The function works through fitting a full model of every predictor variable within a dataset, and then eliminating predictor variables by their p-values (if less than 5%, keep it, if greater, get rid of it). When using forward regression, the same methodology is used, but it starts with a model with no predictors, and searches for predictors that add a significant increase to R^2. When doing this, I kept in mind that R^2 increases when a new predictor is added, and looked at the adjusted R^2 value rather than the original R^2.
When doing the homework, I was able to employ a lot of what I had learned, such as in problem 4.2. Here, I created a scatterplot matrix to observe predictors with strong relationships, used forward regression to find the best variables to predict for Time. In R, I am still learning how to create added variable plots, so the plot for that statistic is not included in the homework.
I love R as it allows me to use bigger datasets than a calculator would, and do bigger things with said data. For example, in problem 4.17, I used the function permcorr() and a for() function (which I wouldn’t have known how to do without my compsci class!) to test a set of data n = 5000 times to test for its significance.
After last week, I still felt a bit weary about the topic of multiple regression, specifically in choosing variables, but I feel that this week chapter eased my worries. I know of many more ways to check for statistical significance and how to choose a good model, which are essential for my understanding of the topic.
## import libraries
library(car)
## Loading required package: carData
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(leaps)
library(MASS)
## Warning: package 'MASS' was built under R version 4.1.2
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## 4.2
peak = read.csv("~/downloads/HighPeaks.csv")
attach(peak)
# a
plot(peak$Elevation, peak$Time, pch=16)
cor(peak$Elevation, peak$Time)
## [1] -0.0162768
# b
peakmr = lm(Time ~ Elevation + Length, data = peak)
summary(peakmr)
##
## Call:
## lm(formula = Time ~ Elevation + Length, data = peak)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5924 -0.8050 -0.1959 0.6380 3.8432
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.0753787 2.5327132 3.188 0.00267 **
## Elevation -0.0014483 0.0005805 -2.495 0.01653 *
## Length 0.7123344 0.0593330 12.006 2.54e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.37 on 43 degrees of freedom
## Multiple R-squared: 0.7703, Adjusted R-squared: 0.7596
## F-statistic: 72.09 on 2 and 43 DF, p-value: 1.844e-14
elevation = lm(Time ~ Elevation, data = peak)
summary(elevation)
##
## Call:
## lm(formula = Time ~ Elevation, data = peak)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6912 -1.6985 -0.5639 1.2963 7.3015
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.2113764 5.1953800 2.158 0.0364 *
## Elevation -0.0001269 0.0011756 -0.108 0.9145
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.826 on 44 degrees of freedom
## Multiple R-squared: 0.0002649, Adjusted R-squared: -0.02246
## F-statistic: 0.01166 on 1 and 44 DF, p-value: 0.9145
length = lm(Time ~ Length, data = peak)
summary(length)
##
## Call:
## lm(formula = Time ~ Length, data = peak)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4491 -0.6687 -0.0122 0.5590 4.0034
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.04817 0.80371 2.548 0.0144 *
## Length 0.68427 0.06162 11.105 2.39e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.449 on 44 degrees of freedom
## Multiple R-squared: 0.737, Adjusted R-squared: 0.7311
## F-statistic: 123.3 on 1 and 44 DF, p-value: 2.39e-14
# c
# residuals + graph of time on length
advp1 = lm(Time ~ Length, data = peak)
cor(peak$Time, peak$Length)
## [1] 0.8585079
resadvp1 = resid(advp1)
plot(fitted(advp1), resadvp1, xlab = "Predicted Time",
ylab = "Resid1", main = "Residuals of Time on Length")
abline(0,0)
summary(resadvp1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.44914 -0.66867 -0.01216 0.00000 0.55897 4.00344
summary(advp1)
##
## Call:
## lm(formula = Time ~ Length, data = peak)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4491 -0.6687 -0.0122 0.5590 4.0034
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.04817 0.80371 2.548 0.0144 *
## Length 0.68427 0.06162 11.105 2.39e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.449 on 44 degrees of freedom
## Multiple R-squared: 0.737, Adjusted R-squared: 0.7311
## F-statistic: 123.3 on 1 and 44 DF, p-value: 2.39e-14
plot(peak$Length, peak$Time, xlab = "Length", ylab = "Time",
main = "Length and Time", abline(lm(peak$Time ~ peak$Length)))
# residuals + graph of time on elevation
advp2 = lm(Time ~ Elevation, data = peak)
cor(peak$Time, peak$Elevation)
## [1] -0.0162768
resadvp2 = resid(advp2)
plot(fitted(advp2), resadvp2, xlab = "Predicted Time",
ylab = "Resid2", main = "Residuals of Time on Elevation")
abline(0,0)
plot(peak$Elevation, peak$Time, xlab = "Elevation", ylab = "Time",
main = "Elevation and Time", abline(lm(peak$Time ~ peak$Elevation)))
summary(advp2)
##
## Call:
## lm(formula = Time ~ Elevation, data = peak)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6912 -1.6985 -0.5639 1.2963 7.3015
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.2113764 5.1953800 2.158 0.0364 *
## Elevation -0.0001269 0.0011756 -0.108 0.9145
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.826 on 44 degrees of freedom
## Multiple R-squared: 0.0002649, Adjusted R-squared: -0.02246
## F-statistic: 0.01166 on 1 and 44 DF, p-value: 0.9145
# added variable plots
error1 = advp1$residuals
error2 = advp2$residuals
error1.error2lm = lm(error1 ~ error2)
error1.error2lm
##
## Call:
## lm(formula = error1 ~ error2)
##
## Coefficients:
## (Intercept) error2
## 1.267e-16 2.601e-01
plot(error2, error1, main = "Added Variable Plot")
abline(error1.error2lm)
abline(0,0)
cor(error1, error2)
## [1] 0.507185
cor(resadvp1, resadvp2)
## [1] 0.507185
# build in function
avPlots(advp1, col = 1)
# mr model
mrpeak = lm(Time ~ Elevation + Length, data = peak)
summary(mrpeak)
##
## Call:
## lm(formula = Time ~ Elevation + Length, data = peak)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5924 -0.8050 -0.1959 0.6380 3.8432
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.0753787 2.5327132 3.188 0.00267 **
## Elevation -0.0014483 0.0005805 -2.495 0.01653 *
## Length 0.7123344 0.0593330 12.006 2.54e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.37 on 43 degrees of freedom
## Multiple R-squared: 0.7703, Adjusted R-squared: 0.7596
## F-statistic: 72.09 on 2 and 43 DF, p-value: 1.844e-14
res = avPlots(lm(Time ~ Elevation + Length, data = peak))
fit = lsfit(res$Elevation[,1], res$Elevation[,2])
fit$coefficients
## Intercept X
## 1.066786e-16 -1.448269e-03
fitlength = lsfit(res$Length[,1], res$Length[,2])
fitlength$coefficients
## Intercept X
## 6.301114e-17 7.123344e-01
The correlation is -0.016 between the variables, therefore, Elevation does not appear to be a helpful predictor for Time
Model with both predictors = 8.07 - 0.00144(Elevation) + 0.7123(Length), R^2 = 0.7703
Elevation only = 11.21 - 0.00012(Elevation), R^2 = 0.0026
Length only = 2.048 + 0.684(Length), R^2 = 0.737
Elevation is not very important in this model. In the model with both predictors, it’s small factor tells us that it doesn’t affect the model that much, while length’s higher factor tells us length is more important. The elevation only model accounts for 0.2% of variance in the model, while length only accounts for 73.7%, but combined, the multiple regression model accounts for 77.03%, making it the optimal model.
Here, the equation of the linear regression line of the added variable plot is Resid1 = 0 + 0.712(Resid2) These plots show that
mlb.df = read.csv("~/downloads/MLBStandings2016.csv")
names(mlb.df)
## [1] "Team" "League" "Wins" "Losses"
## [5] "WinPct" "BattingAverage" "Runs" "Hits"
## [9] "HR" "Doubles" "Triples" "RBI"
## [13] "SB" "OBP" "SLG" "ERA"
## [17] "HitsAllowed" "Walks" "StrikeOuts" "Saves"
## [21] "WHIP"
pairs(mlb.df[,c(3,4,5,6,7,8,9)])
fullmlb = lm(WinPct ~ BattingAverage + Runs + Hits + HR + Doubles + Triples + RBI + SB + OBP + SLG + ERA + HitsAllowed + Walks + StrikeOuts + Saves + WHIP, data = mlb.df)
summary(fullmlb)
##
## Call:
## lm(formula = WinPct ~ BattingAverage + Runs + Hits + HR + Doubles +
## Triples + RBI + SB + OBP + SLG + ERA + HitsAllowed + Walks +
## StrikeOuts + Saves + WHIP, data = mlb.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.038561 -0.010917 0.002214 0.010956 0.027615
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.850e-01 4.214e-01 1.151 0.2705
## BattingAverage 1.549e+01 1.846e+01 0.839 0.4164
## Runs 1.767e-03 9.560e-04 1.848 0.0875 .
## Hits -1.489e-03 1.257e-03 -1.185 0.2574
## HR 2.030e-03 6.853e-03 0.296 0.7718
## Doubles 2.990e-04 2.232e-03 0.134 0.8955
## Triples 1.579e-03 4.769e-03 0.331 0.7458
## RBI -1.306e-03 9.747e-04 -1.340 0.2032
## SB -6.973e-05 1.757e-04 -0.397 0.6979
## OBP -2.374e+00 1.053e+00 -2.254 0.0421 *
## SLG -3.250e+00 1.269e+01 -0.256 0.8019
## ERA -3.687e-02 3.486e-02 -1.058 0.3094
## HitsAllowed 1.011e-03 5.846e-04 1.729 0.1074
## Walks 1.104e-03 5.874e-04 1.879 0.0829 .
## StrikeOuts -7.472e-05 9.117e-05 -0.820 0.4272
## Saves 2.328e-03 9.135e-04 2.548 0.0243 *
## WHIP -1.834e+00 8.672e-01 -2.114 0.0544 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02153 on 13 degrees of freedom
## Multiple R-squared: 0.9525, Adjusted R-squared: 0.8941
## F-statistic: 16.31 on 16 and 13 DF, p-value: 4.541e-06
# a - forward selection
##mlb.df = mlb.df %>%
##select(-Wins, -Losses, -Team)
full.model = lm(WinPct ~ ., data = mlb.df)
forward = regsubsets(WinPct ~ ., data=mlb.df, nbest = 1, nvmax = 4, method = "forward")
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 19 linear dependencies found
with(summary(forward), data.frame(cp, outmat))
## cp TeamAtlanta.Braves TeamBaltimore.Orioles TeamBoston.Red.Sox
## 1 ( 1 ) -Inf
## 2 ( 1 ) -Inf
## 3 ( 1 ) -Inf
## 4 ( 1 ) -Inf *
## TeamChicago.Cubs TeamChicago.White.Sox TeamCincinnati.Reds
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 ) *
## 4 ( 1 ) *
## TeamCleveland.Indians TeamColorado.Rockies TeamDetroit.Tigers
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
## TeamHouston.Astros TeamKansas.City.Royals TeamLos.Angeles.Angels
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
## TeamLos.Angeles.Dodgers TeamMiami.Marlins TeamMilwaukee.Brewers
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
## TeamMinnesota.Twins TeamNew.York.Mets TeamNew.York.Yankees
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
## TeamOakland.Athletics TeamPhiladelphia.Phillies TeamPittsburgh.Pirates
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
## TeamSan.Diego.Padres TeamSan.Francisco.Giants TeamSeattle.Mariners
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
## TeamSt..Louis.Cardinals TeamTampa.Bay.Rays TeamTexas.Rangers
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
## TeamToronto.Blue.Jays TeamWashington.Nationals LeagueNL Wins Losses
## 1 ( 1 ) *
## 2 ( 1 ) * *
## 3 ( 1 ) * *
## 4 ( 1 ) * *
## BattingAverage Runs Hits HR Doubles Triples RBI SB OBP SLG ERA
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
## HitsAllowed Walks StrikeOuts Saves WHIP
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
forwardmodel = lm(WinPct ~ Runs + ERA + Saves + WHIP, data = mlb.df)
summary(forwardmodel)
##
## Call:
## lm(formula = WinPct ~ Runs + ERA + Saves + WHIP, data = mlb.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.051472 -0.017986 -0.001991 0.017048 0.047963
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.160e-01 1.186e-01 4.351 0.000201 ***
## Runs 5.187e-04 7.764e-05 6.681 5.31e-07 ***
## ERA -3.636e-02 2.626e-02 -1.385 0.178402
## Saves 2.643e-03 6.788e-04 3.893 0.000652 ***
## WHIP -2.658e-01 1.275e-01 -2.085 0.047457 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02404 on 25 degrees of freedom
## Multiple R-squared: 0.8863, Adjusted R-squared: 0.8681
## F-statistic: 48.7 on 4 and 25 DF, p-value: 1.91e-11
# b - backward elimination
backward = regsubsets(WinPct ~ ., data = mlb.df, nbest = 1, nvmax = 4, method = "backward")
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 19 linear dependencies found
summary(backward)
## Subset selection object
## Call: regsubsets.formula(WinPct ~ ., data = mlb.df, nbest = 1, nvmax = 4,
## method = "backward")
## 48 Variables (and intercept)
## Forced in Forced out
## TeamAtlanta Braves FALSE FALSE
## TeamBaltimore Orioles FALSE FALSE
## TeamBoston Red Sox FALSE FALSE
## TeamChicago Cubs FALSE FALSE
## TeamChicago White Sox FALSE FALSE
## TeamCincinnati Reds FALSE FALSE
## TeamCleveland Indians FALSE FALSE
## TeamColorado Rockies FALSE FALSE
## TeamDetroit Tigers FALSE FALSE
## TeamHouston Astros FALSE FALSE
## TeamKansas City Royals FALSE FALSE
## TeamLos Angeles Angels FALSE FALSE
## TeamLos Angeles Dodgers FALSE FALSE
## TeamMiami Marlins FALSE FALSE
## TeamMilwaukee Brewers FALSE FALSE
## TeamMinnesota Twins FALSE FALSE
## TeamNew York Mets FALSE FALSE
## TeamNew York Yankees FALSE FALSE
## TeamOakland Athletics FALSE FALSE
## TeamPhiladelphia Phillies FALSE FALSE
## TeamPittsburgh Pirates FALSE FALSE
## TeamSan Diego Padres FALSE FALSE
## TeamSan Francisco Giants FALSE FALSE
## TeamSeattle Mariners FALSE FALSE
## TeamSt. Louis Cardinals FALSE FALSE
## TeamTampa Bay Rays FALSE FALSE
## TeamTexas Rangers FALSE FALSE
## TeamToronto Blue Jays FALSE FALSE
## TeamWashington Nationals FALSE FALSE
## LeagueNL FALSE FALSE
## Wins FALSE FALSE
## Losses FALSE FALSE
## BattingAverage FALSE FALSE
## Runs FALSE FALSE
## Hits FALSE FALSE
## HR FALSE FALSE
## Doubles FALSE FALSE
## Triples FALSE FALSE
## RBI FALSE FALSE
## SB FALSE FALSE
## OBP FALSE FALSE
## SLG FALSE FALSE
## ERA FALSE FALSE
## HitsAllowed FALSE FALSE
## Walks FALSE FALSE
## StrikeOuts FALSE FALSE
## Saves FALSE FALSE
## WHIP FALSE FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: backward
## TeamAtlanta Braves TeamBaltimore Orioles TeamBoston Red Sox
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## TeamChicago Cubs TeamChicago White Sox TeamCincinnati Reds
## 1 ( 1 ) "*" " " " "
## 2 ( 1 ) "*" " " " "
## 3 ( 1 ) "*" " " " "
## 4 ( 1 ) "*" " " " "
## TeamCleveland Indians TeamColorado Rockies TeamDetroit Tigers
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## TeamHouston Astros TeamKansas City Royals TeamLos Angeles Angels
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## TeamLos Angeles Dodgers TeamMiami Marlins TeamMilwaukee Brewers
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## TeamMinnesota Twins TeamNew York Mets TeamNew York Yankees
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) "*" " " " "
## 3 ( 1 ) "*" " " " "
## 4 ( 1 ) "*" " " " "
## TeamOakland Athletics TeamPhiladelphia Phillies TeamPittsburgh Pirates
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## TeamSan Diego Padres TeamSan Francisco Giants TeamSeattle Mariners
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## TeamSt. Louis Cardinals TeamTampa Bay Rays TeamTexas Rangers
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " "*"
## TeamToronto Blue Jays TeamWashington Nationals LeagueNL Wins Losses
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " "
## 4 ( 1 ) " " "*" " " " " " "
## BattingAverage Runs Hits HR Doubles Triples RBI SB OBP SLG ERA
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " " " " " " " " " " " "
## HitsAllowed Walks StrikeOuts Saves WHIP
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " "
with(summary(backward), data.frame(cp,outmat))
## cp TeamAtlanta.Braves TeamBaltimore.Orioles TeamBoston.Red.Sox
## 1 ( 1 ) -Inf
## 2 ( 1 ) -Inf
## 3 ( 1 ) -Inf
## 4 ( 1 ) -Inf
## TeamChicago.Cubs TeamChicago.White.Sox TeamCincinnati.Reds
## 1 ( 1 ) *
## 2 ( 1 ) *
## 3 ( 1 ) *
## 4 ( 1 ) *
## TeamCleveland.Indians TeamColorado.Rockies TeamDetroit.Tigers
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
## TeamHouston.Astros TeamKansas.City.Royals TeamLos.Angeles.Angels
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
## TeamLos.Angeles.Dodgers TeamMiami.Marlins TeamMilwaukee.Brewers
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
## TeamMinnesota.Twins TeamNew.York.Mets TeamNew.York.Yankees
## 1 ( 1 )
## 2 ( 1 ) *
## 3 ( 1 ) *
## 4 ( 1 ) *
## TeamOakland.Athletics TeamPhiladelphia.Phillies TeamPittsburgh.Pirates
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
## TeamSan.Diego.Padres TeamSan.Francisco.Giants TeamSeattle.Mariners
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
## TeamSt..Louis.Cardinals TeamTampa.Bay.Rays TeamTexas.Rangers
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 ) *
## TeamToronto.Blue.Jays TeamWashington.Nationals LeagueNL Wins Losses
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 ) *
## 4 ( 1 ) *
## BattingAverage Runs Hits HR Doubles Triples RBI SB OBP SLG ERA
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
## HitsAllowed Walks StrikeOuts Saves WHIP
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
backwardmodel = lm(WinPct ~ Saves + WHIP + BattingAverage + Runs, data = mlb.df)
summary(backwardmodel)
##
## Call:
## lm(formula = WinPct ~ Saves + WHIP + BattingAverage + Runs, data = mlb.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.04771 -0.01210 -0.00105 0.01774 0.04478
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4641489 0.1401941 3.311 0.00283 **
## Saves 0.0028606 0.0006411 4.462 0.00015 ***
## WHIP -0.4489248 0.0603730 -7.436 8.68e-08 ***
## BattingAverage 0.7818438 0.6840724 1.143 0.26390
## Runs 0.0004269 0.0001162 3.674 0.00114 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02432 on 25 degrees of freedom
## Multiple R-squared: 0.8836, Adjusted R-squared: 0.865
## F-statistic: 47.45 on 4 and 25 DF, p-value: 2.538e-11
coef(backwardmodel)
## (Intercept) Saves WHIP BattingAverage Runs
## 0.464148943 0.002860621 -0.448924811 0.781843787 0.000426943
# c - best subsets method
best = regsubsets(WinPct ~ ., data = mlb.df, method = "exhaustive")
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 19 linear dependencies found
with(summary(best), data.frame(rsq, adjr2, cp, rss, outmat))
## rsq adjr2 cp rss TeamAtlanta.Braves
## 1 ( 1 ) 0.9996343 0.9996212 -Inf 4.644970e-05
## 2 ( 1 ) 0.9999647 0.9999621 -Inf 4.481630e-06
## 3 ( 1 ) 0.9999728 0.9999696 -Inf 3.457235e-06
## 4 ( 1 ) 0.9999795 0.9999763 -Inf 2.598651e-06 *
## 5 ( 1 ) 0.9999839 0.9999805 -Inf 2.044904e-06 *
## 6 ( 1 ) 0.9999860 0.9999824 -Inf 1.775679e-06 *
## 7 ( 1 ) 0.9999882 0.9999844 -Inf 1.504217e-06 *
## 8 ( 1 ) 0.9999904 0.9999868 -Inf 1.214508e-06 *
## TeamBaltimore.Orioles TeamBoston.Red.Sox TeamChicago.Cubs
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 ) *
## 4 ( 1 )
## 5 ( 1 ) *
## 6 ( 1 ) *
## 7 ( 1 ) *
## 8 ( 1 )
## TeamChicago.White.Sox TeamCincinnati.Reds TeamCleveland.Indians
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
## 5 ( 1 )
## 6 ( 1 ) *
## 7 ( 1 )
## 8 ( 1 ) *
## TeamColorado.Rockies TeamDetroit.Tigers TeamHouston.Astros
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
## 5 ( 1 )
## 6 ( 1 )
## 7 ( 1 ) *
## 8 ( 1 )
## TeamKansas.City.Royals TeamLos.Angeles.Angels TeamLos.Angeles.Dodgers
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
## 5 ( 1 )
## 6 ( 1 )
## 7 ( 1 )
## 8 ( 1 )
## TeamMiami.Marlins TeamMilwaukee.Brewers TeamMinnesota.Twins
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
## 5 ( 1 )
## 6 ( 1 )
## 7 ( 1 )
## 8 ( 1 )
## TeamNew.York.Mets TeamNew.York.Yankees TeamOakland.Athletics
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
## 5 ( 1 )
## 6 ( 1 )
## 7 ( 1 ) *
## 8 ( 1 )
## TeamPhiladelphia.Phillies TeamPittsburgh.Pirates TeamSan.Diego.Padres
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 ) *
## 5 ( 1 ) *
## 6 ( 1 ) *
## 7 ( 1 ) *
## 8 ( 1 )
## TeamSan.Francisco.Giants TeamSeattle.Mariners TeamSt..Louis.Cardinals
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
## 5 ( 1 )
## 6 ( 1 )
## 7 ( 1 )
## 8 ( 1 )
## TeamTampa.Bay.Rays TeamTexas.Rangers TeamToronto.Blue.Jays
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
## 5 ( 1 )
## 6 ( 1 )
## 7 ( 1 )
## 8 ( 1 ) *
## TeamWashington.Nationals LeagueNL Wins Losses BattingAverage Runs Hits
## 1 ( 1 ) *
## 2 ( 1 ) * *
## 3 ( 1 ) * *
## 4 ( 1 ) * *
## 5 ( 1 ) * *
## 6 ( 1 ) * *
## 7 ( 1 ) * *
## 8 ( 1 ) * * *
## HR Doubles Triples RBI SB OBP SLG ERA HitsAllowed Walks StrikeOuts
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
## 5 ( 1 )
## 6 ( 1 )
## 7 ( 1 )
## 8 ( 1 ) *
## Saves WHIP
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
## 5 ( 1 )
## 6 ( 1 )
## 7 ( 1 )
## 8 ( 1 ) *
bestmodel = lm(WinPct ~ Runs + Saves + WHIP + BattingAverage, data = mlb.df)
summary(bestmodel)
##
## Call:
## lm(formula = WinPct ~ Runs + Saves + WHIP + BattingAverage, data = mlb.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.04771 -0.01210 -0.00105 0.01774 0.04478
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4641489 0.1401941 3.311 0.00283 **
## Runs 0.0004269 0.0001162 3.674 0.00114 **
## Saves 0.0028606 0.0006411 4.462 0.00015 ***
## WHIP -0.4489248 0.0603730 -7.436 8.68e-08 ***
## BattingAverage 0.7818438 0.6840724 1.143 0.26390
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02432 on 25 degrees of freedom
## Multiple R-squared: 0.8836, Adjusted R-squared: 0.865
## F-statistic: 47.45 on 4 and 25 DF, p-value: 2.538e-11
# d - find mallow's CP
##ols_mallows_cp(forwardmodel, full.model)
##ols_mallows_cp(backwardmodel, full.model)
##ols_mallows_cp(bestmodel, full.model)
Using a forward selection model, I ended with the formula WinPct = .305 + .0043(RBI) - 0.0915(ERA) + 0.7573(BattingAverage) + 0.0019(Saves) The R^2 value for this model is 0.8409
I ended with R^2 value of 0.865, and the formula WinPct = 0.464 + 0.00286(Saves) - 0.4489(WHIP) + 0.7818(BattingAverage) + 0.0042(Runs)
The predictors in this model are Runs, Saves, WHIP, and Batting Average, and it has an adjusted R^2 value of 0.865
Mallow’s Cp for each model Forward model: 11.10941 Backward model: 11.83185 Best subsets model: 11.83185 Not all models gave the same variables, but I would personally use the best subsets model or the backwards elimination model because they had the greatest R^2 value, and therefore account for the most variability.
##{r} baseball = read.csv("~/downloads/BaseballTimes2017.csv") baseball = baseball %>% select(-Game, -League) # a - maximize R^2 maxrsq = lm(Time ~ Runs + Margin + Pitchers + Attendance, data = baseball) summary(maxrsq) # b - maximize adjusted R^2 bestbball = regsubsets(Time ~ ., data = baseball, method = "exhaustive") with(summary(bestbball), data.frame(rsq, adjr2, cp, rss, outmat)) ok1 = lm(Time ~ Runs + Attendance, data = baseball) summary(ok1) # c - minimize Mallow's Cp ols_mallows_cp(ok1, maxrsq) ##
To maximize R^2, all variables are used for an R^2 value of 0.6451
To maximize adjusted R^2, the variables Runs and Attendance are used for an adjusted R^2 of 0.535
The minimum Mallow’s CP is 1.977722
rel = read.csv("~/downloads/RelGDP.csv")
# a
log.gdp = log(rel$GDP)
plot(log.gdp, rel$Religiosity)
# b
logrel = lm(log.gdp ~ Religiosity, data = rel)
summary(logrel)
##
## Call:
## lm(formula = log.gdp ~ Religiosity, data = rel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8387 -0.8108 0.1272 0.5833 3.5923
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.9961 0.3656 30.079 < 2e-16 ***
## Religiosity -1.4013 0.2001 -7.005 1.43e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.085 on 42 degrees of freedom
## Multiple R-squared: 0.5388, Adjusted R-squared: 0.5278
## F-statistic: 49.06 on 1 and 42 DF, p-value: 1.432e-08
# d
studresids = studres(logrel)
plot(rel$Religiosity, studresids)
##### b.
52.78% of variability in log(GDP) is explained in this model.
The coefficient of -1.4013 means that as GDP increases within a country, the religiosity decreases.
The coefficients in this model tell us that the birth weight of Black babies is very dependent on the mother’s race, while not as much so for Hispanic and babies of other races.
baseball = read.csv("~/downloads/BaseballTimes2017.csv")
cor(baseball$Runs, baseball$Time)
## [1] 0.7449071
x = baseball$Runs
y = baseball$Time
originalR = cor(x,y)
newY = sample(baseball$Time)
cor(x, newY)
## [1] -0.1546677
# repeat sampling procedure
n = 5000
permcorr = as.numeric(0)
for (i in 1:n) {
newY = sample(baseball$Time)
permcorr[i] = cor(x, newY)
}
permcorr[1:5]
## [1] -0.15024504 0.53084894 -0.47133220 -0.01667985 -0.38287843
hist(permcorr)
upper = sum(permcorr>abs(originalR))
upper
## [1] 9
lower = sum(permcorr<(-abs(originalR)))
lower
## [1] 0
pvalue = (upper + lower) / n
pvalue
## [1] 0.0018
The correlation between runs and times is 0.744, suggesting they are related. Using R, I used a randomization test to sample the data n = 5000 times. The p-value for the test was 0.0012, which suggests that there is significant evidence that correlation between Runs and Time is greater than zero.
peak = read.csv("~/downloads/HighPeaks.csv")
# a
foura = lm(Length ~ Time, data = peak)
summary(foura)
##
## Call:
## lm(formula = Length ~ Time, data = peak)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4112 -1.1636 -0.0413 1.0514 3.7743
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.10039 1.06739 1.031 0.308
## Time 1.07711 0.09699 11.105 2.39e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.818 on 44 degrees of freedom
## Multiple R-squared: 0.737, Adjusted R-squared: 0.7311
## F-statistic: 123.3 on 1 and 44 DF, p-value: 2.39e-14
confint(foura, level = 0.90)
## 5 % 95 %
## (Intercept) -0.6930744 2.893854
## Time 0.9141373 1.240075
# b - bootstrap distribution
index = 1:46
bootsample = sample(index, replace = TRUE)
head(peak[bootsample,])
## Peak Elevation Difficulty Ascent Length Time
## 13 Nippletop 4620 5 4050 12.6 10.0
## 18 Panther Peak 4442 6 3762 17.6 13.5
## 20 Rocky Peak Ridge 4420 6 4500 13.4 11.0
## 43 Blake 3960 4 3270 13.6 12.0
## 21 Macomb Mtn. 4405 5 2344 8.4 8.0
## 15 Mt. Redfield 4606 7 3225 17.5 14.0
bootmodel = lm(Length ~ Time, data = peak[bootsample,])
summary(bootmodel)
##
## Call:
## lm(formula = Length ~ Time, data = peak[bootsample, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6465 -1.5427 -0.0099 0.9428 3.7480
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3041 1.0634 2.167 0.0357 *
## Time 0.9790 0.0889 11.012 3.14e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.815 on 44 degrees of freedom
## Multiple R-squared: 0.7338, Adjusted R-squared: 0.7277
## F-statistic: 121.3 on 1 and 44 DF, p-value: 3.144e-14
bootbetas = matrix(0,nrow = 5000, ncol = 2)
for (i in 1:5000) {
bootsample = sample(index, replace = TRUE)
bootmodel = lm(Length ~ Time, data = peak[bootsample,])
bootbetas[i,] = coef(bootmodel)
}
hist(bootbetas[,2])
# c
sd(bootbetas[,2])
## [1] 0.1222331
mean(bootbetas[,2])
## [1] 1.092596
quantile(bootbetas[,2],.05)
## 5%
## 0.9159288
quantile(bootbetas[,2], .95)
## 95%
## 1.315015
A 90% confidence interval for time is (1.24, 2.89). This means that the average hiker will hike at a speed from anywhere from 1.24 to 2.89 miles on average.
This distribution suggests that our confidence interval was correct, as most of the observations are centered around the middle of the interval. This is a normal distribution that is skewed right.
The standard deviation is 0.118, and the mean is 1.09. The standard deviation is much lesser than the original model predicted.
The confidence interval from the bootstrap distribution is (0.9252, 1.3076)