CAP 3330. Summer 2023
Assignment 6 (100 points)
Instructions
Create an R notebook and include all the code and interpretations/explanations needed to answer the three questions below.
The submission for this assignment consists of uploading to Canvas the R notebook with the answers for all the questions below. You can submit either the HTML version of the notebook OR an exported PDF. ONLY these two formats will be accepted !!! (DO NOT submit the .Rmd file)
Make sure the document that you submit (HTML or the PDF) includes:
The R code you used for each exercise part.
The R output you got after running each line of code.
Your written answers (if needed). If a question asks you to justify/explain, your notebook must include a text section with the justification/explanation.
Dataset
The dataset you are going to use for this assignment is called Hitters_Fixed, which is a cleaned version of the Hitters dataset. First, you are going to read the Hitters dataset; then, you will clean it to get the Hitters_Fixed dataset. Take the following steps to do so:
Load the “ISLR” package in R. The “ISLR” contains the Hitters dataset.
Remove the missing data from the Hitters dataset and save the results into Hitters_Fixed. Run the following line of code to do so:
Hitters_Fixed = na.omit(Hitters )
That is all. Now you can use the Hitters_Fixed dataset to do this assignment! Note: If you want more info about the variables in the Hitters_Fixed dataset, you can get it using this link: https://docs.google.com/document/d/1qKeEoWVnAkrlPDwuq7Uz65ybOzrXpBC2pplRbabcgJ0/edit?usp=sharing
install.packages("ISLR")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
Installing package into ‘C:/Users/gisse/AppData/Local/R/win-library/4.3’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.3/ISLR_1.4.zip'
Content type 'application/zip' length 2924200 bytes (2.8 MB)
downloaded 2.8 MB
package ‘ISLR’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\gisse\AppData\Local\Temp\Rtmp2jdNZP\downloaded_packages
library(ISLR)
Hitters <- ISLR::Hitters
str(Hitters)
'data.frame': 322 obs. of 20 variables:
$ AtBat : int 293 315 479 496 321 594 185 298 323 401 ...
$ Hits : int 66 81 130 141 87 169 37 73 81 92 ...
$ HmRun : int 1 7 18 20 10 4 1 0 6 17 ...
$ Runs : int 30 24 66 65 39 74 23 24 26 49 ...
$ RBI : int 29 38 72 78 42 51 8 24 32 66 ...
$ Walks : int 14 39 76 37 30 35 21 7 8 65 ...
$ Years : int 1 14 3 11 2 11 2 3 2 13 ...
$ CAtBat : int 293 3449 1624 5628 396 4408 214 509 341 5206 ...
$ CHits : int 66 835 457 1575 101 1133 42 108 86 1332 ...
$ CHmRun : int 1 69 63 225 12 19 1 0 6 253 ...
$ CRuns : int 30 321 224 828 48 501 30 41 32 784 ...
$ CRBI : int 29 414 266 838 46 336 9 37 34 890 ...
$ CWalks : int 14 375 263 354 33 194 24 12 8 866 ...
$ League : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ...
$ Division : Factor w/ 2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ...
$ PutOuts : int 446 632 880 200 805 282 76 121 143 0 ...
$ Assists : int 33 43 82 11 40 421 127 283 290 0 ...
$ Errors : int 20 10 14 3 4 25 7 9 19 0 ...
$ Salary : num NA 475 480 500 91.5 750 70 100 75 1100 ...
$ NewLeague: Factor w/ 2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ...
summary(Hitters)
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks League Division
Min. : 16.0 Min. : 1 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 1.000 Min. : 19.0 Min. : 4.0 Min. : 0.00 Min. : 1.0 Min. : 0.00 Min. : 0.00 A:175 E:157
1st Qu.:255.2 1st Qu.: 64 1st Qu.: 4.00 1st Qu.: 30.25 1st Qu.: 28.00 1st Qu.: 22.00 1st Qu.: 4.000 1st Qu.: 816.8 1st Qu.: 209.0 1st Qu.: 14.00 1st Qu.: 100.2 1st Qu.: 88.75 1st Qu.: 67.25 N:147 W:165
Median :379.5 Median : 96 Median : 8.00 Median : 48.00 Median : 44.00 Median : 35.00 Median : 6.000 Median : 1928.0 Median : 508.0 Median : 37.50 Median : 247.0 Median : 220.50 Median : 170.50
Mean :380.9 Mean :101 Mean :10.77 Mean : 50.91 Mean : 48.03 Mean : 38.74 Mean : 7.444 Mean : 2648.7 Mean : 717.6 Mean : 69.49 Mean : 358.8 Mean : 330.12 Mean : 260.24
3rd Qu.:512.0 3rd Qu.:137 3rd Qu.:16.00 3rd Qu.: 69.00 3rd Qu.: 64.75 3rd Qu.: 53.00 3rd Qu.:11.000 3rd Qu.: 3924.2 3rd Qu.:1059.2 3rd Qu.: 90.00 3rd Qu.: 526.2 3rd Qu.: 426.25 3rd Qu.: 339.25
Max. :687.0 Max. :238 Max. :40.00 Max. :130.00 Max. :121.00 Max. :105.00 Max. :24.000 Max. :14053.0 Max. :4256.0 Max. :548.00 Max. :2165.0 Max. :1659.00 Max. :1566.00
PutOuts Assists Errors Salary NewLeague
Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 67.5 A:176
1st Qu.: 109.2 1st Qu.: 7.0 1st Qu.: 3.00 1st Qu.: 190.0 N:146
Median : 212.0 Median : 39.5 Median : 6.00 Median : 425.0
Mean : 288.9 Mean :106.9 Mean : 8.04 Mean : 535.9
3rd Qu.: 325.0 3rd Qu.:166.0 3rd Qu.:11.00 3rd Qu.: 750.0
Max. :1378.0 Max. :492.0 Max. :32.00 Max. :2460.0
NA's :59
Hitters_Fixed <-na.omit(Hitters)
Questions We are going to obtain a regression equation to predict “Salary” (i.e., Salary is the outcome variable) Question 1) Conduct a regression analysis using “Hits” as the only predictor. Answer the following questions:
plot(Hitters_Fixed$Hits, Hitters_Fixed$Salary)
abline(lm(Salary~Hits, data = Hitters_Fixed), col = "red")
abline(h = mean(Hitters_Fixed$Salary), col = "blue")
sum_lm <- lm(Salary~Hits,data=Hitters_Fixed)
summary(sum_lm)
Call:
lm(formula = Salary ~ Hits, data = Hitters_Fixed)
Residuals:
Min 1Q Median 3Q Max
-893.99 -245.63 -59.08 181.12 2059.90
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 63.0488 64.9822 0.970 0.333
Hits 4.3854 0.5561 7.886 8.53e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 406.2 on 261 degrees of freedom
Multiple R-squared: 0.1924, Adjusted R-squared: 0.1893
F-statistic: 62.19 on 1 and 261 DF, p-value: 8.531e-14
Hits and salary are related. After analyzing the p-value, it is less than 0.05. We can conclude that the relationship between hits and salary are satistically significant.
Y = 63.0488 + Hits * 4.3854
Question 2) Use the same regression analysis from question 1 to answer this question.
Assume that Assumption 2 regarding the validity of the linear regression analysis is satisfied. Run an analysis to check for the validity of all other assumptions. Comment on your findings.
par(mfrow=c(2,2))
plot(sum_lm)
Assuming that “Assumption 2” is satisfied we can look at Plot 1 (Residuals vs Fitted Values). The residuals show a non-linear pattern, so the assumption is not satisfied and a quadratic model should be used instead of a linear one. Assumption 1 is not satisfied.
Question 3) Apply the best subset selection method to find a good multiple linear equation. DO NOT INCLUDE in the analysis the following four predictors:
CHits, CAtBat, CRuns, CRBI
library(leaps)
best_subset_Salary <- regsubsets(Salary~.-CHits -CAtBat -CRuns -CRBI, data=Hitters_Fixed, nvmax=13)
summary(best_subset_Salary)
Subset selection object
Call: regsubsets.formula(Salary ~ . - CHits - CAtBat - CRuns - CRBI,
data = Hitters_Fixed, nvmax = 13)
15 Variables (and intercept)
Forced in Forced out
AtBat FALSE FALSE
Hits FALSE FALSE
HmRun FALSE FALSE
Runs FALSE FALSE
RBI FALSE FALSE
Walks FALSE FALSE
Years FALSE FALSE
CHmRun FALSE FALSE
CWalks FALSE FALSE
LeagueN FALSE FALSE
DivisionW FALSE FALSE
PutOuts FALSE FALSE
Assists FALSE FALSE
Errors FALSE FALSE
NewLeagueN FALSE FALSE
1 subsets of each size up to 13
Selection Algorithm: exhaustive
AtBat Hits HmRun Runs RBI Walks Years CHmRun CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " " " " " "
2 ( 1 ) " " "*" " " " " " " " " " " "*" " " " " " " " " " " " " " "
3 ( 1 ) " " "*" " " " " " " " " " " "*" " " " " " " "*" " " " " " "
4 ( 1 ) "*" "*" " " " " " " " " " " "*" " " " " " " "*" " " " " " "
5 ( 1 ) "*" "*" " " " " " " " " " " "*" " " " " "*" "*" " " " " " "
6 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*" " " " " " "
7 ( 1 ) "*" "*" " " " " " " "*" "*" "*" " " " " "*" "*" " " " " " "
8 ( 1 ) "*" "*" " " " " " " "*" "*" "*" " " " " "*" "*" "*" " " " "
9 ( 1 ) "*" "*" " " " " " " "*" "*" "*" " " "*" "*" "*" "*" " " " "
10 ( 1 ) "*" "*" "*" " " " " "*" "*" "*" " " "*" "*" "*" "*" " " " "
11 ( 1 ) "*" "*" "*" " " " " "*" "*" "*" " " "*" "*" "*" "*" "*" " "
12 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*" " " "*" "*" "*" "*" "*" " "
13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" " "
summary(best_subset_Salary)$adjr2
[1] 0.2727764 0.3902841 0.4136565 0.4410945 0.4578190 0.4699300 0.4759338 0.4781049 0.4790087 0.4780628 0.4766292 0.4745954 0.4725383
adjR2_vector <- round(summary(best_subset_Salary)$adjr2,3)
a <- data.frame(model_size <- 1:13,round(summary(best_subset_Salary)$adjr2,3))
colnames(a)<-c("model_size","adjR2")
a
which.max(adjR2_vector)
[1] 9
coef(best_subest_Salary, 9). When we use which.max(), we can see the max value of adj R2 occurs at position 9. This means that when we add a 10th predictor, the adj R2 value is going to go down. Therefore, adding a 10th predictor from this dataset is not statistically valuable. It is better to use the equation with 9 predictors.
summary(best_subset_Salary)$rsq[9]
[1] 0.4969054
Based on the function above nearly 50% (0.497) of the total variability is eliminated.
I would say that the prediction error in this equation is a bit high. 50% of the proportion of the variance in the dependent variable is explained by the independent variable.