Homework 2

2. Question 3.7.10 pg 123

#Problem 10 a)

First I load the data set from the library (ISLR) then I will fit a mutliple regression model to predict Sales using Price, Urban, and US. Then I will output the summary of the model.

Sales represents unit sales (in thosands) at each location Price represents the price company charges for car seats at each site Urban is a factor with levels No and Yes to indicate whether the store is in an urban or rural location US is a factor with level No and Yes to indicate whether the store is in the US or not

Due to the nature of the factor variables Urban and US I will create dummy variables for them. Urban yes = 1 and no = 0. US yes = 1 and no = 0

set.seed(123)
#Loading data set
library(ISLR)
data(Carseats)



#Fitting multiple regresion model
mod_1 <- lm(Sales ~ Price + Urban + US, data=Carseats)

#sumamry of regression model
summary(mod_1)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

#Prolem 10 b)

It appears that price and and USYes are highly significant since they contain very small p-values. It appears UrbanYes is not significant in the regression model. Price and Sales have a negative relationship.

#Problem 10 c)

\(Sales = 13.043469 - Price*0.054459 - UrbanYes*0.021916 + USYes*1.200573\)

#Problem 10 d)

For which of the predictors can you reject the null hyphothesis \(H_{0} : beta_{j}} = 0\)? We can reject UrbanYes since there is not a significance of it in predicting sales

#Problem 10 e) Now I will fit a mutiple regression model for predicting Sales based on Price and US

mod_2 <- lm(Sales ~ Price + US, data=Carseats)
summary(mod_2)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

#Problem 10 f)

I will use a anova test comparing both models.

anova(mod_1, mod_2)

When lookinging over the summaries of the first compared to the second model it appears the second one fits better. However, the degress of freedom are rather high and the \(R^{2}\) values are rather low.

#Problem 10 g)

Now I obtain the 95% confience intervals for second model from part e

confint(mod_2)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

#Problem 10 h)

Now I will look to see if there are outliers or high leverage observations in second model from part e by plotting the model

par(mfrow=c(2,2))
plot(mod_2)

It appears there are only 3 outliers but nothing too drastic. The same applies to the leverage points, there appears to be 3 leverage points.

##3. Question 3.7.15 pg 126

Now I will load the Boston data set found from the MASS library

library(MASS)
data(Boston)

#Problem 15 a)

Now I will fit a simple linear regression model to predict the response and output a table of all the individual predictors with their estimate, st. Error, t value, and p-values. I will also produce a plot of the points of the signifiance values for each predictor. However, due to the extreme range of values I will limit the y-axis range for min value to be the min pvalue of all the predictors and the upper limit to the median value of all the pvalues. The table will be the exhaustive list of all the predictors and their cofficients from each model.

#Fitting all the individual models
mod_15zn <- lm(crim ~ zn, data=Boston)
mod_15indus <- lm(crim ~ indus, data=Boston)
mod_15chas <- lm(crim ~ chas, data=Boston)
mod_15nox <- lm(crim ~ nox, data=Boston)
mod_15rm <-lm(crim ~ rm, data=Boston)
mod_15age <-lm(crim ~ age, data=Boston)
mod_15dis <-lm(crim~dis, data=Boston)
mod_15rad <-lm(crim~rad, data=Boston)
mod_15tax <-lm(crim~tax, data=Boston)
mod_15ptratio <- lm(crim ~ ptratio, data=Boston)
mod_15black <- lm(crim ~ black, data=Boston)
mod_15lstat <- lm(crim ~ lstat, data=Boston)
mod_15medv <- lm(crim ~ medv,  data=Boston)


#grabbing all the coefficients from the summary of each model 
znc <- summary(mod_15zn)$coefficients[2,]
indusc <- summary(mod_15indus)$coefficients[2,]
chasc <- summary(mod_15chas)$coefficients[2,]
noxc <- summary(mod_15nox)$coefficients[2,]
rmc <- summary(mod_15rm)$coefficients[2,]
agec <-summary(mod_15age)$coefficients[2,]
disc <- summary(mod_15dis)$coefficients[2,]
radc <- summary(mod_15rad)$coefficients[2,]
taxc <- summary(mod_15tax)$coefficients[2,]
ptratioc <- summary(mod_15ptratio)$coefficients[2,]
blackc <- summary(mod_15black)$coefficients[2,]
lstatc <- summary(mod_15lstat)$coefficients[2,]
medvc <- summary(mod_15medv)$coefficients[2,]

#putting all summary coefficients from each model in a data.frame
bos_summary <- data.frame (
  zn = znc,
  indus = indusc,
  chas = chasc,
  nox = noxc,
  rm = rmc,
  age = agec,
  dis = disc,
  rad = radc,
  tax = taxc,
  ptratio = ptratioc,
  black = blackc,
  lstat = lstatc,
  medv = medvc
)

#install.packages('data.table')
#loading a library to transpose the data.frame
library(data.table)
t_bos_summary <- t(bos_summary)

t_bos_summary <- as.data.frame(t_bos_summary)

#grabbing the column names and rownames and assigning them to the tranposed data
colnames(t_bos_summary) <- rownames(bos_summary)
rownames(t_bos_summary) <- colnames(bos_summary)

#outputting the summary coefficients for each predictor
t_bos_summary <- cbind(predictor = rownames(t_bos_summary), t_bos_summary)

rownames(t_bos_summary) <-NULL

colnames(t_bos_summary)[5] <- 'pvalue'

#table ordered by the highest signficance to the least
t_bos_summary <- t_bos_summary[order(t_bos_summary[,5]),]
t_bos_summary

#install.packages('tidyverse')

#plotting the pvalues for a given predictor using ggplot
library(tidyverse)
ggplot(data=t_bos_summary, aes(predictor, pvalue)) + geom_point() + ylim(min(t_bos_summary$pvalue), median(t_bos_summary$pvalue)) + labs(x='Predictor', y='P-Value', title='P-values For Predictors')

#regular plot() command for plotting pvalues for a given predictor
plot(pvalue ~ predictor, data=t_bos_summary, ylim=c(min(t_bos_summary$pvalue), median(t_bos_summary$pvalue)), main='P-values for Predictor', xlab='Predictor', ylab='P-value')

The graphs show that lstat, nox, rad, and tax seem to contain the highest sigificance. However, I believe the table provides an exhaustive list for predictor and it is sorted by most significant to least. It appears rad contains the highest significance in for crim and tax, lstat, nox in that order.

#Problem 15 b)

Now I will fit a multiple regression model using all the predictors in one model.

mod_15all <- lm(crim ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + black + lstat + medv, data=Boston)

summary(mod_15all)

## 
## Call:
## lm(formula = crim ~ zn + indus + chas + nox + rm + age + dis + 
##     rad + tax + ptratio + black + lstat + medv, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.924 -2.120 -0.353  1.019 75.051 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.033228   7.234903   2.354 0.018949 *  
## zn            0.044855   0.018734   2.394 0.017025 *  
## indus        -0.063855   0.083407  -0.766 0.444294    
## chas         -0.749134   1.180147  -0.635 0.525867    
## nox         -10.313535   5.275536  -1.955 0.051152 .  
## rm            0.430131   0.612830   0.702 0.483089    
## age           0.001452   0.017925   0.081 0.935488    
## dis          -0.987176   0.281817  -3.503 0.000502 ***
## rad           0.588209   0.088049   6.680 6.46e-11 ***
## tax          -0.003780   0.005156  -0.733 0.463793    
## ptratio      -0.271081   0.186450  -1.454 0.146611    
## black        -0.007538   0.003673  -2.052 0.040702 *  
## lstat         0.126211   0.075725   1.667 0.096208 .  
## medv         -0.198887   0.060516  -3.287 0.001087 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared:  0.454,  Adjusted R-squared:  0.4396 
## F-statistic: 31.47 on 13 and 492 DF,  p-value: < 2.2e-16

For which predictors can we reject the null hypothesis \(H_{0} : beta_{j}} = 0\)? We may reject the null hypothesis for zn, nox, dis, rad, black, medv, and lstat.

#Problem 15 c)

I will produce a scatter plot of the coefficients from the individual regression models and the coefficients from the model containing all predictors. I will first do this using the base R plot() command and then in ggplot. In order to do this, I will grab the coefficients from all the individual models and the coefficients from the model with all the predictors in it and it put it into a data frame.

#Grabbing the coefficients from the model that includes all the predictors
all_coef <- mod_15all$coefficients

#turning it into a dataframe
all_coef <- as.data.frame(all_coef)

#Dropping row names
all_coef <- all_coef[-1,]

dat_15 <- data.frame (
  all_coef = all_coef,
  indiv_coef = t_bos_summary$Estimate
  
)

plot(all_coef ~ indiv_coef, data=dat_15, main="Individual Coefficient Model vs All Coefficients Model", xlab='Individual Coefficients',ylab='All Coefficients')

ggplot(data=dat_15, aes(indiv_coef, all_coef)) + geom_point() + labs(title="Individual Coefficient Model vs All Coefficients Model", x='Individual Coefficients',y='All Coefficients')

It appears that there is difference between the coefficients that are all included in one model and the seperate coefficients for each individual model. This would make sense because a model with only one covariate would only have the slope for that one covariate. Whereas a model with all the covariates has to balance or keep in mind the slope for all the covariates.

#Problem 15 d)

Now will I fit all the individual models with only one covariate in the form of a cubic polynomial Then I will do a model with all the covariates with that same form. And then I will produce a similar plot with coefficients as a scatterplot like in previous step.

#Fitting all the individual models
mod_15zn <- lm(crim ~ poly(zn, 3), data=Boston)
mod_15indus <- lm(crim ~ poly(indus,3), data=Boston)
mod_15nox <- lm(crim ~ poly(nox,3), data=Boston)
mod_15rm <-lm(crim ~ poly(rm,3), data=Boston)
mod_15age <-lm(crim ~ poly(age,3), data=Boston)
mod_15dis <-lm(crim~poly(dis,3), data=Boston)
mod_15rad <-lm(crim~poly(rad,3), data=Boston)
mod_15tax <-lm(crim~poly(tax,3), data=Boston)
mod_15ptratio <- lm(crim ~ poly(ptratio,3), data=Boston)
mod_15black <- lm(crim ~ poly(black,3), data=Boston)
mod_15lstat <- lm(crim ~ poly(lstat,3), data=Boston)
mod_15medv <- lm(crim ~ poly(medv,3),  data=Boston)



#grabbing all the coefficients from the summary of each model 
znc <- summary(mod_15zn)$coefficients[2,]
indusc <- summary(mod_15indus)$coefficients[2,]
noxc <- summary(mod_15nox)$coefficients[2,]
rmc <- summary(mod_15rm)$coefficients[2,]
agec <-summary(mod_15age)$coefficients[2,]
disc <- summary(mod_15dis)$coefficients[2,]
radc <- summary(mod_15rad)$coefficients[2,]
taxc <- summary(mod_15tax)$coefficients[2,]
ptratioc <- summary(mod_15ptratio)$coefficients[2,]
blackc <- summary(mod_15black)$coefficients[2,]
lstatc <- summary(mod_15lstat)$coefficients[2,]
medvc <- summary(mod_15medv)$coefficients[2,]

#putting all summary coefficients from each model in a data.frame
bos_summary <- data.frame (
  zn = znc,
  indus = indusc,
  nox = noxc,
  rm = rmc,
  age = agec,
  dis = disc,
  rad = radc,
  tax = taxc,
  ptratio = ptratioc,
  black = blackc,
  lstat = lstatc,
  medv = medvc
)

#install.packages('data.table')
#loading a library to transpose the data.frame
library(data.table)
t_bos_summary <- t(bos_summary)

t_bos_summary <- as.data.frame(t_bos_summary)

#grabbing the column names and rownames and assigning them to the tranposed data
colnames(t_bos_summary) <- rownames(bos_summary)
rownames(t_bos_summary) <- colnames(bos_summary)

#outputting the summary coefficients for each predictor
t_bos_summary <- cbind(predictor = rownames(t_bos_summary), t_bos_summary)

rownames(t_bos_summary) <-NULL

colnames(t_bos_summary)[5] <- 'pvalue'

#table ordered by the highest signficance to the least
t_bos_summary <- t_bos_summary[order(t_bos_summary[,5]),]
t_bos_summary

#Grabbing the coefficients from the model that includes all the predictors
all_coef <- mod_15all$coefficients

all_coef <- as.data.frame(all_coef)

l <- t(all_coef)

l <- as.data.frame(l)

l <- subset(l, select=-c(chas))

all_coef <- t(l)

#turning it into a dataframe
all_coef <- as.data.frame(all_coef)

#Dropping row names
all_coef <- all_coef[-1,]

dat_15 <- data.frame (
  all_coef = all_coef,
  indiv_coef = t_bos_summary$Estimate
  
)



plot(all_coef ~ indiv_coef, data=dat_15, main="Individual Coefficient Model vs All Coefficients Model", xlab='Individual Coefficients',ylab='All Coefficients')

ggplot(data=dat_15, aes(indiv_coef, all_coef)) + geom_point() + labs(title="Individual Coefficient Model vs All Coefficients Model", x='Individual Coefficients',y='All Coefficients')

It appears there is much a different interaction when fitting each individual covariate model for the predicting crime rate. The graphs above show a different interaction than the graphs in part c). tax, lstate, rm, zm, rad as the predictor have pvalues that indicate the coefficient is not significant. However, for nox, medv, dis, indus, ptratio, as the predictor the pvalues values suggest an association for a cubic fit.

Homework 2

Modern Applied Statistics II

1. Question 3.7.5 pg 121

2. Question 3.7.10 pg 123