Assignment 11

Problem set 1. LINEAR REGRESSION IN R

Using R’s lm function, perform regression analysis and measure the signiﬁcance of the independent variables for the following two data sets. In the ﬁrst case, you are evaluating the statement that we hear that Maximum Heart Rate of a person is related to their age by the following equation: MaxHR = 220−Age You have been given the following sample:

Age 18 23 25 35 65 54 34 56 72 19 23 42 18 39 37 MaxHR 202 186 187 180 156 169 174 172 153 199 193 174 198 183 178

age <- c(18, 23, 25, 35, 65, 54, 34, 56, 72, 19, 23, 42, 18, 39, 37)
hr <- c(202, 186, 187, 180, 156, 169, 174, 172, 153, 199, 193, 174, 198, 183, 178)

df <- data.frame(age, hr)

fit <- lm(hr ~ age, df)
fit

## 
## Call:
## lm(formula = hr ~ age, data = df)
## 
## Coefficients:
## (Intercept)          age  
##    210.0485      -0.7977

#The equation for the best fit line is Y = -0.79x + 210.0485


summary(fit)

## 
## Call:
## lm(formula = hr ~ age, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9258 -2.5383  0.3879  3.1867  6.6242 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 210.04846    2.86694   73.27  < 2e-16 ***
## age          -0.79773    0.06996  -11.40 3.85e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.578 on 13 degrees of freedom
## Multiple R-squared:  0.9091, Adjusted R-squared:  0.9021 
## F-statistic:   130 on 1 and 13 DF,  p-value: 3.848e-08

#The relationship, not necessarilly the effect, of age and max heart rate is significant at the 0.001 level.


library(ggplot2)

#linear regression plot
ggplot(fit, aes(x=age, y=hr)) + geom_point() + geom_smooth(method=lm)

Using the Auto data set from Assignment 5 (also attached here) perform a Linear Regression analysis using mpg as the dependent variable and the other 4 (displacement, horsepower, weight, acceleration) as independent variables. What is the ﬁnal linear regression ﬁt equation? Which of the 4 independent variables have a signiﬁcant impact on mpg? What are their corresponding signiﬁcance levels? What are the standard errors on each of the coeﬃcients? Please perform this experiment in two ways. First take any random 40 data points from the entire auto data sample and perform the linear regression ﬁt and measure the 95% conﬁdence intervals. Then, take the entire data set (all 392 points) and perform linear regression and measure the 95% conﬁdence intervals. Please report the resulting ﬁt equation, their signiﬁcance values and conﬁdence intervals for each of the two runs. Please submit an R-markdown ﬁle documenting your experiments. Your submission should include the ﬁnal linear ﬁts, and their corresponding signiﬁcance levels. In addition, you should clearly state what you concluded from looking at the ﬁt and their signiﬁcance levels.

cars <- read.table("https://raw.githubusercontent.com/bkreis84/Math/master/auto-mpg.data") 

colnames(cars) <- c("disp", "hp", "wgt", "acc", "mpg")


#Full Data

mfit <- lm(mpg ~ disp + hp + wgt + acc, cars)
summary(mfit)

## 
## Call:
## lm(formula = mpg ~ disp + hp + wgt + acc, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.378  -2.793  -0.333   2.193  16.256 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 45.2511397  2.4560447  18.424  < 2e-16 ***
## disp        -0.0060009  0.0067093  -0.894  0.37166    
## hp          -0.0436077  0.0165735  -2.631  0.00885 ** 
## wgt         -0.0052805  0.0008109  -6.512  2.3e-10 ***
## acc         -0.0231480  0.1256012  -0.184  0.85388    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.247 on 387 degrees of freedom
## Multiple R-squared:  0.707,  Adjusted R-squared:  0.704 
## F-statistic: 233.4 on 4 and 387 DF,  p-value: < 2.2e-16

# The equation would be Y = -.006disp - 0.044hp - 0.005wgt - 0.023acc + 45.25

#Of the four variables, weight (.001) and horsepower (.01) are the only variables found to have a 
#statistically significant relationship. 




# Sample

samp <- cars[sample(nrow(cars), 40), ]
sampfit <- lm(mpg ~ disp + hp + wgt + acc, samp)
summary(sampfit)

## 
## Call:
## lm(formula = mpg ~ disp + hp + wgt + acc, data = samp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1697 -2.1951 -0.1889  2.2019  7.7237 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 44.923642  10.317714   4.354 0.000111 ***
## disp        -0.014929   0.025071  -0.595 0.555361    
## hp           0.015593   0.062440   0.250 0.804252    
## wgt         -0.005962   0.002966  -2.010 0.052210 .  
## acc         -0.188418   0.466440  -0.404 0.688708    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.758 on 35 degrees of freedom
## Multiple R-squared:  0.6943, Adjusted R-squared:  0.6593 
## F-statistic: 19.87 on 4 and 35 DF,  p-value: 1.297e-08

#In the case of the sample, no relationship is found to be statistically significant



confint(mfit, level=0.95)

##                    2.5 %       97.5 %
## (Intercept) 40.422278855 50.080000544
## disp        -0.019192122  0.007190380
## hp          -0.076193029 -0.011022433
## wgt         -0.006874738 -0.003686277
## acc         -0.270094049  0.223798050

confint(sampfit, level=0.95)

##                   2.5 %       97.5 %
## (Intercept) 23.97756913 6.586972e+01
## disp        -0.06582462 3.596716e-02
## hp          -0.11116610 1.423530e-01
## wgt         -0.01198438 6.022197e-05
## acc         -1.13534138 7.585060e-01

The increased sample size provides an improved model that better approximates the population mean,
with a smaller p-value as well as a smaller range for the confidence interval. A reduction in the standard error indicates that the values fall closer to our estimated population mean.The adjusted R-Square is also closer to 1 in our larger sample, indicating that the model is more predictive than the small sample. It may be useful to remove variables one at a time to see if the adjusted r-squared value improves.

Assignment 11

Brian Kreis

April 26, 2017

Problem set 1. LINEAR REGRESSION IN R