Using R’s lm function, perform regression analysis and measure the significance of the independent variables for the following two data sets. In the first case, you are evaluating the statement that we hear that Maximum Heart Rate of a person is related to their age by the following equation: MaxHR = 220−Age You have been given the following sample:
Age 18 23 25 35 65 54 34 56 72 19 23 42 18 39 37 MaxHR 202 186 187 180 156 169 174 172 153 199 193 174 198 183 178
age <- c(18, 23, 25, 35, 65, 54, 34, 56, 72, 19, 23, 42, 18, 39, 37)
hr <- c(202, 186, 187, 180, 156, 169, 174, 172, 153, 199, 193, 174, 198, 183, 178)
df <- data.frame(age, hr)
fit <- lm(hr ~ age, df)
fit
##
## Call:
## lm(formula = hr ~ age, data = df)
##
## Coefficients:
## (Intercept) age
## 210.0485 -0.7977
#The equation for the best fit line is Y = -0.79x + 210.0485
summary(fit)
##
## Call:
## lm(formula = hr ~ age, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9258 -2.5383 0.3879 3.1867 6.6242
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 210.04846 2.86694 73.27 < 2e-16 ***
## age -0.79773 0.06996 -11.40 3.85e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.578 on 13 degrees of freedom
## Multiple R-squared: 0.9091, Adjusted R-squared: 0.9021
## F-statistic: 130 on 1 and 13 DF, p-value: 3.848e-08
#The relationship, not necessarilly the effect, of age and max heart rate is significant at the 0.001 level.
library(ggplot2)
#linear regression plot
ggplot(fit, aes(x=age, y=hr)) + geom_point() + geom_smooth(method=lm)
Using the Auto data set from Assignment 5 (also attached here) perform a Linear Regression analysis using mpg as the dependent variable and the other 4 (displacement, horsepower, weight, acceleration) as independent variables. What is the final linear regression fit equation? Which of the 4 independent variables have a significant impact on mpg? What are their corresponding significance levels? What are the standard errors on each of the coefficients? Please perform this experiment in two ways. First take any random 40 data points from the entire auto data sample and perform the linear regression fit and measure the 95% confidence intervals. Then, take the entire data set (all 392 points) and perform linear regression and measure the 95% confidence intervals. Please report the resulting fit equation, their significance values and confidence intervals for each of the two runs. Please submit an R-markdown file documenting your experiments. Your submission should include the final linear fits, and their corresponding significance levels. In addition, you should clearly state what you concluded from looking at the fit and their significance levels.
cars <- read.table("https://raw.githubusercontent.com/bkreis84/Math/master/auto-mpg.data")
colnames(cars) <- c("disp", "hp", "wgt", "acc", "mpg")
#Full Data
mfit <- lm(mpg ~ disp + hp + wgt + acc, cars)
summary(mfit)
##
## Call:
## lm(formula = mpg ~ disp + hp + wgt + acc, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.378 -2.793 -0.333 2.193 16.256
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.2511397 2.4560447 18.424 < 2e-16 ***
## disp -0.0060009 0.0067093 -0.894 0.37166
## hp -0.0436077 0.0165735 -2.631 0.00885 **
## wgt -0.0052805 0.0008109 -6.512 2.3e-10 ***
## acc -0.0231480 0.1256012 -0.184 0.85388
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.247 on 387 degrees of freedom
## Multiple R-squared: 0.707, Adjusted R-squared: 0.704
## F-statistic: 233.4 on 4 and 387 DF, p-value: < 2.2e-16
# The equation would be Y = -.006disp - 0.044hp - 0.005wgt - 0.023acc + 45.25
#Of the four variables, weight (.001) and horsepower (.01) are the only variables found to have a
#statistically significant relationship.
# Sample
samp <- cars[sample(nrow(cars), 40), ]
sampfit <- lm(mpg ~ disp + hp + wgt + acc, samp)
summary(sampfit)
##
## Call:
## lm(formula = mpg ~ disp + hp + wgt + acc, data = samp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1697 -2.1951 -0.1889 2.2019 7.7237
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.923642 10.317714 4.354 0.000111 ***
## disp -0.014929 0.025071 -0.595 0.555361
## hp 0.015593 0.062440 0.250 0.804252
## wgt -0.005962 0.002966 -2.010 0.052210 .
## acc -0.188418 0.466440 -0.404 0.688708
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.758 on 35 degrees of freedom
## Multiple R-squared: 0.6943, Adjusted R-squared: 0.6593
## F-statistic: 19.87 on 4 and 35 DF, p-value: 1.297e-08
#In the case of the sample, no relationship is found to be statistically significant
confint(mfit, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 40.422278855 50.080000544
## disp -0.019192122 0.007190380
## hp -0.076193029 -0.011022433
## wgt -0.006874738 -0.003686277
## acc -0.270094049 0.223798050
confint(sampfit, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 23.97756913 6.586972e+01
## disp -0.06582462 3.596716e-02
## hp -0.11116610 1.423530e-01
## wgt -0.01198438 6.022197e-05
## acc -1.13534138 7.585060e-01
The increased sample size provides an improved model that better approximates the population mean,
with a smaller p-value as well as a smaller range for the confidence interval. A reduction in the standard error indicates that the values fall closer to our estimated population mean.The adjusted R-Square is also closer to 1 in our larger sample, indicating that the model is more predictive than the small sample. It may be useful to remove variables one at a time to see if the adjusted r-squared value improves.