Using R’s lm function, perform regression analysis and measure the significance of the independent variables for the following two data sets.

Problem 1: Maximum Heart Rate

In the first case, you are evaluating the statement that we hear that Maximum Heart Rate of a person is related to their age by the following equation: MaxHR = 220 - Age

You have been given the following sample:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Age 18 23 25 35 65 54 34 56 72 19 23 42 18 39 37
MaxHR 202 186 187 180 156 169 174 172 153 199 193 174 198 183 178

Perform a linear regression analysis fitting the Max Heart Rate to Age using the lm function in R.

What is the resulting equation?

MaxHR = -0.7977 Age + 210.0485

# Put data in vectors
age <- c(18, 23, 25, 35, 65, 54, 34, 56, 72, 19, 23, 42, 18, 39, 37)
mhr <- c(202, 186, 187, 180, 156, 169, 174, 172, 153, 199, 193, 174, 198, 183, 178)

# Put data in a dataframe for lm function
maxHRdf <- data.frame(age, mhr)

# Fit with lm function
mhrfit <- lm(mhr ~ age, maxHRdf)
mhrfit
## 
## Call:
## lm(formula = mhr ~ age, data = maxHRdf)
## 
## Coefficients:
## (Intercept)          age  
##    210.0485      -0.7977

Is the effect of Age on Max HR significant? What is the significance level?

Yes, we can see from the summary that the p-value is \(3.848 \times 10^{-8}\), which is much lower than 0.01%, and the significance codes give the probability that Age is not significant as zero.

# Use summary function for details
summary(mhrfit)
## 
## Call:
## lm(formula = mhr ~ age, data = maxHRdf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9258 -2.5383  0.3879  3.1867  6.6242 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 210.04846    2.86694   73.27  < 2e-16 ***
## age          -0.79773    0.06996  -11.40 3.85e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.578 on 13 degrees of freedom
## Multiple R-squared:  0.9091, Adjusted R-squared:  0.9021 
## F-statistic:   130 on 1 and 13 DF,  p-value: 3.848e-08

Please also plot the fitted relationship between Max HR and Age.

# Plot Age vs. Max Heart Rate to show what we are fitting
plot(age, mhr, col = "red", main = "Age vs. Max Heart Rate", xlab = "Age", ylab = "Max Heart Rate")
lines(age, (220 - age), col = "green")

# Plot the fit results
plot(mhrfit)

Problem 2: Auto Data Set

Using the Auto data set from Assignment 5 perform a Linear Regression analysis using mpg as the dependent variable and the other 4 (displacement, horsepower, weight, acceleration) as independent variables.

What is the final linear regression fit equation?

# Read in the mpg data from Github
mpgdf <- read.table("https://raw.githubusercontent.com/Godbero/IS605/master/auto-mpg.data", col.names = c("dis", "hp", "wt", "acc", "mpg"))
head(mpgdf)
##   dis  hp   wt  acc mpg
## 1 307 130 3504 12.0  18
## 2 350 165 3693 11.5  15
## 3 318 150 3436 11.0  18
## 4 304 150 3433 12.0  16
## 5 302 140 3449 10.5  17
## 6 429 198 4341 10.0  15
tail(mpgdf)
##     dis hp   wt  acc mpg
## 387 151 90 2950 17.3  27
## 388 140 86 2790 15.6  27
## 389  97 52 2130 24.6  44
## 390 135 84 2295 11.6  32
## 391 120 79 2625 18.6  28
## 392 119 82 2720 19.4  31
# Make random 40 data points from the entire auto data sample
set.seed(42)
mpgdf40 <- mpgdf[sample(nrow(mpgdf), 40), ]
mpgdf40
##     dis  hp   wt  acc  mpg
## 359 231 110 3415 15.8 22.4
## 367 135  84 2525 16.0 29.0
## 112 122  85 2310 18.5 19.0
## 324  90  48 2085 21.7 44.3
## 249 318 140 3735 13.2 19.4
## 201 258  95 3193 17.8 17.5
## 285 302 129 3725 13.4 17.6
## 52   88  76 2065 14.5 30.0
## 253 200  85 2965 15.8 20.2
## 271 151  85 2855 17.6 23.8
## 175 232  90 3211 17.0 19.0
## 274 163 125 3140 13.6 17.0
## 356 145  76 3160 19.6 30.7
## 97  225 105 3121 16.5 18.0
## 382 262  85 3015 17.0 38.0
## 355 141  80 3230 20.4 28.1
## 368 151  90 2735 18.0 27.0
## 45  258 110 2962 13.5 18.0
## 178 121  98 2945 14.5 22.0
## 210 168 120 3820 16.7 16.5
## 337 156  92 2620 14.4 25.8
## 385 144  96 2665 13.9 32.0
## 366 112  85 2575 16.2 31.0
## 350 105  74 2190 14.2 33.0
## 31  140  90 2264 15.5 28.0
## 189 351 152 4215 12.8 14.5
## 143  76  52 1649 16.5 31.0
## 331 168 132 2910 11.4 32.7
## 163 231 110 3039 15.0 21.0
## 304 151  90 2670 16.0 28.4
## 268 105  75 2230 14.5 30.9
## 293  86  65 1975 15.2 34.1
## 140  98  83 2219 16.5 29.0
## 246  85  70 2070 18.6 39.4
## 2   350 165 3693 11.5 15.0
## 298 141  71 3190 24.8 27.2
## 3   318 150 3436 11.0 18.0
## 74  302 140 4294 16.0 13.0
## 321  86  65 2110 17.9 46.6
## 216 111  80 2155 14.8 30.0
# Fit with lm function
mpgfit <- lm(mpg ~ dis + hp + wt + acc, mpgdf)
mpgfit
## 
## Call:
## lm(formula = mpg ~ dis + hp + wt + acc, data = mpgdf)
## 
## Coefficients:
## (Intercept)          dis           hp           wt          acc  
##   45.251140    -0.006001    -0.043608    -0.005281    -0.023148
# Fit with lm function
mpgfit40 <- lm(mpg ~ dis + hp + wt + acc, mpgdf40)
mpgfit40
## 
## Call:
## lm(formula = mpg ~ dis + hp + wt + acc, data = mpgdf40)
## 
## Coefficients:
## (Intercept)          dis           hp           wt          acc  
##   48.760681    -0.006323    -0.085800    -0.005613     0.164672

For the random sample of 40 rows: MPG = -0.006323 Displacement + -0.085800 Horsepower + -0.005613 Weight + 0.164672 Acceleration + 48.760681

For the complete data set: MPG = -0.006001 Displacement + -0.043608 Horsepower + -0.005281 Weight + -0.023148 Acceleration + 45.251140

Which of the 4 independent variables have a significant impact on mpg & what are their significance levels?

# Use summary function on fit to get significance and standard error
summary(mpgfit40)
## 
## Call:
## lm(formula = mpg ~ dis + hp + wt + acc, data = mpgdf40)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7767  -2.7214  -0.5021   1.9057  12.8559 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 48.760681  11.872539   4.107 0.000229 ***
## dis         -0.006323   0.023991  -0.264 0.793677    
## hp          -0.085800   0.105529  -0.813 0.421692    
## wt          -0.005613   0.003972  -1.413 0.166447    
## acc          0.164672   0.630572   0.261 0.795510    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.43 on 35 degrees of freedom
## Multiple R-squared:  0.6036, Adjusted R-squared:  0.5583 
## F-statistic: 13.32 on 4 and 35 DF,  p-value: 1.074e-06
summary(mpgfit)
## 
## Call:
## lm(formula = mpg ~ dis + hp + wt + acc, data = mpgdf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.378  -2.793  -0.333   2.193  16.256 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 45.2511397  2.4560447  18.424  < 2e-16 ***
## dis         -0.0060009  0.0067093  -0.894  0.37166    
## hp          -0.0436077  0.0165735  -2.631  0.00885 ** 
## wt          -0.0052805  0.0008109  -6.512  2.3e-10 ***
## acc         -0.0231480  0.1256012  -0.184  0.85388    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.247 on 387 degrees of freedom
## Multiple R-squared:  0.707,  Adjusted R-squared:  0.704 
## F-statistic: 233.4 on 4 and 387 DF,  p-value: < 2.2e-16
Variable Sig Level 40 Sig Level All Data
Displacement 0.793677 0.37166
Horsepower 0.421692 0.00885
Weight 0.166447 2.3e-10
Acceleration 0.795510 0.85388

None of the variables are significant for the small (40) sample and horsepower and weight are significant for the full data set

What are the standard errors on each of the coefficients?

Variable Stand Error 40 Stand Error All Data
Displacement 0.023991 0.0067093
Horsepower 0.105529 0.0165735
Weight 0.003972 0.0008109
Acceleration 0.630572 0.1256012