LINEAR REGRESSION IN R

Using R’s lm function, perform regression analysis and measure the significance of the independent variables for the following two data sets. In the first case, you are evaluating the statement that we hear that Maximum Heart Rate of a person is related to their age by the following equation:

MaxHR = 220 -???? Age

You have been given the following sample: Age 18 23 25 35 65 54 34 56 72 19 23 42 18 39 37 MaxHR 202 186 187 180 156 169 174 172 153 199 193 174 198 183 178

Perform a linear regression analysis fitting the Max Heart Rate to Age using the lm function in R. What is the resulting equation? Is the effect of Age on Max HR significant? What is the significance level? Please also plot the fitted relationship between Max HR and Age.

age <- c(18, 23, 25, 35, 65, 54, 34, 56, 72, 19, 23, 42, 18, 39, 37)
MaxHR <- c(202, 186, 187, 180, 156, 169, 174, 172, 153, 199, 193, 174, 198, 183, 178)
df <- data.frame(age, MaxHR)
(summary(lm(MaxHR ~ age, df)))
## 
## Call:
## lm(formula = MaxHR ~ age, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9258 -2.5383  0.3879  3.1867  6.6242 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 210.04846    2.86694   73.27  < 2e-16 ***
## age          -0.79773    0.06996  -11.40 3.85e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.578 on 13 degrees of freedom
## Multiple R-squared:  0.9091, Adjusted R-squared:  0.9021 
## F-statistic:   130 on 1 and 13 DF,  p-value: 3.848e-08
# regression model
MaxHR = -0.79773 * age + 210.04846

# p-value 3.848e-08 < 0.05, so the effect of Age on Max HR is significant and the result rejected the null hypothsis.
# the given equation MaxHR = 220 - Age doesn't show any relationship between MaxHr and Age, so it is not correct.

library(ggplot2)
ggplot(df, aes(age, MaxHR)) + 
  geom_point() +
  geom_smooth(se = FALSE, method = "lm") +
  ggtitle('Regression model fit of MaxHR and Age')

Using the Auto data set from Assignment 5 (also attached here) perform a Linear Re-gression analysis using mpg as the dependent variable and the other 4 (displacement, horse-power, weight, acceleration) as independent variables. What is the final linear regression fit equation? Which of the 4 independent variables have a significant impact on mpg? What are their corresponding significance levels? What are the standard errors on each of the coeficients? Please perform this experiment in two ways. First take any random 40 data points from the entire auto data sample and perform the linear regression fit and measure the 95% confidence intervals. Then, take the entire data set (all 392 points) and perform linear regression and measure the 95% confidence intervals. Please report the resulting fit equation, their significance values and confidence intervals for each of the two runs.

# read in data
auto.mpg <- read.table('auto-mpg.data', quote='\'', comment.char='')
names(auto.mpg) <- c('displacement', 'horsepower', 'weight', 'acceleration', 'mpg')
head(auto.mpg)
##   displacement horsepower weight acceleration mpg
## 1          307        130   3504         12.0  18
## 2          350        165   3693         11.5  15
## 3          318        150   3436         11.0  18
## 4          304        150   3433         12.0  16
## 5          302        140   3449         10.5  17
## 6          429        198   4341         10.0  15
str(auto.mpg)
## 'data.frame':    392 obs. of  5 variables:
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
lm.auto <- summary(lm(mpg ~ displacement + horsepower + weight + acceleration, auto.mpg))

# random select 40 samples and do regression modeling
set.seed(234)
df.s <- auto.mpg[sample(1:nrow(auto.mpg), 40, replace = F), ]
str(df.s)
## 'data.frame':    40 obs. of  5 variables:
##  $ displacement: num  86 173 440 85 360 302 231 89 168 140 ...
##  $ horsepower  : num  65 115 215 65 215 139 110 71 116 72 ...
##  $ weight      : num  1975 2700 4312 2020 4615 ...
##  $ acceleration: num  15.2 12.9 8.5 19.2 14 12.8 15.8 14.9 12.6 19.5 ...
##  $ mpg         : num  34.1 26.8 14 31.8 10 20.2 22.4 31.5 25.4 21 ...
lm.s <- summary(lm(mpg ~ displacement + horsepower + weight + acceleration, df.s))

The population regression fit equation:

mpg = -0.0060009 * displacement -0.0436077 * horsepower -0.0052805 * weight -0.0231480 * acceleration + 45.2511397

The sample regression fit equation:

mpg.s = -0.024174 * displacement -0.051365 * horsepower -0.001947 * weight -0.032067 * acceleration + 40.075930

# significance: for population 392 data points, only weight and horsepower have p-value < .05 and show significance, p-values are 2.3e-10 and 0.00885 respectively; and for 40 samples, there are no variables associated with p-value < 0.05
# significance: population
(lm.auto$coefficients[, 4])
##  (Intercept) displacement   horsepower       weight acceleration 
## 7.072099e-55 3.716584e-01 8.848982e-03 2.302545e-10 8.538765e-01
# significance: sample
(lm.s$coefficients[, 4])
##  (Intercept) displacement   horsepower       weight acceleration 
## 5.245973e-06 1.044330e-01 8.810093e-01 3.939484e-01 4.229004e-01
# The standard errors
# population
(lm.auto$coefficients[, 2])
##  (Intercept) displacement   horsepower       weight acceleration 
## 2.4560446927 0.0067093055 0.0165734633 0.0008108541 0.1256011622
# The standard errors
# sample
(lm.s$coefficients[, 2])
##  (Intercept) displacement   horsepower       weight acceleration 
##  8.964438465  0.029265407  0.063022264  0.002885248  0.471386621
# confident intervals 95% 
# population
confint(lm(mpg ~ displacement + horsepower + weight + acceleration, auto.mpg), level = .95)
##                     2.5 %       97.5 %
## (Intercept)  40.422278855 50.080000544
## displacement -0.019192122  0.007190380
## horsepower   -0.076193029 -0.011022433
## weight       -0.006874738 -0.003686277
## acceleration -0.270094049  0.223798050
# confident intervals 95% 
# sample
confint(lm(mpg ~ displacement + horsepower + weight + acceleration, df.s), level = .95)
##                     2.5 %       97.5 %
## (Intercept)  29.931361230 66.328916430
## displacement -0.108198365  0.010625505
## horsepower   -0.137444946  0.118439050
## weight       -0.008347676  0.003367054
## acceleration -1.339215408  0.574716027

Conclusion of comparing the two regression models with the whole population 392 data points and the 40 samples:

  1. the model using whole data sets has high significance level than the model with 40 samples.

  2. the sample variables are associated with bigger standard errors than the population variables.

  3. the Population independent variables have smaller ci compared to those of the sample.

Reference:

https://rstudio-pubs-static.s3.amazonaws.com/124425_0af71d8e326144cdadcb169bccec8083.html

https://github.com/wwells/CUNY_DATA_605/blob/master/Week11/WWells_Assign11.Rmd