Using R’s lm function, perform regression analysis and measure the significance of the independent variables for the following two data sets. In the first case, you are evaluating the statement that we hear that Maximum Heart Rate of a person is related to their age by the following equation:
MaxHR = 220 -???? Age
You have been given the following sample: Age 18 23 25 35 65 54 34 56 72 19 23 42 18 39 37 MaxHR 202 186 187 180 156 169 174 172 153 199 193 174 198 183 178
Perform a linear regression analysis fitting the Max Heart Rate to Age using the lm function in R. What is the resulting equation? Is the effect of Age on Max HR significant? What is the significance level? Please also plot the fitted relationship between Max HR and Age.
age <- c(18, 23, 25, 35, 65, 54, 34, 56, 72, 19, 23, 42, 18, 39, 37)
MaxHR <- c(202, 186, 187, 180, 156, 169, 174, 172, 153, 199, 193, 174, 198, 183, 178)
df <- data.frame(age, MaxHR)
(summary(lm(MaxHR ~ age, df)))
##
## Call:
## lm(formula = MaxHR ~ age, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9258 -2.5383 0.3879 3.1867 6.6242
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 210.04846 2.86694 73.27 < 2e-16 ***
## age -0.79773 0.06996 -11.40 3.85e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.578 on 13 degrees of freedom
## Multiple R-squared: 0.9091, Adjusted R-squared: 0.9021
## F-statistic: 130 on 1 and 13 DF, p-value: 3.848e-08
# regression model
MaxHR = -0.79773 * age + 210.04846
# p-value 3.848e-08 < 0.05, so the effect of Age on Max HR is significant and the result rejected the null hypothsis.
# the given equation MaxHR = 220 - Age doesn't show any relationship between MaxHr and Age, so it is not correct.
library(ggplot2)
ggplot(df, aes(age, MaxHR)) +
geom_point() +
geom_smooth(se = FALSE, method = "lm") +
ggtitle('Regression model fit of MaxHR and Age')
Using the Auto data set from Assignment 5 (also attached here) perform a Linear Re-gression analysis using mpg as the dependent variable and the other 4 (displacement, horse-power, weight, acceleration) as independent variables. What is the final linear regression fit equation? Which of the 4 independent variables have a significant impact on mpg? What are their corresponding significance levels? What are the standard errors on each of the coeficients? Please perform this experiment in two ways. First take any random 40 data points from the entire auto data sample and perform the linear regression fit and measure the 95% confidence intervals. Then, take the entire data set (all 392 points) and perform linear regression and measure the 95% confidence intervals. Please report the resulting fit equation, their significance values and confidence intervals for each of the two runs.
# read in data
auto.mpg <- read.table('auto-mpg.data', quote='\'', comment.char='')
names(auto.mpg) <- c('displacement', 'horsepower', 'weight', 'acceleration', 'mpg')
head(auto.mpg)
## displacement horsepower weight acceleration mpg
## 1 307 130 3504 12.0 18
## 2 350 165 3693 11.5 15
## 3 318 150 3436 11.0 18
## 4 304 150 3433 12.0 16
## 5 302 140 3449 10.5 17
## 6 429 198 4341 10.0 15
str(auto.mpg)
## 'data.frame': 392 obs. of 5 variables:
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
lm.auto <- summary(lm(mpg ~ displacement + horsepower + weight + acceleration, auto.mpg))
# random select 40 samples and do regression modeling
set.seed(234)
df.s <- auto.mpg[sample(1:nrow(auto.mpg), 40, replace = F), ]
str(df.s)
## 'data.frame': 40 obs. of 5 variables:
## $ displacement: num 86 173 440 85 360 302 231 89 168 140 ...
## $ horsepower : num 65 115 215 65 215 139 110 71 116 72 ...
## $ weight : num 1975 2700 4312 2020 4615 ...
## $ acceleration: num 15.2 12.9 8.5 19.2 14 12.8 15.8 14.9 12.6 19.5 ...
## $ mpg : num 34.1 26.8 14 31.8 10 20.2 22.4 31.5 25.4 21 ...
lm.s <- summary(lm(mpg ~ displacement + horsepower + weight + acceleration, df.s))
The population regression fit equation:
mpg = -0.0060009 * displacement -0.0436077 * horsepower -0.0052805 * weight -0.0231480 * acceleration + 45.2511397
The sample regression fit equation:
mpg.s = -0.024174 * displacement -0.051365 * horsepower -0.001947 * weight -0.032067 * acceleration + 40.075930
# significance: for population 392 data points, only weight and horsepower have p-value < .05 and show significance, p-values are 2.3e-10 and 0.00885 respectively; and for 40 samples, there are no variables associated with p-value < 0.05
# significance: population
(lm.auto$coefficients[, 4])
## (Intercept) displacement horsepower weight acceleration
## 7.072099e-55 3.716584e-01 8.848982e-03 2.302545e-10 8.538765e-01
# significance: sample
(lm.s$coefficients[, 4])
## (Intercept) displacement horsepower weight acceleration
## 5.245973e-06 1.044330e-01 8.810093e-01 3.939484e-01 4.229004e-01
# The standard errors
# population
(lm.auto$coefficients[, 2])
## (Intercept) displacement horsepower weight acceleration
## 2.4560446927 0.0067093055 0.0165734633 0.0008108541 0.1256011622
# The standard errors
# sample
(lm.s$coefficients[, 2])
## (Intercept) displacement horsepower weight acceleration
## 8.964438465 0.029265407 0.063022264 0.002885248 0.471386621
# confident intervals 95%
# population
confint(lm(mpg ~ displacement + horsepower + weight + acceleration, auto.mpg), level = .95)
## 2.5 % 97.5 %
## (Intercept) 40.422278855 50.080000544
## displacement -0.019192122 0.007190380
## horsepower -0.076193029 -0.011022433
## weight -0.006874738 -0.003686277
## acceleration -0.270094049 0.223798050
# confident intervals 95%
# sample
confint(lm(mpg ~ displacement + horsepower + weight + acceleration, df.s), level = .95)
## 2.5 % 97.5 %
## (Intercept) 29.931361230 66.328916430
## displacement -0.108198365 0.010625505
## horsepower -0.137444946 0.118439050
## weight -0.008347676 0.003367054
## acceleration -1.339215408 0.574716027
Conclusion of comparing the two regression models with the whole population 392 data points and the 40 samples:
the model using whole data sets has high significance level than the model with 40 samples.
the sample variables are associated with bigger standard errors than the population variables.
the Population independent variables have smaller ci compared to those of the sample.
Reference:
https://rstudio-pubs-static.s3.amazonaws.com/124425_0af71d8e326144cdadcb169bccec8083.html
https://github.com/wwells/CUNY_DATA_605/blob/master/Week11/WWells_Assign11.Rmd