Using R’s lm function, perform regression analysis and measure the significance of the independent variables for the following two data sets. In the first case, you are evaluating the statement that we hear that Maximum Heart Rate of a person is related to their age by the following equation:
\[MaxHR = 220 - Age\] You have been given the following sample:
Age <- c(18, 23, 25, 35, 65, 54, 34, 56, 72, 19, 23, 42, 18, 39, 37)
MaxHR <- c(202, 186, 187, 180, 156, 169, 174, 172, 153, 199, 193, 174, 198, 183, 178)
(HR_data <- data.frame(Age,MaxHR))
## Age MaxHR
## 1 18 202
## 2 23 186
## 3 25 187
## 4 35 180
## 5 65 156
## 6 54 169
## 7 34 174
## 8 56 172
## 9 72 153
## 10 19 199
## 11 23 193
## 12 42 174
## 13 18 198
## 14 39 183
## 15 37 178
Perform a linear regression analysis fitting the Max Heart Rate to Age using the lm function in R. What is the resulting equation? Is the effect of Age on Max HR significant? What is the significance level? Please also plot the fitted relationship between Max HR and Age.
# lm function to perform a linear regression analysis
(HR_data_lm <- summary(lm(MaxHR ~ Age, data = HR_data)))
##
## Call:
## lm(formula = MaxHR ~ Age, data = HR_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9258 -2.5383 0.3879 3.1867 6.6242
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 210.04846 2.86694 73.27 < 2e-16 ***
## Age -0.79773 0.06996 -11.40 3.85e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.578 on 13 degrees of freedom
## Multiple R-squared: 0.9091, Adjusted R-squared: 0.9021
## F-statistic: 130 on 1 and 13 DF, p-value: 3.848e-08
Regression fit equation
\(\widehat{MaxHR} = 210.0484584 + -0.7977266 \times Age\)
Hypoithesis
\(H_O\): There is no significant relationship between Maximum Heart Rate of a person and Age.
\(H_A\): There is a significant relationship between Maximum Heart Rate of a person and Age.
We see our model has a p value of 3.847986510^{-8} which is much lower than 0.05, hence we can reject \(H_O\) and conclude Age has a statistically significant effect on Max HR.
plot(HR_data$Age, HR_data$MaxHR, main = "Scatter Plot", xlab = "Age", ylab = "MaxHR")
abline(lm(HR_data$MaxHR~HR_data$Age), col="red")
Using the Auto data set from Assignment 5 (also attached here) perform a Linear Regression analysis using mpg as the dependent variable and the other 4 (displacement, horsepower, weight, acceleration) as independent variables. What is the final linear regression fit equation? Which of the 4 independent variables have a significant impact on mpg? What are their corresponding significance levels? What are the standard errors on each of the coefficients? Please perform this experiment in two ways. First take any random 40 data points from the entire auto data sample and perform the linear regression fit and measure the 95% confidence intervals. Then, take the entire data set (all 392 points) and perform linear regression and measure the 95% confidence intervals. Please report the resulting fit equation, their significance values and confidence intervals for each of the two runs.
# Setting working directory
# Note: Both .rmd and .data needs to be in same directory
set_directory <- function(directory) {
if (!is.null(directory))
setwd(directory)
}
# Read data and give column names
auto_data <- read.table("auto-mpg.data", header=FALSE,as.is=TRUE)
colnames(auto_data) <- c("displacement", "horsepower","weight","acceleration","mpg")
# Take a random 40 data points
set.seed(7)
auto_rand40 <- auto_data[sample(1:nrow(auto_data), 40, replace = F), ]
# lm function to perform a linear regression analysis
auto_rand40_fit <- lm(mpg ~ displacement + horsepower + weight + acceleration, auto_rand40)
(auto_rand40_lm <- summary(auto_rand40_fit))
##
## Call:
## lm(formula = mpg ~ displacement + horsepower + weight + acceleration,
## data = auto_rand40)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.2596 -2.5462 -0.3807 3.3916 6.7119
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.828807 8.795366 6.006 7.57e-07 ***
## displacement 0.031700 0.019796 1.601 0.11830
## horsepower -0.072520 0.051217 -1.416 0.16563
## weight -0.008321 0.002637 -3.156 0.00328 **
## acceleration -0.154434 0.483699 -0.319 0.75141
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.189 on 35 degrees of freedom
## Multiple R-squared: 0.7588, Adjusted R-squared: 0.7312
## F-statistic: 27.52 on 4 and 35 DF, p-value: 2.223e-10
Final linear regression fit equation
\[\widehat{mpg} = 52.8288075 + 0.0316996 \times displacement + -0.0725195 \times horsepower + -0.008321 \times weight + -0.1544343 \times acceleration\]
p_values <- auto_rand40_lm$coefficients[2:5,"Pr(>|t|)"]
p_values[which(p_values < .05)]
## weight
## 0.00328421
Based on p-values we can conclude only weight (0.0032842) has a significant impact on mpg.
Standard errors
auto_rand40_lm$coefficients[2:5,"Std. Error"]
## displacement horsepower weight acceleration
## 0.01979603 0.05121655 0.00263675 0.48369871
95% confidence intervals
confint(auto_rand40_fit, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 34.973266090 70.684348870
## displacement -0.008488451 0.071887684
## horsepower -0.176494639 0.031455610
## weight -0.013673926 -0.002968152
## acceleration -1.136394839 0.827526336
# lm function to perform a linear regression analysis
auto_data_fit <- lm(mpg ~ displacement + horsepower + weight + acceleration, auto_data)
auto_data_lm <- summary(auto_data_fit)
Final linear regression fit equation
\[\widehat{mpg} = 45.2511397 + -0.0060009 \times displacement + -0.0436077 \times horsepower + -0.0052805 \times weight + -0.023148 \times acceleration\]
p_values <- auto_data_lm$coefficients[2:5,"Pr(>|t|)"]
p_values[which(p_values < .05)]
## horsepower weight
## 8.848982e-03 2.302545e-10
Based on p-values we can conclude horsepower (0.008849) and weight (2.30254510^{-10}) have a significant impact on mpg.
Standard errors
auto_data_lm$coefficients[2:5,"Std. Error"]
## displacement horsepower weight acceleration
## 0.0067093055 0.0165734633 0.0008108541 0.1256011622
95% confidence intervals
confint(auto_data_fit, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 40.422278855 50.080000544
## displacement -0.019192122 0.007190380
## horsepower -0.076193029 -0.011022433
## weight -0.006874738 -0.003686277
## acceleration -0.270094049 0.223798050
By examining all the models we can see with the larger data set we get a reduction in standard errors, p-value and confidence interval, so we can conclude the larger data set gives us a more accurate basis for estimation.