Using R’s lm
function, perform regression analysis and measure the signifcance of the independent variables for the following two data sets. In the first case, you are evaluating the statement that we hear that Maximum Heart Rate of a person is related to their age by the following equation:
\[ MaxHR = 220 - Age \]
You have been given the following sample:
Age: 18 23 25 35 65 54 34 56 72 19 23 42 18 39 37
MaxHR: 202 186 187 180 156 169 174 172 153 199 193 174 198 183 178
Perform a linear regression analysis fitting the Max Heart Rate to Age using the lm function in R. What is the resulting equation? Is the effect of Age on Max HR significant? What is the signficance level? Please also plot the fitted relationship between Max HR and Age.
age <- c(18, 23, 25, 35, 65, 54, 34, 56, 72, 19, 23, 42, 18, 39, 37 )
maxHR <- c(202, 186, 187, 180, 156, 169, 174, 172, 153, 199, 193, 174, 198, 183, 178)
hr.model <- lm(maxHR ~ age)
summary(hr.model)
##
## Call:
## lm(formula = maxHR ~ age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9258 -2.5383 0.3879 3.1867 6.6242
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 210.04846 2.86694 73.27 < 2e-16 ***
## age -0.79773 0.06996 -11.40 3.85e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.578 on 13 degrees of freedom
## Multiple R-squared: 0.9091, Adjusted R-squared: 0.9021
## F-statistic: 130 on 1 and 13 DF, p-value: 3.848e-08
# MaxHR = 210.04846 - 0.79773 * Age
The resulting equation is:
\begin{multline} \widehat{MaxHR} = 210.04846 - 0.79773 * Age \end{multline}Let’s establish the hypothesis test using the a simple linear regression model
\begin{multline} y = b_0 + b_1x + e \end{multline}\(H_0\) : Age does not have an effect on maximum heart rate where \(b_1 = 0\)
(\(b_1\) represents the size of the effect )
\(H_A\) : Age does have an effect on maximum heart rate where where \(b_1 \neq 0\)
The p-value associated with age 0.00000003847987
is much less than 0.05. Consequently, we reject the null hypothesis that \(b_1 = 0\) and assert that there is a significant relationship between age and maximum heart rate.
This is also confirmed by the triple-star result for \(b_1\) in the resulting hr.model
,
A type I error is the mishap of falsely rejecting a null hypothesis when the null hypothesis is true. The probability of committing a type I error is called the significance level of the hypothesis testing.
We’ll plot the fitted values along with the actual values for comparison.
plot(age, fitted(hr.model), col="red", xlab="Age", ylab = "Maximum Heart Rate",
main = "Max Heart Rate ~ Age: Fitted vs. Actual")
points(age, maxHR, col='blue')
legend('topright', # places a legend at the appropriate place
c("Fitted", "Actual"), # puts text in the legend
lty=c(1,1), # gives the legend appropriate symbols (lines)
lwd=c(2.5,2.5),col=c( "red", "blue")) # gives the legend lines the correct color and width
Using the Auto data set from Assignment 5 perform a Linear Regression analysis using mpg as the dependent variable and the other 4 (displacement, horsepower, weight, acceleration) as independent variables.
Consider the modified auto-mpg data (obtained from the UC Irvine Machine Learning dataset). This dataset contains 5 columns: displacement, horsepower, weight, acceleration, mpg.
# read the auto-mpg dataset into a dataframe
setwd("C:\\Users\\keith\\Documents\\DataScience\\CUNY\\DATA605\\Week5\\assign5")
auto_mpg <- read.table("auto-mpg.data", header=FALSE)
# set the column names
colnames(auto_mpg) <- c("displacement", "horsepower", "weight", "acceleration", "mpg")
head(auto_mpg)
## displacement horsepower weight acceleration mpg
## 1 307 130 3504 12.0 18
## 2 350 165 3693 11.5 15
## 3 318 150 3436 11.0 18
## 4 304 150 3433 12.0 16
## 5 302 140 3449 10.5 17
## 6 429 198 4341 10.0 15
str(auto_mpg)
## 'data.frame': 392 obs. of 5 variables:
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
Please perform this experiment in two ways. First take any random 40 data points from the entire auto data sample and perform the linear regression fit and measure the 95% confidence intervals.
# use dplyr
suppressWarnings(suppressMessages(library(dplyr)))
set.seed(123)
# take 40 random data points from the auto data
auto_mpg.sample <- sample_n(auto_mpg, size = 40)
# confirm
dim(auto_mpg.sample)
## [1] 40 5
# perform the linear regression
auto_fit.sample <- lm(mpg ~ displacement + horsepower + weight + acceleration, data = auto_mpg.sample)
# model output
summary(auto_fit.sample)
##
## Call:
## lm(formula = mpg ~ displacement + horsepower + weight + acceleration,
## data = auto_mpg.sample)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3741 -3.2178 -0.4445 2.5421 10.9468
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.756053 10.886881 1.723 0.09375 .
## displacement -0.005115 0.023178 -0.221 0.82660
## horsepower 0.113919 0.064034 1.779 0.08392 .
## weight -0.009710 0.002754 -3.526 0.00120 **
## acceleration 1.497830 0.472153 3.172 0.00314 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.4 on 35 degrees of freedom
## Multiple R-squared: 0.7626, Adjusted R-squared: 0.7355
## F-statistic: 28.11 on 4 and 35 DF, p-value: 1.691e-10
Two independepent variables appear to a signicant impact on mpg – weight and acceleration. Weight has the most significant impact on mpg in terms of significance.
Variable | Significance Level |
---|---|
displacement | 0.0244869568 |
horsepower | 0.0002047104 |
weight | 0.5909555463 |
acceleration | 0.8266035639 |
In R:
summary(auto_fit.sample)$coefficients[, 4]
## (Intercept) displacement horsepower weight acceleration
## 0.093748756 0.826603564 0.083918861 0.001197785 0.003142646
summary(auto_fit.sample)$coefficients[, 2]
## (Intercept) displacement horsepower weight acceleration
## 10.886880701 0.023177660 0.064033658 0.002753505 0.472153349
confint(auto_fit.sample, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) -3.34548975 40.857595897
## displacement -0.05216861 0.041937695
## horsepower -0.01607643 0.243914043
## weight -0.01529989 -0.004120062
## acceleration 0.53930776 2.456352270
set.seed(123)
# confirm size of the auto dataset
dim(auto_mpg)
## [1] 392 5
# perform the linear regression
auto.fit <- lm(mpg ~ displacement + horsepower + weight + acceleration, data = auto_mpg)
# model output
summary(auto.fit)
##
## Call:
## lm(formula = mpg ~ displacement + horsepower + weight + acceleration,
## data = auto_mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.378 -2.793 -0.333 2.193 16.256
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.2511397 2.4560447 18.424 < 2e-16 ***
## displacement -0.0060009 0.0067093 -0.894 0.37166
## horsepower -0.0436077 0.0165735 -2.631 0.00885 **
## weight -0.0052805 0.0008109 -6.512 2.3e-10 ***
## acceleration -0.0231480 0.1256012 -0.184 0.85388
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.247 on 387 degrees of freedom
## Multiple R-squared: 0.707, Adjusted R-squared: 0.704
## F-statistic: 233.4 on 4 and 387 DF, p-value: < 2.2e-16
Using the full auto mpg dataset, two independepent variables appear to a signicant impact on mpg – horsepower and weight. Weight has the most significant impact on mpg in terms of significance.
Variable | Significance Level |
---|---|
displacement | 0.3716583860 |
horsepower | 0.0088489821 |
weight | 0.0000000002 |
acceleration | 0.8538764883 |
In R:
format(round(summary(auto.fit)$coefficients[,4], 10), scientific = F)
## (Intercept) displacement horsepower weight acceleration
## "0.0000000000" "0.3716583860" "0.0088489821" "0.0000000002" "0.8538764883"
summary(auto.fit)$coefficients[, 2]
## (Intercept) displacement horsepower weight acceleration
## 2.4560446927 0.0067093055 0.0165734633 0.0008108541 0.1256011622
confint(auto.fit, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 40.422278855 50.080000544
## displacement -0.019192122 0.007190380
## horsepower -0.076193029 -0.011022433
## weight -0.006874738 -0.003686277
## acceleration -0.270094049 0.223798050
Model | Variable | Coefficient | Standard Error | Significance Level | CI Lower (95%) | CI Upper (95%) |
---|---|---|---|---|---|---|
Auto - Sampled | Intercept | 18.756053 | 10.886881 | 0.09375 | -3.34548975 | 40.857595897 |
Auto - Sampled | displacement | -0.005115 | 0.023178 | 0.82660 | -0.05216861 | 0.041937695 |
Auto - Sampled | horsepower | 0.113919 | 0.064034 | 0.08392 | -0.01607643 | 0.243914043 |
Auto - Sampled | weight | -0.009710 | 0.002754 | 0.00120 | -0.01529989 | -0.004120062 |
Auto - Sampled | acceleration | 1.497830 | 0.472153 | 0.00314 | 0.53930776 | 2.456352270 |
Model | Variable | Coefficient | Standard Error | Significance Level | CI Lower (95%) | CI Upper (95%) |
---|---|---|---|---|---|---|
Auto - Full | Intercept | 45.2511397 | 2.4560447 | 0.0000000000 | 40.422278855 | 50.080000544 |
Auto - Full | displacement | -0.0060009 | 0.0067093 | 0.3716583860 | -0.019192122 | 0.007190380 |
Auto - Full | horsepower | -0.0436077 | 0.0165735 | 0.0088489821 | -0.076193029 | -0.011022433 |
Auto - Full | weight | -0.0052805 | 0.0008109 | 0.0000000002 | -0.006874738 | -0.003686277 |
Auto - Full | acceleration | -0.0231480 | 0.1256012 | 0.8538764883 | -0.270094049 | 0.223798050 |
We see in the above comparison that the Standard Error decreased with the usage of the full Auto mpg dataset. Consequently we see closer ranges in the lower and upper bounds of the confidence intervals.
Reference papers
http://data.princeton.edu/R/linearModels.html
http://rpubs.com/sinhrks/plot_lm
http://stanford.edu/~ejdemyr/r-tutorials-archive/tutorial6.html