Linear Regression in R

Using R’s lm function, perform regression analysis and measure the signifcance of the independent variables for the following two data sets. In the first case, you are evaluating the statement that we hear that Maximum Heart Rate of a person is related to their age by the following equation:

\[ MaxHR = 220 - Age \]

You have been given the following sample:

Age: 18 23 25 35 65 54 34 56 72 19 23 42 18 39 37
MaxHR: 202 186 187 180 156 169 174 172 153 199 193 174 198 183 178

Perform a linear regression analysis fitting the Max Heart Rate to Age using the lm function in R. What is the resulting equation? Is the effect of Age on Max HR significant? What is the signficance level? Please also plot the fitted relationship between Max HR and Age.

age <- c(18, 23, 25, 35, 65, 54, 34, 56, 72, 19, 23, 42, 18, 39,  37 )
maxHR <- c(202, 186, 187, 180, 156, 169, 174, 172, 153, 199, 193, 174, 198, 183, 178)

hr.model <- lm(maxHR ~ age)

summary(hr.model)
## 
## Call:
## lm(formula = maxHR ~ age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9258 -2.5383  0.3879  3.1867  6.6242 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 210.04846    2.86694   73.27  < 2e-16 ***
## age          -0.79773    0.06996  -11.40 3.85e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.578 on 13 degrees of freedom
## Multiple R-squared:  0.9091, Adjusted R-squared:  0.9021 
## F-statistic:   130 on 1 and 13 DF,  p-value: 3.848e-08
# MaxHR = 210.04846 - 0.79773 * Age

The resulting equation is:

\begin{multline} \widehat{MaxHR} = 210.04846 - 0.79773 * Age \end{multline}

Is the effect of Age on Max HR significant?

Let’s establish the hypothesis test using the a simple linear regression model

\begin{multline} y = b_0 + b_1x + e \end{multline}

\(H_0\) : Age does not have an effect on maximum heart rate where \(b_1 = 0\)
(\(b_1\) represents the size of the effect )
\(H_A\) : Age does have an effect on maximum heart rate where where \(b_1 \neq 0\)

The p-value associated with age 0.00000003847987 is much less than 0.05. Consequently, we reject the null hypothesis that \(b_1 = 0\) and assert that there is a significant relationship between age and maximum heart rate.

This is also confirmed by the triple-star result for \(b_1\) in the resulting hr.model,

A type I error is the mishap of falsely rejecting a null hypothesis when the null hypothesis is true. The probability of committing a type I error is called the significance level of the hypothesis testing.

Plot the relationship

We’ll plot the fitted values along with the actual values for comparison.

plot(age, fitted(hr.model), col="red", xlab="Age", ylab = "Maximum Heart Rate",
       main = "Max Heart Rate ~ Age: Fitted vs. Actual")

points(age, maxHR,  col='blue')

legend('topright', # places a legend at the appropriate place 
       c("Fitted", "Actual"), # puts text in the legend
       lty=c(1,1), # gives the legend appropriate symbols (lines)
      lwd=c(2.5,2.5),col=c( "red", "blue")) # gives the legend lines the correct color and width

Using the Auto data set from Assignment 5 perform a Linear Regression analysis using mpg as the dependent variable and the other 4 (displacement, horsepower, weight, acceleration) as independent variables.

Consider the modified auto-mpg data (obtained from the UC Irvine Machine Learning dataset). This dataset contains 5 columns: displacement, horsepower, weight, acceleration, mpg.

# read the auto-mpg dataset into a dataframe

setwd("C:\\Users\\keith\\Documents\\DataScience\\CUNY\\DATA605\\Week5\\assign5")

auto_mpg <-  read.table("auto-mpg.data",  header=FALSE)

# set the column names
colnames(auto_mpg) <- c("displacement", "horsepower", "weight", "acceleration", "mpg")

head(auto_mpg)
##   displacement horsepower weight acceleration mpg
## 1          307        130   3504         12.0  18
## 2          350        165   3693         11.5  15
## 3          318        150   3436         11.0  18
## 4          304        150   3433         12.0  16
## 5          302        140   3449         10.5  17
## 6          429        198   4341         10.0  15
str(auto_mpg)
## 'data.frame':    392 obs. of  5 variables:
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...

Questions:

  • What is the final linear regression fit equation?
  • Which of the 4 independent variables have a significant impact on mpg?
  • What are their corresponding significance levels?
  • What are the standard errors on each of the coefficients?

Please perform this experiment in two ways. First take any random 40 data points from the entire auto data sample and perform the linear regression fit and measure the 95% confidence intervals.

Experiment 1 - Using Sampled Data Points

# use dplyr
suppressWarnings(suppressMessages(library(dplyr)))

set.seed(123)

# take 40 random data points from the auto data
auto_mpg.sample <- sample_n(auto_mpg, size = 40)

# confirm
dim(auto_mpg.sample)
## [1] 40  5
# perform the linear regression 
auto_fit.sample <- lm(mpg ~ displacement + horsepower + weight + acceleration, data = auto_mpg.sample)

# model output
summary(auto_fit.sample)
## 
## Call:
## lm(formula = mpg ~ displacement + horsepower + weight + acceleration, 
##     data = auto_mpg.sample)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3741 -3.2178 -0.4445  2.5421 10.9468 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  18.756053  10.886881   1.723  0.09375 . 
## displacement -0.005115   0.023178  -0.221  0.82660   
## horsepower    0.113919   0.064034   1.779  0.08392 . 
## weight       -0.009710   0.002754  -3.526  0.00120 **
## acceleration  1.497830   0.472153   3.172  0.00314 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.4 on 35 degrees of freedom
## Multiple R-squared:  0.7626, Adjusted R-squared:  0.7355 
## F-statistic: 28.11 on 4 and 35 DF,  p-value: 1.691e-10

What is the final linear regression fit equation?

\begin{multline} \widehat{mpg} = 18.756053 - 0.005115\times displacement + 0.113919\times horsepower - 0.009710 \times weight + 1.497830 \times acceleration \end{multline}

Which of the 4 independent variables have a significant impact on mpg?

Two independepent variables appear to a signicant impact on mpg – weight and acceleration. Weight has the most significant impact on mpg in terms of significance.

Variable Significance Level
displacement 0.0244869568
horsepower 0.0002047104
weight 0.5909555463
acceleration 0.8266035639

In R:

summary(auto_fit.sample)$coefficients[, 4]
##  (Intercept) displacement   horsepower       weight acceleration 
##  0.093748756  0.826603564  0.083918861  0.001197785  0.003142646

What are the standard errors on each of the coefficients?

summary(auto_fit.sample)$coefficients[, 2]
##  (Intercept) displacement   horsepower       weight acceleration 
## 10.886880701  0.023177660  0.064033658  0.002753505  0.472153349

Measure 95% Confidence Interval

confint(auto_fit.sample, level = 0.95)
##                    2.5 %       97.5 %
## (Intercept)  -3.34548975 40.857595897
## displacement -0.05216861  0.041937695
## horsepower   -0.01607643  0.243914043
## weight       -0.01529989 -0.004120062
## acceleration  0.53930776  2.456352270

Experiment 2 - Using Full Dataset

set.seed(123)

# confirm size of the auto dataset
dim(auto_mpg)
## [1] 392   5
# perform the linear regression 

auto.fit <- lm(mpg ~ displacement + horsepower + weight + acceleration, data = auto_mpg)

# model output
summary(auto.fit)
## 
## Call:
## lm(formula = mpg ~ displacement + horsepower + weight + acceleration, 
##     data = auto_mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.378  -2.793  -0.333   2.193  16.256 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.2511397  2.4560447  18.424  < 2e-16 ***
## displacement -0.0060009  0.0067093  -0.894  0.37166    
## horsepower   -0.0436077  0.0165735  -2.631  0.00885 ** 
## weight       -0.0052805  0.0008109  -6.512  2.3e-10 ***
## acceleration -0.0231480  0.1256012  -0.184  0.85388    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.247 on 387 degrees of freedom
## Multiple R-squared:  0.707,  Adjusted R-squared:  0.704 
## F-statistic: 233.4 on 4 and 387 DF,  p-value: < 2.2e-16

What is the final linear regression fit equation?

\begin{multline} \widehat{mpg} = 45.2511397 - 0.006000\times displacement - 0.0436077\times horsepower - 0.0052805 \times weight - 0.0231480\times acceleration \end{multline}

Which of the 4 independent variables have a significant impact on mpg?

Using the full auto mpg dataset, two independepent variables appear to a signicant impact on mpg – horsepower and weight. Weight has the most significant impact on mpg in terms of significance.

Variable Significance Level
displacement 0.3716583860
horsepower 0.0088489821
weight 0.0000000002
acceleration 0.8538764883

In R:

format(round(summary(auto.fit)$coefficients[,4], 10), scientific = F)
##    (Intercept)   displacement     horsepower         weight   acceleration 
## "0.0000000000" "0.3716583860" "0.0088489821" "0.0000000002" "0.8538764883"

What are the standard errors on each of the coefficients?

summary(auto.fit)$coefficients[, 2]
##  (Intercept) displacement   horsepower       weight acceleration 
## 2.4560446927 0.0067093055 0.0165734633 0.0008108541 0.1256011622

Measure 95% Confidence Interval

confint(auto.fit, level = 0.95)
##                     2.5 %       97.5 %
## (Intercept)  40.422278855 50.080000544
## displacement -0.019192122  0.007190380
## horsepower   -0.076193029 -0.011022433
## weight       -0.006874738 -0.003686277
## acceleration -0.270094049  0.223798050

Model Results

Model Variable Coefficient Standard Error Significance Level CI Lower (95%) CI Upper (95%)
Auto - Sampled Intercept 18.756053 10.886881 0.09375 -3.34548975 40.857595897
Auto - Sampled displacement -0.005115 0.023178 0.82660 -0.05216861 0.041937695
Auto - Sampled horsepower 0.113919 0.064034 0.08392 -0.01607643 0.243914043
Auto - Sampled weight -0.009710 0.002754 0.00120 -0.01529989 -0.004120062
Auto - Sampled acceleration 1.497830 0.472153 0.00314 0.53930776 2.456352270
Model Variable Coefficient Standard Error Significance Level CI Lower (95%) CI Upper (95%)
Auto - Full Intercept 45.2511397 2.4560447 0.0000000000 40.422278855 50.080000544
Auto - Full displacement -0.0060009 0.0067093 0.3716583860 -0.019192122 0.007190380
Auto - Full horsepower -0.0436077 0.0165735 0.0088489821 -0.076193029 -0.011022433
Auto - Full weight -0.0052805 0.0008109 0.0000000002 -0.006874738 -0.003686277
Auto - Full acceleration -0.0231480 0.1256012 0.8538764883 -0.270094049 0.223798050

We see in the above comparison that the Standard Error decreased with the usage of the full Auto mpg dataset. Consequently we see closer ranges in the lower and upper bounds of the confidence intervals.

Reference papers
http://data.princeton.edu/R/linearModels.html
http://rpubs.com/sinhrks/plot_lm
http://stanford.edu/~ejdemyr/r-tutorials-archive/tutorial6.html