DATA 605 - Assignment 11

LINEAR REGRESSION IN R

Using R’s lm function, perform regression analysis and measure the significance of the independent variables for the following two data sets. In the first case, you are evaluating the statement that we hear that Maximum Heart Rate of a person is related to their age by the following equation.

#load data and packages
suppressWarnings(library(faraway))
age <- c(18, 23, 25, 35, 65, 54, 34, 56, 72, 19, 23, 42, 18, 39, 37)
maxHR <- c(202, 186, 187, 180, 156, 169, 174, 172, 153, 199, 193,  174, 198, 183, 178)

Perform a linear regression analysis fitting the Max Heart Rate to Age using the lm function in R. What is the resulting equation? Is the effect of Age on Max HR significant? What is the significance level?

model1<-lm(age~maxHR)
#equation
model1$call

## lm(formula = age ~ maxHR)

#variable/target coefficients // model equation
coef(model1)

## (Intercept)       maxHR 
##  242.766917   -1.139609

sumary(model1)

##               Estimate Std. Error t value  Pr(>|t|)
## (Intercept) 242.766917  18.072391  13.433 5.341e-09
## maxHR        -1.139609   0.099947 -11.402 3.848e-08
## 
## n = 15, p = 2, Residual SE = 5.47152, R-Squared = 0.91

Yes, based upon the results. Our P value for MaxHR is < 0.05 and is therefore a significant predictor. The significance level (P) is 3.848e-08.

Please also plot the fitted relationship between Max HR and Age.

old.par <- par(mfrow=c(1, 2))
plot(model1)

par(old.par)

Using the Auto data set from Assignment 5 (also attached here) perform a Linear Regression analysis using mpg as the dependent variable and the other 5 (displacement, horsepower, weight, acceleration) as independent variables.

Please perform this experiment in two ways. First take any random 40 data points from the entire auto data sample and perform the linear regression fit and measure the 95% confidence intervals.

#full data set#
data <- suppressWarnings(read.table("https://raw.githubusercontent.com/RobertSellers/dataWarehouse/master/auto-mpg.data"))
names(data) <- c("mpg", "displacement", "horsepower","weight","acceleration")

#sample of 40
sampleSet<-data[sample(nrow(data), 40), ]

model40<-lm(mpg~.,sampleSet)
modelFull<-lm(mpg~.,data)

What is the final linear regression fit equation?

#equation
model40$call

## lm(formula = mpg ~ ., data = sampleSet)

#variable/target coefficients // model equation
coef(model40)

##   (Intercept)  displacement    horsepower        weight  acceleration 
## -235.27711038    1.72368189    0.06479871    2.11936733    1.12521423

#equation
modelFull$call

## lm(formula = mpg ~ ., data = data)

#variable/target coefficients // model equation
coef(modelFull)

##  (Intercept) displacement   horsepower       weight acceleration 
## -55.47963630   0.63155975   0.08281334  -3.51392367  -0.34375687

Which of the 4 independent variables have a significant impact on mpg? What are their corresponding significance levels? What are the standard errors on each of the coefficients?

#variable/target coefficients
sumary(model40)

##                 Estimate  Std. Error t value  Pr(>|t|)
## (Intercept)  -235.277110   97.481322 -2.4136   0.02117
## displacement    1.723682    0.347974  4.9535 1.846e-05
## horsepower      0.064799    0.013828  4.6861 4.125e-05
## weight          2.119367    3.381309  0.6268   0.53486
## acceleration    1.125214    1.262144  0.8915   0.37874
## 
## n = 40, p = 5, Residual SE = 34.56110, R-Squared = 0.91

sumary(modelFull)

##                 Estimate  Std. Error t value  Pr(>|t|)
## (Intercept)  -55.4796363  25.3119536 -2.1918 0.0289866
## displacement   0.6315597   0.1224161  5.1591 3.972e-07
## horsepower     0.0828133   0.0049061 16.8796 < 2.2e-16
## weight        -3.5139237   0.9337417 -3.7633 0.0001937
## acceleration  -0.3437569   0.3843392 -0.8944 0.3716584
## 
## n = 392, p = 5, Residual SE = 32.14197, R-Squared = 0.91

The sumary function provides each of these values.

Sample Data Set

The significant predictor for the sample dataset is horsepower (P=2.2e-16,SE=0.012020). Acceleration (P>0.05, SE=1.229467), displacement (P=9.603e-07, SE=0.315030), and weight (P=0.0001937,3.202106) are insignificant.

Full Data Set

The significant predictors for the full dataset are displacement (P=3.972e-07, SE=0.1224161), horsepower (P=2.2e-16,SE=0.0049061), and weight (P=0.0001937,0.9337417). Acceleration is insignificant (P>0.05, SE=0.3843392).

#confidence intervals
confint(model40, level=0.95)

##                      2.5 %      97.5 %
## (Intercept)  -433.17471457 -37.3795062
## displacement    1.01725631   2.4301075
## horsepower      0.03672671   0.0928707
## weight         -4.74505460   8.9837893
## acceleration   -1.43707415   3.6875026

confint(modelFull, level=0.95)

##                      2.5 %      97.5 %
## (Intercept)  -105.24579153 -5.71348108
## displacement    0.39087586  0.87224363
## horsepower      0.07316736  0.09245933
## weight         -5.34976508 -1.67808227
## acceleration   -1.09941104  0.41189731

Based on all of these results, we see reductions in standard error, confidence interval range, and p-values (significance) for larger (complete) data set. The smaller our sample size, the less accurate our model becomes.

DATA 605 - Assignment 11

Robert Sellers

November 6, 2016

LINEAR REGRESSION IN R