Using R’s lm function, perform regression analysis and measure the significance of the independent variables for the following two data sets:

age   <- c(18, 23, 25, 35, 65, 54, 34, 56, 72, 19, 23, 42, 18, 39, 37)
maxHR <- c(202, 186, 187, 180, 156, 169, 174, 172, 153, 199, 193, 174, 198, 183, 178)
heartData <- data.frame(age, maxHR)

Is the effect of Age on Max HR significant?

\(H_0\) : Age has no effect on Max HR ==> \(b_1 = 0\) , here \(b_1\) is the ‘size of the effect’ from the linear regression equation \(y = b_0 + b_1 x + e\)
\(H_1\) : Age has effect on Max HR ==> \(b_1 \neq 0\)

Lets now use the R’s built-in lm() function to generate the linear regression model to fit the above:

fit.heartdata <- lm(maxHR ~ age, data = heartData) 
fit.heartdata
## 
## Call:
## lm(formula = maxHR ~ age, data = heartData)
## 
## Coefficients:
## (Intercept)          age  
##    210.0485      -0.7977

Hence, the linear model for the above data follows : \[ MaxHR = 210.0485 - 0.7977Age \]

Now, lets see how significant the model is:

(hdsummary <- summary(fit.heartdata))
## 
## Call:
## lm(formula = maxHR ~ age, data = heartData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9258 -2.5383  0.3879  3.1867  6.6242 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 210.04846    2.86694   73.27  < 2e-16 ***
## age          -0.79773    0.06996  -11.40 3.85e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.578 on 13 degrees of freedom
## Multiple R-squared:  0.9091, Adjusted R-squared:  0.9021 
## F-statistic:   130 on 1 and 13 DF,  p-value: 3.848e-08

From the above, we can reject the null hypothesis \(H_0\) ( no effect on Max HR) due to the low significant level of the relationship, which is much less than 0.01. (probability less than the significant threshold)

round(hdsummary$coefficients["age", 4], 10) 
## [1] 3.85e-08

And we can conclude that there is correlation between Age and MaxHR.

library(ggplot2)
ggplot(heartData, aes(age, maxHR)) + geom_point(aes(y=maxHR)) + stat_smooth(method = lm, level = .95) + xlab("Age") + ylab("Max Heart Rate") + ggtitle("Max Heart Rate Vs Age")

Using the Auto data set perform a Linear Regression analysis using mpg as the dependent variable and the other 4 (displacement, horse-power, weight, acceleration) as independent variables.

\(H_0\) : displacement, horse-power, weight, acceleration have NO effect on fuel efficiency,which means \(b_1, b_2, b_3, b_4 = 0\)
\(H_1\) : displacement, horse-power, weight, acceleration have effect on fuel efficiency,which means \(b_1, b_2, b_3, b_4 \neq 0\)

autodata <- read.table('auto-mpg.data',
                   col.names = c('displacement', 'horsepower', 'weight', 'acceleration', 'mpg'))

head(autodata)
##   displacement horsepower weight acceleration mpg
## 1          307        130   3504         12.0  18
## 2          350        165   3693         11.5  15
## 3          318        150   3436         11.0  18
## 4          304        150   3433         12.0  16
## 5          302        140   3449         10.5  17
## 6          429        198   4341         10.0  15

Perform linear regression using 40 random data rows from the data set.

Lets take random sample of 40 rows and calc the linear model, and measure the 95% confidence interval for each of the independent variable

set.seed(10)
random40 <- autodata[sample(nrow(autodata), 40), ]
(auto.fit <- lm(mpg ~ . , data = random40))
## 
## Call:
## lm(formula = mpg ~ ., data = random40)
## 
## Coefficients:
##  (Intercept)  displacement    horsepower        weight  acceleration  
##    44.117698     -0.023242     -0.006429     -0.005408      0.076267
summary(auto.fit)
## 
## Call:
## lm(formula = mpg ~ ., data = random40)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.2563 -2.6450 -0.3425  2.2191 12.2042 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  44.117698  10.879547   4.055 0.000266 ***
## displacement -0.023242   0.025681  -0.905 0.371646    
## horsepower   -0.006429   0.075464  -0.085 0.932590    
## weight       -0.005408   0.003029  -1.785 0.082860 .  
## acceleration  0.076267   0.502775   0.152 0.880301    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.502 on 35 degrees of freedom
## Multiple R-squared:  0.7566, Adjusted R-squared:  0.7288 
## F-statistic:  27.2 on 4 and 35 DF,  p-value: 2.597e-10
auto.fit$coefficients
##  (Intercept) displacement   horsepower       weight acceleration 
## 44.117697855 -0.023241502 -0.006429309 -0.005408394  0.076266690

The equation is: \[mpg = 44.12 - 0.02 Displacement - 0.006 Horsepower - 0.005 Weight + 0.076 Acceleration\]

Here, the low p-value suggests us to reject the null hypothesis ( the independent variables do not affect the dependent variable)

Lets measure the 95% conf interval

confint(auto.fit, level=0.95)
##                    2.5 %       97.5 %
## (Intercept)  22.03104408 6.620435e+01
## displacement -0.07537641 2.889340e-02
## horsepower   -0.15962850 1.467699e-01
## weight       -0.01155797 7.411781e-04
## acceleration -0.94442052 1.096954e+00

Otherthan the intercept, none of the variables appears to be ‘much’ significant. Lets repeat the same on the entire data:

(auto.fit.full = lm(mpg ~ . , data = autodata))
## 
## Call:
## lm(formula = mpg ~ ., data = autodata)
## 
## Coefficients:
##  (Intercept)  displacement    horsepower        weight  acceleration  
##    45.251140     -0.006001     -0.043608     -0.005281     -0.023148
summary(auto.fit.full)
## 
## Call:
## lm(formula = mpg ~ ., data = autodata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.378  -2.793  -0.333   2.193  16.256 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.2511397  2.4560447  18.424  < 2e-16 ***
## displacement -0.0060009  0.0067093  -0.894  0.37166    
## horsepower   -0.0436077  0.0165735  -2.631  0.00885 ** 
## weight       -0.0052805  0.0008109  -6.512  2.3e-10 ***
## acceleration -0.0231480  0.1256012  -0.184  0.85388    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.247 on 387 degrees of freedom
## Multiple R-squared:  0.707,  Adjusted R-squared:  0.704 
## F-statistic: 233.4 on 4 and 387 DF,  p-value: < 2.2e-16
confint(auto.fit.full, level = 0.95)
##                     2.5 %       97.5 %
## (Intercept)  40.422278855 50.080000544
## displacement -0.019192122  0.007190380
## horsepower   -0.076193029 -0.011022433
## weight       -0.006874738 -0.003686277
## acceleration -0.270094049  0.223798050

Using the full data we get the linear fit as: \[ mpg = 45.25 - 0.006 Displacment - 0.043 Horsepower - 0.005 Weight - 0.023 Acceleration \]

And from the above data, in addition to the intercept, the Weight and Horsepower independent variables are showing significant impact on the fuel efficiency. Notice the negative impact they have on the auto mpg. [ The Weight’s significant level is shown as \(2.3 \times 10^{-10}\) and the 95% CI is -0.0036]. For other variables ( displacement, and acceleration , the significant level and the 95% CI do not rule out the null hypothesis).

ggplot(autodata, aes(x=weight, y=mpg)) + geom_point() + stat_smooth(method = lm, level = .95) + xlab("Weight") + ylab("mpg") + ggtitle("Vehicle Weight vs Fuel Efficiency - Linear Regression 95% CI")