Question 1: Using R’s lm function, perform regression analysis and measure the significance of the independent variables for the following two data sets. In the first case, you are evaluating the statement that we hear that Maximum Heart Rate of a person is related to their age by the following equation:

  • MaxHR = 220 - Age
options(warn = -1)
suppressMessages(library(ggplot2))
suppressMessages(library(plotly))
library(knitr)

dataset <- data.frame( age = c(18, 23, 25, 35, 65, 54, 34, 56, 72, 19, 23, 42, 18, 39, 37), maxHR = c(202, 186, 187, 180, 156, 169, 174, 172, 153, 199, 193, 174, 198, 183, 178))
str(dataset)
## 'data.frame':    15 obs. of  2 variables:
##  $ age  : num  18 23 25 35 65 54 34 56 72 19 ...
##  $ maxHR: num  202 186 187 180 156 169 174 172 153 199 ...

SETTING UP HYPOTHESIS:

  • Null Hypothesis: The population parameters(means) are the same

\({ H }_{ 0 }\quad :\quad { \mu }_{ 1 }=\quad { \mu }_{ 2 }\quad ={ \quad \mu }_{ 3 }............={ \mu }_{ n }\)

  • Alternative Hypothesis: Not \({ H }_{ 0 }\quad\)

\({ H }_{ 1 }\quad :\quad { \mu }_{ 1 }\not=\quad { \mu }_{ 2 }\quad\not ={ \quad \mu }_{ 3 }............\not={ \mu }_{ n }\)

Rejection:

Rejection \({ H }_{ 0 }\) if the calculated value (P-Value) is less the tabulated value(Table Value = 0.05) i.e P \(\le \quad \alpha\), at 0.05, otherwise do not reject \({ H }_{ 0 }\quad\).

plot(dataset$age, dataset$maxHR, main = "Scatter Plot", xlab = "Age", ylab = "MAx HR")
abline(lm(dataset$maxHR~dataset$age), col="blue")

Above is the plot of Age against maxHR, and it reveals that there is a strong negative correlation between Age and MaxHR. Where its correlation co-efficient is stated below;

cor(dataset$maxHR, dataset$age)
## [1] -0.9534656

Below is the linear model between Age and MaxHR with the plot of Residual against the fitted values.

linearmodel <- lm(dataset$maxHR~dataset$age)

summary(linearmodel)
## 
## Call:
## lm(formula = dataset$maxHR ~ dataset$age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9258 -2.5383  0.3879  3.1867  6.6242 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 210.04846    2.86694   73.27  < 2e-16 ***
## dataset$age  -0.79773    0.06996  -11.40 3.85e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.578 on 13 degrees of freedom
## Multiple R-squared:  0.9091, Adjusted R-squared:  0.9021 
## F-statistic:   130 on 1 and 13 DF,  p-value: 3.848e-08

selecting best model using both Forward and backward selection

stepwise <- step(linearmodel, direction = "both")
## Start:  AIC=47.49
## dataset$maxHR ~ dataset$age
## 
##               Df Sum of Sq     RSS    AIC
## <none>                      272.43 47.490
## - dataset$age  1    2724.5 2996.93 81.459
summary(stepwise)
## 
## Call:
## lm(formula = dataset$maxHR ~ dataset$age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9258 -2.5383  0.3879  3.1867  6.6242 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 210.04846    2.86694   73.27  < 2e-16 ***
## dataset$age  -0.79773    0.06996  -11.40 3.85e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.578 on 13 degrees of freedom
## Multiple R-squared:  0.9091, Adjusted R-squared:  0.9021 
## F-statistic:   130 on 1 and 13 DF,  p-value: 3.848e-08
kable(confint(stepwise))
2.5 % 97.5 %
(Intercept) 203.854813 216.2421034
dataset$age -0.948872 -0.6465811

Plot of fitted values and showing relationship between Age and MaxHR

residual_age = data.frame(Fitted = fitted(stepwise),
Residuals = resid(stepwise), Treatment = dataset$age)

ggplot(residual_age, aes(Fitted, Residuals, colour = Treatment)) + geom_point()

Decision and Conclusion:

Since P-value is less than the tabulated value, we therefore fail to accept (reject) \({ H }_{ 0 }\), and conclude that there is statistically significant relationship between Age and MaxHR, i.e there population means are not the same, and that age contribute significantly to MAXHR or better say, the higher the age, the lower the MaxHR.

Question 2:

Using the Auto data set from Assignment 5 (also attached here) perform a Linear Re- gression analysis using mpg as the dependent variable and the other 4 (displacement, horse- power, weight, acceleration) as independent variables. What is the final linear regression fit equation? Which of the 4 independent variables have a significant impact on mpg? What are their corresponding significance levels? What are the standard errors on each of the coeficients? Please perform this experiment in two ways. First take any random 40 data points from the entire auto data sample and perform the linear regression fit and measure the 95% confidence intervals. Then, take the entire data set (all 392 points) and perform linear regression and measure the 95% confidence intervals. Please report the resulting fit equation, their significance values and confidence intervals for each of the two runs. Please submit an R-markdown file documenting your experiments. Your submission should include the final linear fits, and their corresponding signifcance levels. In addition, you should clearly state what you concluded from looking at the fit and their significance levels. 1

auto <- read.csv("https://raw.githubusercontent.com/mascotinme/GitHub/master/MSDA%20605/auto-mpg.txt", header=FALSE, sep = "")

colnames(auto)[1] <- "displacement"
colnames(auto)[2] <- "horsePower"
colnames(auto)[3] <- "weight"
colnames(auto)[4] <-  "acceleration"
colnames(auto)[5] <- "mpg"

kable(head(auto))
displacement horsePower weight acceleration mpg
307 130 3504 12.0 18
350 165 3693 11.5 15
318 150 3436 11.0 18
304 150 3433 12.0 16
302 140 3449 10.5 17
429 198 4341 10.0 15
set.seed(123)
sample1 <- auto[sample(nrow(auto),40 ), ]
dim(sample1)
## [1] 40  5

SETTING UP HYPOTHESIS:

  • Null Hypothesis: The population parameters(means) are the same

\({ H }_{ 0 }\quad :\quad { \mu }_{ 1 }=\quad { \mu }_{ 2 }\quad ={ \quad \mu }_{ 3 }............={ \mu }_{ n }\)

  • Alternative Hypothesis: Not \({ H }_{ 0 }\quad\)

\({ H }_{ 1 }\quad :\quad { \mu }_{ 1 }\not=\quad { \mu }_{ 2 }\quad\not ={ \quad \mu }_{ 3 }............\not={ \mu }_{ n }\)

Rejection:

Rejection \({ H }_{ 0 }\) if the calculated value (P-Value) is less the tabulated value(Table Value = 0.05) i.e P \(\le \quad \alpha\), at 0.05, otherwise do not reject \({ H }_{ 0 }\quad\).

auto_40 <- lm(mpg ~ (displacement+weight+acceleration+horsePower), data=sample1 )

summary(auto_40)
## 
## Call:
## lm(formula = mpg ~ (displacement + weight + acceleration + horsePower), 
##     data = sample1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3741 -3.2178 -0.4445  2.5421 10.9468 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  18.756053  10.886881   1.723  0.09375 . 
## displacement -0.005115   0.023178  -0.221  0.82660   
## weight       -0.009710   0.002754  -3.526  0.00120 **
## acceleration  1.497830   0.472153   3.172  0.00314 **
## horsePower    0.113919   0.064034   1.779  0.08392 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.4 on 35 degrees of freedom
## Multiple R-squared:  0.7626, Adjusted R-squared:  0.7355 
## F-statistic: 28.11 on 4 and 35 DF,  p-value: 1.691e-10
stepwise2 <- step(auto_40, direction = "both")
## Start:  AIC=123.18
## mpg ~ (displacement + weight + acceleration + horsePower)
## 
##                Df Sum of Sq    RSS    AIC
## - displacement  1     0.943 678.45 121.24
## <none>                      677.51 123.18
## - horsePower    1    61.266 738.78 124.64
## - acceleration  1   194.808 872.32 131.29
## - weight        1   240.720 918.23 133.34
## 
## Step:  AIC=121.24
## mpg ~ weight + acceleration + horsePower
## 
##                Df Sum of Sq     RSS    AIC
## <none>                       678.45 121.24
## - horsePower    1     64.80  743.25 122.89
## + displacement  1      0.94  677.51 123.18
## - acceleration  1    193.90  872.36 129.29
## - weight        1    501.04 1179.49 141.36
summary(stepwise2)
## 
## Call:
## lm(formula = mpg ~ weight + acceleration + horsePower, data = sample1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3356 -3.2119 -0.3329  2.3184 11.1514 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.712020   9.855301   2.000  0.05307 .  
## weight       -0.010130   0.001965  -5.156 9.32e-06 ***
## acceleration  1.489111   0.464240   3.208  0.00281 ** 
## horsePower    0.108624   0.058579   1.854  0.07190 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.341 on 36 degrees of freedom
## Multiple R-squared:  0.7623, Adjusted R-squared:  0.7424 
## F-statistic: 38.47 on 3 and 36 DF,  p-value: 2.533e-11
# Note that we use both Forward and Backward selection.

From the above linear regression and best model selection (using backward selection method), we can deduce that only “weight” and “acceleration” would be select as the best model for further analysis.

Thus the equation;

mpg = 19.7120 - 0.010weight + 1.4891acceleration + e

where e is the error term.

confint(stepwise2)
##                    2.5 %      97.5 %
## (Intercept)  -0.27545679 39.69949657
## weight       -0.01411413 -0.00614536
## acceleration  0.54758909  2.43063257
## horsePower   -0.01018100  0.22742820
#Note that the confidence intervals below are non-zero for both weight and acceleration.

Below is the linear regression model for all the 392 observations and selection of the best model using backward selection method.

auto_392 <- lm(mpg~displacement+horsePower+weight+acceleration, data=auto)
summary(auto_392)
## 
## Call:
## lm(formula = mpg ~ displacement + horsePower + weight + acceleration, 
##     data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.378  -2.793  -0.333   2.193  16.256 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.2511397  2.4560447  18.424  < 2e-16 ***
## displacement -0.0060009  0.0067093  -0.894  0.37166    
## horsePower   -0.0436077  0.0165735  -2.631  0.00885 ** 
## weight       -0.0052805  0.0008109  -6.512  2.3e-10 ***
## acceleration -0.0231480  0.1256012  -0.184  0.85388    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.247 on 387 degrees of freedom
## Multiple R-squared:  0.707,  Adjusted R-squared:  0.704 
## F-statistic: 233.4 on 4 and 387 DF,  p-value: < 2.2e-16
stepwise3 <- step(auto_392, direction = "both")
## Start:  AIC=1138.75
## mpg ~ displacement + horsePower + weight + acceleration
## 
##                Df Sum of Sq    RSS    AIC
## - acceleration  1      0.61 6980.0 1136.8
## - displacement  1     14.43 6993.8 1137.6
## <none>                      6979.4 1138.8
## - horsePower    1    124.86 7104.3 1143.7
## - weight        1    764.85 7744.3 1177.5
## 
## Step:  AIC=1136.78
## mpg ~ displacement + horsePower + weight
## 
##                Df Sum of Sq    RSS    AIC
## - displacement  1     13.82 6993.8 1135.6
## <none>                      6980.0 1136.8
## + acceleration  1      0.61 6979.4 1138.8
## - horsePower    1    190.28 7170.3 1145.3
## - weight        1   1015.31 7995.3 1188.0
## 
## Step:  AIC=1135.56
## mpg ~ horsePower + weight
## 
##                Df Sum of Sq    RSS    AIC
## <none>                      6993.8 1135.6
## + displacement  1     13.82 6980.0 1136.8
## + acceleration  1      0.01 6993.8 1137.6
## - horsePower    1    327.39 7321.2 1151.5
## - weight        1   2392.07 9385.9 1248.9
summary(stepwise3)
## 
## Call:
## lm(formula = mpg ~ horsePower + weight, data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.0762  -2.7340  -0.3312   2.1752  16.2601 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 45.6402108  0.7931958  57.540  < 2e-16 ***
## horsePower  -0.0473029  0.0110851  -4.267 2.49e-05 ***
## weight      -0.0057942  0.0005023 -11.535  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.24 on 389 degrees of freedom
## Multiple R-squared:  0.7064, Adjusted R-squared:  0.7049 
## F-statistic: 467.9 on 2 and 389 DF,  p-value: < 2.2e-16

From the above, it can deduce that the following variables have less P-values;

horsePower weight

and the best formular or model is;

mpg = 45.6402 -0.0058xweight + 0.0473x horsePower + e

where e is the error term or residual which is 4.24

kable(confint(stepwise3))
2.5 % 97.5 %
(Intercept) 44.0807235 47.1996981
horsePower -0.0690970 -0.0255087
weight -0.0067818 -0.0048065
# Non-zero confidence intervals.
residual_mpg = data.frame(Fitted = fitted(stepwise3),
Residuals = resid(stepwise3), Treatment = auto$mpg)
#fitted with mpg

ggplot(residual_mpg, aes(Fitted, Residuals, colour = Treatment)) + geom_point()

Decision and Conclusion:

Since p-value for all the two variables (horsePower and weight) are lesser than the tabulated value of 0.05, we therefore fail to accept (reject) \({ H }_{ 0 }\quad\) and conclude that they do not have the same means. Or that the relationship between the mpg and other variables is statistically significance.

Interpretation:

Weight: The heavier the vehicle is, the less the MPG (and Vice-versa)

Horse Power: The more the horsepower, the more the MPG maybe.(Vice-versa)