|IS 605 FUNDAMENTALS OF COMPUTATIONAL MATHEMATICS - WEEK 11 | Data Analysis
Question 1: Using R’s lm function, perform regression analysis and measure the significance of the independent variables for the following two data sets. In the first case, you are evaluating the statement that we hear that Maximum Heart Rate of a person is related to their age by the following equation:
options(warn = -1)
suppressMessages(library(ggplot2))
suppressMessages(library(plotly))
library(knitr)
dataset <- data.frame( age = c(18, 23, 25, 35, 65, 54, 34, 56, 72, 19, 23, 42, 18, 39, 37), maxHR = c(202, 186, 187, 180, 156, 169, 174, 172, 153, 199, 193, 174, 198, 183, 178))
str(dataset)## 'data.frame': 15 obs. of 2 variables:
## $ age : num 18 23 25 35 65 54 34 56 72 19 ...
## $ maxHR: num 202 186 187 180 156 169 174 172 153 199 ...
SETTING UP HYPOTHESIS:
\({ H }_{ 0 }\quad :\quad { \mu }_{ 1 }=\quad { \mu }_{ 2 }\quad ={ \quad \mu }_{ 3 }............={ \mu }_{ n }\)
\({ H }_{ 1 }\quad :\quad { \mu }_{ 1 }\not=\quad { \mu }_{ 2 }\quad\not ={ \quad \mu }_{ 3 }............\not={ \mu }_{ n }\)
Rejection:
Rejection \({ H }_{ 0 }\) if the calculated value (P-Value) is less the tabulated value(Table Value = 0.05) i.e P \(\le \quad \alpha\), at 0.05, otherwise do not reject \({ H }_{ 0 }\quad\).
plot(dataset$age, dataset$maxHR, main = "Scatter Plot", xlab = "Age", ylab = "MAx HR")
abline(lm(dataset$maxHR~dataset$age), col="blue")Above is the plot of Age against maxHR, and it reveals that there is a strong negative correlation between Age and MaxHR. Where its correlation co-efficient is stated below;
cor(dataset$maxHR, dataset$age)## [1] -0.9534656
Below is the linear model between Age and MaxHR with the plot of Residual against the fitted values.
linearmodel <- lm(dataset$maxHR~dataset$age)
summary(linearmodel)##
## Call:
## lm(formula = dataset$maxHR ~ dataset$age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9258 -2.5383 0.3879 3.1867 6.6242
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 210.04846 2.86694 73.27 < 2e-16 ***
## dataset$age -0.79773 0.06996 -11.40 3.85e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.578 on 13 degrees of freedom
## Multiple R-squared: 0.9091, Adjusted R-squared: 0.9021
## F-statistic: 130 on 1 and 13 DF, p-value: 3.848e-08
selecting best model using both Forward and backward selection
stepwise <- step(linearmodel, direction = "both")## Start: AIC=47.49
## dataset$maxHR ~ dataset$age
##
## Df Sum of Sq RSS AIC
## <none> 272.43 47.490
## - dataset$age 1 2724.5 2996.93 81.459
summary(stepwise)##
## Call:
## lm(formula = dataset$maxHR ~ dataset$age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9258 -2.5383 0.3879 3.1867 6.6242
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 210.04846 2.86694 73.27 < 2e-16 ***
## dataset$age -0.79773 0.06996 -11.40 3.85e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.578 on 13 degrees of freedom
## Multiple R-squared: 0.9091, Adjusted R-squared: 0.9021
## F-statistic: 130 on 1 and 13 DF, p-value: 3.848e-08
kable(confint(stepwise))| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | 203.854813 | 216.2421034 |
| dataset$age | -0.948872 | -0.6465811 |
Plot of fitted values and showing relationship between Age and MaxHR
residual_age = data.frame(Fitted = fitted(stepwise),
Residuals = resid(stepwise), Treatment = dataset$age)
ggplot(residual_age, aes(Fitted, Residuals, colour = Treatment)) + geom_point()Decision and Conclusion:
Since P-value is less than the tabulated value, we therefore fail to accept (reject) \({ H }_{ 0 }\), and conclude that there is statistically significant relationship between Age and MaxHR, i.e there population means are not the same, and that age contribute significantly to MAXHR or better say, the higher the age, the lower the MaxHR.
Question 2:
Using the Auto data set from Assignment 5 (also attached here) perform a Linear Re- gression analysis using mpg as the dependent variable and the other 4 (displacement, horse- power, weight, acceleration) as independent variables. What is the final linear regression fit equation? Which of the 4 independent variables have a significant impact on mpg? What are their corresponding significance levels? What are the standard errors on each of the coeficients? Please perform this experiment in two ways. First take any random 40 data points from the entire auto data sample and perform the linear regression fit and measure the 95% confidence intervals. Then, take the entire data set (all 392 points) and perform linear regression and measure the 95% confidence intervals. Please report the resulting fit equation, their significance values and confidence intervals for each of the two runs. Please submit an R-markdown file documenting your experiments. Your submission should include the final linear fits, and their corresponding signifcance levels. In addition, you should clearly state what you concluded from looking at the fit and their significance levels. 1
auto <- read.csv("https://raw.githubusercontent.com/mascotinme/GitHub/master/MSDA%20605/auto-mpg.txt", header=FALSE, sep = "")
colnames(auto)[1] <- "displacement"
colnames(auto)[2] <- "horsePower"
colnames(auto)[3] <- "weight"
colnames(auto)[4] <- "acceleration"
colnames(auto)[5] <- "mpg"
kable(head(auto))| displacement | horsePower | weight | acceleration | mpg |
|---|---|---|---|---|
| 307 | 130 | 3504 | 12.0 | 18 |
| 350 | 165 | 3693 | 11.5 | 15 |
| 318 | 150 | 3436 | 11.0 | 18 |
| 304 | 150 | 3433 | 12.0 | 16 |
| 302 | 140 | 3449 | 10.5 | 17 |
| 429 | 198 | 4341 | 10.0 | 15 |
set.seed(123)
sample1 <- auto[sample(nrow(auto),40 ), ]
dim(sample1)## [1] 40 5
SETTING UP HYPOTHESIS:
\({ H }_{ 0 }\quad :\quad { \mu }_{ 1 }=\quad { \mu }_{ 2 }\quad ={ \quad \mu }_{ 3 }............={ \mu }_{ n }\)
\({ H }_{ 1 }\quad :\quad { \mu }_{ 1 }\not=\quad { \mu }_{ 2 }\quad\not ={ \quad \mu }_{ 3 }............\not={ \mu }_{ n }\)
Rejection:
Rejection \({ H }_{ 0 }\) if the calculated value (P-Value) is less the tabulated value(Table Value = 0.05) i.e P \(\le \quad \alpha\), at 0.05, otherwise do not reject \({ H }_{ 0 }\quad\).
auto_40 <- lm(mpg ~ (displacement+weight+acceleration+horsePower), data=sample1 )
summary(auto_40)##
## Call:
## lm(formula = mpg ~ (displacement + weight + acceleration + horsePower),
## data = sample1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3741 -3.2178 -0.4445 2.5421 10.9468
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.756053 10.886881 1.723 0.09375 .
## displacement -0.005115 0.023178 -0.221 0.82660
## weight -0.009710 0.002754 -3.526 0.00120 **
## acceleration 1.497830 0.472153 3.172 0.00314 **
## horsePower 0.113919 0.064034 1.779 0.08392 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.4 on 35 degrees of freedom
## Multiple R-squared: 0.7626, Adjusted R-squared: 0.7355
## F-statistic: 28.11 on 4 and 35 DF, p-value: 1.691e-10
stepwise2 <- step(auto_40, direction = "both")## Start: AIC=123.18
## mpg ~ (displacement + weight + acceleration + horsePower)
##
## Df Sum of Sq RSS AIC
## - displacement 1 0.943 678.45 121.24
## <none> 677.51 123.18
## - horsePower 1 61.266 738.78 124.64
## - acceleration 1 194.808 872.32 131.29
## - weight 1 240.720 918.23 133.34
##
## Step: AIC=121.24
## mpg ~ weight + acceleration + horsePower
##
## Df Sum of Sq RSS AIC
## <none> 678.45 121.24
## - horsePower 1 64.80 743.25 122.89
## + displacement 1 0.94 677.51 123.18
## - acceleration 1 193.90 872.36 129.29
## - weight 1 501.04 1179.49 141.36
summary(stepwise2)##
## Call:
## lm(formula = mpg ~ weight + acceleration + horsePower, data = sample1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3356 -3.2119 -0.3329 2.3184 11.1514
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.712020 9.855301 2.000 0.05307 .
## weight -0.010130 0.001965 -5.156 9.32e-06 ***
## acceleration 1.489111 0.464240 3.208 0.00281 **
## horsePower 0.108624 0.058579 1.854 0.07190 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.341 on 36 degrees of freedom
## Multiple R-squared: 0.7623, Adjusted R-squared: 0.7424
## F-statistic: 38.47 on 3 and 36 DF, p-value: 2.533e-11
# Note that we use both Forward and Backward selection.From the above linear regression and best model selection (using backward selection method), we can deduce that only “weight” and “acceleration” would be select as the best model for further analysis.
Thus the equation;
mpg = 19.7120 - 0.010weight + 1.4891acceleration + e
where e is the error term.
confint(stepwise2)## 2.5 % 97.5 %
## (Intercept) -0.27545679 39.69949657
## weight -0.01411413 -0.00614536
## acceleration 0.54758909 2.43063257
## horsePower -0.01018100 0.22742820
#Note that the confidence intervals below are non-zero for both weight and acceleration.Below is the linear regression model for all the 392 observations and selection of the best model using backward selection method.
auto_392 <- lm(mpg~displacement+horsePower+weight+acceleration, data=auto)
summary(auto_392)##
## Call:
## lm(formula = mpg ~ displacement + horsePower + weight + acceleration,
## data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.378 -2.793 -0.333 2.193 16.256
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.2511397 2.4560447 18.424 < 2e-16 ***
## displacement -0.0060009 0.0067093 -0.894 0.37166
## horsePower -0.0436077 0.0165735 -2.631 0.00885 **
## weight -0.0052805 0.0008109 -6.512 2.3e-10 ***
## acceleration -0.0231480 0.1256012 -0.184 0.85388
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.247 on 387 degrees of freedom
## Multiple R-squared: 0.707, Adjusted R-squared: 0.704
## F-statistic: 233.4 on 4 and 387 DF, p-value: < 2.2e-16
stepwise3 <- step(auto_392, direction = "both")## Start: AIC=1138.75
## mpg ~ displacement + horsePower + weight + acceleration
##
## Df Sum of Sq RSS AIC
## - acceleration 1 0.61 6980.0 1136.8
## - displacement 1 14.43 6993.8 1137.6
## <none> 6979.4 1138.8
## - horsePower 1 124.86 7104.3 1143.7
## - weight 1 764.85 7744.3 1177.5
##
## Step: AIC=1136.78
## mpg ~ displacement + horsePower + weight
##
## Df Sum of Sq RSS AIC
## - displacement 1 13.82 6993.8 1135.6
## <none> 6980.0 1136.8
## + acceleration 1 0.61 6979.4 1138.8
## - horsePower 1 190.28 7170.3 1145.3
## - weight 1 1015.31 7995.3 1188.0
##
## Step: AIC=1135.56
## mpg ~ horsePower + weight
##
## Df Sum of Sq RSS AIC
## <none> 6993.8 1135.6
## + displacement 1 13.82 6980.0 1136.8
## + acceleration 1 0.01 6993.8 1137.6
## - horsePower 1 327.39 7321.2 1151.5
## - weight 1 2392.07 9385.9 1248.9
summary(stepwise3)##
## Call:
## lm(formula = mpg ~ horsePower + weight, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.0762 -2.7340 -0.3312 2.1752 16.2601
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.6402108 0.7931958 57.540 < 2e-16 ***
## horsePower -0.0473029 0.0110851 -4.267 2.49e-05 ***
## weight -0.0057942 0.0005023 -11.535 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.24 on 389 degrees of freedom
## Multiple R-squared: 0.7064, Adjusted R-squared: 0.7049
## F-statistic: 467.9 on 2 and 389 DF, p-value: < 2.2e-16
From the above, it can deduce that the following variables have less P-values;
horsePower weight
and the best formular or model is;
mpg = 45.6402 -0.0058xweight + 0.0473x horsePower + e
where e is the error term or residual which is 4.24
kable(confint(stepwise3))| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | 44.0807235 | 47.1996981 |
| horsePower | -0.0690970 | -0.0255087 |
| weight | -0.0067818 | -0.0048065 |
# Non-zero confidence intervals.residual_mpg = data.frame(Fitted = fitted(stepwise3),
Residuals = resid(stepwise3), Treatment = auto$mpg)
#fitted with mpg
ggplot(residual_mpg, aes(Fitted, Residuals, colour = Treatment)) + geom_point()Decision and Conclusion:
Since p-value for all the two variables (horsePower and weight) are lesser than the tabulated value of 0.05, we therefore fail to accept (reject) \({ H }_{ 0 }\quad\) and conclude that they do not have the same means. Or that the relationship between the mpg and other variables is statistically significance.
Interpretation:
Weight: The heavier the vehicle is, the less the MPG (and Vice-versa)
Horse Power: The more the horsepower, the more the MPG maybe.(Vice-versa)