age <- c(18, 23, 25, 35, 65, 54, 34, 56, 72, 19, 23, 42, 18, 39, 37)
maxHR <- c(202, 186, 187, 180, 156, 169, 174, 172, 153, 199, 193, 174, 198, 183, 178)
heartData <- data.frame(age, maxHR)
Is the effect of Age on Max HR significant?
\(H_0\) : Age has no effect on Max HR ==> \(b_1 = 0\) , here \(b_1\) is the ‘size of the effect’ from the linear regression equation \(y = b_0 + b_1 x + e\)
\(H_1\) : Age has effect on Max HR ==> \(b_1 \neq 0\)
Lets now use the R’s built-in lm() function to generate the linear regression model to fit the above:
fit.heartdata <- lm(maxHR ~ age, data = heartData)
fit.heartdata
##
## Call:
## lm(formula = maxHR ~ age, data = heartData)
##
## Coefficients:
## (Intercept) age
## 210.0485 -0.7977
Hence, the linear model for the above data follows : \[ MaxHR = 210.0485 - 0.7977Age \]
Now, lets see how significant the model is:
(hdsummary <- summary(fit.heartdata))
##
## Call:
## lm(formula = maxHR ~ age, data = heartData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9258 -2.5383 0.3879 3.1867 6.6242
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 210.04846 2.86694 73.27 < 2e-16 ***
## age -0.79773 0.06996 -11.40 3.85e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.578 on 13 degrees of freedom
## Multiple R-squared: 0.9091, Adjusted R-squared: 0.9021
## F-statistic: 130 on 1 and 13 DF, p-value: 3.848e-08
From the above, we can reject the null hypothesis \(H_0\) ( no effect on Max HR) due to the low significant level of the relationship, which is much less than 0.01. (probability less than the significant threshold)
round(hdsummary$coefficients["age", 4], 10)
## [1] 3.85e-08
And we can conclude that there is correlation between Age and MaxHR.
library(ggplot2)
ggplot(heartData, aes(age, maxHR)) + geom_point(aes(y=maxHR)) + stat_smooth(method = lm, level = .95) + xlab("Age") + ylab("Max Heart Rate") + ggtitle("Max Heart Rate Vs Age")
\(H_0\) : displacement, horse-power, weight, acceleration have NO effect on fuel efficiency,which means \(b_1, b_2, b_3, b_4 = 0\)
\(H_1\) : displacement, horse-power, weight, acceleration have effect on fuel efficiency,which means \(b_1, b_2, b_3, b_4 \neq 0\)
autodata <- read.table('auto-mpg.data',
col.names = c('displacement', 'horsepower', 'weight', 'acceleration', 'mpg'))
head(autodata)
## displacement horsepower weight acceleration mpg
## 1 307 130 3504 12.0 18
## 2 350 165 3693 11.5 15
## 3 318 150 3436 11.0 18
## 4 304 150 3433 12.0 16
## 5 302 140 3449 10.5 17
## 6 429 198 4341 10.0 15
Lets take random sample of 40 rows and calc the linear model, and measure the 95% confidence interval for each of the independent variable
set.seed(10)
random40 <- autodata[sample(nrow(autodata), 40), ]
(auto.fit <- lm(mpg ~ . , data = random40))
##
## Call:
## lm(formula = mpg ~ ., data = random40)
##
## Coefficients:
## (Intercept) displacement horsepower weight acceleration
## 44.117698 -0.023242 -0.006429 -0.005408 0.076267
summary(auto.fit)
##
## Call:
## lm(formula = mpg ~ ., data = random40)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.2563 -2.6450 -0.3425 2.2191 12.2042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.117698 10.879547 4.055 0.000266 ***
## displacement -0.023242 0.025681 -0.905 0.371646
## horsepower -0.006429 0.075464 -0.085 0.932590
## weight -0.005408 0.003029 -1.785 0.082860 .
## acceleration 0.076267 0.502775 0.152 0.880301
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.502 on 35 degrees of freedom
## Multiple R-squared: 0.7566, Adjusted R-squared: 0.7288
## F-statistic: 27.2 on 4 and 35 DF, p-value: 2.597e-10
auto.fit$coefficients
## (Intercept) displacement horsepower weight acceleration
## 44.117697855 -0.023241502 -0.006429309 -0.005408394 0.076266690
The equation is: \[mpg = 44.12 - 0.02 Displacement - 0.006 Horsepower - 0.005 Weight + 0.076 Acceleration\]
Here, the low p-value suggests us to reject the null hypothesis ( the independent variables do not affect the dependent variable)
Lets measure the 95% conf interval
confint(auto.fit, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 22.03104408 6.620435e+01
## displacement -0.07537641 2.889340e-02
## horsepower -0.15962850 1.467699e-01
## weight -0.01155797 7.411781e-04
## acceleration -0.94442052 1.096954e+00
Otherthan the intercept, none of the variables appears to be ‘much’ significant. Lets repeat the same on the entire data:
(auto.fit.full = lm(mpg ~ . , data = autodata))
##
## Call:
## lm(formula = mpg ~ ., data = autodata)
##
## Coefficients:
## (Intercept) displacement horsepower weight acceleration
## 45.251140 -0.006001 -0.043608 -0.005281 -0.023148
summary(auto.fit.full)
##
## Call:
## lm(formula = mpg ~ ., data = autodata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.378 -2.793 -0.333 2.193 16.256
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.2511397 2.4560447 18.424 < 2e-16 ***
## displacement -0.0060009 0.0067093 -0.894 0.37166
## horsepower -0.0436077 0.0165735 -2.631 0.00885 **
## weight -0.0052805 0.0008109 -6.512 2.3e-10 ***
## acceleration -0.0231480 0.1256012 -0.184 0.85388
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.247 on 387 degrees of freedom
## Multiple R-squared: 0.707, Adjusted R-squared: 0.704
## F-statistic: 233.4 on 4 and 387 DF, p-value: < 2.2e-16
confint(auto.fit.full, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 40.422278855 50.080000544
## displacement -0.019192122 0.007190380
## horsepower -0.076193029 -0.011022433
## weight -0.006874738 -0.003686277
## acceleration -0.270094049 0.223798050
Using the full data we get the linear fit as: \[ mpg = 45.25 - 0.006 Displacment - 0.043 Horsepower - 0.005 Weight - 0.023 Acceleration \]
And from the above data, in addition to the intercept, the Weight and Horsepower independent variables are showing significant impact on the fuel efficiency. Notice the negative impact they have on the auto mpg. [ The Weight’s significant level is shown as \(2.3 \times 10^{-10}\) and the 95% CI is -0.0036]. For other variables ( displacement, and acceleration , the significant level and the 95% CI do not rule out the null hypothesis).
ggplot(autodata, aes(x=weight, y=mpg)) + geom_point() + stat_smooth(method = lm, level = .95) + xlab("Weight") + ylab("mpg") + ggtitle("Vehicle Weight vs Fuel Efficiency - Linear Regression 95% CI")