attach(Auto)
## The following object is masked from package:ggplot2:
##
## mpg
eq <- lm(data = Auto, formula = mpg ~ horsepower)
S <- summary(eq)
S$coefficients; S$adj.r.squared
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.9358610 0.717498656 55.65984 1.220362e-187
## horsepower -0.1578447 0.006445501 -24.48914 7.031989e-81
## [1] 0.6049379
Given the p.value for the coeffieicent of horsepower is much less than 0.05. And adjusted r.squared value for this equation is more than 0.6. So we can conclude that the relationship is not just significant, but also highly strong.
cor(mpg, horsepower)
## [1] -0.7784268
Cor function would help us verify the direction of correlation. As you see above, the direction is minus
We can varify this simply with confint function.
kable(confint(eq,level = 0.95))
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | 38.525212 | 41.3465103 |
| horsepower | -0.170517 | -0.1451725 |
kable(confint(eq,level = 0.98))
| 1 % | 99 % | |
|---|---|---|
| (Intercept) | 38.2598220 | 41.6119001 |
| horsepower | -0.1729011 | -0.1427884 |
attach(Auto)
## The following objects are masked from Auto (pos = 3):
##
## acceleration, cylinders, displacement, horsepower, mpg, name,
## origin, weight, year
## The following object is masked from package:ggplot2:
##
## mpg
eq <- lm(data = Auto, formula = mpg ~ horsepower)
Pred.mpg.h <- predict(eq)
newdt <- data.frame(cbind(mpg,horsepower,Pred.mpg.h))
head(newdt)
## mpg horsepower Pred.mpg.h
## 1 18 130 19.416046
## 2 15 165 13.891480
## 3 18 150 16.259151
## 4 16 150 16.259151
## 5 17 140 17.837598
## 6 15 198 8.682604
plot1 <- ggplot(newdt, aes(x = horsepower, y = mpg)) + geom_point() + geom_line(aes(y=Pred.mpg.h,colour = "red"))
plot1
Polynomial Version - I used the log
attach(Auto)
## The following objects are masked from Auto (pos = 3):
##
## acceleration, cylinders, displacement, horsepower, mpg, name,
## origin, weight, year
## The following objects are masked from Auto (pos = 4):
##
## acceleration, cylinders, displacement, horsepower, mpg, name,
## origin, weight, year
## The following object is masked from package:ggplot2:
##
## mpg
eq1 <- lm(data = Auto, formula = mpg ~ log(horsepower))
Pred.mpg.h1 <- predict(eq1)
newdt1 <- data.frame(cbind(mpg,horsepower,Pred.mpg.h1))
plot2 <- ggplot(newdt1, aes(x = horsepower, y = mpg)) + geom_point() + geom_line(aes(y=Pred.mpg.h1,colour = "red"))
plot2
As you see in the graph, the residule shows normal distribution. And The leverage is quite low. Meaning there is no outlier in the residule. You can’t see this kind of plot in the real world actually…
par(mfrow = c(2,2))
plot(eq)
Note that three explanatory variables, displacement, horsepower, weight, shows non-linear relationship with mpg.
pairs(Auto)
For the variables with non-linear relationship, each has strong correlation. Um…
cor(Auto[,1:8])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
As you see in the model, the r.square value is high, and most of variable has low p.values. But horsepower, unlike the first analysis we took, has high p.value, 0.21963. It might mean there is some interaction between some variables and horsepower.
model <- lm(data = Auto[,1:8], mpg ~.)
Sm <- summary(model)
Sm$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218434622 4.6442941494 -3.707438 2.401841e-04
## cylinders -0.493376319 0.3232823146 -1.526147 1.277965e-01
## displacement 0.019895644 0.0075150792 2.647430 8.444649e-03
## horsepower -0.016951144 0.0137868914 -1.229512 2.196328e-01
## weight -0.006474043 0.0006520478 -9.928787 7.874953e-21
## acceleration 0.080575838 0.0988449567 0.815174 4.154780e-01
## year 0.750772678 0.0509731223 14.728795 3.055983e-39
## origin 1.426140495 0.2781360924 5.127492 4.665681e-07
The coefficients of year is 0.750773. It means as the year of Model increases, Miles per gallon(mpg) increases. That means there has been technical improvement on mpg as the time passed.
It shows high leverage as you see in the right-bottom of graph.
par(mfrow = c(2,2))
plot(model)
Let’s try with displacement, horsepower weight ones which show the corlinearity along with higher correlation with mpg.
model2 <- lm(data = Auto[,1:8], mpg ~ horsepower*weight*displacement)
model3 <- update(model2, mpg ~.- horsepower:weight)
model4 <- update(model3, mpg ~.- horsepower:weight:displacement)
model5 <- update(model4, mpg ~.- weight:displacement)
summary(model5)
##
## Call:
## lm(formula = mpg ~ horsepower + weight + displacement + horsepower:displacement,
## data = Auto[, 1:8])
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.8438 -2.2068 -0.2347 2.0116 16.4152
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.446e+01 1.544e+00 35.276 < 2e-16 ***
## horsepower -1.989e-01 2.134e-02 -9.319 < 2e-16 ***
## weight -2.749e-03 7.148e-04 -3.846 0.00014 ***
## displacement -7.129e-02 9.565e-03 -7.454 6e-13 ***
## horsepower:displacement 4.937e-04 5.604e-05 8.811 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.876 on 387 degrees of freedom
## Multiple R-squared: 0.7559, Adjusted R-squared: 0.7534
## F-statistic: 299.6 on 4 and 387 DF, p-value: < 2.2e-16
As you see, the interaction between horsepower and displacement has been detacted.
Then, how about log one? All parameters are significant, and Adjusted R squre is quite higher than the previous model.
model2 <- lm(data = Auto[,1:8], mpg ~ log(horsepower)*log(weight)*log(displacement))
summary(model2)
##
## Call:
## lm(formula = mpg ~ log(horsepower) * log(weight) * log(displacement),
## data = Auto[, 1:8])
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.3760 -2.0492 -0.3067 1.8102 16.2392
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -2513.045 1007.463 -2.494
## log(horsepower) 554.944 225.352 2.463
## log(weight) 359.082 127.785 2.810
## log(displacement) 445.825 195.671 2.278
## log(horsepower):log(weight) -78.126 28.579 -2.734
## log(horsepower):log(displacement) -93.910 42.413 -2.214
## log(weight):log(displacement) -62.672 24.329 -2.576
## log(horsepower):log(weight):log(displacement) 13.168 5.278 2.495
## Pr(>|t|)
## (Intercept) 0.01304 *
## log(horsepower) 0.01423 *
## log(weight) 0.00521 **
## log(displacement) 0.02325 *
## log(horsepower):log(weight) 0.00655 **
## log(horsepower):log(displacement) 0.02740 *
## log(weight):log(displacement) 0.01037 *
## log(horsepower):log(weight):log(displacement) 0.01302 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.85 on 384 degrees of freedom
## Multiple R-squared: 0.761, Adjusted R-squared: 0.7566
## F-statistic: 174.7 on 7 and 384 DF, p-value: < 2.2e-16