R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
colnames(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

Linear Regression predict mpg

lm_mpg <- lm(mpg ~ wt + hp, data = mtcars)
summary(lm_mpg)
## 
## Call:
## lm(formula = mpg ~ wt + hp, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.941 -1.600 -0.182  1.050  5.854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
## wt          -3.87783    0.63273  -6.129 1.12e-06 ***
## hp          -0.03177    0.00903  -3.519  0.00145 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8148 
## F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

Interpretation of linear regression of mpg

Intercept (37.2273): When both wt and hp are 0, the predicted mpg is supposed to be 37.23. This is a baseline with no real meaning beacuse a car cannot have zero weight or zero horsepower.

wt (−3.8778): Holding horsepower constant, for every unit increase in wieght (1000), predicted mpg decreases by 3.88 mpg. This is statistically significant beacuse p < 0.001. This means that the heavier vehicles burn more fuel.

hp (−0.0318): Holding weight constant, for every unit increase in horsepower, predicted mpg decreases by 0.032 miles per gallon. This is statistically significant becuase p = 0.0015. More powerful engines consume more fuel.

plot(lm_mpg, which = 1:4)

MSE

mean(lm_mpg$residuals^2)
## [1] 6.095242

R^2

summary(lm_mpg)$r.squared
## [1] 0.8267855

Interaction between wt and hp for lm_mpg

lm_wt_hp <- lm(mpg ~ wt * hp, data = mtcars)
summary(lm_wt_hp)
## 
## Call:
## lm(formula = mpg ~ wt * hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0632 -1.6491 -0.7362  1.4211  4.5513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 49.80842    3.60516  13.816 5.01e-14 ***
## wt          -8.21662    1.26971  -6.471 5.20e-07 ***
## hp          -0.12010    0.02470  -4.863 4.04e-05 ***
## wt:hp        0.02785    0.00742   3.753 0.000811 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.153 on 28 degrees of freedom
## Multiple R-squared:  0.8848, Adjusted R-squared:  0.8724 
## F-statistic: 71.66 on 3 and 28 DF,  p-value: 2.981e-13

This has one more term than modle #1 or lm_mpg. This has 4 terms opposed to 3, and it has 28 degrees of freedom cause it lost one due to the interaction. The intercept is 12.58 larger than the first model. Wt, hp, and wt*hp all changes from -3.878, -0.032, and NA, to -8.217, -0.120, and +0.02785 respectively.

Winzorisation

q5  <- quantile(mtcars$hp, 0.05)   #5th percentile value of hp
q95 <- quantile(mtcars$hp, 0.95)   #95th percentile value of hp

hp_win <- pmax(pmin(mtcars$hp, q95), q5)

lm_win <- lm(mpg ~ wt + hp_win, data = mtcars) #fit win data to first model
summary(lm_win)
## 
## Call:
## lm(formula = mpg ~ wt + hp_win, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8825 -1.6545 -0.0968  0.8367  5.7259 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.31722    1.56964  23.774  < 2e-16 ***
## wt          -3.58279    0.66427  -5.394  8.5e-06 ***
## hp_win      -0.03952    0.01059  -3.732 0.000824 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.546 on 29 degrees of freedom
## Multiple R-squared:  0.833,  Adjusted R-squared:  0.8215 
## F-statistic: 72.34 on 2 and 29 DF,  p-value: 5.348e-12
library(car)
## Loading required package: carData
vif_values <- vif(lm_mpg)
corrplot_data <- cor(mtcars[, c("mpg", "wt", "hp")])

We find that the VIF for both is about 1.77 which is well below the 5 threshold. While the correlation is 0.659, it is not high enough to cause multicoliearity. Both predictors contribute independent explanatory power on mpg.

9. No it does not improve model predictability, but instead shows how well you explained your past data not necessarily the best predictor.