data605 wk 11 discussion

library(ggplot2)
library(caret)

## Loading required package: lattice

mtcars <- read.csv("https://raw.githubusercontent.com/johnpannyc/data605_wk11_discussion/master/mtcars.csv")

The “mtcars” dataset has 32 observations of 11 variables. In addition to the fuel consumption (in miles per gallon) of 32 automobiles, the dataset also captures 10 other attributes of the vehicles, such as the number of cylinders (cyl), displacement in cu. in. (disp), gross horsepower (hp), rear axle ratio (drat), weight in tons (wt), time to cover 1/4 mile (qsec), V/S (vs), automatic or manual transmission (am), number of forward gears (gear), and the number of carburetors (carb).

Objective: whether weight(wt) and migles per gallon (mpg) fit one factor linear regression model.

summary(mtcars)

##                 model         mpg             cyl             disp      
##  AMC Javelin       : 1   Min.   :10.40   Min.   :4.000   Min.   : 71.1  
##  Cadillac Fleetwood: 1   1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8  
##  Camaro Z28        : 1   Median :19.20   Median :6.000   Median :196.3  
##  Chrysler Imperial : 1   Mean   :20.09   Mean   :6.188   Mean   :230.7  
##  Datsun 710        : 1   3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0  
##  Dodge Challenger  : 1   Max.   :33.90   Max.   :8.000   Max.   :472.0  
##  (Other)           :26                                                  
##        hp             drat             wt             qsec      
##  Min.   : 52.0   Min.   :2.760   Min.   :1.513   Min.   :14.50  
##  1st Qu.: 96.5   1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89  
##  Median :123.0   Median :3.695   Median :3.325   Median :17.71  
##  Mean   :146.7   Mean   :3.597   Mean   :3.217   Mean   :17.85  
##  3rd Qu.:180.0   3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90  
##  Max.   :335.0   Max.   :4.930   Max.   :5.424   Max.   :22.90  
##                                                                 
##        vs               am              gear            carb      
##  Min.   :0.0000   Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4375   Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :5.000   Max.   :8.000  
##

head(mtcars)

##               model  mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## 4    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 5 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## 6           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

fit <- lm(wt ~ mpg, data = mtcars)
summary(fit)

## 
## Call:
## lm(formula = wt ~ mpg, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6516 -0.3490 -0.1381  0.3190  1.3684 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.04726    0.30869  19.590  < 2e-16 ***
## mpg         -0.14086    0.01474  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4945 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

g = ggplot(mtcars, aes(x = wt, y = mpg))
g = g + xlab("weight of vehicle")
g = g + ylab("miles per gallon")
g = g + geom_point(size = 3, color = "blue", alpha=0.5)
g = g + geom_smooth(method = "lm", color = "red")
g

#According to the plot, Estimate, Multiple R-squred, p value, this is fit the linear regression model.

Here is my explanation We can see that the distribution of the residuals do not appear to be strongly symmetrical. That means that the model predicts certain points that fall far away from the actual observed points. We could take this further consider plotting the residuals to see whether this normally distributed. A small p-value indicates that it is unlikely we will observe a relationship between the predictor (speed) and response (dist) variables due to chance. Typically, a p-value of 5% or less is a good cut-off point. In our model example, the p-values are very close to zero. Three stars (or asterisks) represent a highly significant p-value. Consequently, a small p-value for the intercept and the slope indicates that we can reject the null hypothesis which allows us to conclude that there is a relationship between wt and mpg. Residual Standard Error is measure of the quality of a linear regression fit. Our result is 0.4945 which is small. The R-squared (R2) statistic provides a measure of how well the model is fitting the actual data. R2 is a measure of the linear relationship between our predictor variable (speed) and our response / target variable (dist). It always lies between 0 and 1. In our study, the R2 we get is 0.7528. Or roughly 75% of the variance found in the response variable (mpg) can be explained by the predictor variable (wt). F-statistic is a good indicator of whether there is a relationship between our predictor and the response variables. The further the F-statistic is from 1 the better it is. In our study, the F-statistic is 91.38 which is relatively larger than 1 given the size of our data (30).

x1 <- mtcars$wt
y1 <- resid(lm(mtcars$wt ~ mtcars$mpg))
g = ggplot(data.frame(x = x1, y = y1), aes(x = x1, y = y1))
g = g + geom_hline(yintercept = 0, size = 1)
g = g + geom_point(size = 3, color = "blue", alpha = 0.5)
g = g + xlab("weight") + ylab("residual")
g

Basically, it meet the standard of linear regression model except three very heavy vechiles. I guess these vehicles may use diseal instead of gas.

par(mfrow = c(2,2))
plot(fit)

The residual plot does not show distinct pattern, basically around horizon line. Normal QQ shows a slightly S shape which indicates a few extreme value comes from normal distibution. Scale-location, the residual randomly spread but not centered along the horizonline. Residuals vs Leverage, most cases are within cook’s distance line.

I think the plots indicates the model is fairly fit the linear regression.

There are a few factor can be influencial on our model, such as whether the car powered by gas or diseal, whether is a manual gear or autmatic transmission.

data605 wk 11 discussion

Jun Pan

April 12, 2019