library(ggplot2)
library(caret)
## Loading required package: lattice
mtcars <- read.csv("https://raw.githubusercontent.com/johnpannyc/data605_wk11_discussion/master/mtcars.csv")
The “mtcars” dataset has 32 observations of 11 variables. In addition to the fuel consumption (in miles per gallon) of 32 automobiles, the dataset also captures 10 other attributes of the vehicles, such as the number of cylinders (cyl), displacement in cu. in. (disp), gross horsepower (hp), rear axle ratio (drat), weight in tons (wt), time to cover 1/4 mile (qsec), V/S (vs), automatic or manual transmission (am), number of forward gears (gear), and the number of carburetors (carb).
Objective: whether weight(wt) and migles per gallon (mpg) fit one factor linear regression model.
summary(mtcars)
## model mpg cyl disp
## AMC Javelin : 1 Min. :10.40 Min. :4.000 Min. : 71.1
## Cadillac Fleetwood: 1 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8
## Camaro Z28 : 1 Median :19.20 Median :6.000 Median :196.3
## Chrysler Imperial : 1 Mean :20.09 Mean :6.188 Mean :230.7
## Datsun 710 : 1 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0
## Dodge Challenger : 1 Max. :33.90 Max. :8.000 Max. :472.0
## (Other) :26
## hp drat wt qsec
## Min. : 52.0 Min. :2.760 Min. :1.513 Min. :14.50
## 1st Qu.: 96.5 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89
## Median :123.0 Median :3.695 Median :3.325 Median :17.71
## Mean :146.7 Mean :3.597 Mean :3.217 Mean :17.85
## 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90
## Max. :335.0 Max. :4.930 Max. :5.424 Max. :22.90
##
## vs am gear carb
## Min. :0.0000 Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4375 Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :1.0000 Max. :5.000 Max. :8.000
##
head(mtcars)
## model mpg cyl disp hp drat wt qsec vs am gear carb
## 1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
fit <- lm(wt ~ mpg, data = mtcars)
summary(fit)
##
## Call:
## lm(formula = wt ~ mpg, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6516 -0.3490 -0.1381 0.3190 1.3684
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.04726 0.30869 19.590 < 2e-16 ***
## mpg -0.14086 0.01474 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4945 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
g = ggplot(mtcars, aes(x = wt, y = mpg))
g = g + xlab("weight of vehicle")
g = g + ylab("miles per gallon")
g = g + geom_point(size = 3, color = "blue", alpha=0.5)
g = g + geom_smooth(method = "lm", color = "red")
g
#According to the plot, Estimate, Multiple R-squred, p value, this is fit the linear regression model.
Here is my explanation We can see that the distribution of the residuals do not appear to be strongly symmetrical. That means that the model predicts certain points that fall far away from the actual observed points. We could take this further consider plotting the residuals to see whether this normally distributed. A small p-value indicates that it is unlikely we will observe a relationship between the predictor (speed) and response (dist) variables due to chance. Typically, a p-value of 5% or less is a good cut-off point. In our model example, the p-values are very close to zero. Three stars (or asterisks) represent a highly significant p-value. Consequently, a small p-value for the intercept and the slope indicates that we can reject the null hypothesis which allows us to conclude that there is a relationship between wt and mpg. Residual Standard Error is measure of the quality of a linear regression fit. Our result is 0.4945 which is small. The R-squared (R2) statistic provides a measure of how well the model is fitting the actual data. R2 is a measure of the linear relationship between our predictor variable (speed) and our response / target variable (dist). It always lies between 0 and 1. In our study, the R2 we get is 0.7528. Or roughly 75% of the variance found in the response variable (mpg) can be explained by the predictor variable (wt). F-statistic is a good indicator of whether there is a relationship between our predictor and the response variables. The further the F-statistic is from 1 the better it is. In our study, the F-statistic is 91.38 which is relatively larger than 1 given the size of our data (30).
x1 <- mtcars$wt
y1 <- resid(lm(mtcars$wt ~ mtcars$mpg))
g = ggplot(data.frame(x = x1, y = y1), aes(x = x1, y = y1))
g = g + geom_hline(yintercept = 0, size = 1)
g = g + geom_point(size = 3, color = "blue", alpha = 0.5)
g = g + xlab("weight") + ylab("residual")
g
Basically, it meet the standard of linear regression model except three very heavy vechiles. I guess these vehicles may use diseal instead of gas.
par(mfrow = c(2,2))
plot(fit)
The residual plot does not show distinct pattern, basically around horizon line. Normal QQ shows a slightly S shape which indicates a few extreme value comes from normal distibution. Scale-location, the residual randomly spread but not centered along the horizonline. Residuals vs Leverage, most cases are within cook’s distance line.
I think the plots indicates the model is fairly fit the linear regression.
There are a few factor can be influencial on our model, such as whether the car powered by gas or diseal, whether is a manual gear or autmatic transmission.