price and weight.Suppose that you want to build a regression model that predicts the price of cars using a data set named cars.
price and weight.Make sure to interpret the direction and the magnitude of the relationship. In addition, keep in mind that correlation (or regression) coefficients do not show causation but only association.
Create scatterplots
# Load the package
library(openintro)
library(ggplot2)
str(bdims)
## 'data.frame': 507 obs. of 25 variables:
## $ bia.di: num 42.9 43.7 40.1 44.3 42.5 43.3 43.5 44.4 43.5 42 ...
## $ bii.di: num 26 28.5 28.2 29.9 29.9 27 30 29.8 26.5 28 ...
## $ bit.di: num 31.5 33.5 33.3 34 34 31.5 34 33.2 32.1 34 ...
## $ che.de: num 17.7 16.9 20.9 18.4 21.5 19.6 21.9 21.8 15.5 22.5 ...
## $ che.di: num 28 30.8 31.7 28.2 29.4 31.3 31.7 28.8 27.5 28 ...
## $ elb.di: num 13.1 14 13.9 13.9 15.2 14 16.1 15.1 14.1 15.6 ...
## $ wri.di: num 10.4 11.8 10.9 11.2 11.6 11.5 12.5 11.9 11.2 12 ...
## $ kne.di: num 18.8 20.6 19.7 20.9 20.7 18.8 20.8 21 18.9 21.1 ...
## $ ank.di: num 14.1 15.1 14.1 15 14.9 13.9 15.6 14.6 13.2 15 ...
## $ sho.gi: num 106 110 115 104 108 ...
## $ che.gi: num 89.5 97 97.5 97 97.5 ...
## $ wai.gi: num 71.5 79 83.2 77.8 80 82.5 82 76.8 68.5 77.5 ...
## $ nav.gi: num 74.5 86.5 82.9 78.8 82.5 80.1 84 80.5 69 81.5 ...
## $ hip.gi: num 93.5 94.8 95 94 98.5 95.3 101 98 89.5 99.8 ...
## $ thi.gi: num 51.5 51.5 57.3 53 55.4 57.5 60.9 56 50 59.8 ...
## $ bic.gi: num 32.5 34.4 33.4 31 32 33 42.4 34.1 33 36.5 ...
## $ for.gi: num 26 28 28.8 26.2 28.4 28 32.3 28 26 29.2 ...
## $ kne.gi: num 34.5 36.5 37 37 37.7 36.6 40.1 39.2 35.5 38.3 ...
## $ cal.gi: num 36.5 37.5 37.3 34.8 38.6 36.1 40.3 36.7 35 38.6 ...
## $ ank.gi: num 23.5 24.5 21.9 23 24.4 23.5 23.6 22.5 22 22.2 ...
## $ wri.gi: num 16.5 17 16.9 16.6 18 16.9 18.8 18 16.5 16.9 ...
## $ age : int 21 23 28 23 22 21 26 27 23 21 ...
## $ wgt : num 65.6 71.8 80.7 72.6 78.8 74.8 86.4 78.4 62 81.6 ...
## $ hgt : num 174 175 194 186 187 ...
## $ sex : int 1 1 1 1 1 1 1 1 1 1 ...
# relationship between height and wegit
ggplot(data = bdims, aes(x = wgt, y = hgt)) + #bdims dataset is from openintro rpackage
geom_point()
# Compute correlation coefficient
cor(bdims$hgt, bdims$wgt, use = "pairwise.complete.obs")
## [1] 0.7173011
Interpretation
Run a regression model for price with one explanatory variable, weight, and answer Q2 through Q5.
The correlation between the two varibles is above 0.6 which means the weight and price is strong.
Yes the weight is statistically significant at 5% because of the three asterisks that are displayed next to the coefficient. The larger amount of stars the more signigicant it is ## Q3. What price does the model predict for a car that weighs 4000 pounds? The plot point in the graph shows the dot of a car weighing 4000 pounds would be $48,000 USD. ## Q4. What is the reported residual standard error? What does it mean? The residual standard error is 433 on 52 degrees of freedom. This is the difference of what is actually the weight and then what is predicted fom the data. ## Q5. What is the reported adjusted R squared? What does it mean? The adjusted R-squared is 0.566 which means that 56.6% of the variability in weight can be explained by the price.
Run a second regression model for price with two explanatory variables: weight and passengers, and answer Q6.
The second model fits better because the residual standard error number is smaller and the reported adjusted R squared number is also higher which means there is less error.
Build regression model
# Create a linear model 1
mod_1 <- lm(weight ~ price, data = cars)
# View summary of model 1
summary(mod_1)
##
## Call:
## lm(formula = weight ~ price, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1328.29 -228.09 10.92 258.19 924.27
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2171.113 118.956 18.251 < 2e-16 ***
## price 43.331 5.169 8.383 3.17e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 433 on 52 degrees of freedom
## Multiple R-squared: 0.5747, Adjusted R-squared: 0.5666
## F-statistic: 70.28 on 1 and 52 DF, p-value: 3.173e-11
# Create a linear model 2
mod_2 <- lm(weight ~ price + passengers, data = cars)
# View summary of model 2
summary(mod_2)
##
## Call:
## lm(formula = weight ~ price + passengers, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -976.81 -201.56 6.13 151.33 799.88
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 294.25 356.98 0.824 0.414
## price 35.99 4.36 8.256 5.80e-11 ***
## passengers 395.91 72.56 5.456 1.44e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 347.4 on 51 degrees of freedom
## Multiple R-squared: 0.7315, Adjusted R-squared: 0.7209
## F-statistic: 69.46 on 2 and 51 DF, p-value: 2.748e-15
Interpretation