# Load packages
library(dplyr)
library(ggplot2)
library(openintro)
The simple linear regression model can be visualized by a straight line, a “best fit” line that cuts through the data in a way that minimizes the distance between the line and the data points. This can be done by using the geom_smooth() function.
# Scatterplot with regression line
ggplot(data = cars, aes(x = weight, y = price)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) # lm stands for linear model; se for standard errors
Show that the mean of residuals is zero (not exactly zero due to rounding error). Calculate residual standard error.
# Create a linear model
mod <- lm(price ~ weight, data = cars)
# View summary of model
summary(mod)
##
## Call:
## lm(formula = price ~ weight, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.767 -3.766 -1.155 2.568 35.440
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20.295205 4.915159 -4.129 0.000132 ***
## weight 0.013264 0.001582 8.383 3.17e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.575 on 52 degrees of freedom
## Multiple R-squared: 0.5747, Adjusted R-squared: 0.5666
## F-statistic: 70.28 on 1 and 52 DF, p-value: 3.173e-11
Interpretation
Is the coefficient statistically significant at 5%? Yes, the data is more that 95% significant
Is the y-intercept statistically significant at 5%? Yes, the data is more that 95% significant
Interpret the coefficient of weight. For every pound the car weighs the price is $13.26
Interpret the y-intercept. The y-intercept is negative making the data invalid
What would be the price of a car that weighs 3000 pounds? $39,771.70
What is the reported residual standard error? What does it mean? 7.575 on 52 degrees of freedom meaning the standard error away from the best fit line is around 7.575
What is the reported adjusted R squared? What does it mean? The adjusted R squared is .5666 meaning 56.66% of the variability in price is in weight
Find another variable that is strongly correlated to price. Demonstrate the nature of the relationship with a scatterplot and a regression model. Explain them in a sentence or two. ## Chapter 3: Simple linear regression
The simple linear regression model can be visualized by a straight line, a “best fit” line that cuts through the data in a way that minimizes the distance between the line and the data points. This can be done by using the geom_smooth() function.
# Scatterplot with regression line
ggplot(data = cars, aes(x = price, y = mpgCity)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) # lm stands for linear model; se for standard errors
Show that the mean of residuals is zero (not exactly zero due to rounding error). Calculate residual standard error.
# Create a linear model
mod <- lm(price ~ mpgCity, data = cars)
# View summary of model
summary(mod)
##
## Call:
## lm(formula = price ~ mpgCity, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.447 -5.368 -2.850 5.140 37.134
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.7839 4.4983 10.178 5.63e-14 ***
## mpgCity -1.1062 0.1857 -5.956 2.26e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.956 on 52 degrees of freedom
## Multiple R-squared: 0.4056, Adjusted R-squared: 0.3941
## F-statistic: 35.48 on 1 and 52 DF, p-value: 2.256e-07
Price is negatively correlated to mpgcity meaning generally, as the mpgcity increases the price decreases