Quiz on Regression

# Load packages
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(openintro)

## Please visit openintro.org for free statistics materials

## 
## Attaching package: 'openintro'

## The following object is masked from 'package:ggplot2':
## 
##     diamonds

## The following objects are masked from 'package:datasets':
## 
##     cars, trees

Chapter 3: Simple linear regression

3.1 The “best fit” line

The simple linear regression model can be visualized by a straight line, a “best fit” line that cuts through the data in a way that minimizes the distance between the line and the data points. This can be done by using the geom_smooth() function.

# Scatterplot with regression line
ggplot(data = cars, aes(x = weight, y = price)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) # lm stands for linear model; se for standard errors

Chapter 4: Interpreting regression models

4.1 Fitting simple linear models

Create a linear model using lm(). This function return a model object having class “lm”. This object contains lots of information about your regression model, including:

the data used to fit the model,
the specification of the model,
the fitted values and residuals,
the residuals.

# Linear model for weight as a function of height
lm(price ~ weight, data = cars)

## 
## Call:
## lm(formula = price ~ weight, data = cars)
## 
## Coefficients:
## (Intercept)       weight  
##   -20.29521      0.01326

Interpretation

coefficient An one-centimeter increase in the height of a person is associated with an increase of 1.018 kg in the weight.
intercept When a person is 0 centimeter tall, his/her weight is -105.011 kg. Obviously, the intercept is meaningless in this case.

Chapter 5: Model Fit

5.2 Standard error of residuals (Residual Standard Error)

Show that the mean of residuals is zero (not exactly zero due to rounding error). Calculate residual standard error.

# Create a linear model
mod <- lm(price ~ weight, data = cars)

# View summary of model
summary(mod)

## 
## Call:
## lm(formula = price ~ weight, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.767  -3.766  -1.155   2.568  35.440 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -20.295205   4.915159  -4.129 0.000132 ***
## weight        0.013264   0.001582   8.383 3.17e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.575 on 52 degrees of freedom
## Multiple R-squared:  0.5747, Adjusted R-squared:  0.5666 
## F-statistic: 70.28 on 1 and 52 DF,  p-value: 3.173e-11

Interpretation coefficient is not statistically significant at 5% because there are 3 stars showing. y-intercept is not statistically significant at 5% because there are 3 stars showing. *The price of the car that weighs 3000 pounds would be $68.23. I found the price by looking at the R package and dividing the weight which was 3000 pounds by the price which was $43.33.

The magnitude of a typical residual is 9.3 kg.
In other words, the model estimated weight misses the actual weight by about 9.3 kg.

5.4 Interpretation of R^2

The R^2 reported for the regression model for weight in terms of height is 0.5136. This means that 51.36% of the variability in weight can be explained by height.

# Scatterplot with regression line
ggplot(data = cars, aes(x = price, y = weight)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) # lm stands for linear model; se for standard errors