Getting started: Setting up the environment

The dataset

head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Data exploration

dim(cars)
## [1] 50  2
str(cars)
## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
skim(cars)
Data summary
Name cars
Number of rows 50
Number of columns 2
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
speed 0 1 15.40 5.29 4 12 15 19 25 ▂▅▇▇▃
dist 0 1 42.98 25.77 2 26 36 56 120 ▅▇▅▂▁

Summary statistics

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Correlation between speed & distance

ggplot(cars, aes(x = speed, y = dist))+
  geom_point(color = 4)

### Strength and direction of the correlation:

cars %>%
  summarise(cor(speed, dist, use = "complete.obs"))
##   cor(speed, dist, use = "complete.obs")
## 1                              0.8068949

Linear regression

stop_dist <-lm(formula = cars$dist~cars$speed, data = cars)
model <- stop_dist
summary(model)
## 
## Call:
## lm(formula = cars$dist ~ cars$speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12
y <-17.5791 + 3.9324 * 3.9324
y
## [1] 33.04287

Prediction and prediction errors :

Regression line

ggplot(data = cars, aes(x = speed, y = dist)) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

## Model diagnostics

To assess whether our linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability or homoscedasticiticy.

residuals vs. fitted (predicted) values

# Fit a linear regression model
model <- lm(dist ~ speed, data = cars)

# Get fitted values and residuals from the model
fitted_values <- fitted(model)
residuals <- resid(model)

# Create residual plot

ggplot(data = NULL, aes(x = fitted_values, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

Quantile vs.quantile

# Fit a linear regression model
cars.lm <- lm(dist ~ speed, data = cars)

# Create Q-Q plot of residuals
qqnorm(resid(cars.lm))
qqline(resid(cars.lm))

Our conclusion:

  • The residuals vs. fitted values plot gives no indication that the assumptions of our model are false. The plot shows that the variability across the different values of x is about the same, thus confirming its homoscedasticiticy.

  • The scatterplot shows that while many values are clustered more or less around the regression line, there are also few values that lie rather far away from the regression line indicating that our model may not be perfect and may need improvement.

  • The quantile vs quantile plot shows residuals forming an almost straight line, but the plot also indicates the possible presence of outliers. We can fix this problem and improve our model by using a log transformation of either the x or y values/ or by adding an x2 term.

OVERALL, the observed residuals plots confirm the assumptions of linearity and homoscedasticiticy. But, the Q-Q plot indicate that the assumption of near normal distribution of residuals is not reasonable. In order to fix this problem, a log transformation of either x or y values may be needed in order to get rid of outliers and therefore improve our model before we move on with statistical inference techniques.