head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
dim(cars)
## [1] 50 2
str(cars)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
skim(cars)
| Name | cars |
| Number of rows | 50 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| speed | 0 | 1 | 15.40 | 5.29 | 4 | 12 | 15 | 19 | 25 | ▂▅▇▇▃ |
| dist | 0 | 1 | 42.98 | 25.77 | 2 | 26 | 36 | 56 | 120 | ▅▇▅▂▁ |
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
ggplot(cars, aes(x = speed, y = dist))+
geom_point(color = 4)
### Strength and direction of the correlation:
cars %>%
summarise(cor(speed, dist, use = "complete.obs"))
## cor(speed, dist, use = "complete.obs")
## 1 0.8068949
stop_dist <-lm(formula = cars$dist~cars$speed, data = cars)
model <- stop_dist
summary(model)
##
## Call:
## lm(formula = cars$dist ~ cars$speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
y <-17.5791 + 3.9324 * 3.9324
y
## [1] 33.04287
ggplot(data = cars, aes(x = speed, y = dist)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
## Model diagnostics
# Fit a linear regression model
model <- lm(dist ~ speed, data = cars)
# Get fitted values and residuals from the model
fitted_values <- fitted(model)
residuals <- resid(model)
# Create residual plot
ggplot(data = NULL, aes(x = fitted_values, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
# Fit a linear regression model
cars.lm <- lm(dist ~ speed, data = cars)
# Create Q-Q plot of residuals
qqnorm(resid(cars.lm))
qqline(resid(cars.lm))
The residuals vs. fitted values plot gives no indication that the assumptions of our model are false. The plot shows that the variability across the different values of x is about the same, thus confirming its homoscedasticiticy.
The scatterplot shows that while many values are clustered more or less around the regression line, there are also few values that lie rather far away from the regression line indicating that our model may not be perfect and may need improvement.
The quantile vs quantile plot shows residuals forming an almost straight line, but the plot also indicates the possible presence of outliers. We can fix this problem and improve our model by using a log transformation of either the x or y values/ or by adding an x2 term.
OVERALL, the observed residuals plots confirm the assumptions of linearity and homoscedasticiticy. But, the Q-Q plot indicate that the assumption of near normal distribution of residuals is not reasonable. In order to fix this problem, a log transformation of either x or y values may be needed in order to get rid of outliers and therefore improve our model before we move on with statistical inference techniques.