# Load the "cars" dataset
data(cars)
# Build a linear model for stopping distance as a function of speed
model <- lm(dist ~ speed, data = cars)
plot(dist ~ speed, data = cars, main = "Stopping Distance vs. Speed", xlab = "Speed", ylab = "Stopping Distance")
abline(model, col = "blue")
summary(model)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
As per the book, typically we look for a standard error that is at least five to 10 times smaller than the corresponding coefficient.
3.9324/0.4155
## [1] 9.46426
This is within that range. Based off the t-value, finding datum that is extreme, is extremely small. The R-squared value of 0.6511 means that 65% of the variance in the model is explained by the speed, which is a fairly good measure considering the sheer variability in cars.
par(mfrow = c(2, 2))
plot(model, which = 1:4)
Looking at the Q-Q plot, one end diverges indicating that there are more vehicles with furhter stopping distances. This makes sense as tires are generally the limiting factor for stopping speeds.
Practically, we look for 4 key parts to analyze if a regression is reasonable.
Independence-This is a given spec of the data set There is a linear relationship, which is demonstrated loosely in the scatter plot above. There is constant variance, which is shown by the uniform spread of residuals. There is a normality of Residuals across a reasonable range, which is demonstrated by the chart above.
Cook’s disance indicates the potential of a few outliers, that merit additional attention, but overall, there is a linear relationship between speed and stopping distance.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
coffee_data <- read.csv("coffee_graded_data.csv", header = TRUE)
coffee_data_filtered <- coffee_data %>%
filter(Total.Cup.Points >=70, altitude_mean_meters >= 800, altitude_mean_meters <= 2500)
# Build a linear model for Total Cup Points as a function of altitude_mean_meters
model <- lm(Total.Cup.Points ~ altitude_mean_meters, data = coffee_data_filtered)
# Visualize the relationship between Total Cup Points and altitude_mean_meters
plot(Total.Cup.Points ~ altitude_mean_meters, data = coffee_data_filtered, main = "Total Cup Points vs. Altitude", xlab = "Altitude (m)", ylab = "Total Cup Points")
abline(model, col = "blue")
# Evaluate the quality of the model
summary(model)
##
## Call:
## lm(formula = Total.Cup.Points ~ altitude_mean_meters, data = coffee_data_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.3141 -1.0156 0.2344 1.3370 6.6796
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.893e+01 3.467e-01 227.7 <2e-16 ***
## altitude_mean_meters 2.437e-03 2.436e-04 10.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.246 on 975 degrees of freedom
## Multiple R-squared: 0.09307, Adjusted R-squared: 0.09214
## F-statistic: 100.1 on 1 and 975 DF, p-value: < 2.2e-16
# Perform residual analysis
par(mfrow = c(2, 2))
plot(model, which = 1:4)