Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Data Load in

# Load the "cars" dataset
data(cars)

Model Building

# Build a linear model for stopping distance as a function of speed
model <- lm(dist ~ speed, data = cars)

Basic Visualisation

plot(dist ~ speed, data = cars, main = "Stopping Distance vs. Speed", xlab = "Speed", ylab = "Stopping Distance")
abline(model, col = "blue")

Model Evaluation

summary(model)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

As per the book, typically we look for a standard error that is at least five to 10 times smaller than the corresponding coefficient.

3.9324/0.4155
## [1] 9.46426

This is within that range. Based off the t-value, finding datum that is extreme, is extremely small. The R-squared value of 0.6511 means that 65% of the variance in the model is explained by the speed, which is a fairly good measure considering the sheer variability in cars.

Residual Analysis

par(mfrow = c(2, 2))
plot(model, which = 1:4)

Looking at the Q-Q plot, one end diverges indicating that there are more vehicles with furhter stopping distances. This makes sense as tires are generally the limiting factor for stopping speeds.

Overal Analysis

Practically, we look for 4 key parts to analyze if a regression is reasonable.

Independence-This is a given spec of the data set There is a linear relationship, which is demonstrated loosely in the scatter plot above. There is constant variance, which is shown by the uniform spread of residuals. There is a normality of Residuals across a reasonable range, which is demonstrated by the chart above.

Cook’s disance indicates the potential of a few outliers, that merit additional attention, but overall, there is a linear relationship between speed and stopping distance.

Coffee Plants

Is there a linear relationship between elevation of coffee growth and cup score.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
coffee_data <- read.csv("coffee_graded_data.csv", header = TRUE)
coffee_data_filtered <- coffee_data %>%
  filter(Total.Cup.Points >=70, altitude_mean_meters >= 800, altitude_mean_meters <= 2500)
# Build a linear model for Total Cup Points as a function of altitude_mean_meters
model <- lm(Total.Cup.Points ~ altitude_mean_meters, data = coffee_data_filtered)

# Visualize the relationship between Total Cup Points and altitude_mean_meters
plot(Total.Cup.Points ~ altitude_mean_meters, data = coffee_data_filtered, main = "Total Cup Points vs. Altitude", xlab = "Altitude (m)", ylab = "Total Cup Points")
abline(model, col = "blue")

# Evaluate the quality of the model
summary(model)
## 
## Call:
## lm(formula = Total.Cup.Points ~ altitude_mean_meters, data = coffee_data_filtered)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.3141  -1.0156   0.2344   1.3370   6.6796 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.893e+01  3.467e-01   227.7   <2e-16 ***
## altitude_mean_meters 2.437e-03  2.436e-04    10.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.246 on 975 degrees of freedom
## Multiple R-squared:  0.09307,    Adjusted R-squared:  0.09214 
## F-statistic: 100.1 on 1 and 975 DF,  p-value: < 2.2e-16
# Perform residual analysis
par(mfrow = c(2, 2))
plot(model, which = 1:4)