The code chunk below will clean the data a bit before analyzing it
strava <-
strava_full |>
# Keeping the needed columns and renaming some
dplyr::select(
date = activity_date,
time = moving_time,
distance,
elevation_gain
) |>
# Converting units to American (standard)
mutate(
date = mdy_hms(date),
# Time from seconds to minutes
time = time / 60,
# Distance from kilometers to miles
distance = distance * 0.621,
# Elevation from meters to 100 feet
elevation_gain = elevation_gain * 3.28/100
) |>
# Keeping the activities between 5 and 30 miles
filter(
between(distance, 5, 30),
# Removing outliers where the owner forgot to turn off strava
date(date) != '2024-04-19',
date(date) != '2024-06-20'
)
tibble(strava)
## # A tibble: 122 × 4
## date time distance elevation_gain
## <dttm> <dbl> <dbl> <dbl>
## 1 2023-06-28 22:18:00 60.0 9.28 0.935
## 2 2023-06-30 17:53:13 54.7 8.33 1.50
## 3 2023-07-04 16:12:12 67.0 11.2 1.37
## 4 2023-07-14 18:29:18 77.9 15.5 5.16
## 5 2023-07-15 17:35:51 45.4 8.91 0.920
## 6 2023-07-17 16:51:24 59.4 11.5 1.64
## 7 2023-07-21 18:27:02 140. 25.5 4.94
## 8 2023-07-23 21:34:35 65.8 12.8 1.67
## 9 2023-07-24 22:07:51 82.1 10.7 1.45
## 10 2023-07-25 14:45:21 105. 18.8 2.08
## # ℹ 112 more rows
The data set strava contains bike activities and has the following four variables:
date
: The day and time the trip startedtime
: The time spent moving during the activity (in
minutes)distance
: The total distance of the bike trip (in
miles)elevation_gain
: The total elevation climbed (in 100
feet) while bikingCreate the individual graphs for time, distance, and elevation_gain
strava |>
# Placing the three columns into one column to create a faceted graph
pivot_longer(
cols = -date,
names_to = 'feature',
values_to = 'value'
) |>
# Starting the graph
ggplot(
mapping = aes(
x = value
)
) +
geom_density(
fill = 'orangered'
) +
facet_wrap(
facets = vars(feature),
scales = 'free',
ncol = 1
)
Describe the shape of each of the three variables.
time: Unimodel and right skewed
distance: Unimodal and right skewed
elevation_gained: Slightly bi-modal and right skewed
Create a scatterplot matrix for the three variables
strava |>
dplyr::select(-date) |>
GGally::ggpairs()
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
Describe any important characteristics between the variables
time vs distance:
Seems to be a fairly strong linear association
time vs elevation_gain:
A moderate-ish linear association
distance vs elevation_gain:
Also a moderate-ish linear association
If we wanted to predict the amount of time a trip would take, which of the two other variables would be better to use? Justify your answer
Distance has a higher correlation and stronger association from the scatterplot with time than elevation_gain does, so it would be the better predictor of time.
Create the linear regression model predicting time using
distance and name it bike_slr. Display the results using
get_regression_table()
bike_slr <- lm(time ~ distance, data = strava)
get_regression_table(bike_slr)
## # A tibble: 2 × 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 8.63 2.73 3.16 0.002 3.23 14.0
## 2 distance 5.00 0.204 24.5 0 4.60 5.41
How long is a trip predicted to take if the length of the trip is 20 miles?
\[\widehat{\text{time}} = 8.6 + 5.0(\text{distance}) = 8.6 + 5.0*20 = 108.6\]
The trip is expected to take about 110.4 minutes
If the trip takes 100 minutes, what is the residual?
\[\text{residual} = e = y - \hat{y} = 100 - 108.6 = -8.6\]
Interpret the effect of distance on the time a bike trip takes, in context
For every additional mile on the trip, the activity is expected to take 5 minutes longer
Create the residual plot.
# Calculating y_hat and the residuals
get_regression_points(model = bike_slr) |>
# Creating the residual plot: e vs x
ggplot(
mapping = aes(
x = distance,
y = residual
)
) +
geom_point() +
geom_hline(
yintercept = 0,
color = 'red'
)
Are the conditions for linear regression met? Briefly explain your answer
Yes, there aren’t any patterns or trends in the residual plot, indicating that there is only a linear association between time and distance
Calculate the fit statistics, \(R^2\) and rmse (or sigma)
get_regression_summaries(bike_slr)
## # A tibble: 1 × 9
## r_squared adj_r_squared mse rmse sigma statistic p_value df nobs
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.833 0.832 181. 13.5 13.6 600. 0 1 122
Does the model make accurate predictions?
An \(R^2\) of 0.83 indicates the accuracy is decently high, but not super accurate. The rmse of 13.5 tells us that the typical prediction is off by about 13.5 minutes, which is pretty high! So the model could be better.
Build a model to predict time using distance AND
elevation_gained. Save it as bike_mlr. Display the results using
get_regression_table()
bike_mlr <- lm(time ~ distance + elevation_gain, data = strava)
get_regression_table(bike_mlr)
## # A tibble: 3 × 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 7.20 2.96 2.43 0.016 1.34 13.1
## 2 distance 4.86 0.233 20.9 0 4.40 5.32
## 3 elevation_gain 1.49 1.19 1.25 0.215 -0.875 3.86
How long is a trip predicted to take if the length of the trip is 18 miles and gained 200 feet in elevation? Round the model estimates to 1 decimal place
\[\widehat{\text{time}} = 7.2 + 4.9(\text{distance}) + 1.5(\text{elevation}) = 7.2 + 4.9*20 + 1.5*2 = 98.4\]
The trip is expected to take about 98.4 minutes
If the trip takes 100 minutes, what is the residual?
\[\text{residual} = e = y - \hat{y} = 100 - 98.4 = 1.6\]
Interpret the effect of elevation_gain on the time a bike trip takes, in context
For every additional 100 feet of elevation gained on the activity, it is expected to take about 1.5 minutes longer if the distance is kept the same
Create the residual plot for the multiple linear regression model.
# Calculating y_hat and the residuals
get_regression_points(model = bike_mlr) |>
# Creating the residual plot: e vs y-hat
ggplot(
mapping = aes(
x = time_hat,
y = residual
)
) +
geom_point() +
geom_hline(
yintercept = 0,
color = 'red'
)
Are the conditions for linear regression met? Briefly explain your answer
Yes, there aren’t any patterns or trends in the residual plot, indicating that there is only a linear association between time and distance
Calculate the fit statistics, \(R^2\) and rmse (or sigma)
get_regression_summaries(bike_mlr)
## # A tibble: 1 × 9
## r_squared adj_r_squared mse rmse sigma statistic p_value df nobs
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.835 0.833 179. 13.4 13.5 302. 0 2 122
Does the model make accurate predictions?
An \(R^2\) of 0.835 indicates the accuracy is decently high, but not super accurate. The rmse of 13.3 tells us that the typical prediction is off by about 13.3 minutes, which is pretty high! So the model could be better.
Does it appear that including elevation_gain is worth keeping in the model compared to just distance alone? Justify your answer!
bind_rows(
.id = 'model',
'slr' = get_regression_summaries(bike_slr),
'mlr' = get_regression_summaries(bike_mlr)
)
## # A tibble: 2 × 10
## model r_squared adj_r_squared mse rmse sigma statistic p_value df nobs
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 slr 0.833 0.832 181. 13.5 13.6 600. 0 1 122
## 2 mlr 0.835 0.833 179. 13.4 13.5 302. 0 2 122
No, including elevation gained barely increases $R^2 and barely decreases rmse, indicating it isn’t worth including if distance is already included in the model.