Regression Lab: Strava data

Initial Cleaning

The code chunk below will clean the data a bit before analyzing it

strava <- 
  strava_full |> 
  # Keeping the needed columns and renaming some
  dplyr::select(
    date = activity_date, 
    time = moving_time,
    distance,
    elevation_gain
    ) |> 
  # Converting units to American (standard)
  mutate(
    date = mdy_hms(date),
    # Time from seconds to minutes
    time = time / 60,
    # Distance from kilometers to miles
    distance = distance * 0.621,
    # Elevation from meters to 100 feet
    elevation_gain = elevation_gain * 3.28/100
  ) |> 
  # Keeping the activities between 5 and 30 miles
  filter(
    between(distance, 5, 30),
    # Removing outliers where the owner forgot to turn off strava 
    date(date) != '2024-04-19',
    date(date) != '2024-06-20'
  )


tibble(strava)

## # A tibble: 122 × 4
##    date                 time distance elevation_gain
##    <dttm>              <dbl>    <dbl>          <dbl>
##  1 2023-06-28 22:18:00  60.0     9.28          0.935
##  2 2023-06-30 17:53:13  54.7     8.33          1.50 
##  3 2023-07-04 16:12:12  67.0    11.2           1.37 
##  4 2023-07-14 18:29:18  77.9    15.5           5.16 
##  5 2023-07-15 17:35:51  45.4     8.91          0.920
##  6 2023-07-17 16:51:24  59.4    11.5           1.64 
##  7 2023-07-21 18:27:02 140.     25.5           4.94 
##  8 2023-07-23 21:34:35  65.8    12.8           1.67 
##  9 2023-07-24 22:07:51  82.1    10.7           1.45 
## 10 2023-07-25 14:45:21 105.     18.8           2.08 
## # ℹ 112 more rows

Data description

The data set strava contains bike activities and has the following four variables:

date: The day and time the trip started
time: The time spent moving during the activity (in minutes)
distance: The total distance of the bike trip (in miles)
elevation_gain: The total elevation climbed (in 100 feet) while biking

Question 1: EDA

Part 1: Univariate EDA

Create the individual graphs for time, distance, and elevation_gain

strava |>
  # Placing the three columns into one column to create a faceted graph
  pivot_longer(
    cols = -date,
    names_to = 'feature',
    values_to = 'value'
  ) |> 
  # Starting the graph
  ggplot(
    mapping = aes(
      x = value
    )
  ) + 
  geom_density(
    fill = 'orangered'
  ) + 
  facet_wrap(
    facets = vars(feature),
    scales = 'free',
    ncol = 1
  )

Describe the shape of each of the three variables.

time: Unimodel and right skewed

distance: Unimodal and right skewed

elevation_gained: Slightly bi-modal and right skewed

Part 1b: Bivariate EDA

Create a scatterplot matrix for the three variables

strava |> 
  dplyr::select(-date) |> 
  GGally::ggpairs()

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Describe any important characteristics between the variables

time vs distance:

Seems to be a fairly strong linear association

time vs elevation_gain:

A moderate-ish linear association

distance vs elevation_gain:

Also a moderate-ish linear association

Part 1c: Best predictor of time

If we wanted to predict the amount of time a trip would take, which of the two other variables would be better to use? Justify your answer

Distance has a higher correlation and stronger association from the scatterplot with time than elevation_gain does, so it would be the better predictor of time.

Question 2: Simple linear regression - Time by Distance

Part 2a: Build the model

Create the linear regression model predicting time using distance and name it bike_slr. Display the results using get_regression_table()

bike_slr <- lm(time ~ distance, data = strava)

get_regression_table(bike_slr)

## # A tibble: 2 × 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept     8.63     2.73       3.16   0.002     3.23    14.0 
## 2 distance      5.00     0.204     24.5    0         4.60     5.41

Part 2b: Predicting a trip’s time based on distance travelled

How long is a trip predicted to take if the length of the trip is 20 miles?

\[\widehat{\text{time}} = 8.6 + 5.0(\text{distance}) = 8.6 + 5.0*20 = 108.6\]

The trip is expected to take about 110.4 minutes

If the trip takes 100 minutes, what is the residual?

\[\text{residual} = e = y - \hat{y} = 100 - 108.6 = -8.6\]

Part 2c: Interpreting the slope of distance

Interpret the effect of distance on the time a bike trip takes, in context

For every additional mile on the trip, the activity is expected to take 5 minutes longer

Part 2d: Residual Plot

Create the residual plot.

# Calculating y_hat and the residuals
get_regression_points(model = bike_slr) |> 
  # Creating the residual plot: e vs x
  ggplot(
    mapping = aes(
      x = distance,
      y = residual
    )
  ) + 
  geom_point() + 
  geom_hline(
    yintercept = 0, 
    color = 'red'
  )

Are the conditions for linear regression met? Briefly explain your answer

Yes, there aren’t any patterns or trends in the residual plot, indicating that there is only a linear association between time and distance

Part 2e: Fit statistics

Calculate the fit statistics, $R^2$ and rmse (or sigma)

get_regression_summaries(bike_slr)

## # A tibble: 1 × 9
##   r_squared adj_r_squared   mse  rmse sigma statistic p_value    df  nobs
##       <dbl>         <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl> <dbl> <dbl>
## 1     0.833         0.832  181.  13.5  13.6      600.       0     1   122

Does the model make accurate predictions?

An $R^2$ of 0.83 indicates the accuracy is decently high, but not super accurate. The rmse of 13.5 tells us that the typical prediction is off by about 13.5 minutes, which is pretty high! So the model could be better.

Question 3: Multiple Linear Regression

Part 3a: Create the MLR model

Build a model to predict time using distance AND elevation_gained. Save it as bike_mlr. Display the results using get_regression_table()

bike_mlr <- lm(time ~ distance + elevation_gain, data = strava)

get_regression_table(bike_mlr)

## # A tibble: 3 × 7
##   term           estimate std_error statistic p_value lower_ci upper_ci
##   <chr>             <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept          7.20     2.96       2.43   0.016    1.34     13.1 
## 2 distance           4.86     0.233     20.9    0        4.40      5.32
## 3 elevation_gain     1.49     1.19       1.25   0.215   -0.875     3.86

Part 3b: Predicting a trip’s time based on distance travelled and elevation_gain

How long is a trip predicted to take if the length of the trip is 18 miles and gained 200 feet in elevation? Round the model estimates to 1 decimal place

\[\widehat{\text{time}} = 7.2 + 4.9(\text{distance}) + 1.5(\text{elevation}) = 7.2 + 4.9*20 + 1.5*2 = 98.4\]

The trip is expected to take about 98.4 minutes

If the trip takes 100 minutes, what is the residual?

\[\text{residual} = e = y - \hat{y} = 100 - 98.4 = 1.6\]

Part 3c: Interpreting the slope of elevation_gain

Interpret the effect of elevation_gain on the time a bike trip takes, in context

For every additional 100 feet of elevation gained on the activity, it is expected to take about 1.5 minutes longer if the distance is kept the same

Part 3d: Residual Plot

Create the residual plot for the multiple linear regression model.

# Calculating y_hat and the residuals
get_regression_points(model = bike_mlr) |> 
  # Creating the residual plot: e vs y-hat
  ggplot(
    mapping = aes(
      x = time_hat,
      y = residual
    )
  ) + 
  geom_point() + 
  geom_hline(
    yintercept = 0, 
    color = 'red'
  )

Are the conditions for linear regression met? Briefly explain your answer

Yes, there aren’t any patterns or trends in the residual plot, indicating that there is only a linear association between time and distance

Part 3e: Fit statistics

Calculate the fit statistics, $R^2$ and rmse (or sigma)

get_regression_summaries(bike_mlr)

## # A tibble: 1 × 9
##   r_squared adj_r_squared   mse  rmse sigma statistic p_value    df  nobs
##       <dbl>         <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl> <dbl> <dbl>
## 1     0.835         0.833  179.  13.4  13.5      302.       0     2   122

Does the model make accurate predictions?

An $R^2$ of 0.835 indicates the accuracy is decently high, but not super accurate. The rmse of 13.3 tells us that the typical prediction is off by about 13.3 minutes, which is pretty high! So the model could be better.

Part 3f: Model Improvemnt with elevation_gain

Does it appear that including elevation_gain is worth keeping in the model compared to just distance alone? Justify your answer!

bind_rows(
  .id = 'model',
  'slr' = get_regression_summaries(bike_slr),
  'mlr' = get_regression_summaries(bike_mlr)
)

## # A tibble: 2 × 10
##   model r_squared adj_r_squared   mse  rmse sigma statistic p_value    df  nobs
##   <chr>     <dbl>         <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl> <dbl> <dbl>
## 1 slr       0.833         0.832  181.  13.5  13.6      600.       0     1   122
## 2 mlr       0.835         0.833  179.  13.4  13.5      302.       0     2   122

No, including elevation gained barely increases $R^2 and barely decreases rmse, indicating it isn’t worth including if distance is already included in the model.

Regression Lab: Strava data - key

Your name

2025-05-01

Initial Cleaning

Data description

Question 1: EDA

Part 1: Univariate EDA

Part 1b: Bivariate EDA

Part 1c: Best predictor of time

Question 2: Simple linear regression - Time by Distance

Part 2a: Build the model

Part 2b: Predicting a trip’s time based on distance travelled

Part 2c: Interpreting the slope of distance

Part 2d: Residual Plot

Part 2e: Fit statistics

Question 3: Multiple Linear Regression

Part 3a: Create the MLR model

Part 3b: Predicting a trip’s time based on distance travelled and elevation_gain

Part 3c: Interpreting the slope of elevation_gain

Part 3d: Residual Plot

Part 3e: Fit statistics

Part 3f: Model Improvemnt with elevation_gain