Project 1 - Ultrarunning

Joseph Flores - Project 1: Ultrarunning

Introduction

I will be using a data set from an ultra running marathon completed in 2023. This data revolves around the age of female and male runners as well as other attributes such as emotional intelligence, the type of surface the runners went on, avg distance ran per week etc.. I will be investigating what type of effect does the trail a runner goes on have on their emotional intelligence.

Set up the libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggfortify)

Set up the working directories

setwd("C:/Users/jfgam/Downloads/Data 101")
running <-read_csv("ultrarunning.csv")
Rows: 288 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl  (9): age, sex, pb_surface, pb_elev, pb100k_dec, avg_km, teique_sf, steu...
time (1): pb100k_time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Next, clean up the data set

Exclude the other variables that you won’t be using

running_clean <- running %>%
filter(! is.na(pb_elev) & !is.na(age) & !is.na(pb100k_time) & !is.na(pb100k_dec) &!is.na(avg_km) & !is.na(steu_b) & ! is.na(teique_sf)) |>
select(- pb_elev,- sex, -pb100k_time, -pb100k_dec, -avg_km, -steu_b, -stem_b)

Next up, form a graph

 ggplot(running_clean, aes(x= pb_surface, y=teique_sf)) +
  geom_point(alpha = 0.2, color = "orange") +
  labs(title = "Emotional Intelligence correlated to trail type",
       x = "Type of running trail",
       y = "Emotional Intelligence",
       caption = "1: Dirt Trail, 2: Track, 3: Road, 4: Mix of all")

Now let’s take a look at the regression model

full_model <- lm(sex ~ age + pb_surface + teique_sf + pb_elev + pb100k_time + pb100k_dec + avg_km + steu_b + stem_b, data = running)
summary(full_model)

Call:
lm(formula = sex ~ age + pb_surface + teique_sf + pb_elev + pb100k_time + 
    pb100k_dec + avg_km + steu_b + stem_b, data = running)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8952 -0.5122  0.2127  0.3659  0.6327 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.857e+00  5.491e-01   3.381 0.000987 ***
age         -1.344e-03  4.193e-03  -0.321 0.749160    
pb_surface  -3.707e-02  4.139e-02  -0.896 0.372304    
teique_sf    1.243e-01  6.786e-02   1.832 0.069567 .  
pb_elev      4.802e-05  2.389e-05   2.010 0.046798 *  
pb100k_time -8.830e-06  1.277e-05  -0.691 0.490788    
pb100k_dec   1.358e-02  4.559e-02   0.298 0.766373    
avg_km      -2.044e-04  2.078e-03  -0.098 0.921839    
steu_b      -1.487e-02  1.982e-02  -0.750 0.454747    
stem_b      -3.238e-02  2.382e-02  -1.359 0.176777    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4688 on 115 degrees of freedom
  (163 observations deleted due to missingness)
Multiple R-squared:  0.09376,   Adjusted R-squared:  0.02284 
F-statistic: 1.322 on 9 and 115 DF,  p-value: 0.2331

Based on the data provided by the regression model, I can come to the conclusion that the type of trail has no sort of impact on the emotional intelligence of the runners; in other news, the elevation that the runners went though seemed to have more of an impact on their best average times and emotional intelligence

Conclusion

Cleaning up

For the cleaning up aspect of this project, I focused on excluding categories using select and “-” I wasn’t focused on, as well as filtering out the NA data’s using “filter, (!is.na)

What it shows

The visualization shows a basic correlation with trail type and the emotional intelligence for each runner. Although there was a correlation, there is no profound evidence, even with the linear regression model to indicate an impact of the trail type and the emotional intelligence of runners

What I would’ve done better

A lot. My handling of r and coding in general is very sloppy and I’ve had to refer back to the notes and other past visualizations. I wanted to include legends, heat maps, more variables on the graph and overall something more refined and polished.