I will be using a data set from an ultra running marathon completed in 2023. This data revolves around the age of female and male runners as well as other attributes such as emotional intelligence, the type of surface the runners went on, avg distance ran per week etc.. I will be investigating what type of effect does the trail a runner goes on have on their emotional intelligence.
Set up the libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 288 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (9): age, sex, pb_surface, pb_elev, pb100k_dec, avg_km, teique_sf, steu...
time (1): pb100k_time
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Next, clean up the data set
Exclude the other variables that you won’t be using
Call:
lm(formula = sex ~ age + pb_surface + teique_sf + pb_elev + pb100k_time +
pb100k_dec + avg_km + steu_b + stem_b, data = running)
Residuals:
Min 1Q Median 3Q Max
-0.8952 -0.5122 0.2127 0.3659 0.6327
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.857e+00 5.491e-01 3.381 0.000987 ***
age -1.344e-03 4.193e-03 -0.321 0.749160
pb_surface -3.707e-02 4.139e-02 -0.896 0.372304
teique_sf 1.243e-01 6.786e-02 1.832 0.069567 .
pb_elev 4.802e-05 2.389e-05 2.010 0.046798 *
pb100k_time -8.830e-06 1.277e-05 -0.691 0.490788
pb100k_dec 1.358e-02 4.559e-02 0.298 0.766373
avg_km -2.044e-04 2.078e-03 -0.098 0.921839
steu_b -1.487e-02 1.982e-02 -0.750 0.454747
stem_b -3.238e-02 2.382e-02 -1.359 0.176777
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4688 on 115 degrees of freedom
(163 observations deleted due to missingness)
Multiple R-squared: 0.09376, Adjusted R-squared: 0.02284
F-statistic: 1.322 on 9 and 115 DF, p-value: 0.2331
Based on the data provided by the regression model, I can come to the conclusion that the type of trail has no sort of impact on the emotional intelligence of the runners; in other news, the elevation that the runners went though seemed to have more of an impact on their best average times and emotional intelligence
Conclusion
Cleaning up
For the cleaning up aspect of this project, I focused on excluding categories using select and “-” I wasn’t focused on, as well as filtering out the NA data’s using “filter, (!is.na)
What it shows
The visualization shows a basic correlation with trail type and the emotional intelligence for each runner. Although there was a correlation, there is no profound evidence, even with the linear regression model to indicate an impact of the trail type and the emotional intelligence of runners
What I would’ve done better
A lot. My handling of r and coding in general is very sloppy and I’ve had to refer back to the notes and other past visualizations. I wanted to include legends, heat maps, more variables on the graph and overall something more refined and polished.