Getting started with RStudio

Introduction

R Studio is a powerful tool when working with data. You can create and document your workflow, clean, tidy and visualize your data, wrangle with it and run simple and highly complex analyses.

At this point you should have already:

  1. Either downloaded R and R-Studio or be accessing these through the app store on a university PC.

  2. Set up a folder for your workflow.

  3. Created an R-Studio project saved in your folder.

The aim of this workbook is to get you comfortable with some of the basics for using R so you can create or read in data sets, produce some summary statistics and plot data.

PART 1: Getting started: exploring “base R”

Essentially R is a programming language where you ask it to do something and (presuming you give the correct instruction) R does it. For example if we ask R to calculate 1 + 1 it will give us the answer 2 or 2 * 2 we will get 4- but if we say 2 x 2 we will get an error:

1 + 1
[1] 2
2 * 2
[1] 4
#2 x 2

You can also give data names using equals = or most commonly leftward assignment <- or the less common rightward assignment ->

a = 2 * 2

b <- 1 + 1

a * b -> c

c
[1] 8

Creating a data frame

Using the same concept we can create strings of data and bind them together into a “data frame” - a table of data that we can explore. For example, imaging we have 10 participants who have completed a strength training intervention and we ask all of them for a rating of perceived excretion afterwards.

rpe <- c(7, 7.5, 3, 4, 6, 9, 8, 6.5, 5.5, 8)
participant <- c("P1", "P2", "P3", "P4", "P5", "P6", "P7", "P8","P9", "P10" )

rpe
 [1] 7.0 7.5 3.0 4.0 6.0 9.0 8.0 6.5 5.5 8.0

Note how we can now write “rpe” and the values appear. But we probably want to bind these together with our participant data. To do this we use the function data.frame() Here I have called the data frame “rpe_data”

rpe_data <- data.frame(participant, rpe)
rpe_data
   participant rpe
1           P1 7.0
2           P2 7.5
3           P3 3.0
4           P4 4.0
5           P5 6.0
6           P6 9.0
7           P7 8.0
8           P8 6.5
9           P9 5.5
10         P10 8.0

If we want we can quickly derive some summary statistics using functions such as mean() sd() or median() inside the brackets we are going to write the name of the data frame followed by the $ symbol and then the column name rpe_data$rpe

mean(rpe_data$rpe)
[1] 6.45
sd(rpe_data$rpe)
[1] 1.877498
median(rpe_data$rpe)
[1] 6.75

We can also create a simple plot of the data using the code barplot() the height of the plot (y asix) = rpe and the name argument (names.arg [x axis]).

barplot(height = rpe_data$rpe, names.arg = rpe_data$participant)

We can also add arguments to our bar plot to change the look, for example below I have changed the color and added a main title and x and y axis labels. I also added limits to the y axis so the full range of the RPE scale could be included.

barplot(height = rpe_data$rpe, names.arg = rpe_data$participant, 
        col = "blue", 
        main = "Participant's RPE", 
        xlab = "Participants", 
        ylab = "RPE Scores (AU)",  
        ylim = c(0, 10))

What if I want the plot to be ordered from the lowest to the highest RPE? I can reorder the data frame and then re-run the graph using the “order” function, the code is slightly trickier here with both [ ] and ( ):

rpe_data <- rpe_data[order(rpe_data$rpe), ]

barplot(height = rpe_data$rpe, names.arg = rpe_data$participant, col = "blue", main = "Participant's RPE", xlab = "Participants", ylab = "RPE Scores (AU)",  ylim = c(0, 10))

Finally we might be interested in the distribution of these data, so we could plot a histogram and or run a test for normality (e.g. the Shapiro-Wilk test). I have done a basic histogram (which looks a bit odd as we only have 10 participants!) but you can individualize it yourself now, maybe add labels and changes colors to the histogram in the same way you have done above.

hist(rpe_data$rpe)

shapiro.test(rpe_data$rpe)

    Shapiro-Wilk normality test

data:  rpe_data$rpe
W = 0.95291, p-value = 0.703

This suggests the data is not significantly different from normal - there are issues with this type of test but let’s not worry about that for now!

Additional Challenges

  1. Can you add to this data set?

    • Perhaps add an additional column with a relevant variable (e.g., volume load of the strength session) or strength training experience?

    • You could add this data in a way that might correlate with RPE?

  2. Can you then work out how to run a Pearson’s correlation in R-Studio?

  3. Maybe you could add RPE values for a different type of resistance training session?

  4. You could then see if you could work out how to run a t-test to see if there is a difference in RPE values between the sessions.

PART 2: Exploring data in the “tidyverse”

Reading in data from excel or .csv

So far we have done all our work in base R which is fine but what if we already have data in an excel or .csv file that we want to import? Firstly it is important the file we want to import is visible in our “working directory”. If you have set your R Project up in a folder then as long as the file is in this folder it should read in fine.

We are going to work of the “Strength_testing.xlsx” file that is on blackboard. First download this file and save it in your folder. If you open it in excel you can see the data and the “dictionary” tab which should give you information about the data in each column. Once you’ve familiarized yourself I would recommend closing the excel document before going any further.

Back in R, you should now see your excel file in “files” (the bottom right hand corner of your R-Studio console).

How do we get this into R?

Well to do this we need a package called “readxl” and this packages may not be installed yet so we may need to do this before we go any further. Note, if your right mouse click on the excel file in “files” and click “Import database” it will automatically prompt you down load the package. If not, you can use the code install.packages("readxl").

Once a packages is installed we need to tell R we want to use it and to do this we use the code library()

N.B.: An R package is a collection of functions, data, and documentation that extends the capabilities of base R.

library(readxl) # this tells R we want t use the package "readxl"

data <- read_excel("Strength_testing.xlsx") # I use the code read_excel to read in the package data "Strength_testing.xlsx". I have also called it "data" 

head(data) #This shows me the first few rows of my data frame - I can use View(data) to see the whole dataframe in a new tab
# A tibble: 6 × 13
  id    i_age age   date                time_point  height  mass   cmj rj_ct
  <chr> <chr> <chr> <dttm>              <chr>        <dbl> <dbl> <dbl> <dbl>
1 ETC02 u13s  u12s  2024-02-16 00:00:00 Winter 2024    148  37.6  20.7   190
2 ETC03 u13s  u12s  2024-02-16 00:00:00 Winter 2024    142  32.4  17.2   196
3 ETC05 u12s  u12s  2024-02-16 00:00:00 Winter 2024    151  45.5  12.9   187
4 ETC07 u13s  u12s  2024-02-16 00:00:00 Winter 2024    160  41.5  19.9   150
5 ETC08 u13s  u12s  2024-02-16 00:00:00 Winter 2024    157  44.6  19.6   184
6 ETC09 u13s  u12s  2024-02-16 00:00:00 Winter 2024    150  37.6  18.8    NA
# ℹ 4 more variables: rj_height <dbl>, rj_rsi <dbl>, rel_max_force <dbl>,
#   max_force <dbl>

Data manipulation with tidyverse

Tidyverse is a collection of packages that share an underlying design philosophy, grammar, and data structure.​ Tidyverse allows you to import, wrangle (tidy & transform), visualise and model your data. Below we will primarily use a package called dplyr - but we will use some functions from other packages.

Here we are going to cover the following data manipulation functions that will serve you well over time:

  • filter() picks cases based on their values.

  • arrange() changes the ordering of the rows.

  • select() picks variables based on their names.

  • mutate() adds new variables that are functions of existing variables

  • summarise() reduces multiple values down to a single summary.

  • group_by() makes an existing table and converts it into a grouped table

Take a look at the tidyverse website where you can go into each of the different packages and functions.

Getting started with tidyverse

Firstly, if you haven’t already done so you will need to install tidyverse using the following code:

install.packages("tidyverse")

Once installed we still need to tell R we want to use tidyverse and do that using library()

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Filtering our data

A useful little function within the forcats package in tidyverse is fct_count() which counts the number of rows in data frame for each factor in a variable. In other words if we want to count how many testing observations we have at each time point in our data we can do that easily:

fct_count(data$time_point) 
# A tibble: 4 × 2
  f               n
  <fct>       <int>
1 Autumn 2023    23
2 Spring 2024    40
3 Summer 2024    55
4 Winter 2024    39

f represents the factors and n the count.

If we want to we can sorted it highest to lowest count and add the proportion too:

fct_count(data$time_point, sort = TRUE,
prop = TRUE) 
# A tibble: 4 × 3
  f               n     p
  <fct>       <int> <dbl>
1 Summer 2024    55 0.350
2 Spring 2024    40 0.255
3 Winter 2024    39 0.248
4 Autumn 2023    23 0.146

Okay, so we have data from four testing sessions from Autumn 2023 to Summer 2024 and “Summer 2024” has the most observations (55) which represents 35% of the data set.

For now let’s say we are only interested in the data from Summer 2024 as this has the most observations. We are going to need to filter the data. We can do that easy using the filter() function in dplyr

data<-filter(data, time_point == "Summer 2024") #Here we are selecting all rows in "data" where time_point is equal to Summer 2024, note how we use == here. You can load your data back in again and try selecting data for a specefic age group or id for example. 
head(data)
# A tibble: 6 × 13
  id    i_age age   date                time_point  height  mass   cmj rj_ct
  <chr> <chr> <chr> <dttm>              <chr>        <dbl> <dbl> <dbl> <dbl>
1 ETC01 u13s  u13s  2024-08-29 00:00:00 Summer 2024   151   43.4  25.4   170
2 ETC03 u13s  u13s  2024-08-28 00:00:00 Summer 2024   146.  35.2  20.4   173
3 ETC05 u12s  u12s  2024-08-29 00:00:00 Summer 2024   155   51.4  14.2   174
4 ETC07 u13s  u13s  2024-08-29 00:00:00 Summer 2024   165.  46.1  25.5   158
5 ETC08 u13s  u13s  2024-08-29 00:00:00 Summer 2024   158.  46.9  17.3   178
6 ETC09 u13s  u13s  2024-08-29 00:00:00 Summer 2024   156.  39.8  20.6   139
# ℹ 4 more variables: rj_height <dbl>, rj_rsi <dbl>, rel_max_force <dbl>,
#   max_force <dbl>

Maybe we want to view this data in order of highest to lowest jump height. We can use the arrange() function combined with desc() as so:

data |> 
  arrange(desc(cmj))
# A tibble: 55 × 13
   id    i_age age   date                time_point  height  mass   cmj rj_ct
   <chr> <chr> <chr> <dttm>              <chr>        <dbl> <dbl> <dbl> <dbl>
 1 RTC46 u16s  u16s  2024-08-29 00:00:00 Summer 2024   160.  80.6  28.9   188
 2 RTC82 u14s  u14s  2024-08-29 00:00:00 Summer 2024   161.  46.7  28.3   160
 3 RTC90 u16s  u16s  2024-08-28 00:00:00 Summer 2024   162.  51.7  27.9   182
 4 RTC94 u16s  u16s  2024-08-29 00:00:00 Summer 2024   165   52.3  27.9   184
 5 ETC41 u12s  u12s  2024-08-28 00:00:00 Summer 2024   150.  35.8  27.7   201
 6 RTC50 u16s  u16s  2024-08-29 00:00:00 Summer 2024   168.  48.7  27.4   158
 7 RTC86 u14s  u14s  2024-08-28 00:00:00 Summer 2024   163   57.1  27     208
 8 RTC77 u14s  u14s  2024-08-29 00:00:00 Summer 2024   160.  66.4  26.3   152
 9 ETC07 u13s  u13s  2024-08-29 00:00:00 Summer 2024   165.  46.1  25.5   158
10 ETC01 u13s  u13s  2024-08-29 00:00:00 Summer 2024   151   43.4  25.4   170
# ℹ 45 more rows
# ℹ 4 more variables: rj_height <dbl>, rj_rsi <dbl>, rel_max_force <dbl>,
#   max_force <dbl>

Or maybe we want to arrange by age group (youngest to oldest and jump height, highest to lowest:

data |> 
  arrange(age, desc(cmj))
# A tibble: 55 × 13
   id    i_age age   date                time_point  height  mass   cmj rj_ct
   <chr> <chr> <chr> <dttm>              <chr>        <dbl> <dbl> <dbl> <dbl>
 1 ETC41 u12s  u12s  2024-08-28 00:00:00 Summer 2024   150.  35.8  27.7   201
 2 ETC36 u12s  u12s  2024-08-28 00:00:00 Summer 2024   152.  43.6  23     208
 3 ETC27 u12s  u12s  2024-08-28 00:00:00 Summer 2024   156.  40.4  22     206
 4 ETC16 u12s  u12s  2024-08-28 00:00:00 Summer 2024   144.  38.4  18.7   173
 5 ETC28 u12s  u12s  2024-08-28 00:00:00 Summer 2024   134   32.6  17.7   171
 6 ETC18 u12s  u12s  2024-08-28 00:00:00 Summer 2024   160.  54.1  17.6   180
 7 ETC14 u12s  u12s  2024-08-28 00:00:00 Summer 2024   156.  42.8  16.7   165
 8 ETC39 u12s  u12s  2024-08-28 00:00:00 Summer 2024   144.  36.8  16.6   174
 9 ETC33 u12s  u12s  2024-08-29 00:00:00 Summer 2024   148.  35    16.2   194
10 ETC17 u12s  u12s  2024-08-28 00:00:00 Summer 2024   146.  41.4  14.8   183
# ℹ 45 more rows
# ℹ 4 more variables: rj_height <dbl>, rj_rsi <dbl>, rel_max_force <dbl>,
#   max_force <dbl>

Okay, can you now apply what you learnt in part 1 and work out some summary statistics and plot some of this data, maybe CMJ height for example? You might find a more normal looking histogram!

TASKS

  1. Can you plot a histogram of some of the variables in our filtered data set?

  2. Can you filter the data to just look at the u16s age group?

  3. Maybe we want identify players who need more focused S&C training. Can you filter the data to only look at players with a low CMJ height say >20 cm?

Summary statistics

Let’s now calculate some summary statistics, we can use the sumarize() function and apply it across our data set - however we can only calculate a mean value when the data is numeric and our data also has categorical variables.

As always with R we have choices as to how to solve problems, we could select the variables we want to summarize across manually e.g. c(height, mass, etc....)

  data %>%
    summarize(across(c(height, mass, cmj), mean, na.rm = TRUE))
Warning: There was 1 warning in `summarize()`.
ℹ In argument: `across(c(height, mass, cmj), mean, na.rm = TRUE)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))
# A tibble: 1 × 3
  height  mass   cmj
   <dbl> <dbl> <dbl>
1   158.  49.2  21.1

Or we could be a little bit clever and use where(is.numeric) to tell R to only calculate mean values for numeric columns.

We also have missing data that has been imputed with N/A in some of the cells therefore I have added na.rm = TRUE to make sure an average is still derived when there is missing data. Try change this to false and see what happens.

mean_values <- data %>%
    summarize(across(where(is.numeric), mean, na.rm = TRUE))

mean_values
# A tibble: 1 × 8
  height  mass   cmj rj_ct rj_height rj_rsi rel_max_force max_force
   <dbl> <dbl> <dbl> <dbl>     <dbl>  <dbl>         <dbl>     <dbl>
1   158.  49.2  21.1  174.      19.2   1.12          26.9     1248.

N.B. if we wanted to deleted rows with NA we can do this using the code drop_na(data) which is part of the tidyr package within the tidyverse.

It might be more useful if we can calculate the mean for each age category. To do this we can use the group_by() function and combine it with summarize().

To do this we are going to create a pipe. You may have noticed the symbol in the code above this is the pipe symbol %>% in dplyr and it is particularly useful for combining code together. Note you can also use the syntax |> which indicates a pipe and is compatible with base R, but doesn’t work well with some functions in tidyverse.

The code below has one additional line from the code above which groups the data by the column age and then takes the mean for each numeric value.

Do you notice anything interesting from these means?

# Group by the age column and calculate the mean for each numeric column
mean_values <- data %>% 
    group_by(age) %>%
    summarize(across(where(is.numeric), mean, na.rm = TRUE))

mean_values
# A tibble: 4 × 9
  age   height  mass   cmj rj_ct rj_height rj_rsi rel_max_force max_force
  <chr>  <dbl> <dbl> <dbl> <dbl>     <dbl>  <dbl>         <dbl>     <dbl>
1 u12s    149.  41.6  18.3  185.      17.1  0.924          28.2     1065.
2 u13s    156.  44.9  20.6  175.      19.0  1.12           28.1     1144.
3 u14s    162.  54.2  22.0  171.      19.9  1.19           26.0     1347 
4 u16s    165.  55.1  22.9  168.      20.3  1.21           25.6     1407.

Selecting the columns we want to use

We might not want to keep all the variables (columns) in our data and we can use the select() function to select variables we want to keep or remove from our data. For example, we might want to remove variables i_age & date as we don’t really need them. Here we can use -c() and then state these variables:

data |> 
  select(-c(i_age, date))
# A tibble: 55 × 11
   id    age   time_point  height  mass   cmj rj_ct rj_height rj_rsi
   <chr> <chr> <chr>        <dbl> <dbl> <dbl> <dbl>     <dbl>  <dbl>
 1 ETC01 u13s  Summer 2024   151   43.4  25.4   170      20.3   1.2 
 2 ETC03 u13s  Summer 2024   146.  35.2  20.4   173      20.2   1.17
 3 ETC05 u12s  Summer 2024   155   51.4  14.2   174      17.3   0.99
 4 ETC07 u13s  Summer 2024   165.  46.1  25.5   158      17     1.07
 5 ETC08 u13s  Summer 2024   158.  46.9  17.3   178      22.8   1.28
 6 ETC09 u13s  Summer 2024   156.  39.8  20.6   139      19.1   1.37
 7 ETC10 u13s  Summer 2024   158.  49.1  18.9   186      16.7   0.89
 8 ETC11 u13s  Summer 2024   164   57.3  20.7   150      13.3   0.89
 9 ETC14 u12s  Summer 2024   156.  42.8  16.7   165      13.7   0.82
10 ETC15 u13s  Summer 2024   149.  36.5  17.8   130      23.8   1.83
# ℹ 45 more rows
# ℹ 2 more variables: rel_max_force <dbl>, max_force <dbl>

Or we might want to just look at two variables (age group and cmj height) e.g.,

data |> 
  select(age, cmj)
# A tibble: 55 × 2
   age     cmj
   <chr> <dbl>
 1 u13s   25.4
 2 u13s   20.4
 3 u12s   14.2
 4 u13s   25.5
 5 u13s   17.3
 6 u13s   20.6
 7 u13s   18.9
 8 u13s   20.7
 9 u12s   16.7
10 u13s   17.8
# ℹ 45 more rows

Creating new variables with mutate

Sometimes we might want to create new variables in a data set. Give we have mass and height data we could calculate BMI using mutate. For BMI we need height in meters so need to divide by 100 and then square it. We then divide mass by this number (mass / (height / 100) ^2)

See how the code below adds a new column to the end of the data set named bmi.

mutate(data, bmi = mass/ (height / 100) ^ 2 )
# A tibble: 55 × 14
   id    i_age age   date                time_point  height  mass   cmj rj_ct
   <chr> <chr> <chr> <dttm>              <chr>        <dbl> <dbl> <dbl> <dbl>
 1 ETC01 u13s  u13s  2024-08-29 00:00:00 Summer 2024   151   43.4  25.4   170
 2 ETC03 u13s  u13s  2024-08-28 00:00:00 Summer 2024   146.  35.2  20.4   173
 3 ETC05 u12s  u12s  2024-08-29 00:00:00 Summer 2024   155   51.4  14.2   174
 4 ETC07 u13s  u13s  2024-08-29 00:00:00 Summer 2024   165.  46.1  25.5   158
 5 ETC08 u13s  u13s  2024-08-29 00:00:00 Summer 2024   158.  46.9  17.3   178
 6 ETC09 u13s  u13s  2024-08-29 00:00:00 Summer 2024   156.  39.8  20.6   139
 7 ETC10 u13s  u13s  2024-08-29 00:00:00 Summer 2024   158.  49.1  18.9   186
 8 ETC11 u13s  u13s  2024-08-28 00:00:00 Summer 2024   164   57.3  20.7   150
 9 ETC14 u12s  u12s  2024-08-28 00:00:00 Summer 2024   156.  42.8  16.7   165
10 ETC15 u13s  u13s  2024-08-29 00:00:00 Summer 2024   149.  36.5  17.8   130
# ℹ 45 more rows
# ℹ 5 more variables: rj_height <dbl>, rj_rsi <dbl>, rel_max_force <dbl>,
#   max_force <dbl>, bmi <dbl>

Let’s now wrap this all into a pipe where we select id, age, height and mass and then create a new column bmi. We will finally arrange the data from shortest to tallest player.

bmi_data <- data %>% 
    select (id, age, height, mass) %>%
      mutate( bmi = mass/ (height / 100) ^ 2 )  %>%
        arrange(height)
    
bmi_data
# A tibble: 55 × 5
   id    age   height  mass   bmi
   <chr> <chr>  <dbl> <dbl> <dbl>
 1 ETC28 u12s    134   32.6  18.2
 2 ETC39 u12s    144.  36.8  17.9
 3 ETC16 u12s    144.  38.4  18.6
 4 ETC31 u13s    144.  36.4  17.4
 5 ETC03 u13s    146.  35.2  16.5
 6 ETC17 u12s    146.  41.4  19.3
 7 ETC26 u12s    147.  46.7  21.6
 8 ETC33 u12s    148.  35    16.1
 9 ETC15 u13s    149.  36.5  16.4
10 ETC41 u12s    150.  35.8  16.0
# ℹ 45 more rows

And for good measure lets count the number of observations in each age group

fct_count(bmi_data$age, sort = TRUE,
prop = TRUE) 
# A tibble: 4 × 3
  f         n     p
  <fct> <int> <dbl>
1 u16s     16 0.291
2 u13s     15 0.273
3 u12s     12 0.218
4 u14s     12 0.218

TASKS

  1. Can you create a new data frame with player id, age group, cmj and rj_contact time and rj_height.

  2. Can you calculate an rsi from these data yourself in a new column?

  3. Can you take the average for these variables by age group and graph these using barplot in base R

Additional Challenges

Tidyverse includes the package ggplot2 which is a powerful tool for data visualization.

  1. Can you explore ggplot and see if you can create a plot to visualize some of the interesting differences in mean values we saw between age groups?

  2. If we wanted to see if these differences were real what statistical test could we run? Can you find out how to run this test on this particular data in R?

  3. What about athlete profiling, could we for example look at the association between max strength (rel_force) and fast strength (e.g., rj_rsi)?

Example ggplot which we will explore in Part 3

Part 3: Data visualization with ggplot

Introduction

We are going to use some Athlete Reported Outcome Measures (AROM) of “readiness” to train to explore joining multiple data sheets with common column structure and to learn to use ggplot.

Here the Athlete Readiness to Train Questionnaire (ART-Q) was filled in on a weekly basis by a group of girls football players across a season. We also have RPE data in this data set for the training session performed but we won’t look at this data here.

Data was collected using several difference tablet applications, each creating a separate .csv file (these are on blackboard).

Your initial challenge is to read the data in from these files and combine them into one data frame

First you’ll need to set up a folder for your files (just the .csv file you want to combine);

Reading in multiple data files

Create a list of all your files

The code below creates a list of all .csv files in a particular folder, here we have called this “list_of_files”: note the file directly will be specific to where you have saved your .csv files.

list_of_files <- list.files(path = "~/Library/CloudStorage/OneDrive-TeessideUniversity/Work/Teaching/MSc/Advanced Testing, Monitoring and Data Analysis for Strength and Conditioning/2025/Work_area",
           recursive = TRUE,
           pattern = "\\.csv$",
           full.names = TRUE)

Combining these file using “rbind”

Here we are asking r to read all files in “list_of_files” into R Studio using the lapply function. As our .csv files have no column headers we specify this.

The code below uses rbind to bind (adds the rows of each file underneath): NOTE we will use cbind later in the module to combine files with the same rows but different columns.

data <- do.call(rbind, lapply
                (list_of_files, read.csv, as.is=T, header = FALSE))

head(data)
  V1         V2    V3 V4         V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15
1 51 31/10/2022 17:01 NA Resistance NA  7  6  7   7   5   6   7   7   2
2 28 31/10/2022 17:01 NA Resistance NA  7  7  7   7   7   7   7   7   0
3 12 31/10/2022 17:22 NA Resistance NA  4  4  4   4   4   4   4   4   0
4 30 31/10/2022 17:56 NA Resistance NA  5  6  4   5   6   5   5   6   3
5  3 31/10/2022 17:58 NA Resistance NA  4  4  6   6   5   7   3   7   0
6 14 31/10/2022 18:01 NA Resistance NA  6  3  4   3   5   7   1   6   0
                        V16 V17 V18 V19 V20 V21 V22                       V23
1 jess vasey                 80   0   0   0   0   0                          
2                            80   0   0   0   0   0                          
3                            80   0   0   0   0   0                          
4 Maddy Hillyer              80   0   0   0   0   0                          
5                            80   0   0   0   0   0                          
6                            80   0   0   0   0   0                          
       V24    V25  V26                V27
1 Football Female U16s Tactical/Technical
2 Football Female U16s Tactical/Technical
3 Football Female U16s Tactical/Technical
4 Football Female U16s Tactical/Technical
5 Football Female U16s Tactical/Technical
6 Football Female U16s Tactical/Technical

Hey presto we have multiple files combined, we just need to tidy the data up and give them some context now.

But it’s a pretty messy data set!

Data tiding

As with last week we are going to use tidyverse to help us:

First of all we don’t need columns V4, V5 or V6 or anything after columns 24 (these are default settings and incorrect). So we are going to use *select* to leave us with all the columns that are not 4-6 or 24-27. Again I’ve used head(data) so you can see what you are doing.

data <- data %>% select(-c(4:6,24:27))  

head(data)
  V1         V2    V3 V7 V8 V9 V10 V11 V12 V13 V14 V15
1 51 31/10/2022 17:01  7  6  7   7   5   6   7   7   2
2 28 31/10/2022 17:01  7  7  7   7   7   7   7   7   0
3 12 31/10/2022 17:22  4  4  4   4   4   4   4   4   0
4 30 31/10/2022 17:56  5  6  4   5   6   5   5   6   3
5  3 31/10/2022 17:58  4  4  6   6   5   7   3   7   0
6 14 31/10/2022 18:01  6  3  4   3   5   7   1   6   0
                        V16 V17 V18 V19 V20 V21 V22                       V23
1 jess vasey                 80   0   0   0   0   0                          
2                            80   0   0   0   0   0                          
3                            80   0   0   0   0   0                          
4 Maddy Hillyer              80   0   0   0   0   0                          
5                            80   0   0   0   0   0                          
6                            80   0   0   0   0   0                          

So we are left with the data we are interested in, but what are the columns representing. Luckily we (or I) know this. I have written them in the code below in the correct order. You could try changing these and see what happens.

The command we will use here is rename_with - this is going to apply these names to the dataframe “data”.

data <- data %>%
  rename_with(~ c("ID","Date","Time", "Mood", "Health", "Tiredness", "Sleep", "Soreness", "Food", "School", "Hydration" ,  "Other", "Comments",
                  "Duration", "sRPE", "sRPE_B", "sRPE_L", "sRPE_U", "sRPE_T", "Session Comments"))

head(data)
  ID       Date  Time Mood Health Tiredness Sleep Soreness Food School
1 51 31/10/2022 17:01    7      6         7     7        5    6      7
2 28 31/10/2022 17:01    7      7         7     7        7    7      7
3 12 31/10/2022 17:22    4      4         4     4        4    4      4
4 30 31/10/2022 17:56    5      6         4     5        6    5      5
5  3 31/10/2022 17:58    4      4         6     6        5    7      3
6 14 31/10/2022 18:01    6      3         4     3        5    7      1
  Hydration Other                  Comments Duration sRPE sRPE_B sRPE_L sRPE_U
1         7     2 jess vasey                      80    0      0      0      0
2         7     0                                 80    0      0      0      0
3         4     0                                 80    0      0      0      0
4         6     3 Maddy Hillyer                   80    0      0      0      0
5         7     0                                 80    0      0      0      0
6         6     0                                 80    0      0      0      0
  sRPE_T          Session Comments
1      0                          
2      0                          
3      0                          
4      0                          
5      0                          
6      0                          

So, this looks a bit better? However, we need to tell it that “Date” refers to a date and that ID refers to a factor. So we can use *as.Date()* and *as.factor()* - if we want it as a number we could use: *as.numeric()*. Notice how these change:

NOTE: as.Date does not work well here as the files have different date formats so I had to use *parse_date_time* & *guess_formats* from the lubridate package

data$Date <- parse_date_time(data$Date, guess_formats(data$Date, c("dmy", "ymd")))
data$ID<-as.factor(data$ID)
head(data)
  ID       Date  Time Mood Health Tiredness Sleep Soreness Food School
1 51 2022-10-31 17:01    7      6         7     7        5    6      7
2 28 2022-10-31 17:01    7      7         7     7        7    7      7
3 12 2022-10-31 17:22    4      4         4     4        4    4      4
4 30 2022-10-31 17:56    5      6         4     5        6    5      5
5  3 2022-10-31 17:58    4      4         6     6        5    7      3
6 14 2022-10-31 18:01    6      3         4     3        5    7      1
  Hydration Other                  Comments Duration sRPE sRPE_B sRPE_L sRPE_U
1         7     2 jess vasey                      80    0      0      0      0
2         7     0                                 80    0      0      0      0
3         4     0                                 80    0      0      0      0
4         6     3 Maddy Hillyer                   80    0      0      0      0
5         7     0                                 80    0      0      0      0
6         6     0                                 80    0      0      0      0
  sRPE_T          Session Comments
1      0                          
2      0                          
3      0                          
4      0                          
5      0                          
6      0                          

We have “athlete readiness to train (ART)” data with several AROM and some RPE data.

We might be interested in both or just in one set of data so we can subset these if we want. For today let’s focus just on the AROM. In this case the first 11 columns are what we need so we will use select again but rather than -c() we’ll use c(). Notice how the description here is of a dataframe 6x11 now rather than 6x20:

ART <- data %>%  select(c(1:11)) # select columns

view(ART)

Great but why have we got “0”s in these data? This is because when the players rate their RPE on the app and not the ART items these are populated by a “0” - so we need to delete every row with a zero. We also need to drop cells containing “N/A”

ART<- ART %>%  filter(Mood != 0) #deletes all rows where Mood has been rated 0 
ART<- ART %>% drop_na() #deletes any rows with an N/A 

Of course we can combine all this into a pipe

# or run a short pipe
  ART <- data %>%
  select(c(1:11)) %>%
  filter(Mood != 0) %>%
  drop_na()

head(ART)
  ID       Date  Time Mood Health Tiredness Sleep Soreness Food School
1 51 2022-10-31 17:01    7      6         7     7        5    6      7
2 28 2022-10-31 17:01    7      7         7     7        7    7      7
3 12 2022-10-31 17:22    4      4         4     4        4    4      4
4 30 2022-10-31 17:56    5      6         4     5        6    5      5
5  3 2022-10-31 17:58    4      4         6     6        5    7      3
6 14 2022-10-31 18:01    6      3         4     3        5    7      1
  Hydration
1         7
2         7
3         4
4         6
5         7
6         6

Do we need “Time” (column 3)?

If not, we can delete it the same way:

  ID       Date Mood Health Tiredness Sleep Soreness Food School Hydration
1 51 2022-10-31    7      6         7     7        5    6      7         7
2 28 2022-10-31    7      7         7     7        7    7      7         7
3 12 2022-10-31    4      4         4     4        4    4      4         4
4 30 2022-10-31    5      6         4     5        6    5      5         6
5  3 2022-10-31    4      4         6     6        5    7      3         7
6 14 2022-10-31    6      3         4     3        5    7      1         6

Finally - should we consider our Likert scale scores as numeric variables? Probably. We can easily change these by

ART <- ART %>%
  mutate(across(c(3:10), as.numeric))

head(ART)
  ID       Date Mood Health Tiredness Sleep Soreness Food School Hydration
1 51 2022-10-31    7      6         7     7        5    6      7         7
2 28 2022-10-31    7      7         7     7        7    7      7         7
3 12 2022-10-31    4      4         4     4        4    4      4         4
4 30 2022-10-31    5      6         4     5        6    5      5         6
5  3 2022-10-31    4      4         6     6        5    7      3         7
6 14 2022-10-31    6      3         4     3        5    7      1         6

Data visualization

Let’s visualize it first to familiarize ourselves

We will use ggplot to do this, let’s just start with a simple plot to look at “Mood”

The first line of code gives us our canvas with Mood on the x axis

ggplot(data = ART, aes(x = Mood))

Next lets add a plot, I’ve started with geom_bar() which gives us the count of each value. Most of the time players are in a reasonably good Mood!:

ggplot(data = ART, aes(x = Mood)) + 
  geom_bar()

We could use geom_box plot which would give us the median and inter-quartile range plus any outliers (in this case we are interested in outliers, when players fall outside there normal ranges.

ggplot(data = ART, aes(x = Mood)) +
  geom_boxplot()

What about soreness or School?

This is a useful snapshot but this is also the total of lots or repeated data on players so we might want to look at the data over different sessions. So lets add an x and y axis

ggplot(data = ART, aes(x = Mood, y = Date)) +
  geom_boxplot() 
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?

Okay so this doesn’t help us! Why does it look like this?

Date is a continuous variable and we need it to be a factor or categorical variable in this instance. So we can use mutate() again, this time to create a “Week” column with values like “Week 1”, “Week 2”, etc.

To do so we are using week(Date) to get the week number but also adding the text “Week” in front of it:

ART<- ART %>%
  mutate(Week = factor(paste("Week", week(Date))))
head(ART)
  ID       Date Mood Health Tiredness Sleep Soreness Food School Hydration
1 51 2022-10-31    7      6         7     7        5    6      7         7
2 28 2022-10-31    7      7         7     7        7    7      7         7
3 12 2022-10-31    4      4         4     4        4    4      4         4
4 30 2022-10-31    5      6         4     5        6    5      5         6
5  3 2022-10-31    4      4         6     6        5    7      3         7
6 14 2022-10-31    6      3         4     3        5    7      1         6
     Week
1 Week 44
2 Week 44
3 Week 44
4 Week 44
5 Week 44
6 Week 44
ggplot(data = ART, aes(x = Mood, y = Week)) +
  geom_boxplot() 

So this is a bit more useful, we can see that while mood looks pretty stable throughout there are one or two weeks where there are outliers. NOTE: this is far from the ideal way of doing this but it’s a start!

Let’s finish this plot by tidying it up and making it look a bit neater:

First let’s fill the boxes with a color:

ggplot(data = ART, aes(x = Mood, y = Week, fill = Week)) +
  geom_boxplot() 

Nice but we don’t need the legend so we will get rid of that:

ggplot(data = ART, aes(x = Mood, y = Week, fill = Week)) +
  geom_boxplot()+
  theme(legend.position = "none")

Better but let’s give the plot a nicer theme and add a title, it would also be useful to label the x axis from 1- 7 too?

ggplot(data = ART, aes( x = Mood, y = Week, fill = Week))+
  geom_boxplot() +
  theme_classic() +
  theme(legend.position = "none") +
  labs(
    title = "Athlete Readiness Train",  # plot title
    subtitle = "Mood accross weeks",   # sub-title
    y = "" # y axis title - in this case leave it blank
  )

Remember rule 1: what is your question!

Can you think about some of applied questions you might want to ask? Can you visusalise the data to help?

Example:

Question: Can we identify if players who may not be ready to train, those who you need to go and speak to?

To do these we really need to understand a player’s normal score on these questions and their typical variation. We could crudely use a box plot to take a quick look but this time across player IDs

ggplot(data = ART, aes(x = Mood, y = ID, fill = ID)) +
  geom_boxplot()+
  theme_classic() + #you can play with different themes
  theme(legend.position = "none") +
 labs(
    title = "Athlete Readiness Train",  # plot title
    subtitle = "Mood accross players",   # sub-title
    y = "" # y axis title - in this case leave it blank
  )

What does this show us?

There isn’t a nice box plot for each player!

Why? They could be putting the same number every week or they might not have more than a couple of data entries? Or both!

So perhaps we need to find out how many observations we have for each player first:

Let’s steel a bit of code from “Part 2” to count factors:

count <- fct_count(ART$ID, sort = TRUE,
           prop = FALSE) 

view(count)
head(count)
# A tibble: 6 × 2
  f         n
  <fct> <int>
1 26       18
2 3        18
3 42       18
4 9        17
5 18       16
6 31       16

Okay, so we can see there are a few players who have not really engaged, so let’s filter these out of the data and only keep players in who have at least 10 observations.

We can group by ID and then filter ID based on n (number of observations [>=10 ])

ART %>%
  group_by(ID) %>%
  filter(n() >= 10) %>%
  ungroup() -> filtered_ART

We can now repeat our plot but on the filtered_ART data:

ggplot(data =filtered_ART, aes(x = Mood, y = ID, fill = ID)) +
  geom_boxplot()+
  theme_classic() + #you can play with different themes
  theme(legend.position = "none") +
 labs(
    title = "Athlete Readiness Train",  # plot title
    subtitle = "Mood accross players",   # sub-title
    y = "" # y axis title - in this case leave it blank
  )

Now we are getting a better picture of our data.

  • Most players are in a good mood when coming in to S&C sessions (apart from player 10).

  • However, there is a difference in how players vary week on week, some are always rating there Mood as good or very good and others are more variable.

We need to look at individuals over time and not the group over time.

Let’s plot this:

I’m going to plot Date and Mood and use a simple line graph. However, I’m going to use facet_wrap() to separate the plot out into different graphs for each player.

ggplot(data = filtered_ART, aes(x = Date, y = Mood)) +
  geom_line() +
  facet_wrap(~ID)

This is still quite crude but helpful,

  • Take player 53: a score of “4” is nothing to be concerned about but if player 11 was to give a score of 4 we would definitely want to go and chat to her!

We can plot this using a geom_point and adding a linear regression line + 95% CI using geom_smooth:

ggplot(data = filtered_ART, aes(x = Date, y = Mood)) +
  geom_point() +                    # Add scatter points
  geom_smooth(method = "lm", se = TRUE, color = "blue") +  # Add linear regression line
  labs(title = "Mood accross time",
       x = "Date",
       y = "Mood") +
  theme_minimal() +
  scale_y_continuous(breaks = seq(1, 7, by = 1), limits = c(1, 8))+
  facet_wrap(~ID) 
`geom_smooth()` using formula = 'y ~ x'

This plot is useful as it enables us to see the most reason observation in relation to all others across a group of 36 players.

We might want to zoom in on an individual player more closely and we can filter the data by ID to do so within ggplot through a short pipe:

ART %>%
  filter(ID == "42") %>%
ggplot(aes(x = Date, y = Mood)) +
  geom_point() +                    # Add scatter points
  geom_smooth(method = "lm", se = TRUE, color = "blue") +  # Add linear regression line
  labs(title = "Mood accross time",
       x = "Date",
       y = "Mood") +
  theme_minimal() +
  scale_y_continuous(breaks = seq(1, 7, by = 1), limits = c(1, 8)) # this provides y axis limits but aslo tells ggplot to provide numbers at 1, 2, 3, 4 ... 7 on the y axis.  I like to see each number!
`geom_smooth()` using formula = 'y ~ x'

See how player 42 has two values on the same date (on three occasions) - this has to be an error.

You’ll see the extra value is always 4, but she does not ordinarily select 4 (in fact on no other occasion). We cannot be sure why, but it is possible (probable) another player has filled in the questionnaire as player 42 - probably because they’ve not engaged regularly and forgotten their player ID!!

Of course this is a guess - what to do with this data??

Finally,

we could filter the data and combine different plots to get a good understanding of the player’s status. Here I have created a dataframe called iART for individual ART and filtered for player 16. I have then created a plot for Mood, Soreness, School and Hydration

iART<-ART %>%
  filter(ID == "47") 

m<- ggplot(data = iART, aes(x = Date, y = Mood)) +
  geom_point() +                    # Add scatter points
  geom_smooth(method = "lm", se = TRUE, color = "blue") +  # Add linear regression line
  labs(title = "Mood accross time",
       x = "Date",
       y = "Mood") +
  theme_minimal() +
  scale_y_continuous(breaks = seq(1, 7, by = 1), limits = c(1, 8))

s<-ggplot(data = iART, aes(x = Date, y = Soreness)) +
 geom_point() +                    # Add scatter points
  geom_smooth(method = "lm", se = TRUE, color = "blue") +  # Add linear regression line
  labs(title = "Soreness accross time",
       x = "Date",
       y = "Soreness") +
  theme_minimal() +
  scale_y_continuous(breaks = seq(1, 7, by = 1), limits = c(1, 8))

sch<- ggplot(data = iART, aes(x = Date, y = School)) +
 geom_point() +                    # Add scatter points
  geom_smooth(method = "lm", se = TRUE, color = "blue") +  # Add linear regression line
  labs(title = "School accross time",
       x = "Date",
       y = "School") +
  theme_minimal() +
  scale_y_continuous(breaks = seq(1, 7, by = 1), limits = c(1, 8))

hyd<- ggplot(data = iART, aes(x = Date, y = Hydration)) +
 geom_point() +                    # Add scatter points
  geom_smooth(method = "lm", se = TRUE, color = "blue") +  # Add linear regression line
  labs(title = "Hydration accross time",
       x = "Date",
       y = "Hydration") +
  theme_minimal() +
  scale_y_continuous(breaks = seq(1, 7, by = 1), limits = c(1, 8))

We can then combine this plots into 1 using ggarrange - however we need a new package for this (ggpubr) which you will need to install if you haven’t used it.

library(ggpubr)

ggarrange(m,s,sch, hyd)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

This plot might be useful in a player meeting, are the trends in mood for example something to discuss?

TASKS

  1. Can you use mutate to create a total ART score (or a mean score)?

  2. Can you plot the association between items, e.g., are mood and sleep related? Can you run a correlation for this?

  3. Can you take the mean score and sd for these itmes for each player using group_by() and summarize(). Can you plot these?

Additional Challenges

  1. Can you create a box_plot of the data with all items on the same plot? You might need to learn how to use pivot_longer() to do this,

  2. Can you run a linear mixed model and derive the between and within player variation in ART-Q responses.

  3. Can you find a novel way to present this data back to an athlete or coach?