Background

Technology has been part of the training routines in several sports particularly high performance sport. Athletes and coaches rely on technology to accurately quantify different variables of interest that help several aspects of training and competition.

The use of global positioning satellite units (GPS) has now been part of any competitive sport that tries to gain an edge in competition. Therefore, coaches and sport scientists are interested that athletes use these devices during training and, when allowed, competition.

However, the market of technology has grow fast with several options to choose from. On the bright side, this generate competitions between companies and the devices are developing in all features needed. At the same time, the companies usually develop their own model of data collection, storage and how they output these data to the customer. This is done by developing proprietary software or cloud-based applications to handle the data. This may cause some confusion if one possess products from different companies but a common feature is that most of them allow the user to export the data registered by the devices in .csv format files. This allow the users to tailor the data analysis according to their need. Therefore, this project aims to demonstrate a process of data wrangling of the NK Speedcoach GPS Pro/textsuperscript{} output rowing data from several training sessions.

Scenario

A great friend of mine and incredible rowing coach came to me to discuss an issue he was facing in training. To help me to visualize the problem, he has sent me a data set from the NK Speedcoach GPS Pro, in a csv file. The output file is represented by figure 1 below:

Figure 1. Output of .csv file format from the NK Speedcoach GPS Pro

As you can see, this format is not “analysis friendly” and does not comply with tidy data principles. As such, I had to perform several steps to wrangle these data to a more analysis friendly scenario.

Project Roadmap

Ask

Since the format of the original data needed to be changed, I had to answer the following questions:

  1. What is the final format of the file I need?
  2. How should I configure the original file in a way I need it for analysis?
  3. Can I include features in the data to help me using it in the future for other tasks other than the exploratory analysis?

With these questions in mind, I have organize the flow of my thoughts and get ready to continue!

Prepare

The data is directly collected by two NK Speedcoach GPS Pro units. Each athlete use the unit exclusively, not sharing with other athletes. The NK system allows the training session data to be exported in .csv files for deeper analysis. The output format, however, is the problem to be solved by this project. Assuming the data come from GPS units, the bias that might exist come from equipment error and may be systematic.

The data was stored in a physical drive (external hard drive) protected by a password. I have also created a backup copy and stored the data. I have inspected the data visually and prepared it for sorting, filtering, and cleaning. Also, ensuring the data integrity would enhance the quality of the interpretation of the outcomes of future data analysis.

Process

For this project, I will use R and RStudio due to the data wrangling and visualizations capabilities of these tools. I have inspected the original .csv files to understand the format of the original output and come up with solutions to wrangle the data in these files. All the documentation has been done using RMarkdown.

Analyse

Each one of the eight files used in this project contains data representing an entire on water training session. I need to format the original file into an organized table where I have all the variables well defined and to create other variables that identifies date and time of the session and an iD variable, if I need to merge files in the future. All these information are already in the original files, though they are scattered around the document.

Let’s begin by setting up my working environment and load the required packages for the task.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(readr)
library(psych)
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(snakecase)

After inspecting the .csv files, I confirmed that all files have the same format, where the information are registered in the same cells (except for the per stroke data. The longer the session, the bigger the data in these fields). Therefore, the code will be created in one file and tested on the other ones.

Therefore, I will load the file:

nksc_data <- read.csv("spdcoach_2586162_20220412_0533am.csv", header = FALSE) 

Now, let’s view how the file was loaded in RStudio to begin the data wrangling.

View(nksc_data)

As expected, the file was loaded into the working environment, though the data sit slightly different from when I inspected in Excel. It has included some blank rows in the per stroke data. Also, it has placed the geo-positioning data on these blank rows. Therefore, the initial task will be to deal with that: Fixing the per stroke data.

nksc_data_fxd <-  nksc_data[seq(28, nrow(nksc_data), 2), ] #Here, I create a new table removing the blank rows.Note the data begins in row 28.
nksc_data_fxd <- nksc_data_fxd[-c(14)] # Removing unnecessary columns
nksc_data_latlon  <-  nksc_data[seq(29, nrow(nksc_data), 2), ]
nksc_data_latlon <- nksc_data_latlon[-c(1:8, 11:14)]# Removing unnecessary columns
nksc_data_cld <- cbind(nksc_data_fxd,nksc_data_latlon)

Now that I have most of the data organized in a table, I will name them, using the original naming.

Here I created a vector with the variable names.

var_name1 <- nksc_data [24:24, 1:13] 
var_name2 <- nksc_data [25:25, 9:10] 
var_name <- cbind(var_name1,var_name2)
print (var_name)
##          V1             V2             V3           V4          V5          V6
## 24 Interval Distance (GPS) Distance (IMP) Elapsed Time Split (GPS) Speed (GPS)
##             V7          V8          V9           V10                   V11
## 24 Split (IMP) Speed (IMP) Stroke Rate Total Strokes Distance/Stroke (GPS)
##                      V12        V13       V9      V10
## 24 Distance/Stroke (IMP) Heart Rate GPS Lat. GPS Lon.

Here I inserted the vector created above to name the variables.

names(nksc_data_cld) <- var_name
colnames(nksc_data_cld) <- to_snake_case(colnames(nksc_data_cld))# For convenience and better system compatibility, convert all headers to snakecase format.
print (head(nksc_data_cld))
##    interval distance_gps distance_imp elapsed_time  split_gps speed_gps
## 28        1          1.5          0.0   00:00:07.4 01:23:20.0      0.10
## 30        1          4.1          0.0   00:00:09.9 00:15:43.4      0.53
## 32        1          7.1          0.0   00:00:12.3 00:08:00.8      1.04
## 34        1         10.7          0.0   00:00:14.7 00:06:21.7      1.31
## 36        1         15.1          0.0   00:00:17.3 00:05:26.8      1.53
## 38        1         18.5          0.0   00:00:19.3 00:04:57.6      1.68
##     split_imp speed_imp stroke_rate total_strokes distance_stroke_gps
## 28 00:00:00.0      0.00         8.0             1                 1.5
## 30 00:00:00.0      0.00        23.0             2                 2.6
## 32 00:00:00.0      0.00        26.0             3                 3.1
## 34 00:00:00.0      0.00        25.0             4                 3.5
## 36 00:00:00.0      0.00        23.0             5                 4.5
## 38 00:00:00.0      0.00        29.0             6                 3.4
##    distance_stroke_imp heart_rate     gps_lat     gps_lon
## 28                 0.0         84 -23.5556380 -46.7216440
## 30                 0.0         83 -23.5556173 -46.7216559
## 32                 0.0         85 -23.5555925 -46.7216687
## 34                 0.0         84 -23.5555648 -46.7216851
## 36                 0.0         83 -23.5555291 -46.7217050
## 38                 0.0         84 -23.5555029 -46.7217215
colnames(nksc_data_cld) <- to_snake_case(colnames(nksc_data_cld))
rownames(nksc_data_cld) <- NULL  # Resetting index
nksc_data_cld <- nksc_data_cld[-c(1132,1250),] # Removing noisy data

Then, I converted the data, originally outputted as character, to time and numeric format for convenience and appropriateness.

#Convert chr columns to time columns where appropriate
  library(chron) #this library is a dependency for this line part of the script
## 
## Attaching package: 'chron'
## The following objects are masked from 'package:lubridate':
## 
##     days, hours, minutes, seconds, years
  nksc_data_cld$elapsed_time <- as.numeric(chron(times = nksc_data_cld$elapsed_time))*86400
  nksc_data_cld$split_gps <- chron(times = nksc_data_cld$split_gps)
  nksc_data_cld$split_imp <- chron(times = nksc_data_cld$split_imp)

  #Convert chr columns to numeric columns where appropriate
  nksc_data_cld <- nksc_data_cld %>% 
  mutate_at(c("interval","distance_gps","distance_imp",
              "speed_gps","speed_imp","stroke_rate","total_strokes",
              "distance_stroke_gps","distance_stroke_imp","heart_rate",
              "gps_lat","gps_lon"), as.numeric)

Conclusion

Data processing is an important part of data analysis and data science. “Real-world” data sometimes is stored in a format that is difficult to analyse. The script developed in this project was formatted as a function to be embedded in a shiny app.