NK SpeedCoach GPS Pro - Training session data wrangling

Project Roadmap

Ask

Since the format of the original data needed to be changed, I had to answer the following questions:

What is the final format of the file I need?
How should I configure the original file in a way I need it for analysis?
Can I include features in the data to help me using it in the future for other tasks other than the exploratory analysis?

With these questions in mind, I have organize the flow of my thoughts and get ready to continue!

Prepare

The data is directly collected by two NK Speedcoach GPS Pro units. Each athlete use the unit exclusively, not sharing with other athletes. The NK system allows the training session data to be exported in .csv files for deeper analysis. The output format, however, is the problem to be solved by this project. Assuming the data come from GPS units, the bias that might exist come from equipment error and may be systematic.

The data was stored in a physical drive (external hard drive) protected by a password. I have also created a backup copy and stored the data. I have inspected the data visually and prepared it for sorting, filtering, and cleaning. Also, ensuring the data integrity would enhance the quality of the interpretation of the outcomes of future data analysis.

Process

For this project, I will use R and RStudio due to the data wrangling and visualizations capabilities of these tools. I have inspected the original .csv files to understand the format of the original output and come up with solutions to wrangle the data in these files. All the documentation has been done using RMarkdown.

Analyse

Each one of the eight files used in this project contains data representing an entire on water training session. I need to format the original file into an organized table where I have all the variables well defined and to create other variables that identifies date and time of the session and an iD variable, if I need to merge files in the future. All these information are already in the original files, though they are scattered around the document.

Let’s begin by setting up my working environment and load the required packages for the task.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(lubridate)

## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(readr)
library(psych)

## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(snakecase)

After inspecting the .csv files, I confirmed that all files have the same format, where the information are registered in the same cells (except for the per stroke data. The longer the session, the bigger the data in these fields). Therefore, the code will be created in one file and tested on the other ones.

Therefore, I will load the file:

nksc_data <- read.csv("spdcoach_2586162_20220412_0533am.csv", header = FALSE)

Now, let’s view how the file was loaded in RStudio to begin the data wrangling.

View(nksc_data)

As expected, the file was loaded into the working environment, though the data sit slightly different from when I inspected in Excel. It has included some blank rows in the per stroke data. Also, it has placed the geo-positioning data on these blank rows. Therefore, the initial task will be to deal with that: Fixing the per stroke data.

nksc_data_fxd <-  nksc_data[seq(28, nrow(nksc_data), 2), ] #Here, I create a new table removing the blank rows.Note the data begins in row 28.
nksc_data_fxd <- nksc_data_fxd[-c(14)] # Removing unnecessary columns
nksc_data_latlon  <-  nksc_data[seq(29, nrow(nksc_data), 2), ]
nksc_data_latlon <- nksc_data_latlon[-c(1:8, 11:14)]# Removing unnecessary columns
nksc_data_cld <- cbind(nksc_data_fxd,nksc_data_latlon)

Now that I have most of the data organized in a table, I will name them, using the original naming.

Here I created a vector with the variable names.

var_name1 <- nksc_data [24:24, 1:13] 
var_name2 <- nksc_data [25:25, 9:10] 
var_name <- cbind(var_name1,var_name2)
print (var_name)

##          V1             V2             V3           V4          V5          V6
## 24 Interval Distance (GPS) Distance (IMP) Elapsed Time Split (GPS) Speed (GPS)
##             V7          V8          V9           V10                   V11
## 24 Split (IMP) Speed (IMP) Stroke Rate Total Strokes Distance/Stroke (GPS)
##                      V12        V13       V9      V10
## 24 Distance/Stroke (IMP) Heart Rate GPS Lat. GPS Lon.

Here I inserted the vector created above to name the variables.

names(nksc_data_cld) <- var_name
colnames(nksc_data_cld) <- to_snake_case(colnames(nksc_data_cld))# For convenience and better system compatibility, convert all headers to snakecase format.
print (head(nksc_data_cld))

##    interval distance_gps distance_imp elapsed_time  split_gps speed_gps
## 28        1          1.5          0.0   00:00:07.4 01:23:20.0      0.10
## 30        1          4.1          0.0   00:00:09.9 00:15:43.4      0.53
## 32        1          7.1          0.0   00:00:12.3 00:08:00.8      1.04
## 34        1         10.7          0.0   00:00:14.7 00:06:21.7      1.31
## 36        1         15.1          0.0   00:00:17.3 00:05:26.8      1.53
## 38        1         18.5          0.0   00:00:19.3 00:04:57.6      1.68
##     split_imp speed_imp stroke_rate total_strokes distance_stroke_gps
## 28 00:00:00.0      0.00         8.0             1                 1.5
## 30 00:00:00.0      0.00        23.0             2                 2.6
## 32 00:00:00.0      0.00        26.0             3                 3.1
## 34 00:00:00.0      0.00        25.0             4                 3.5
## 36 00:00:00.0      0.00        23.0             5                 4.5
## 38 00:00:00.0      0.00        29.0             6                 3.4
##    distance_stroke_imp heart_rate     gps_lat     gps_lon
## 28                 0.0         84 -23.5556380 -46.7216440
## 30                 0.0         83 -23.5556173 -46.7216559
## 32                 0.0         85 -23.5555925 -46.7216687
## 34                 0.0         84 -23.5555648 -46.7216851
## 36                 0.0         83 -23.5555291 -46.7217050
## 38                 0.0         84 -23.5555029 -46.7217215

colnames(nksc_data_cld) <- to_snake_case(colnames(nksc_data_cld))
rownames(nksc_data_cld) <- NULL  # Resetting index
nksc_data_cld <- nksc_data_cld[-c(1132,1250),] # Removing noisy data

Then, I converted the data, originally outputted as character, to time and numeric format for convenience and appropriateness.

#Convert chr columns to time columns where appropriate
  library(chron) #this library is a dependency for this line part of the script

## 
## Attaching package: 'chron'

## The following objects are masked from 'package:lubridate':
## 
##     days, hours, minutes, seconds, years

  nksc_data_cld$elapsed_time <- as.numeric(chron(times = nksc_data_cld$elapsed_time))*86400
  nksc_data_cld$split_gps <- chron(times = nksc_data_cld$split_gps)
  nksc_data_cld$split_imp <- chron(times = nksc_data_cld$split_imp)

  #Convert chr columns to numeric columns where appropriate
  nksc_data_cld <- nksc_data_cld %>% 
  mutate_at(c("interval","distance_gps","distance_imp",
              "speed_gps","speed_imp","stroke_rate","total_strokes",
              "distance_stroke_gps","distance_stroke_imp","heart_rate",
              "gps_lat","gps_lon"), as.numeric)

Conclusion

Data processing is an important part of data analysis and data science. “Real-world” data sometimes is stored in a format that is difficult to analyse. The script developed in this project was formatted as a function to be embedded in a shiny app.

NK SpeedCoach GPS Pro - Training session data wrangling

Thiago Oliveira Borges, PhD

15/04/2022

Background

Scenario

Project Roadmap

Ask

Prepare

Process

Analyse

Conclusion