A Basic Template for Data Science Projects

Introduction

This vignette describes the basic steps of a data science project from getting data through to communication. The workflow serves as a template and uses very basic code that can be expanded in any direction and complexity. The paper aims at beginners, and hence, the chosen structure of code chunks is simple to show how steps build upon each other.

Step 0: Getting R ready for the task

Whilst the task is completely achievable in base R, my preference is using the tidyverse packages. They are consistent with each other, provide strong options to wrangle and visualise and many resources are freely available.

library (here)
library (tidyverse) 
library (dbplyr)
library (lubridate)
library (ggplot2)

Should one of the above packages need to be installed, the following command can be used: install.packages(pkgs, dependencies = TRUE).

Helpful tools for the packages can be found on: https://rstudio.com/resources/cheatsheets/

Step 1: Getting Data

I picked a random question that a data scientist could be asked.

Is the gold price related to COVID-19 infections?

After a few minutes of googling, I found two csv files on https://www.perthmint.com/historical_metal_prices.aspx and https://www.covid19data.com.au/, downloaded them to my local R project folder and assigned them to objects using read_csv.

# Reading csv files COVID-19 

covid_raw <- read_csv(here::here("data", "COVID_AU_national_cumulative.csv"))

# Reading csv files Perth Mint Spot Price Data

gold_raw <- read_csv(here::here("data", "gold-Current.csv"), skip = 4) # the first 4 lines skiped

# New column names

colnames(gold_raw) <- c("date","PMSP_1", "PMSP_2", "PMSP_3", "PMSP_4", "PMSP_5", "PMSP_6", "PMSP_7", "PMSP_8", "PMSP_9", "PMSP_10", "PMSP_11", "PMSP_12", "PMSP_13", "PMSP_14", "PMSP_15", "PMSP_16", "PMSP_17", "PMSP_18", "LR")

The csv file from Perth Mint Spot Market Prices used 4 rows to describe the variables. This does not resemble what tidyverse defines as Tibble and R would return Errors. To solve this problem, I chose to skip those rows and assign new names to the variables.

More information is available in: Wickham, H., 2014. Tidy data. Journal of Statistical Software, 59(10), pp.1-23.

Step 2: Tidying Data

As we could see by using the view function the data is not yet ready to be passed on to a model. I’m using select to choose the variables we are interested in. With mutate we are converting the variables into the correct format for our model. And we use the filter function to choose the observations that we want to investigate.

# Select Perth Mint Spot Price Data

gold_tidy <- gold_raw %>% 
  select ('date', 'PMSP_16') %>%
  mutate(date = dmy(date)) %>% 
  mutate(PMSP_16 = as.numeric(PMSP_16)) %>%
  filter(date >= as.Date("2020-02-01") & date <= as.Date("2020-07-31"))

# Select COVID-19 Data

covid_tidy <- covid_raw %>% 
  select ('date', 'confirmed') %>%
  filter(date >= as.Date("2020-02-01") & date <= as.Date("2020-07-31"))

Now the two data sets are ready to be merged. We achieve this with a left_join.

# Merge Data 

mod_data <- left_join(covid_tidy, gold_tidy)

I have broken step 2 into three sub-steps and assigned the result to an object to make the code more readable. With the view function the reader can now look into the objects to understand what individual functions did to our data sets.

Step 3: Visualisation

I have been using ggplot to visualise a scatter plot with geom_point and fitted a simple regression line with geom_smooth.

# Visualising Data

ggplot(mod_data) +
  geom_point(mapping = aes(x = confirmed, y = PMSP_16)) +
  geom_smooth(method = lm, aes(x = confirmed, y = PMSP_16), size = 0.5, se = FALSE) +
  ylab("Perth Mint Spot Price Gold in AUD") +
  xlab("Cumulative COVID-19 Confirmed") +
  ggtitle("COVID Cases and Gold Price", subtitle = "01/02/2020 to 31/07/2020")

Step 4: Modelling

With the lm function, we can run a simple linear regression on our data where the Perth Mint Spot Market Price for gold in AUD is the response variable and the confirmed COVID-19 cases in Australia is the independent variable. I’m assigning the return to a new object and we can see the results with summary.

# Regression

mod_stats <- lm(confirmed ~ PMSP_16, mod_data)

summary(mod_stats)

## 
## Call:
## lm(formula = confirmed ~ PMSP_16, data = mod_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6428  -2371   -380   2718   7373 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -53287.987   7066.803  -7.541 7.58e-12 ***
## PMSP_16         22.920      2.741   8.362 8.98e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3370 on 128 degrees of freedom
##   (52 observations deleted due to missingness)
## Multiple R-squared:  0.3533, Adjusted R-squared:  0.3482 
## F-statistic: 69.92 on 1 and 128 DF,  p-value: 8.978e-14

More information can be found in: Bruce, P., Bruce, A., & Gedeck, P. (2020). Practical statistics for data scientists: 50+ essential concepts using R and Python. (Second edition.). O’Reilly Media, Inc.

Step 5: Conclusion/Communication

At this point no definitive conclusion can be drawn yet, so I’m going to use this section to outline possible next actions a data scientist might consider in order to satisfy the question that we posed as an example for this exercise.

As we can see in the visualisation there are two areas of anomaly right at the beginning and one where the number of cases reached cumulative approximately 7,000. We know from the news that initially not enough tests were available, and there was a low number of daily new cases before the second wave. In both cases the gold price shows fluctuation,which could explain the anomalies. This strongly suggests that the gold price is not only influenced by the COVID-19 cases.

What next? Is a multiple linear regression the answer? A different set of data (e.g. new infections per day)? This is where the real work begins.

Whatever the data scientist decides to investigate next, if a project is structured in a similar way as this template, one or all of the above steps can be extended or refined with little effort.