Summary from The Essentials of Data Science: Knowledge Discovery Using R - Chapter 3

The beginning of a data science project has to do with understanding the business problem to be tackled. This includes:

Understanding the goals of the project
Setting the success criteria
An idea of how the results will be deployed

Then we need to get in contact with data technicians to identify available data. This is followed by a data phase where we work with the technicians to acces and ingest the data into R. This gets us in position to start a journey of discovery of new insights driven by data. We are looking to get in touch with the data in the context of the business problem.

# Required packages for session

library(FSelector) # Feature selection: information.gain().
library(dplyr) # Data wrangling, glimpse() and tbl_df().
library(ggplot2) # Visualize data.
library(lubridate) # Dates and time.
library(randomForest) # Impute missing values with na.roughfix().
library(rattle) # weatherAUS data and normVarNames().
library(readr) # Efficient reading of CSV data.
library(scales) # Format comma().
library(stringi) # String concat operator %s+%.
library(stringr) # String operations.
library(tibble) # Convert row names into a column.
library(tidyr) # Prepare a tidy dataset, gather().
library(magrittr) # Pipes %>%, %T>% and equals(), extract().

3.1 Data Ingestion

Identifying the Data Source

First we need to get our data into R. Practically any source format is supported through the many packages available.

The weatherAUS dataset from rattle comes in CSV format. We need to identify and record the location of our dataset to analyze. This can be done from the internet or loading the file locally.

We can save the Internet location of the file in as a string of characters in a variable, in this case we use dspath as our variable.

# Identify the source location of the dataset.

dspath <- "http://rattle.togaware.com/weatherAUS.csv"

Reading the Data

Now we need to read the dataset into the memory of the computer this can be achieved using reader::read_csv(). This function returns a data frame, while using the reader package from the tidyverse this is actually an enhanced data frame known as tibble, a table data frame. This table is composed of rows that are observations and columns that make for variables. Again, to store the data frame we use a variable in this case weatherAUS.

# Ingest the dataset.
weatherAUS <- rattle::weatherAUS

Template Variables

To get a reusable template for data wrangling we create a reference to the original dataset using a generic variable, in this case ds, short for dataset.

ds <- weatherAUS

Modifying ds will not affect the original weatherAUS dataset. R avoids making unnecessary copies of datasets, so until the copy dataset gets any modifications it’s still only a reference, when we start making changes then it will use extra memory to store this new dataset.

Next we record the name of the dataset and a generic reference to the dataset. This use of the function get() is a tricky way to get the dataset original dataset reference in our ds variable. It’s just an example for template use.

# Prepare the dataset for usage with template
dsname <- "weatherAUS"
ds <- get(dsname)

Using generic variables has advantages but we need to be careful. A disadvantage is that we may be working with several datasets referenced using the same generic variable ds. Naming of datasets needs to be managed carefully.

The shape of the Data

Once data is loaded we seek a basic understanding of the data - its shape. We are interested in the size of the dataset in terms of observations (rows) and variables (columns). We can simply type the variable name that stores the dataset and will be presented with a summary, this only for our tibble format from the tidyverse.

# Print the dataset

ds

The funciton dim() provides dimension information, observations and variables, while nrow() and ncols() give the same information separated. We use rattle::comcat() to format numbers with commas.

# Basic size information
dim(ds) %>% comcat()

## 176,747 24

# Observations
nrow(ds) %>% comcat()

## 176,747

# Variables
ncol(ds) %>% comcat()

## 24

A useful alternative for our initial insight into the dataset is to use tibble::glimpse, this works as base::summary.

# Quick view of dataset contents

glimpse(ds)

## Rows: 176,747
## Columns: 24
## $ Date          <date> 2008-12-01, 2008-12-02, 2008-12-03, 2008-12-04, 2008...
## $ Location      <chr> "Albury", "Albury", "Albury", "Albury", "Albury", "Al...
## $ MinTemp       <dbl> 13.4, 7.4, 12.9, 9.2, 17.5, 14.6, 14.3, 7.7, 9.7, 13....
## $ MaxTemp       <dbl> 22.9, 25.1, 25.7, 28.0, 32.3, 29.7, 25.0, 26.7, 31.9,...
## $ Rainfall      <dbl> 0.6, 0.0, 0.0, 0.0, 1.0, 0.2, 0.0, 0.0, 0.0, 1.4, 0.0...
## $ Evaporation   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Sunshine      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ WindGustDir   <ord> W, WNW, WSW, NE, W, WNW, W, W, NNW, W, N, NNE, W, SW,...
## $ WindGustSpeed <dbl> 44, 44, 46, 24, 41, 56, 50, 35, 80, 28, 30, 31, 61, 4...
## $ WindDir9am    <ord> W, NNW, W, SE, ENE, W, SW, SSE, SE, S, SSE, NE, NNW, ...
## $ WindDir3pm    <ord> WNW, WSW, WSW, E, NW, W, W, W, NW, SSE, ESE, ENE, NNW...
## $ WindSpeed9am  <dbl> 20, 4, 19, 11, 7, 19, 20, 6, 7, 15, 17, 15, 28, 24, 4...
## $ WindSpeed3pm  <dbl> 24, 22, 26, 9, 20, 24, 24, 17, 28, 11, 6, 13, 28, 20,...
## $ Humidity9am   <int> 71, 44, 38, 45, 82, 55, 49, 48, 42, 58, 48, 89, 76, 6...
## $ Humidity3pm   <int> 22, 25, 30, 16, 33, 23, 19, 19, 9, 27, 22, 91, 93, 43...
## $ Pressure9am   <dbl> 1007.7, 1010.6, 1007.6, 1017.6, 1010.8, 1009.2, 1009....
## $ Pressure3pm   <dbl> 1007.1, 1007.8, 1008.7, 1012.8, 1006.0, 1005.4, 1008....
## $ Cloud9am      <int> 8, NA, NA, NA, 7, NA, 1, NA, NA, NA, NA, 8, 8, NA, NA...
## $ Cloud3pm      <int> NA, NA, 2, NA, 8, NA, NA, NA, NA, NA, NA, 8, 8, 7, NA...
## $ Temp9am       <dbl> 16.9, 17.2, 21.0, 18.1, 17.8, 20.6, 18.1, 16.3, 18.3,...
## $ Temp3pm       <dbl> 21.8, 24.3, 23.2, 26.5, 29.7, 28.9, 24.6, 25.5, 30.2,...
## $ RainToday     <fct> No, No, No, No, No, No, No, No, No, Yes, No, Yes, Yes...
## $ RISK_MM       <dbl> 0.0, 0.0, 0.0, 1.0, 0.2, 0.0, 0.0, 0.0, 1.4, 0.0, 2.2...
## $ RainTomorrow  <fct> No, No, No, No, No, No, No, No, Yes, No, Yes, Yes, Ye...

Normalizing Variable Names

Next we review the variables included int the dataset.

names(ds)

##  [1] "Date"          "Location"      "MinTemp"       "MaxTemp"      
##  [5] "Rainfall"      "Evaporation"   "Sunshine"      "WindGustDir"  
##  [9] "WindGustSpeed" "WindDir9am"    "WindDir3pm"    "WindSpeed9am" 
## [13] "WindSpeed3pm"  "Humidity9am"   "Humidity3pm"   "Pressure9am"  
## [17] "Pressure3pm"   "Cloud9am"      "Cloud3pm"      "Temp9am"      
## [21] "Temp3pm"       "RainToday"     "RISK_MM"       "RainTomorrow"

A useful convention is to map all variable names to lowercase. R is case sensitve so that doing this will result in different variable names as far as R is concerned. We use rattle::normVarNames() to make a reasonable attempt of converting variables from a dataset into a preferred standard form.

When changing a variable name within a dataset a new copy of that value of that variable is created so that now ds and weatherAUS will now be referring to different datasets in memory.

# Original names before normalising

names(ds)

##  [1] "Date"          "Location"      "MinTemp"       "MaxTemp"      
##  [5] "Rainfall"      "Evaporation"   "Sunshine"      "WindGustDir"  
##  [9] "WindGustSpeed" "WindDir9am"    "WindDir3pm"    "WindSpeed9am" 
## [13] "WindSpeed3pm"  "Humidity9am"   "Humidity3pm"   "Pressure9am"  
## [17] "Pressure3pm"   "Cloud9am"      "Cloud3pm"      "Temp9am"      
## [21] "Temp3pm"       "RainToday"     "RISK_MM"       "RainTomorrow"

# Normalise the variable names

names(ds) %>% normVarNames() -> names(ds) 
names(ds)

##  [1] "date"            "location"        "min_temp"        "max_temp"       
##  [5] "rainfall"        "evaporation"     "sunshine"        "wind_gust_dir"  
##  [9] "wind_gust_speed" "wind_dir_9am"    "wind_dir_3pm"    "wind_speed_9am" 
## [13] "wind_speed_3pm"  "humidity_9am"    "humidity_3pm"    "pressure_9am"   
## [17] "pressure_3pm"    "cloud_9am"       "cloud_3pm"       "temp_9am"       
## [21] "temp_3pm"        "rain_today"      "risk_mm"         "rain_tomorrow"

3.2 Data Review

Structure

glimpse(ds)

## Rows: 176,747
## Columns: 24
## $ date            <date> 2008-12-01, 2008-12-02, 2008-12-03, 2008-12-04, 20...
## $ location        <chr> "Albury", "Albury", "Albury", "Albury", "Albury", "...
## $ min_temp        <dbl> 13.4, 7.4, 12.9, 9.2, 17.5, 14.6, 14.3, 7.7, 9.7, 1...
## $ max_temp        <dbl> 22.9, 25.1, 25.7, 28.0, 32.3, 29.7, 25.0, 26.7, 31....
## $ rainfall        <dbl> 0.6, 0.0, 0.0, 0.0, 1.0, 0.2, 0.0, 0.0, 0.0, 1.4, 0...
## $ evaporation     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ sunshine        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ wind_gust_dir   <ord> W, WNW, WSW, NE, W, WNW, W, W, NNW, W, N, NNE, W, S...
## $ wind_gust_speed <dbl> 44, 44, 46, 24, 41, 56, 50, 35, 80, 28, 30, 31, 61,...
## $ wind_dir_9am    <ord> W, NNW, W, SE, ENE, W, SW, SSE, SE, S, SSE, NE, NNW...
## $ wind_dir_3pm    <ord> WNW, WSW, WSW, E, NW, W, W, W, NW, SSE, ESE, ENE, N...
## $ wind_speed_9am  <dbl> 20, 4, 19, 11, 7, 19, 20, 6, 7, 15, 17, 15, 28, 24,...
## $ wind_speed_3pm  <dbl> 24, 22, 26, 9, 20, 24, 24, 17, 28, 11, 6, 13, 28, 2...
## $ humidity_9am    <int> 71, 44, 38, 45, 82, 55, 49, 48, 42, 58, 48, 89, 76,...
## $ humidity_3pm    <int> 22, 25, 30, 16, 33, 23, 19, 19, 9, 27, 22, 91, 93, ...
## $ pressure_9am    <dbl> 1007.7, 1010.6, 1007.6, 1017.6, 1010.8, 1009.2, 100...
## $ pressure_3pm    <dbl> 1007.1, 1007.8, 1008.7, 1012.8, 1006.0, 1005.4, 100...
## $ cloud_9am       <int> 8, NA, NA, NA, 7, NA, 1, NA, NA, NA, NA, 8, 8, NA, ...
## $ cloud_3pm       <int> NA, NA, 2, NA, 8, NA, NA, NA, NA, NA, NA, 8, 8, 7, ...
## $ temp_9am        <dbl> 16.9, 17.2, 21.0, 18.1, 17.8, 20.6, 18.1, 16.3, 18....
## $ temp_3pm        <dbl> 21.8, 24.3, 23.2, 26.5, 29.7, 28.9, 24.6, 25.5, 30....
## $ rain_today      <fct> No, No, No, No, No, No, No, No, No, Yes, No, Yes, Y...
## $ risk_mm         <dbl> 0.0, 0.0, 0.0, 1.0, 0.2, 0.0, 0.0, 0.0, 1.4, 0.0, 2...
## $ rain_tomorrow   <fct> No, No, No, No, No, No, No, No, Yes, No, Yes, Yes, ...

From this summary we can see the variable names, data types and first few values of the variable. We can see date (date), character (chr) and numeric (dbl) data types.

Now we start to ask questions from the data, such as whether the date values are a sequence of days, what are the other locations listed, if the data type of our variables are correct. We see missing values, we want to know about our numeric variables distributions.

content

# Review first observations

head(ds)

# Review last observations

tail(ds)

The date variables starts on december 2008 until april 2020.

3.3 Data Cleaning

Identifying Factors

Some variables are identified as characters, this are often called categoric variables and are represented in R as a factor data type, this lets algorithms handle them specially. Where the character data takes a limited number of possible values we convert the variable from character into factor (categoric), so as to take advantage of the special handling. Each value of a factor is called level.

Our first character variable is location, we can use base::unique() to confirm the number of locations.

# How many locations are in the dataset

ds$location %>% 
  unique() %>% 
  length()

## [1] 49

Because we may not know in general what other locations we will come across in related datasets we might want to retain the variable as character.

# Converting location to factor

ds$location %>% as.factor() -> ds$location

We can review the distribution of locations with base::table().

table(ds$location)

## 
##         Adelaide           Albany           Albury     AliceSprings 
##             3832             3679             3680             3680 
##    BadgerysCreek         Ballarat          Bendigo         Brisbane 
##             3632             3680             3671             3833 
##           Cairns         Canberra            Cobar     CoffsHarbour 
##             3680             4076             3649             3649 
##         Dartmoor           Darwin        GoldCoast           Hobart 
##             3649             3833             3680             3833 
##        Katherine       Launceston        Melbourne MelbourneAirport 
##             2218             3680             3833             3649 
##          Mildura            Moree     MountGambier      MountGinini 
##             3649             3649             3679             3680 
##        Newcastle             Nhil        NorahHead    NorfolkIsland 
##             3680             2218             3644             3649 
##        Nuriootpa       PearceRAAF          Penrith            Perth 
##             3648             3648             3679             3832 
##     PerthAirport         Portland         Richmond             Sale 
##             3648             3649             3649             3649 
##       SalmonGums           Sydney    SydneyAirport       Townsville 
##             3602             3984             3649             3680 
##      Tuggeranong            Uluru       WaggaWagga          Walpole 
##             3679             2218             3649             3645 
##         Watsonia      Williamtown      Witchcliffe       Wollongong 
##             3649             3649             3648             3680 
##          Woomera 
##             3649

Now we look at rain_today and rain_tomorrow distributions.

# Review the distribution of observations across levels

ds %>% 
  select(starts_with("rain_")) %>% 
  sapply(table)

##     rain_today rain_tomorrow
## No      135371        135353
## Yes      37058         37077

We confirm that No and Yes are the only values these two variables have and so it makes sense to convert them both to factors.

# Names of the rain variables
ds %>% 
  select(starts_with("rain_")) %>% 
  names() %>% 
  print() ->
vnames

## [1] "rain_today"    "rain_tomorrow"

# Confirm these are currently character variables
ds[vnames] %>% sapply(class) # They are factors but we will work as if characters

##    rain_today rain_tomorrow 
##      "factor"      "factor"

# Converting variables from character to factor
ds[vnames] %>% 
  lapply(factor) %>% 
  data.frame() %>% 
  as_tibble() %>%  #tbl_df deprecated
  sapply(., class)

##    rain_today rain_tomorrow 
##      "factor"      "factor"

# Verifying the distribution has not changed

ds %>% 
  select(starts_with("rain_")) %>% 
  sapply(table)

##     rain_today rain_tomorrow
## No      135371        135353
## Yes      37058         37077

Now we have three wind direction variables identified as wind_gust_dir, wind_dir_9am and wind_dir_3pm are also characters.

ds %>% 
  select(contains("_dir")) %>% 
  sapply(table)

##     wind_gust_dir wind_dir_9am wind_dir_3pm
## N           10989        13978        10475
## NNE          7937         9782         8002
## NE           8715         9335        10092
## ENE          9965         9592         9605
## E           11071        11237        10123
## ESE          9055         9536        10290
## SE          11331        11398        12919
## SSE         10946        10954        11089
## S           11043        10519        11788
## SSW         10809         9272         9902
## SW          10793        10135        11166
## WSW         11136         8392        11700
## W           12122        10183        12411
## WNW         10045         9067        10846
## NW           9705        10488        10315
## NNW          7954         9468         9358

Here the values are ordered in alphabetic, but our variables have an intrinsic compass order that relates to directions. So we need to indicate the order of the levels.

# Levels ordered
compass <- c("N", "NNE", "NE", "ENE",
             "E", "ESE", "SE", "SSE",
             "S", "SSW", "SW", "WSW",
             "W", "WNW", "NW", "NNW")

# Note the names of wind direction

ds %>% 
  select(contains("_dir")) %>% 
  names() %>% 
  print() -> 
vnames

## [1] "wind_gust_dir" "wind_dir_9am"  "wind_dir_3pm"

# Confirm that these are characters

ds[vnames] %>% sapply(class) # We assume they are characters

##      wind_gust_dir wind_dir_9am wind_dir_3pm
## [1,] "ordered"     "ordered"    "ordered"   
## [2,] "factor"      "factor"     "factor"

# Convert characters to factors
ds[vnames] %>% 
  lapply(factor, levels=compass, ordered=TRUE) %>% 
  data.frame() %>% 
  as_tibble() %>% 
  sapply(., class)

##      wind_gust_dir wind_dir_9am wind_dir_3pm
## [1,] "ordered"     "ordered"    "ordered"   
## [2,] "factor"      "factor"     "factor"

# Distribution

ds %>% 
  select(contains("_dir")) %>% 
  sapply(table)

##     wind_gust_dir wind_dir_9am wind_dir_3pm
## N           10989        13978        10475
## NNE          7937         9782         8002
## NE           8715         9335        10092
## ENE          9965         9592         9605
## E           11071        11237        10123
## ESE          9055         9536        10290
## SE          11331        11398        12919
## SSE         10946        10954        11089
## S           11043        10519        11788
## SSW         10809         9272         9902
## SW          10793        10135        11166
## WSW         11136         8392        11700
## W           12122        10183        12411
## WNW         10045         9067        10846
## NW           9705        10488        10315
## NNW          7954         9468         9358

Other character variables are evaporation and sunshine, this have missing values.

# Note the variable names

cvars <- c("evaporation", "sunshine")

# Review variables

head(ds[cvars])

sample_n(ds[cvars], 6)

The readr::read_csv() only takes into account a subset of all data to determine the data types, so looking at thos missing values at the start this variables were parsed as characters.

# Checking variable class

ds[cvars] %>% sapply(class) #read_csv() actually determined they are numeric

## evaporation    sunshine 
##   "numeric"   "numeric"

# Convert to numeric
ds[cvars] %>% sapply(as.numeric) -> ds[cvars]

# Confirm conversion
ds[cvars] %>% sapply(class)

## evaporation    sunshine 
##   "numeric"   "numeric"

We have dealt we all character variables from our dataset, converting them to factors or numerics on a case by case basis on our understanding of the data.

Normalise Factors

This can take time so we need to be selective.

# Note which variables are categoric
ds %>% 
  sapply(is.factor) %>% 
  which() %>% 
  print() ->
catc

##      location wind_gust_dir  wind_dir_9am  wind_dir_3pm    rain_today 
##             2             8            10            11            22 
## rain_tomorrow 
##            24

# Normalise the levels of all categoric variables

for (v in catc) 
  levels(ds[[v]]) %<>% normVarNames()

# Review normalizing
glimpse(ds[catc])

## Rows: 176,747
## Columns: 6
## $ location      <fct> albury, albury, albury, albury, albury, albury, albur...
## $ wind_gust_dir <ord> w, wnw, wsw, ne, w, wnw, w, w, nnw, w, n, nne, w, sw,...
## $ wind_dir_9am  <ord> w, nnw, w, se, ene, w, sw, sse, se, s, sse, ne, nnw, ...
## $ wind_dir_3pm  <ord> wnw, wsw, wsw, e, nw, w, w, w, nw, sse, ese, ene, nnw...
## $ rain_today    <fct> no, no, no, no, no, no, no, no, no, yes, no, yes, yes...
## $ rain_tomorrow <fct> no, no, no, no, no, no, no, no, yes, no, yes, yes, ye...

Ensuring Target is a Factor

Many data mining tasks can be expressed as building classification models. For such models we want to ensure the target is categoric. It could be 0/1 and hence is loaded as numeric. In such cases we could tell our model algorithm to explicitly do classification or else set the target using base::as.factor() in the formula. Nonetheless it is generally cleaner to do this here and note that this code has no effect if the target is already categoric.

# Note the target variable
target <- "rain_tomorrow"

# Ensure the target is categoric
ds[[target]] %<>% as.factor() 

# Confirm the distribution

ds[target] %>% table()

## .
##     no    yes 
## 135353  37077

ds %>% 
  ggplot(aes_string(x=target)) +
  geom_bar(width=0.2, fill="light blue") +
  theme(text=element_text(size = 14)) +
  scale_y_continuous(labels=comma) +
  labs(title = "Distribution of Rain Tomorrow",
       x = "Rain Tomorrow",
       y = "Count",
       caption = "Source: weatherAUS")

3.4 Variable Roles

We have a basic understanding of our data set and its content, we have identified data types and cleaned names. Now we want to identify roles played by the variables within the dataset.

# Note the available variables

ds %>% 
  names %>% 
  print() ->
vars

##  [1] "date"            "location"        "min_temp"        "max_temp"       
##  [5] "rainfall"        "evaporation"     "sunshine"        "wind_gust_dir"  
##  [9] "wind_gust_speed" "wind_dir_9am"    "wind_dir_3pm"    "wind_speed_9am" 
## [13] "wind_speed_3pm"  "humidity_9am"    "humidity_3pm"    "pressure_9am"   
## [17] "pressure_3pm"    "cloud_9am"       "cloud_3pm"       "temp_9am"       
## [21] "temp_3pm"        "rain_today"      "risk_mm"         "rain_tomorrow"

By this stage in the project we will usually have identified a business problem that is the focus of attention. In our case we will assume it is to build a predictive analytics model to predict the chance of it raining tomorrow given the observation of today’s weather. In this case, the variable rain_tomorrow is the target variable. The dataset we have is then a training dataset of historic observations. The task in model building is identify any patterns among the observed variables that suggest that it rains the following day.

We also move the target variable to be the first in the vector of variables recorded in vars. This is common practice where the first variable in the dataset is the target variable and the remainder are the variables used to build a model. Another common practice is for the target to be the last variable in the dataset.

# Place the target variable at the beginning of the vars

c(target, vars) %>% 
  unique() %>% 
  print() ->
vars

##  [1] "rain_tomorrow"   "date"            "location"        "min_temp"       
##  [5] "max_temp"        "rainfall"        "evaporation"     "sunshine"       
##  [9] "wind_gust_dir"   "wind_gust_speed" "wind_dir_9am"    "wind_dir_3pm"   
## [13] "wind_speed_9am"  "wind_speed_3pm"  "humidity_9am"    "humidity_3pm"   
## [17] "pressure_9am"    "pressure_3pm"    "cloud_9am"       "cloud_3pm"      
## [21] "temp_9am"        "temp_3pm"        "rain_today"      "risk_mm"

We use base::unique() to remove te original ocurrence in the vars vector.

Another variable related to the outcome rather than to today’s observations is risk_mm. From business context this records the amount of rain that fell “tomorrow”. We refer to this as a risk variable. It is a measure of the impact of risk of the outcome we are predicting (whether it rains tomorrow). It is an output variables and should not be used as an input to the modeling, it is not an independent variable.

# Note risk variable
risk <- "risk_mm"

The date and location variables work as identifiers. Given a date and a location we have an observation of the remaining variables. These are the identifiers, these are not usually used as independent variables for predictive analytics models.

# Note identifiers

id <- c("date", "location")

3.5 Feature Selection

Now we want to identify variables that are irrelevant or inappropriate for modelling.

IDs and Outputs

First we need to ignore identifiers and risk variables. We can join our id and risk variables in a single vector to ignore from our model.

# Initialize ignored variables: identifiers and risk

id %>% 
  union(if (exists("risk")) risk) %>% 
  print() ->
ignore

## [1] "date"     "location" "risk_mm"

We might also check for variables that have a unique value for every observation. These are ofthe identifiers and should be ignored.

We begin defining a function that given a vector of data it will return the number of unique values found in that vector.

# Helper function to count number of distinct values

count_unique <- function(x) {
  x %>% unique() %>% length()
}

# Heuristic for candidate identifiers ot possibly ignore

ds[vars] %>% 
  sapply(count_unique) %>% 
  equals(nrow(ds)) %>% 
  which() %>% 
  names() %>% 
  print() ->
ids

## character(0)

# Add them to the variables to be ignored for modeling
ignore <- union(ignore, ids) %>% print

## [1] "date"     "location" "risk_mm"

Here we count every unique value in our variables and compare the counts with the number of rows in the dataset, if any value is equal to the number observations it is selected to be in the ids variable. Then we add all the variables to be ignored in the ignore variable.

We identify no more variables identifiers.

All Missing

Next we remove variables that have only missing values.

# Helper function to count the number of values missing

count_na <- function(x){
  x %>% is.na() %>% sum()
}

# Identify variables with only missing values

ds[vars] %>% 
  sapply(count_na) %>% 
  equals(nrow(ds)) %>% 
  which() %>% 
  names() %>% 
  print() ->
missing

## character(0)

# Add this variables to the ignored vector
ignore  %<>% union(missing) %>% print()

## [1] "date"     "location" "risk_mm"

Many Missing

It is also helpful to identify those variables which are very spares - that have mostly missing values. We can decide a threshold of the proportion missing above which to ignore the variable as not likely to add much value to our analysis. For example, we may want to ignore variables with more than 80% of the values missing.

# Identify our threshold
missing.threshold <- 0.8

# Identify variables that are mostly missing
ds[vars] %>% 
  sapply(count_na) %>% 
  '>'(missing.threshold*nrow(ds)) %>% 
  which() %>% 
  names() %>% 
  print() ->
mostly

## character(0)

Too Many Levels

Another issue we often come across in our datasets are factors that have very many levels. We might want to ignore such variables (or perhaps group them appropriately). Hence we simply identify them and add them to the list of variables to ignore.

# Helper function to count the number of levels

count_levels <- function(x){
  ds %>% extract2(x) %>% levels() %>% length()
}

# Identify a threshold above which we have too many variables

levels.threshold <- 20

# Identify variables that have too many levels

ds[vars] %>% 
  sapply(is.factor) %>% 
  which() %>% 
  names() %>% 
  sapply(count_levels) %>% 
  ">="(levels.threshold) %>% 
  which() %>% 
  names() %>% 
  print() ->
too.many

## [1] "location"

# Add them to the variables to be ignored

ignore <- union(ignore, too.many) %>% print()

## [1] "date"     "location" "risk_mm"

Constants

We ignore variables with constant values, because generally they add no extra information to the analysis.

# Helper function to test if all values in vector are the same

all_same <- function(x){
  all(x == x[1L])
}

# Identify variables that have a single value
ds[vars] %>% 
  sapply(all_same) %>% 
  which() %>% 
  names() %>% 
  print() ->
constants

## character(0)

# Add them to ignored variables
ignore <- union(ignore, constants) %>% print()

## [1] "date"     "location" "risk_mm"

Correlated Variables

Correlated variables record same information in different ways. So we want to remove highly correlated variables, because they wont add new information to the model.

# Note which variables are numeric
vars %>% 
  setdiff(ignore) %>% 
  extract(ds, .) %>% 
  sapply(is.numeric) %>% 
  which() %>% 
  names() %>% 
  print() ->
numc

##  [1] "min_temp"        "max_temp"        "rainfall"        "evaporation"    
##  [5] "sunshine"        "wind_gust_speed" "wind_speed_9am"  "wind_speed_3pm" 
##  [9] "humidity_9am"    "humidity_3pm"    "pressure_9am"    "pressure_3pm"   
## [13] "cloud_9am"       "cloud_3pm"       "temp_9am"        "temp_3pm"

We calculate correlation between numeric variables with stats::cor(). This generates a matrix of pairwise correlations based on only the complete observations, so missing ones are ignored.

# For numeric variables generate a table of correlations
# Uses pivot_longer() instead of gather()
ds[numc] %>% 
  cor(use="complete.obs") %>% 
  ifelse(upper.tri(., diag = TRUE), NA, .) %>% 
  data.frame() %>% 
  as.tibble() %>% 
  set_colnames(numc) %>% 
  mutate(var1=numc) %>% 
  pivot_longer(c(-var1),names_to="var2",values_to="cor") %>% 
  na.omit() %>% 
  arrange(-cor) %>% 
  print() ->
mc

## Warning: `as.tibble()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## # A tibble: 120 x 3
##    var1           var2              cor
##    <chr>          <chr>           <dbl>
##  1 temp_3pm       max_temp        0.984
##  2 pressure_3pm   pressure_9am    0.962
##  3 temp_9am       min_temp        0.908
##  4 temp_9am       max_temp        0.893
##  5 temp_3pm       temp_9am        0.870
##  6 max_temp       min_temp        0.752
##  7 temp_3pm       min_temp        0.729
##  8 wind_speed_3pm wind_gust_speed 0.691
##  9 humidity_3pm   humidity_9am    0.677
## 10 evaporation    max_temp        0.641
## # ... with 110 more rows

# For numeric variables generate a table of correlations
ds[numc] %>% 
  cor(use="complete.obs") %>% 
  ifelse(upper.tri(., diag = TRUE), NA, .) %>% 
  data.frame() %>% 
  as.tibble() %>% 
  set_colnames(numc) %>% 
  mutate(var1=numc) %>% 
  gather(var2,cor,-var1) %>% 
  na.omit() %>% 
  arrange(-cor) %>% 
  print() ->
mc

## # A tibble: 120 x 3
##    var1           var2              cor
##    <chr>          <chr>           <dbl>
##  1 temp_3pm       max_temp        0.984
##  2 pressure_3pm   pressure_9am    0.962
##  3 temp_9am       min_temp        0.908
##  4 temp_9am       max_temp        0.893
##  5 temp_3pm       temp_9am        0.870
##  6 max_temp       min_temp        0.752
##  7 temp_3pm       min_temp        0.729
##  8 wind_speed_3pm wind_gust_speed 0.691
##  9 humidity_3pm   humidity_9am    0.677
## 10 evaporation    max_temp        0.641
## # ... with 110 more rows

The final result lets us identify pairs of variables where we might want to keep one but not the other variable because they are highly correlated. We select manually, maybe with a threshold of 0.90 or more correlations.

# Note the correlated variables that are redundant

correlated <- c("temp_3pm", "pressure_3pm", "temp_9am")

# Add them to the variables to be ignored for modeling

ignore <- union(ignore, correlated) %>% print()

## [1] "date"         "location"     "risk_mm"      "temp_3pm"     "pressure_3pm"
## [6] "temp_9am"

Removing Variables

Once we identify all variables to ignore we remove them from our list of variables to use.

# Check the number of variables currently
length(vars)

## [1] 24

# Remove the variables to ignore
vars %>% setdiff(ignore) %>% print() -> vars

##  [1] "rain_tomorrow"   "min_temp"        "max_temp"        "rainfall"       
##  [5] "evaporation"     "sunshine"        "wind_gust_dir"   "wind_gust_speed"
##  [9] "wind_dir_9am"    "wind_dir_3pm"    "wind_speed_9am"  "wind_speed_3pm" 
## [13] "humidity_9am"    "humidity_3pm"    "pressure_9am"    "cloud_9am"      
## [17] "cloud_3pm"       "rain_today"

# Confirm they are ignored
length(vars)

## [1] 18

Algorithmic Feature Selection

Here to prepare our dataset we use the package FSelector, which provides functions to identify subsets of variables that might be most effective for modeling. We can use this (and other packages) to further assist us in reducing the variables for modeling. FSelector::cfs() identifies a subset of variables to consider for use in modeling by using correlation and entropy.

# Construct the formulation of the modeling to undertake
form <- formula(target %s+% " ~ .") %>% print()

## rain_tomorrow ~ .

# Use correlation search to identify key variabels
cfs(form, ds[vars])

## [1] "rainfall"     "sunshine"     "humidity_3pm" "cloud_3pm"    "rain_today"

The stringi::%s+% operator concatenates strings, we use it to produce a formula that indicates we will model our target variable with all of the other variables of the dataset. A second example uses FSelector::information.gain(), this lists the variable importance, so it gives advice of a useful subset of variables.

# Use information gain to identify variable importance

information.gain(form, ds[vars]) %>% 
  rownames_to_column("variable") %>% 
  arrange(-attr_importance)

Both methods gave the same variables as our most important variables for modeling.

3.6 Missing Data

When working with missing values we should analyze whether there is any pattern to the missing targets. This may indicate systematic data issue rather than simple randomness. Understanding why the data is systematically missing gives a better understanding about our data and how it is collected.

# Check dimensions of dataset
dim(ds)

## [1] 176747     24

# Identify observations with missing target
missing_target <- ds %>% extract2(target) %>% is.na()

# Check how many are found
sum(missing_target)

## [1] 4317

# Remove observations with a missing target
ds %>% filter(!missing_target) -> ds

# Confirm that filter delivered the expected dataset
dim(ds)

## [1] 172430     24

Algorithms may deal with missing values in our input variables, but this might not always be the case, so we need to address the situation. In previous steps we removed variables with many missing values, we note that this is not always appropriate. We might want to impute missing values ( noting also that this is not always wise, we are inventing data).

Next we illustrate a process for imputing missing values using randomForest::na.roughfix().

# Backup the dataset 
ods <- ds

# Count number of missing values
ds[vars] %>% is.na() %>% sum() %>% comcat()

## 399,165

# Impute missing values
ds[vars] %>% na.roughfix() -> ds[vars]

# Confirm that no missing values remain
ds[vars] %>% is.na() %>% sum() %>% comcat()

## 0

# Restore original dataset
ds <- ods

An alternative might be to remove observations that have missing values, here we use stats::na.omit() to identify the rows to omit based on the vars to be included for modeling.

# Backup the dataset 
ods <- ds

# Initialise the list of observations to be removed
omit <- NULL

# Review current dataset
ds[vars] %>% nrow() %>% comcat()

## 172,430

ds[vars] %>% is.na() %>% sum() %>% comcat()

## 399,165

# Identify observations with missing values
mo <- attr(na.omit(ds[vars]), "na.action")

# Record the observations to omit
omit <- union(omit, mo)

# If there are observations to omit, remove them
if (length(omit)) ds <- ds[-omit,]

# Confirm the observations have been removed
ds[vars] %>% nrow() %>% comcat()

## 61,820

ds[vars] %>% is.na() %>% sum()

## [1] 0

# Restore original dataset
ds <- ods
omit <- NULL

3.7 Feature Creation

Another important task for the data scientist is to create new features from the provided data where appropriate. This process of feature creation is iterative, we come back to it many times during the life cycle of a project. As we get insights into the data and through modeling we identify new features.

Derived Features

The date variables lets us derive features that may be useful for our model building. Here we add two derived features to our dataset: years and season. For the year variable we simply extract from date using base::format(). To compute the season we extract the month base::as.integer() which is then an index to base::switch() to a specific season depending on the month.

ds %>%
  mutate(year = factor(format(date, "%Y")),
         season = format(date, "%m") %>% 
                  as.integer() %>% 
                  sapply(function(x)
                    switch(x,
                           "summer","summer","autumn",
                           "autumn","autumn","winter",
                           "winter","winter","spring",
                           "spring","spring","summer")) %>% 
                    as.factor()) -> ds 
ds %>% 
  select(.,date,year,season) %>% sample_n(10)

# Introduce variables according to roles
vars %>% c("season") -> vars
id %<>% c("year")

Model Generated Features

Another option to developing new features for our model is to include model-generated features. A common one is to cluster observations or some aggregation of observations into similar groups. We then uniquely identify each group and add this grouping as another column or variable within our dataset.

We will illustrate the process of a cluster analysis over the numeric variables within our dataset. The aim of this cluster is to group the locations so that within a group the locations will have similar values for the numeric variables and between the groups the numeric variables will be more dissimilar. The traditional clustering algorithm is stats::kmeans().

We identify the number of clusters that we wish to create, storing them as the variable NCLUS. We then select the numc variables from the dataset ds, dplyr::group_by() the location and then dplyr::summarise_all() variables. We then remove the location and just leave the numeric variables, this are then rescaled so that all values across the different variables are in the same range. The rescaling is required for cluster so as not to introduce bias due to very differently scaled variables.

# Set seed for repeatably
set.seed(7465)

# Cluster the numeric data per location
NCLUST <- 5

ds[c("location",numc)] %>% 
  group_by(location) %>% 
  summarise_all(funs(mean(., na.rm = TRUE))) %T>% 
  {locations <<- .$location} %>% #Store locations for later
  select(-location) %>% 
  sapply(function(x) ifelse(is.na(x), 0, x)) %>% 
  as.data.frame %>% 
  sapply(scale) %>% 
  kmeans(NCLUST) %>% 
  print() %>% 
  extract2("cluster") ->
cluster

## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## K-means clustering with 5 clusters of sizes 16, 10, 3, 11, 9
## 
## Cluster means:
##     min_temp   max_temp    rainfall evaporation   sunshine wind_gust_speed
## 1 -0.6329818 -0.7092251 -0.32589491 -0.03078092 -0.1201087      0.01980454
## 2  0.1230244  0.4891127 -0.63922651  0.74038988  1.0093003     -0.03001963
## 3  0.9467065  2.0565398 -0.67744951  0.61779897 -0.3928548      0.25915800
## 4 -0.5023530 -0.4665302  0.04592559 -1.17176045 -1.0679563     -0.33590190
## 5  1.2870255  0.6020765  1.45930564  0.45828488  0.5283132      0.32230784
##   wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am
## 1    -0.03453387      0.1735126   0.56565701    0.2964073    0.2957966
## 2     0.02110433     -0.4786273  -0.88108262   -0.9351112    0.2975354
## 3     0.25535647     -0.4108368  -2.26102789   -1.8714636    0.2874347
## 4    -0.58496983     -0.6115458   0.61788107    0.3863261   -1.0177971
## 5     0.66778859      1.1077317  -0.02814379    0.6637110    0.2917071
##   pressure_3pm   cloud_9am   cloud_3pm   temp_9am   temp_3pm
## 1    0.2973469  0.91256343  0.88638514 -0.7210094 -0.6963267
## 2    0.2967594 -0.14604672 -0.04024751  0.2536996  0.5060387
## 3    0.2817550  0.06818767  0.08541570  1.4656647  2.0890292
## 4   -1.0170817 -1.41005922 -1.45077310 -0.4975308 -0.4895005
## 5    0.2908320  0.24061564  0.21361889  1.1194444  0.5775844
## 
## Clustering vector:
##  [1] 2 1 1 3 4 1 1 2 5 1 2 5 4 5 5 1 3 1 1 1 2 2 1 4 4 4 4 5 1 2 4 2 2 1 1 1 4 5
## [39] 5 5 4 3 2 4 1 5 4 1 2
## 
## Within cluster sum of squares by cluster:
## [1]  96.11806  48.27105  27.46643 160.62506  56.41959
##  (between_SS / total_SS =  49.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

head(cluster)

## [1] 2 1 1 3 4 1

# More updated version
# Set seed for repeatably
set.seed(7465)

# Cluster the numeric data per location
NCLUST <- 5

ds[c("location",numc)] %>% 
  group_by(location) %>% 
  summarise(across(.fns = ~mean(., na.rm = TRUE)),.groups = "drop") %T>% 
  {locations <<- .$location} %>% #Store locations for later
  select(-location) %>% 
  sapply(function(x) ifelse(is.na(x),0,x)) %>% 
  as.data.frame() %>% 
  sapply(scale) %>% 
  kmeans(NCLUST) %>% 
  print() %>% 
  extract2("cluster") ->
cluster

## K-means clustering with 5 clusters of sizes 16, 10, 3, 11, 9
## 
## Cluster means:
##     min_temp   max_temp    rainfall evaporation   sunshine wind_gust_speed
## 1 -0.6329818 -0.7092251 -0.32589491 -0.03078092 -0.1201087      0.01980454
## 2  0.1230244  0.4891127 -0.63922651  0.74038988  1.0093003     -0.03001963
## 3  0.9467065  2.0565398 -0.67744951  0.61779897 -0.3928548      0.25915800
## 4 -0.5023530 -0.4665302  0.04592559 -1.17176045 -1.0679563     -0.33590190
## 5  1.2870255  0.6020765  1.45930564  0.45828488  0.5283132      0.32230784
##   wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am
## 1    -0.03453387      0.1735126   0.56565701    0.2964073    0.2957966
## 2     0.02110433     -0.4786273  -0.88108262   -0.9351112    0.2975354
## 3     0.25535647     -0.4108368  -2.26102789   -1.8714636    0.2874347
## 4    -0.58496983     -0.6115458   0.61788107    0.3863261   -1.0177971
## 5     0.66778859      1.1077317  -0.02814379    0.6637110    0.2917071
##   pressure_3pm   cloud_9am   cloud_3pm   temp_9am   temp_3pm
## 1    0.2973469  0.91256343  0.88638514 -0.7210094 -0.6963267
## 2    0.2967594 -0.14604672 -0.04024751  0.2536996  0.5060387
## 3    0.2817550  0.06818767  0.08541570  1.4656647  2.0890292
## 4   -1.0170817 -1.41005922 -1.45077310 -0.4975308 -0.4895005
## 5    0.2908320  0.24061564  0.21361889  1.1194444  0.5775844
## 
## Clustering vector:
##  [1] 2 1 1 3 4 1 1 2 5 1 2 5 4 5 5 1 3 1 1 1 2 2 1 4 4 4 4 5 1 2 4 2 2 1 1 1 4 5
## [39] 5 5 4 3 2 4 1 5 4 1 2
## 
## Within cluster sum of squares by cluster:
## [1]  96.11806  48.27105  27.46643 160.62506  56.41959
##  (between_SS / total_SS =  49.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

head(cluster)

## [1] 2 1 1 3 4 1

Now each location has an associated cluster stored in the cluster vector, we add their base::names(). Then dplyr::mutate() the dataset by adding the new cluster number indexed by the location for each observation.

# Index the cluster vector with the appropriate locations
names(cluster) <- locations

# Add cluster to the dataset

ds %<>% mutate(cluster = "area" %>% 
                paste(cluster[ds$location]) %>% 
                as.factor()) 

# Check clusters
ds %>% select(location, cluster) %>% sample_n(10)

# Record variables in appropriate role
vars %<>% c("cluster")

# Checking that clusters look good
cluster[levels(ds$location)] %>% sort()

##            albany            albury          ballarat           bendigo 
##                 1                 1                 1                 1 
##          canberra            hobart        launceston         melbourne 
##                 1                 1                 1                 1 
## melbourne_airport     mount_gambier         nuriootpa          portland 
##                 1                 1                 1                 1 
##          richmond              sale          watsonia        wollongong 
##                 1                 1                 1                 1 
##          adelaide          brisbane             cobar           mildura 
##                 2                 2                 2                 2 
##             moree       pearce_raaf             perth     perth_airport 
##                 2                 2                 2                 2 
##       wagga_wagga           woomera     alice_springs         katherine 
##                 2                 2                 3                 3 
##             uluru    badgerys_creek          dartmoor      mount_ginini 
##                 3                 4                 4                 4 
##         newcastle              nhil        norah_head           penrith 
##                 4                 4                 4                 4 
##       salmon_gums       tuggeranong           walpole       witchcliffe 
##                 4                 4                 4                 4 
##            cairns     coffs_harbour            darwin        gold_coast 
##                 5                 5                 5                 5 
##    norfolk_island            sydney    sydney_airport        townsville 
##                 5                 5                 5                 5 
##       williamtown 
##                 5

cluster %>% sort()

##            albany            albury          ballarat           bendigo 
##                 1                 1                 1                 1 
##          canberra            hobart        launceston         melbourne 
##                 1                 1                 1                 1 
## melbourne_airport     mount_gambier         nuriootpa          portland 
##                 1                 1                 1                 1 
##          richmond              sale          watsonia        wollongong 
##                 1                 1                 1                 1 
##          adelaide          brisbane             cobar           mildura 
##                 2                 2                 2                 2 
##             moree       pearce_raaf             perth     perth_airport 
##                 2                 2                 2                 2 
##       wagga_wagga           woomera     alice_springs         katherine 
##                 2                 2                 3                 3 
##             uluru    badgerys_creek          dartmoor      mount_ginini 
##                 3                 4                 4                 4 
##         newcastle              nhil        norah_head           penrith 
##                 4                 4                 4                 4 
##       salmon_gums       tuggeranong           walpole       witchcliffe 
##                 4                 4                 4                 4 
##            cairns     coffs_harbour            darwin        gold_coast 
##                 5                 5                 5                 5 
##    norfolk_island            sydney    sydney_airport        townsville 
##                 5                 5                 5                 5 
##       williamtown 
##                 5

3.8 Preparing the Metadata

The metadata is data about the data. We now record data about our dataset that we use later in further processing and analysis.

Variable Types

We have identified variable roles such as target, risk and ignored variables. Now we identify our model inputs or independent variables. We record them both as a vector of characters (variable names) and a vector of integers (variable indicies).

# Noting variable names
vars %>% 
  setdiff(target) %>% 
  print() ->
inputs

##  [1] "min_temp"        "max_temp"        "rainfall"        "evaporation"    
##  [5] "sunshine"        "wind_gust_dir"   "wind_gust_speed" "wind_dir_9am"   
##  [9] "wind_dir_3pm"    "wind_speed_9am"  "wind_speed_3pm"  "humidity_9am"   
## [13] "humidity_3pm"    "pressure_9am"    "cloud_9am"       "cloud_3pm"      
## [17] "rain_today"      "season"          "cluster"

The integer indices are determined from the base::names() of the variables in the original dataset. Note the use of USE.NAMES= from base::sapply() to turn off the inclusion of names in the resulting vector to keep the result as a simple vector.

# Noting variable indices
inputs %>% 
  sapply(function(x) which(x == names(ds)), USE.NAMES = FALSE) %>% 
  print() ->
inputi

##  [1]  3  4  5  6  7  8  9 10 11 12 13 14 15 16 18 19 22 26 27

# Recording number of observations
ds %>% 
  nrow() %T>% 
  comcat() ->
nobs

## 172,430

# Confirm various subset sizes
dim(ds) %>% comcat()

## 172,430 27

dim(ds[vars]) %>% comcat()

## 172,430 20

dim(ds[inputs]) %>% comcat()

## 172,430 19

dim(ds[inputi]) %>% comcat()

## 172,430 19

Numeric and Categoric Variables

For handling we identify numeric and categoric variables for separate. An example are some cluster algorithms that only take numeric variables. Again we identify them with both a character vector containing their names and an integer vector with their indices. Note that when we use the index we assume the variables remain in the same order within the dataset and all variables are present.

# Identify numeric variables by index

ds %>% 
  sapply(is.numeric) %>% 
  which() %>% 
  intersect(inputi) %>% 
  print() ->
numi

##  [1]  3  4  5  6  7  9 12 13 14 15 16 18 19

# Identify the numeric variables by name

ds %>% 
  names() %>% 
  extract(numi) %>% 
  print() ->
numc

##  [1] "min_temp"        "max_temp"        "rainfall"        "evaporation"    
##  [5] "sunshine"        "wind_gust_speed" "wind_speed_9am"  "wind_speed_3pm" 
##  [9] "humidity_9am"    "humidity_3pm"    "pressure_9am"    "cloud_9am"      
## [13] "cloud_3pm"

# Identify categoric variables by index

ds %>% 
  sapply(is.factor) %>% 
  which() %>% 
  intersect(inputi) %>% 
  print() ->
cati

## [1]  8 10 11 22 26 27

# Identify the categoric variables by name

ds %>% 
  names() %>% 
  extract(cati) %>% 
  print() ->
catc

## [1] "wind_gust_dir" "wind_dir_9am"  "wind_dir_3pm"  "rain_today"   
## [5] "season"        "cluster"

3.9 Preparing for Model Building

Our end goal is to build a model based on the data we have prepared. A model will capture knowledge about the world that the data represents. The final two tasks are to specify the form of the model and to identify the actual observations from which the model is to be built. For the latter we partition the dataset into subsets.

Formula to Describe the Model

A formula is used to identify what it is that we will model from the supplied data. Typically we identify a target variable which we will model based on other input variables. We can construct a stats::formula() automatically from a dataset if the first column is the target variable and the remaining ones are the inputs. Our vars vector has already this format.

ds[vars] %>% 
  formula() %>% 
  print() ->
form

## rain_tomorrow ~ min_temp + max_temp + rainfall + evaporation + 
##     sunshine + wind_gust_dir + wind_gust_speed + wind_dir_9am + 
##     wind_dir_3pm + wind_speed_9am + wind_speed_3pm + humidity_9am + 
##     humidity_3pm + pressure_9am + cloud_9am + cloud_3pm + rain_today + 
##     season + cluster
## <environment: 0x0000000057e74540>

This formula indicates that we aim to build a model that captures the knowledge required to predict the output rain_tomorrow from the provided input variables. This kind of model is a classification, because there’s just a couple of options for our target variable. It can be compared with regression models that predict numeric outcomes. This model in particular is a binary classification model.

Training, Validation and Testing Datasets

A common methodology for building models is to partition the available data into a training dataset and a testing dataset. We will build (or train) a model using the training dataset and then test how good the model is by applying it to the testing dataset. Typically we also introduce a validation dataset. This is used during the building of the model to assist in tuning the algorithm through trialing different parameters to the machine learning algorithm. In this way we search for the best model using the validation dataset and then obtain a measure of the performance of the final model using the testing dataset.

# Set seed
seed <- 42
set.seed(seed)

The training dataset is typically a 70% random sample for building the model. The second and third consist of the remainder, used to tune and then estimate the expected performance of the model. Rather than creating three subsets, we simply record the index of the observations that belong to each of the three subsets.

# Partition the full dataset into three
# Train set
nobs %>% 
  sample(0.7*nobs) %T>% 
  {length(.) %>% comcat()} %T>% 
  {sort(.) %>% head(30) %>% print()} ->
train

## 120,700 
##  [1]  1  2  6  7  8  9 10 11 12 14 15 16 18 19 20 21 22 23 26 27 30 31 32 34 35
## [26] 36 37 39 40 41

# Validation set
nobs %>% 
  seq_len() %>% 
  setdiff(train) %>% 
  sample(0.15*nobs) %T>%
  {length(.) %>% comcat()} %T>%
  {sort(.) %>% head(15) %>% print()} ->
validate

## 25,864 
##  [1]   5  13  17  24  29  38  50  58  71  75  91 100 104 106 117

# Test set
nobs %>% 
  seq_len() %>% 
  setdiff((union(train,validate))) %T>%
  {length(.) %>% comcat()} %T>%
  {head(.) %>% print()} ->
test

## 25,866 
## [1]  3  4 25 28 33 46

Now we record the actual target values across the three data sets as these will be used in the evaluations performed in later chapters. A secondary target variable is our risk variable that reports the amount of rain recorded tomorrow. This two are highly correlated.

tr_target <- ds[train,][[target]] %>%
  {head(., 20) %>% print()}

##  [1] no  no  no  no  no  no  yes no  no  no  no  no  no  yes no  no  no  yes yes
## [20] no 
## Levels: no yes

tr_risk <- ds[train,][[risk]] %>%
  {head(., 20) %>% print()}

##  [1]  0.0  0.2  0.4  0.0  0.0  0.0 18.0  0.0  0.0  0.2  0.0  0.0  0.4 13.2  0.1
## [16]  0.0  0.0 14.6  4.4  0.0

va_target <- ds[validate,][[target]] %>%
  {head(., 20) %>% print()}

##  [1] no  yes no  no  no  no  no  no  no  no  no  no  no  yes no  yes no  yes no 
## [20] no 
## Levels: no yes

va_risk <- ds[validate,][[risk]] %>%
  {head(., 20) %>% print()}

##  [1] 0.0 1.4 0.2 0.0 0.0 0.0 0.0 0.8 0.0 0.0 0.0 0.0 0.0 2.4 0.8 7.0 0.6 1.4 0.2
## [20] 0.2

te_target <- ds[test,][[target]] %>%
  {head(., 20) %>% print()}

##  [1] no  no  no  yes no  no  no  no  no  yes no  no  no  no  no  yes no  no  no 
## [20] yes
## Levels: no yes

te_risk <- ds[test,][[risk]] %>%
  {head(., 20) %>% print()}

##  [1]  0.0  1.0  0.0  1.2  0.0  0.0  0.0  0.0  0.0  3.0  0.0  0.0  0.0  0.0  0.0
## [16]  1.8  0.0  0.0  0.0 20.0

3.10 Save the Dataset

Having transformed, cleaned, wrangled and added new variables we will now save the data and its metadata into a binary RData file. Saving it as a binary compressed file saves both storage space and time on reloading the dataset. This is generally faster than loading a CSV file.

# Save data into a appropriate folder
fpath <- "C:/Users/antho/Desktop/DS/R/The Essentials of Data Science/data"

# Timestamp
dsdate <- "_" %s+% format(Sys.Date(), "%Y%m%d") %T>% print()

## [1] "_20210125"

# Filename for the saved dataset
dsfile <- dsname %s+% dsdate %s+% ".RData"

# Full path to the dataset
fpath %>% 
  file.path(dsfile) %T>% 
  print() ->
dsrdata

## [1] "C:/Users/antho/Desktop/DS/R/The Essentials of Data Science/data/weatherAUS_20210125.RData"

# Save relevant R objects to the binary RData file
save(ds, dsname, dspath, dsdate, nobs
     ,vars, target, risk, id, ignore,
     omit,inputi, inputs, numi, numc,
     cati, catc,form, seed, train, 
     validate, test,tr_target, tr_risk,
     va_target, va_risk, te_target, te_risk,
     file=dsrdata)

# Check the resulting file size in bytes
file.size(dsrdata) %>% comma()

## [1] "6,935,779"

load(dsrdata) %>% print()

##  [1] "ds"        "dsname"    "dspath"    "dsdate"    "nobs"      "vars"     
##  [7] "target"    "risk"      "id"        "ignore"    "omit"      "inputi"   
## [13] "inputs"    "numi"      "numc"      "cati"      "catc"      "form"     
## [19] "seed"      "train"     "validate"  "test"      "tr_target" "tr_risk"  
## [25] "va_target" "va_risk"   "te_target" "te_risk"

Chapter 3 - Data Wrangling

Anthony Mora