content
# Review first observations
head(ds)
# Review last observations
tail(ds)
The date variables starts on december 2008 until april 2020.
The beginning of a data science project has to do with understanding the business problem to be tackled. This includes:
Then we need to get in contact with data technicians to identify available data. This is followed by a data phase where we work with the technicians to acces and ingest the data into R. This gets us in position to start a journey of discovery of new insights driven by data. We are looking to get in touch with the data in the context of the business problem.
# Required packages for session
library(FSelector) # Feature selection: information.gain().
library(dplyr) # Data wrangling, glimpse() and tbl_df().
library(ggplot2) # Visualize data.
library(lubridate) # Dates and time.
library(randomForest) # Impute missing values with na.roughfix().
library(rattle) # weatherAUS data and normVarNames().
library(readr) # Efficient reading of CSV data.
library(scales) # Format comma().
library(stringi) # String concat operator %s+%.
library(stringr) # String operations.
library(tibble) # Convert row names into a column.
library(tidyr) # Prepare a tidy dataset, gather().
library(magrittr) # Pipes %>%, %T>% and equals(), extract().
First we need to get our data into R
. Practically any source format is supported through the many packages available.
The weatherAUS dataset from rattle
comes in CSV format. We need to identify and record the location of our dataset to analyze. This can be done from the internet or loading the file locally.
We can save the Internet location of the file in as a string of characters in a variable, in this case we use dspath
as our variable.
# Identify the source location of the dataset.
"http://rattle.togaware.com/weatherAUS.csv" dspath <-
Now we need to read the dataset into the memory of the computer this can be achieved using reader::read_csv()
. This function returns a data frame, while using the reader
package from the tidyverse
this is actually an enhanced data frame known as tibble, a table data frame. This table is composed of rows that are observations and columns that make for variables. Again, to store the data frame we use a variable in this case weatherAUS
.
# Ingest the dataset.
rattle::weatherAUS weatherAUS <-
To get a reusable template for data wrangling we create a reference to the original dataset using a generic variable, in this case ds
, short for dataset.
weatherAUS ds <-
Modifying ds
will not affect the original weatherAUS
dataset. R
avoids making unnecessary copies of datasets, so until the copy dataset gets any modifications it’s still only a reference, when we start making changes then it will use extra memory to store this new dataset.
Next we record the name of the dataset and a generic reference to the dataset. This use of the function get()
is a tricky way to get the dataset original dataset reference in our ds
variable. It’s just an example for template use.
# Prepare the dataset for usage with template
"weatherAUS"
dsname <- get(dsname) ds <-
Using generic variables has advantages but we need to be careful. A disadvantage is that we may be working with several datasets referenced using the same generic variable ds
. Naming of datasets needs to be managed carefully.
Once data is loaded we seek a basic understanding of the data - its shape. We are interested in the size of the dataset in terms of observations (rows) and variables (columns). We can simply type the variable name that stores the dataset and will be presented with a summary, this only for our tibble format from the tidyverse.
# Print the dataset
ds
The funciton dim()
provides dimension information, observations and variables, while nrow()
and ncols()
give the same information separated. We use rattle::comcat()
to format numbers with commas.
# Basic size information
dim(ds) %>% comcat()
## 176,747 24
# Observations
nrow(ds) %>% comcat()
## 176,747
# Variables
ncol(ds) %>% comcat()
## 24
A useful alternative for our initial insight into the dataset is to use tibble::glimpse
, this works as base::summary
.
# Quick view of dataset contents
glimpse(ds)
## Rows: 176,747
## Columns: 24
## $ Date <date> 2008-12-01, 2008-12-02, 2008-12-03, 2008-12-04, 2008...
## $ Location <chr> "Albury", "Albury", "Albury", "Albury", "Albury", "Al...
## $ MinTemp <dbl> 13.4, 7.4, 12.9, 9.2, 17.5, 14.6, 14.3, 7.7, 9.7, 13....
## $ MaxTemp <dbl> 22.9, 25.1, 25.7, 28.0, 32.3, 29.7, 25.0, 26.7, 31.9,...
## $ Rainfall <dbl> 0.6, 0.0, 0.0, 0.0, 1.0, 0.2, 0.0, 0.0, 0.0, 1.4, 0.0...
## $ Evaporation <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Sunshine <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ WindGustDir <ord> W, WNW, WSW, NE, W, WNW, W, W, NNW, W, N, NNE, W, SW,...
## $ WindGustSpeed <dbl> 44, 44, 46, 24, 41, 56, 50, 35, 80, 28, 30, 31, 61, 4...
## $ WindDir9am <ord> W, NNW, W, SE, ENE, W, SW, SSE, SE, S, SSE, NE, NNW, ...
## $ WindDir3pm <ord> WNW, WSW, WSW, E, NW, W, W, W, NW, SSE, ESE, ENE, NNW...
## $ WindSpeed9am <dbl> 20, 4, 19, 11, 7, 19, 20, 6, 7, 15, 17, 15, 28, 24, 4...
## $ WindSpeed3pm <dbl> 24, 22, 26, 9, 20, 24, 24, 17, 28, 11, 6, 13, 28, 20,...
## $ Humidity9am <int> 71, 44, 38, 45, 82, 55, 49, 48, 42, 58, 48, 89, 76, 6...
## $ Humidity3pm <int> 22, 25, 30, 16, 33, 23, 19, 19, 9, 27, 22, 91, 93, 43...
## $ Pressure9am <dbl> 1007.7, 1010.6, 1007.6, 1017.6, 1010.8, 1009.2, 1009....
## $ Pressure3pm <dbl> 1007.1, 1007.8, 1008.7, 1012.8, 1006.0, 1005.4, 1008....
## $ Cloud9am <int> 8, NA, NA, NA, 7, NA, 1, NA, NA, NA, NA, 8, 8, NA, NA...
## $ Cloud3pm <int> NA, NA, 2, NA, 8, NA, NA, NA, NA, NA, NA, 8, 8, 7, NA...
## $ Temp9am <dbl> 16.9, 17.2, 21.0, 18.1, 17.8, 20.6, 18.1, 16.3, 18.3,...
## $ Temp3pm <dbl> 21.8, 24.3, 23.2, 26.5, 29.7, 28.9, 24.6, 25.5, 30.2,...
## $ RainToday <fct> No, No, No, No, No, No, No, No, No, Yes, No, Yes, Yes...
## $ RISK_MM <dbl> 0.0, 0.0, 0.0, 1.0, 0.2, 0.0, 0.0, 0.0, 1.4, 0.0, 2.2...
## $ RainTomorrow <fct> No, No, No, No, No, No, No, No, Yes, No, Yes, Yes, Ye...
Next we review the variables included int the dataset.
names(ds)
## [1] "Date" "Location" "MinTemp" "MaxTemp"
## [5] "Rainfall" "Evaporation" "Sunshine" "WindGustDir"
## [9] "WindGustSpeed" "WindDir9am" "WindDir3pm" "WindSpeed9am"
## [13] "WindSpeed3pm" "Humidity9am" "Humidity3pm" "Pressure9am"
## [17] "Pressure3pm" "Cloud9am" "Cloud3pm" "Temp9am"
## [21] "Temp3pm" "RainToday" "RISK_MM" "RainTomorrow"
A useful convention is to map all variable names to lowercase. R
is case sensitve so that doing this will result in different variable names as far as R
is concerned. We use rattle::normVarNames()
to make a reasonable attempt of converting variables from a dataset into a preferred standard form.
When changing a variable name within a dataset a new copy of that value of that variable is created so that now ds
and weatherAUS
will now be referring to different datasets in memory.
# Original names before normalising
names(ds)
## [1] "Date" "Location" "MinTemp" "MaxTemp"
## [5] "Rainfall" "Evaporation" "Sunshine" "WindGustDir"
## [9] "WindGustSpeed" "WindDir9am" "WindDir3pm" "WindSpeed9am"
## [13] "WindSpeed3pm" "Humidity9am" "Humidity3pm" "Pressure9am"
## [17] "Pressure3pm" "Cloud9am" "Cloud3pm" "Temp9am"
## [21] "Temp3pm" "RainToday" "RISK_MM" "RainTomorrow"
# Normalise the variable names
names(ds) %>% normVarNames() -> names(ds)
names(ds)
## [1] "date" "location" "min_temp" "max_temp"
## [5] "rainfall" "evaporation" "sunshine" "wind_gust_dir"
## [9] "wind_gust_speed" "wind_dir_9am" "wind_dir_3pm" "wind_speed_9am"
## [13] "wind_speed_3pm" "humidity_9am" "humidity_3pm" "pressure_9am"
## [17] "pressure_3pm" "cloud_9am" "cloud_3pm" "temp_9am"
## [21] "temp_3pm" "rain_today" "risk_mm" "rain_tomorrow"
glimpse(ds)
## Rows: 176,747
## Columns: 24
## $ date <date> 2008-12-01, 2008-12-02, 2008-12-03, 2008-12-04, 20...
## $ location <chr> "Albury", "Albury", "Albury", "Albury", "Albury", "...
## $ min_temp <dbl> 13.4, 7.4, 12.9, 9.2, 17.5, 14.6, 14.3, 7.7, 9.7, 1...
## $ max_temp <dbl> 22.9, 25.1, 25.7, 28.0, 32.3, 29.7, 25.0, 26.7, 31....
## $ rainfall <dbl> 0.6, 0.0, 0.0, 0.0, 1.0, 0.2, 0.0, 0.0, 0.0, 1.4, 0...
## $ evaporation <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ sunshine <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ wind_gust_dir <ord> W, WNW, WSW, NE, W, WNW, W, W, NNW, W, N, NNE, W, S...
## $ wind_gust_speed <dbl> 44, 44, 46, 24, 41, 56, 50, 35, 80, 28, 30, 31, 61,...
## $ wind_dir_9am <ord> W, NNW, W, SE, ENE, W, SW, SSE, SE, S, SSE, NE, NNW...
## $ wind_dir_3pm <ord> WNW, WSW, WSW, E, NW, W, W, W, NW, SSE, ESE, ENE, N...
## $ wind_speed_9am <dbl> 20, 4, 19, 11, 7, 19, 20, 6, 7, 15, 17, 15, 28, 24,...
## $ wind_speed_3pm <dbl> 24, 22, 26, 9, 20, 24, 24, 17, 28, 11, 6, 13, 28, 2...
## $ humidity_9am <int> 71, 44, 38, 45, 82, 55, 49, 48, 42, 58, 48, 89, 76,...
## $ humidity_3pm <int> 22, 25, 30, 16, 33, 23, 19, 19, 9, 27, 22, 91, 93, ...
## $ pressure_9am <dbl> 1007.7, 1010.6, 1007.6, 1017.6, 1010.8, 1009.2, 100...
## $ pressure_3pm <dbl> 1007.1, 1007.8, 1008.7, 1012.8, 1006.0, 1005.4, 100...
## $ cloud_9am <int> 8, NA, NA, NA, 7, NA, 1, NA, NA, NA, NA, 8, 8, NA, ...
## $ cloud_3pm <int> NA, NA, 2, NA, 8, NA, NA, NA, NA, NA, NA, 8, 8, 7, ...
## $ temp_9am <dbl> 16.9, 17.2, 21.0, 18.1, 17.8, 20.6, 18.1, 16.3, 18....
## $ temp_3pm <dbl> 21.8, 24.3, 23.2, 26.5, 29.7, 28.9, 24.6, 25.5, 30....
## $ rain_today <fct> No, No, No, No, No, No, No, No, No, Yes, No, Yes, Y...
## $ risk_mm <dbl> 0.0, 0.0, 0.0, 1.0, 0.2, 0.0, 0.0, 0.0, 1.4, 0.0, 2...
## $ rain_tomorrow <fct> No, No, No, No, No, No, No, No, Yes, No, Yes, Yes, ...
From this summary we can see the variable names, data types and first few values of the variable. We can see date (date), character (chr) and numeric (dbl) data types.
Now we start to ask questions from the data, such as whether the date values are a sequence of days, what are the other locations listed, if the data type of our variables are correct. We see missing values, we want to know about our numeric variables distributions.
# Review first observations
head(ds)
# Review last observations
tail(ds)
The date variables starts on december 2008 until april 2020.
Some variables are identified as characters, this are often called categoric variables and are represented in R
as a factor data type, this lets algorithms handle them specially. Where the character data takes a limited number of possible values we convert the variable from character into factor (categoric), so as to take advantage of the special handling. Each value of a factor is called level.
Our first character variable is location, we can use base::unique()
to confirm the number of locations.
# How many locations are in the dataset
$location %>%
ds unique() %>%
length()
## [1] 49
Because we may not know in general what other locations we will come across in related datasets we might want to retain the variable as character.
# Converting location to factor
$location %>% as.factor() -> ds$location ds
We can review the distribution of locations with base::table()
.
table(ds$location)
##
## Adelaide Albany Albury AliceSprings
## 3832 3679 3680 3680
## BadgerysCreek Ballarat Bendigo Brisbane
## 3632 3680 3671 3833
## Cairns Canberra Cobar CoffsHarbour
## 3680 4076 3649 3649
## Dartmoor Darwin GoldCoast Hobart
## 3649 3833 3680 3833
## Katherine Launceston Melbourne MelbourneAirport
## 2218 3680 3833 3649
## Mildura Moree MountGambier MountGinini
## 3649 3649 3679 3680
## Newcastle Nhil NorahHead NorfolkIsland
## 3680 2218 3644 3649
## Nuriootpa PearceRAAF Penrith Perth
## 3648 3648 3679 3832
## PerthAirport Portland Richmond Sale
## 3648 3649 3649 3649
## SalmonGums Sydney SydneyAirport Townsville
## 3602 3984 3649 3680
## Tuggeranong Uluru WaggaWagga Walpole
## 3679 2218 3649 3645
## Watsonia Williamtown Witchcliffe Wollongong
## 3649 3649 3648 3680
## Woomera
## 3649
Now we look at rain_today
and rain_tomorrow
distributions.
# Review the distribution of observations across levels
%>%
ds select(starts_with("rain_")) %>%
sapply(table)
## rain_today rain_tomorrow
## No 135371 135353
## Yes 37058 37077
We confirm that No
and Yes
are the only values these two variables have and so it makes sense to convert them both to factors.
# Names of the rain variables
%>%
ds select(starts_with("rain_")) %>%
names() %>%
print() ->
vnames
## [1] "rain_today" "rain_tomorrow"
# Confirm these are currently character variables
%>% sapply(class) # They are factors but we will work as if characters ds[vnames]
## rain_today rain_tomorrow
## "factor" "factor"
# Converting variables from character to factor
%>%
ds[vnames] lapply(factor) %>%
data.frame() %>%
as_tibble() %>% #tbl_df deprecated
sapply(., class)
## rain_today rain_tomorrow
## "factor" "factor"
# Verifying the distribution has not changed
%>%
ds select(starts_with("rain_")) %>%
sapply(table)
## rain_today rain_tomorrow
## No 135371 135353
## Yes 37058 37077
Now we have three wind direction variables identified as wind_gust_dir
, wind_dir_9am
and wind_dir_3pm
are also characters.
%>%
ds select(contains("_dir")) %>%
sapply(table)
## wind_gust_dir wind_dir_9am wind_dir_3pm
## N 10989 13978 10475
## NNE 7937 9782 8002
## NE 8715 9335 10092
## ENE 9965 9592 9605
## E 11071 11237 10123
## ESE 9055 9536 10290
## SE 11331 11398 12919
## SSE 10946 10954 11089
## S 11043 10519 11788
## SSW 10809 9272 9902
## SW 10793 10135 11166
## WSW 11136 8392 11700
## W 12122 10183 12411
## WNW 10045 9067 10846
## NW 9705 10488 10315
## NNW 7954 9468 9358
Here the values are ordered in alphabetic, but our variables have an intrinsic compass order that relates to directions. So we need to indicate the order of the levels.
# Levels ordered
c("N", "NNE", "NE", "ENE",
compass <-"E", "ESE", "SE", "SSE",
"S", "SSW", "SW", "WSW",
"W", "WNW", "NW", "NNW")
# Note the names of wind direction
%>%
ds select(contains("_dir")) %>%
names() %>%
print() ->
vnames
## [1] "wind_gust_dir" "wind_dir_9am" "wind_dir_3pm"
# Confirm that these are characters
%>% sapply(class) # We assume they are characters ds[vnames]
## wind_gust_dir wind_dir_9am wind_dir_3pm
## [1,] "ordered" "ordered" "ordered"
## [2,] "factor" "factor" "factor"
# Convert characters to factors
%>%
ds[vnames] lapply(factor, levels=compass, ordered=TRUE) %>%
data.frame() %>%
as_tibble() %>%
sapply(., class)
## wind_gust_dir wind_dir_9am wind_dir_3pm
## [1,] "ordered" "ordered" "ordered"
## [2,] "factor" "factor" "factor"
# Distribution
%>%
ds select(contains("_dir")) %>%
sapply(table)
## wind_gust_dir wind_dir_9am wind_dir_3pm
## N 10989 13978 10475
## NNE 7937 9782 8002
## NE 8715 9335 10092
## ENE 9965 9592 9605
## E 11071 11237 10123
## ESE 9055 9536 10290
## SE 11331 11398 12919
## SSE 10946 10954 11089
## S 11043 10519 11788
## SSW 10809 9272 9902
## SW 10793 10135 11166
## WSW 11136 8392 11700
## W 12122 10183 12411
## WNW 10045 9067 10846
## NW 9705 10488 10315
## NNW 7954 9468 9358
Other character variables are evaporation and sunshine, this have missing values.
# Note the variable names
c("evaporation", "sunshine")
cvars <-
# Review variables
head(ds[cvars])
sample_n(ds[cvars], 6)
The readr::read_csv()
only takes into account a subset of all data to determine the data types, so looking at thos missing values at the start this variables were parsed as characters.
# Checking variable class
%>% sapply(class) #read_csv() actually determined they are numeric ds[cvars]
## evaporation sunshine
## "numeric" "numeric"
# Convert to numeric
%>% sapply(as.numeric) -> ds[cvars]
ds[cvars]
# Confirm conversion
%>% sapply(class) ds[cvars]
## evaporation sunshine
## "numeric" "numeric"
We have dealt we all character variables from our dataset, converting them to factors or numerics on a case by case basis on our understanding of the data.
This can take time so we need to be selective.
# Note which variables are categoric
%>%
ds sapply(is.factor) %>%
which() %>%
print() ->
catc
## location wind_gust_dir wind_dir_9am wind_dir_3pm rain_today
## 2 8 10 11 22
## rain_tomorrow
## 24
# Normalise the levels of all categoric variables
for (v in catc)
levels(ds[[v]]) %<>% normVarNames()
# Review normalizing
glimpse(ds[catc])
## Rows: 176,747
## Columns: 6
## $ location <fct> albury, albury, albury, albury, albury, albury, albur...
## $ wind_gust_dir <ord> w, wnw, wsw, ne, w, wnw, w, w, nnw, w, n, nne, w, sw,...
## $ wind_dir_9am <ord> w, nnw, w, se, ene, w, sw, sse, se, s, sse, ne, nnw, ...
## $ wind_dir_3pm <ord> wnw, wsw, wsw, e, nw, w, w, w, nw, sse, ese, ene, nnw...
## $ rain_today <fct> no, no, no, no, no, no, no, no, no, yes, no, yes, yes...
## $ rain_tomorrow <fct> no, no, no, no, no, no, no, no, yes, no, yes, yes, ye...
Many data mining tasks can be expressed as building classification models. For such models we want to ensure the target is categoric. It could be 0/1 and hence is loaded as numeric. In such cases we could tell our model algorithm to explicitly do classification or else set the target using base::as.factor()
in the formula. Nonetheless it is generally cleaner to do this here and note that this code has no effect if the target is already categoric.
# Note the target variable
"rain_tomorrow"
target <-
# Ensure the target is categoric
%<>% as.factor()
ds[[target]]
# Confirm the distribution
%>% table() ds[target]
## .
## no yes
## 135353 37077
%>%
ds ggplot(aes_string(x=target)) +
geom_bar(width=0.2, fill="light blue") +
theme(text=element_text(size = 14)) +
scale_y_continuous(labels=comma) +
labs(title = "Distribution of Rain Tomorrow",
x = "Rain Tomorrow",
y = "Count",
caption = "Source: weatherAUS")
We have a basic understanding of our data set and its content, we have identified data types and cleaned names. Now we want to identify roles played by the variables within the dataset.
# Note the available variables
%>%
ds names %>%
print() ->
vars
## [1] "date" "location" "min_temp" "max_temp"
## [5] "rainfall" "evaporation" "sunshine" "wind_gust_dir"
## [9] "wind_gust_speed" "wind_dir_9am" "wind_dir_3pm" "wind_speed_9am"
## [13] "wind_speed_3pm" "humidity_9am" "humidity_3pm" "pressure_9am"
## [17] "pressure_3pm" "cloud_9am" "cloud_3pm" "temp_9am"
## [21] "temp_3pm" "rain_today" "risk_mm" "rain_tomorrow"
By this stage in the project we will usually have identified a business problem that is the focus of attention. In our case we will assume it is to build a predictive analytics model to predict the chance of it raining tomorrow given the observation of today’s weather. In this case, the variable rain_tomorrow
is the target variable. The dataset we have is then a training dataset of historic observations. The task in model building is identify any patterns among the observed variables that suggest that it rains the following day.
We also move the target variable to be the first in the vector of variables recorded in vars
. This is common practice where the first variable in the dataset is the target variable and the remainder are the variables used to build a model. Another common practice is for the target to be the last variable in the dataset.
# Place the target variable at the beginning of the vars
c(target, vars) %>%
unique() %>%
print() ->
vars
## [1] "rain_tomorrow" "date" "location" "min_temp"
## [5] "max_temp" "rainfall" "evaporation" "sunshine"
## [9] "wind_gust_dir" "wind_gust_speed" "wind_dir_9am" "wind_dir_3pm"
## [13] "wind_speed_9am" "wind_speed_3pm" "humidity_9am" "humidity_3pm"
## [17] "pressure_9am" "pressure_3pm" "cloud_9am" "cloud_3pm"
## [21] "temp_9am" "temp_3pm" "rain_today" "risk_mm"
We use base::unique()
to remove te original ocurrence in the vars vector.
Another variable related to the outcome rather than to today’s observations is risk_mm
. From business context this records the amount of rain that fell “tomorrow”. We refer to this as a risk variable. It is a measure of the impact of risk of the outcome we are predicting (whether it rains tomorrow). It is an output variables and should not be used as an input to the modeling, it is not an independent variable.
# Note risk variable
"risk_mm" risk <-
The date and location variables work as identifiers. Given a date and a location we have an observation of the remaining variables. These are the identifiers, these are not usually used as independent variables for predictive analytics models.
# Note identifiers
c("date", "location") id <-
Now we want to identify variables that are irrelevant or inappropriate for modelling.
First we need to ignore identifiers and risk variables. We can join our id
and risk
variables in a single vector to ignore from our model.
# Initialize ignored variables: identifiers and risk
%>%
id union(if (exists("risk")) risk) %>%
print() ->
ignore
## [1] "date" "location" "risk_mm"
We might also check for variables that have a unique value for every observation. These are ofthe identifiers and should be ignored.
We begin defining a function that given a vector of data it will return the number of unique values found in that vector.
# Helper function to count number of distinct values
function(x) {
count_unique <-%>% unique() %>% length()
x
}
# Heuristic for candidate identifiers ot possibly ignore
%>%
ds[vars] sapply(count_unique) %>%
equals(nrow(ds)) %>%
which() %>%
names() %>%
print() ->
ids
## character(0)
# Add them to the variables to be ignored for modeling
union(ignore, ids) %>% print ignore <-
## [1] "date" "location" "risk_mm"
Here we count every unique value in our variables and compare the counts with the number of rows in the dataset, if any value is equal to the number observations it is selected to be in the ids
variable. Then we add all the variables to be ignored in the ignore
variable.
We identify no more variables identifiers.
Next we remove variables that have only missing values.
# Helper function to count the number of values missing
function(x){
count_na <-%>% is.na() %>% sum()
x
}
# Identify variables with only missing values
%>%
ds[vars] sapply(count_na) %>%
equals(nrow(ds)) %>%
which() %>%
names() %>%
print() ->
missing
## character(0)
# Add this variables to the ignored vector
%<>% union(missing) %>% print() ignore
## [1] "date" "location" "risk_mm"
It is also helpful to identify those variables which are very spares - that have mostly missing values. We can decide a threshold of the proportion missing above which to ignore the variable as not likely to add much value to our analysis. For example, we may want to ignore variables with more than 80% of the values missing.
# Identify our threshold
0.8
missing.threshold <-
# Identify variables that are mostly missing
%>%
ds[vars] sapply(count_na) %>%
'>'(missing.threshold*nrow(ds)) %>%
which() %>%
names() %>%
print() ->
mostly
## character(0)
Another issue we often come across in our datasets are factors that have very many levels. We might want to ignore such variables (or perhaps group them appropriately). Hence we simply identify them and add them to the list of variables to ignore.
# Helper function to count the number of levels
function(x){
count_levels <-%>% extract2(x) %>% levels() %>% length()
ds
}
# Identify a threshold above which we have too many variables
20
levels.threshold <-
# Identify variables that have too many levels
%>%
ds[vars] sapply(is.factor) %>%
which() %>%
names() %>%
sapply(count_levels) %>%
">="(levels.threshold) %>%
which() %>%
names() %>%
print() ->
too.many
## [1] "location"
# Add them to the variables to be ignored
union(ignore, too.many) %>% print() ignore <-
## [1] "date" "location" "risk_mm"
We ignore variables with constant values, because generally they add no extra information to the analysis.
# Helper function to test if all values in vector are the same
function(x){
all_same <-all(x == x[1L])
}
# Identify variables that have a single value
%>%
ds[vars] sapply(all_same) %>%
which() %>%
names() %>%
print() ->
constants
## character(0)
# Add them to ignored variables
union(ignore, constants) %>% print() ignore <-
## [1] "date" "location" "risk_mm"
Once we identify all variables to ignore we remove them from our list of variables to use.
# Check the number of variables currently
length(vars)
## [1] 24
# Remove the variables to ignore
%>% setdiff(ignore) %>% print() -> vars vars
## [1] "rain_tomorrow" "min_temp" "max_temp" "rainfall"
## [5] "evaporation" "sunshine" "wind_gust_dir" "wind_gust_speed"
## [9] "wind_dir_9am" "wind_dir_3pm" "wind_speed_9am" "wind_speed_3pm"
## [13] "humidity_9am" "humidity_3pm" "pressure_9am" "cloud_9am"
## [17] "cloud_3pm" "rain_today"
# Confirm they are ignored
length(vars)
## [1] 18
Here to prepare our dataset we use the package FSelector
, which provides functions to identify subsets of variables that might be most effective for modeling. We can use this (and other packages) to further assist us in reducing the variables for modeling. FSelector::cfs()
identifies a subset of variables to consider for use in modeling by using correlation and entropy.
# Construct the formulation of the modeling to undertake
formula(target %s+% " ~ .") %>% print() form <-
## rain_tomorrow ~ .
# Use correlation search to identify key variabels
cfs(form, ds[vars])
## [1] "rainfall" "sunshine" "humidity_3pm" "cloud_3pm" "rain_today"
The stringi::%s+%
operator concatenates strings, we use it to produce a formula that indicates we will model our target variable with all of the other variables of the dataset. A second example uses FSelector::information.gain()
, this lists the variable importance, so it gives advice of a useful subset of variables.
# Use information gain to identify variable importance
information.gain(form, ds[vars]) %>%
rownames_to_column("variable") %>%
arrange(-attr_importance)
Both methods gave the same variables as our most important variables for modeling.
When working with missing values we should analyze whether there is any pattern to the missing targets. This may indicate systematic data issue rather than simple randomness. Understanding why the data is systematically missing gives a better understanding about our data and how it is collected.
# Check dimensions of dataset
dim(ds)
## [1] 176747 24
# Identify observations with missing target
ds %>% extract2(target) %>% is.na()
missing_target <-
# Check how many are found
sum(missing_target)
## [1] 4317
# Remove observations with a missing target
%>% filter(!missing_target) -> ds
ds
# Confirm that filter delivered the expected dataset
dim(ds)
## [1] 172430 24
Algorithms may deal with missing values in our input variables, but this might not always be the case, so we need to address the situation. In previous steps we removed variables with many missing values, we note that this is not always appropriate. We might want to impute missing values ( noting also that this is not always wise, we are inventing data).
Next we illustrate a process for imputing missing values using randomForest::na.roughfix()
.
# Backup the dataset
ds
ods <-
# Count number of missing values
%>% is.na() %>% sum() %>% comcat() ds[vars]
## 399,165
# Impute missing values
%>% na.roughfix() -> ds[vars]
ds[vars]
# Confirm that no missing values remain
%>% is.na() %>% sum() %>% comcat() ds[vars]
## 0
# Restore original dataset
ods ds <-
An alternative might be to remove observations that have missing values, here we use stats::na.omit()
to identify the rows to omit based on the vars
to be included for modeling.
# Backup the dataset
ds
ods <-
# Initialise the list of observations to be removed
NULL
omit <-
# Review current dataset
%>% nrow() %>% comcat() ds[vars]
## 172,430
%>% is.na() %>% sum() %>% comcat() ds[vars]
## 399,165
# Identify observations with missing values
attr(na.omit(ds[vars]), "na.action")
mo <-
# Record the observations to omit
union(omit, mo)
omit <-
# If there are observations to omit, remove them
if (length(omit)) ds <- ds[-omit,]
# Confirm the observations have been removed
%>% nrow() %>% comcat() ds[vars]
## 61,820
%>% is.na() %>% sum() ds[vars]
## [1] 0
# Restore original dataset
ods
ds <- NULL omit <-
Another important task for the data scientist is to create new features from the provided data where appropriate. This process of feature creation is iterative, we come back to it many times during the life cycle of a project. As we get insights into the data and through modeling we identify new features.
The date variables lets us derive features that may be useful for our model building. Here we add two derived features to our dataset: years and season. For the year variable we simply extract from date using base::format()
. To compute the season we extract the month base::as.integer()
which is then an index to base::switch()
to a specific season depending on the month.
%>%
ds mutate(year = factor(format(date, "%Y")),
season = format(date, "%m") %>%
as.integer() %>%
sapply(function(x)
switch(x,
"summer","summer","autumn",
"autumn","autumn","winter",
"winter","winter","spring",
"spring","spring","summer")) %>%
as.factor()) -> ds
%>%
ds select(.,date,year,season) %>% sample_n(10)
# Introduce variables according to roles
%>% c("season") -> vars
vars %<>% c("year") id
Another option to developing new features for our model is to include model-generated features. A common one is to cluster observations or some aggregation of observations into similar groups. We then uniquely identify each group and add this grouping as another column or variable within our dataset.
We will illustrate the process of a cluster analysis over the numeric variables within our dataset. The aim of this cluster is to group the locations so that within a group the locations will have similar values for the numeric variables and between the groups the numeric variables will be more dissimilar. The traditional clustering algorithm is stats::kmeans()
.
We identify the number of clusters that we wish to create, storing them as the variable NCLUS
. We then select the numc
variables from the dataset ds
, dplyr::group_by()
the location and then dplyr::summarise_all()
variables. We then remove the location
and just leave the numeric variables, this are then rescaled so that all values across the different variables are in the same range. The rescaling is required for cluster so as not to introduce bias due to very differently scaled variables.
# Set seed for repeatably
set.seed(7465)
# Cluster the numeric data per location
5
NCLUST <-
c("location",numc)] %>%
ds[ group_by(location) %>%
summarise_all(funs(mean(., na.rm = TRUE))) %T>%
{locations <<- .$location} %>% #Store locations for later
select(-location) %>%
sapply(function(x) ifelse(is.na(x), 0, x)) %>%
as.data.frame %>%
sapply(scale) %>%
kmeans(NCLUST) %>%
print() %>%
extract2("cluster") ->
cluster
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## K-means clustering with 5 clusters of sizes 16, 10, 3, 11, 9
##
## Cluster means:
## min_temp max_temp rainfall evaporation sunshine wind_gust_speed
## 1 -0.6329818 -0.7092251 -0.32589491 -0.03078092 -0.1201087 0.01980454
## 2 0.1230244 0.4891127 -0.63922651 0.74038988 1.0093003 -0.03001963
## 3 0.9467065 2.0565398 -0.67744951 0.61779897 -0.3928548 0.25915800
## 4 -0.5023530 -0.4665302 0.04592559 -1.17176045 -1.0679563 -0.33590190
## 5 1.2870255 0.6020765 1.45930564 0.45828488 0.5283132 0.32230784
## wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am
## 1 -0.03453387 0.1735126 0.56565701 0.2964073 0.2957966
## 2 0.02110433 -0.4786273 -0.88108262 -0.9351112 0.2975354
## 3 0.25535647 -0.4108368 -2.26102789 -1.8714636 0.2874347
## 4 -0.58496983 -0.6115458 0.61788107 0.3863261 -1.0177971
## 5 0.66778859 1.1077317 -0.02814379 0.6637110 0.2917071
## pressure_3pm cloud_9am cloud_3pm temp_9am temp_3pm
## 1 0.2973469 0.91256343 0.88638514 -0.7210094 -0.6963267
## 2 0.2967594 -0.14604672 -0.04024751 0.2536996 0.5060387
## 3 0.2817550 0.06818767 0.08541570 1.4656647 2.0890292
## 4 -1.0170817 -1.41005922 -1.45077310 -0.4975308 -0.4895005
## 5 0.2908320 0.24061564 0.21361889 1.1194444 0.5775844
##
## Clustering vector:
## [1] 2 1 1 3 4 1 1 2 5 1 2 5 4 5 5 1 3 1 1 1 2 2 1 4 4 4 4 5 1 2 4 2 2 1 1 1 4 5
## [39] 5 5 4 3 2 4 1 5 4 1 2
##
## Within cluster sum of squares by cluster:
## [1] 96.11806 48.27105 27.46643 160.62506 56.41959
## (between_SS / total_SS = 49.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
head(cluster)
## [1] 2 1 1 3 4 1
# More updated version
# Set seed for repeatably
set.seed(7465)
# Cluster the numeric data per location
5
NCLUST <-
c("location",numc)] %>%
ds[ group_by(location) %>%
summarise(across(.fns = ~mean(., na.rm = TRUE)),.groups = "drop") %T>%
{locations <<- .$location} %>% #Store locations for later
select(-location) %>%
sapply(function(x) ifelse(is.na(x),0,x)) %>%
as.data.frame() %>%
sapply(scale) %>%
kmeans(NCLUST) %>%
print() %>%
extract2("cluster") ->
cluster
## K-means clustering with 5 clusters of sizes 16, 10, 3, 11, 9
##
## Cluster means:
## min_temp max_temp rainfall evaporation sunshine wind_gust_speed
## 1 -0.6329818 -0.7092251 -0.32589491 -0.03078092 -0.1201087 0.01980454
## 2 0.1230244 0.4891127 -0.63922651 0.74038988 1.0093003 -0.03001963
## 3 0.9467065 2.0565398 -0.67744951 0.61779897 -0.3928548 0.25915800
## 4 -0.5023530 -0.4665302 0.04592559 -1.17176045 -1.0679563 -0.33590190
## 5 1.2870255 0.6020765 1.45930564 0.45828488 0.5283132 0.32230784
## wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am
## 1 -0.03453387 0.1735126 0.56565701 0.2964073 0.2957966
## 2 0.02110433 -0.4786273 -0.88108262 -0.9351112 0.2975354
## 3 0.25535647 -0.4108368 -2.26102789 -1.8714636 0.2874347
## 4 -0.58496983 -0.6115458 0.61788107 0.3863261 -1.0177971
## 5 0.66778859 1.1077317 -0.02814379 0.6637110 0.2917071
## pressure_3pm cloud_9am cloud_3pm temp_9am temp_3pm
## 1 0.2973469 0.91256343 0.88638514 -0.7210094 -0.6963267
## 2 0.2967594 -0.14604672 -0.04024751 0.2536996 0.5060387
## 3 0.2817550 0.06818767 0.08541570 1.4656647 2.0890292
## 4 -1.0170817 -1.41005922 -1.45077310 -0.4975308 -0.4895005
## 5 0.2908320 0.24061564 0.21361889 1.1194444 0.5775844
##
## Clustering vector:
## [1] 2 1 1 3 4 1 1 2 5 1 2 5 4 5 5 1 3 1 1 1 2 2 1 4 4 4 4 5 1 2 4 2 2 1 1 1 4 5
## [39] 5 5 4 3 2 4 1 5 4 1 2
##
## Within cluster sum of squares by cluster:
## [1] 96.11806 48.27105 27.46643 160.62506 56.41959
## (between_SS / total_SS = 49.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
head(cluster)
## [1] 2 1 1 3 4 1
Now each location has an associated cluster stored in the cluster
vector, we add their base::names()
. Then dplyr::mutate()
the dataset by adding the new cluster number indexed by the location for each observation.
# Index the cluster vector with the appropriate locations
names(cluster) <- locations
# Add cluster to the dataset
%<>% mutate(cluster = "area" %>%
ds paste(cluster[ds$location]) %>%
as.factor())
# Check clusters
%>% select(location, cluster) %>% sample_n(10) ds
# Record variables in appropriate role
%<>% c("cluster")
vars
# Checking that clusters look good
levels(ds$location)] %>% sort() cluster[
## albany albury ballarat bendigo
## 1 1 1 1
## canberra hobart launceston melbourne
## 1 1 1 1
## melbourne_airport mount_gambier nuriootpa portland
## 1 1 1 1
## richmond sale watsonia wollongong
## 1 1 1 1
## adelaide brisbane cobar mildura
## 2 2 2 2
## moree pearce_raaf perth perth_airport
## 2 2 2 2
## wagga_wagga woomera alice_springs katherine
## 2 2 3 3
## uluru badgerys_creek dartmoor mount_ginini
## 3 4 4 4
## newcastle nhil norah_head penrith
## 4 4 4 4
## salmon_gums tuggeranong walpole witchcliffe
## 4 4 4 4
## cairns coffs_harbour darwin gold_coast
## 5 5 5 5
## norfolk_island sydney sydney_airport townsville
## 5 5 5 5
## williamtown
## 5
%>% sort() cluster
## albany albury ballarat bendigo
## 1 1 1 1
## canberra hobart launceston melbourne
## 1 1 1 1
## melbourne_airport mount_gambier nuriootpa portland
## 1 1 1 1
## richmond sale watsonia wollongong
## 1 1 1 1
## adelaide brisbane cobar mildura
## 2 2 2 2
## moree pearce_raaf perth perth_airport
## 2 2 2 2
## wagga_wagga woomera alice_springs katherine
## 2 2 3 3
## uluru badgerys_creek dartmoor mount_ginini
## 3 4 4 4
## newcastle nhil norah_head penrith
## 4 4 4 4
## salmon_gums tuggeranong walpole witchcliffe
## 4 4 4 4
## cairns coffs_harbour darwin gold_coast
## 5 5 5 5
## norfolk_island sydney sydney_airport townsville
## 5 5 5 5
## williamtown
## 5
The metadata is data about the data. We now record data about our dataset that we use later in further processing and analysis.
We have identified variable roles such as target, risk and ignored variables. Now we identify our model inputs or independent variables. We record them both as a vector of characters (variable names) and a vector of integers (variable indicies).
# Noting variable names
%>%
vars setdiff(target) %>%
print() ->
inputs
## [1] "min_temp" "max_temp" "rainfall" "evaporation"
## [5] "sunshine" "wind_gust_dir" "wind_gust_speed" "wind_dir_9am"
## [9] "wind_dir_3pm" "wind_speed_9am" "wind_speed_3pm" "humidity_9am"
## [13] "humidity_3pm" "pressure_9am" "cloud_9am" "cloud_3pm"
## [17] "rain_today" "season" "cluster"
The integer indices are determined from the base::names()
of the variables in the original dataset. Note the use of USE.NAMES= from base::sapply()
to turn off the inclusion of names in the resulting vector to keep the result as a simple vector.
# Noting variable indices
%>%
inputs sapply(function(x) which(x == names(ds)), USE.NAMES = FALSE) %>%
print() ->
inputi
## [1] 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 19 22 26 27
# Recording number of observations
%>%
ds nrow() %T>%
comcat() ->
nobs
## 172,430
# Confirm various subset sizes
dim(ds) %>% comcat()
## 172,430 27
dim(ds[vars]) %>% comcat()
## 172,430 20
dim(ds[inputs]) %>% comcat()
## 172,430 19
dim(ds[inputi]) %>% comcat()
## 172,430 19
For handling we identify numeric and categoric variables for separate. An example are some cluster algorithms that only take numeric variables. Again we identify them with both a character vector containing their names and an integer vector with their indices. Note that when we use the index we assume the variables remain in the same order within the dataset and all variables are present.
# Identify numeric variables by index
%>%
ds sapply(is.numeric) %>%
which() %>%
intersect(inputi) %>%
print() ->
numi
## [1] 3 4 5 6 7 9 12 13 14 15 16 18 19
# Identify the numeric variables by name
%>%
ds names() %>%
extract(numi) %>%
print() ->
numc
## [1] "min_temp" "max_temp" "rainfall" "evaporation"
## [5] "sunshine" "wind_gust_speed" "wind_speed_9am" "wind_speed_3pm"
## [9] "humidity_9am" "humidity_3pm" "pressure_9am" "cloud_9am"
## [13] "cloud_3pm"
# Identify categoric variables by index
%>%
ds sapply(is.factor) %>%
which() %>%
intersect(inputi) %>%
print() ->
cati
## [1] 8 10 11 22 26 27
# Identify the categoric variables by name
%>%
ds names() %>%
extract(cati) %>%
print() ->
catc
## [1] "wind_gust_dir" "wind_dir_9am" "wind_dir_3pm" "rain_today"
## [5] "season" "cluster"
Our end goal is to build a model based on the data we have prepared. A model will capture knowledge about the world that the data represents. The final two tasks are to specify the form of the model and to identify the actual observations from which the model is to be built. For the latter we partition the dataset into subsets.
A formula is used to identify what it is that we will model from the supplied data. Typically we identify a target variable which we will model based on other input variables. We can construct a stats::formula()
automatically from a dataset if the first column is the target variable and the remaining ones are the inputs. Our vars
vector has already this format.
%>%
ds[vars] formula() %>%
print() ->
form
## rain_tomorrow ~ min_temp + max_temp + rainfall + evaporation +
## sunshine + wind_gust_dir + wind_gust_speed + wind_dir_9am +
## wind_dir_3pm + wind_speed_9am + wind_speed_3pm + humidity_9am +
## humidity_3pm + pressure_9am + cloud_9am + cloud_3pm + rain_today +
## season + cluster
## <environment: 0x0000000057e74540>
This formula indicates that we aim to build a model that captures the knowledge required to predict the output rain_tomorrow
from the provided input variables. This kind of model is a classification, because there’s just a couple of options for our target variable. It can be compared with regression models that predict numeric outcomes. This model in particular is a binary classification model.
A common methodology for building models is to partition the available data into a training dataset and a testing dataset. We will build (or train) a model using the training dataset and then test how good the model is by applying it to the testing dataset. Typically we also introduce a validation dataset. This is used during the building of the model to assist in tuning the algorithm through trialing different parameters to the machine learning algorithm. In this way we search for the best model using the validation dataset and then obtain a measure of the performance of the final model using the testing dataset.
# Set seed
42
seed <-set.seed(seed)
The training dataset is typically a 70% random sample for building the model. The second and third consist of the remainder, used to tune and then estimate the expected performance of the model. Rather than creating three subsets, we simply record the index of the observations that belong to each of the three subsets.
# Partition the full dataset into three
# Train set
%>%
nobs sample(0.7*nobs) %T>%
{length(.) %>% comcat()} %T>%
{sort(.) %>% head(30) %>% print()} ->
train
## 120,700
## [1] 1 2 6 7 8 9 10 11 12 14 15 16 18 19 20 21 22 23 26 27 30 31 32 34 35
## [26] 36 37 39 40 41
# Validation set
%>%
nobs seq_len() %>%
setdiff(train) %>%
sample(0.15*nobs) %T>%
{length(.) %>% comcat()} %T>%
{sort(.) %>% head(15) %>% print()} ->
validate
## 25,864
## [1] 5 13 17 24 29 38 50 58 71 75 91 100 104 106 117
# Test set
%>%
nobs seq_len() %>%
setdiff((union(train,validate))) %T>%
{length(.) %>% comcat()} %T>%
{head(.) %>% print()} ->
test
## 25,866
## [1] 3 4 25 28 33 46
Now we record the actual target values across the three data sets as these will be used in the evaluations performed in later chapters. A secondary target variable is our risk variable that reports the amount of rain recorded tomorrow. This two are highly correlated.
ds[train,][[target]] %>%
tr_target <- {head(., 20) %>% print()}
## [1] no no no no no no yes no no no no no no yes no no no yes yes
## [20] no
## Levels: no yes
ds[train,][[risk]] %>%
tr_risk <- {head(., 20) %>% print()}
## [1] 0.0 0.2 0.4 0.0 0.0 0.0 18.0 0.0 0.0 0.2 0.0 0.0 0.4 13.2 0.1
## [16] 0.0 0.0 14.6 4.4 0.0
ds[validate,][[target]] %>%
va_target <- {head(., 20) %>% print()}
## [1] no yes no no no no no no no no no no no yes no yes no yes no
## [20] no
## Levels: no yes
ds[validate,][[risk]] %>%
va_risk <- {head(., 20) %>% print()}
## [1] 0.0 1.4 0.2 0.0 0.0 0.0 0.0 0.8 0.0 0.0 0.0 0.0 0.0 2.4 0.8 7.0 0.6 1.4 0.2
## [20] 0.2
ds[test,][[target]] %>%
te_target <- {head(., 20) %>% print()}
## [1] no no no yes no no no no no yes no no no no no yes no no no
## [20] yes
## Levels: no yes
ds[test,][[risk]] %>%
te_risk <- {head(., 20) %>% print()}
## [1] 0.0 1.0 0.0 1.2 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0
## [16] 1.8 0.0 0.0 0.0 20.0
Having transformed, cleaned, wrangled and added new variables we will now save the data and its metadata into a binary RData file. Saving it as a binary compressed file saves both storage space and time on reloading the dataset. This is generally faster than loading a CSV file.
# Save data into a appropriate folder
"C:/Users/antho/Desktop/DS/R/The Essentials of Data Science/data"
fpath <-
# Timestamp
"_" %s+% format(Sys.Date(), "%Y%m%d") %T>% print() dsdate <-
## [1] "_20210125"
# Filename for the saved dataset
dsname %s+% dsdate %s+% ".RData"
dsfile <-
# Full path to the dataset
%>%
fpath file.path(dsfile) %T>%
print() ->
dsrdata
## [1] "C:/Users/antho/Desktop/DS/R/The Essentials of Data Science/data/weatherAUS_20210125.RData"
# Save relevant R objects to the binary RData file
save(ds, dsname, dspath, dsdate, nobs
,vars, target, risk, id, ignore,
omit,inputi, inputs, numi, numc,
cati, catc,form, seed, train,
validate, test,tr_target, tr_risk,
va_target, va_risk, te_target, te_risk,file=dsrdata)
# Check the resulting file size in bytes
file.size(dsrdata) %>% comma()
## [1] "6,935,779"
load(dsrdata) %>% print()
## [1] "ds" "dsname" "dspath" "dsdate" "nobs" "vars"
## [7] "target" "risk" "id" "ignore" "omit" "inputi"
## [13] "inputs" "numi" "numc" "cati" "catc" "form"
## [19] "seed" "train" "validate" "test" "tr_target" "tr_risk"
## [25] "va_target" "va_risk" "te_target" "te_risk"