MATH2349 Semester 2, 2019

Required packages

The packages that had been of use to this analysis were as follows: readr, magrittr, dplyr, tidyr, stringr, Hmisc, car, knitr, editrules, deductive, validate, deducorrect, outliers, mvn, forecast, infotheo and mlr.

library(readr)
library(magrittr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:magrittr':
## 
##     extract

library(stringr)
library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(knitr)
library(editrules)

## Loading required package: igraph

## 
## Attaching package: 'igraph'

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

## 
## Attaching package: 'editrules'

## The following objects are masked from 'package:igraph':
## 
##     blocks, normalize

## The following object is masked from 'package:tidyr':
## 
##     separate

## The following object is masked from 'package:dplyr':
## 
##     contains

library(deductive)
library(validate)

## 
## Attaching package: 'validate'

## The following object is masked from 'package:igraph':
## 
##     compare

## The following objects are masked from 'package:Hmisc':
## 
##     label, label<-

## The following object is masked from 'package:ggplot2':
## 
##     expr

## The following object is masked from 'package:dplyr':
## 
##     expr

library(deducorrect)
library(outliers)
library(MVN)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## sROC 0.1-2 loaded

library(forecast)

## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## Registered S3 methods overwritten by 'forecast':
##   method             from    
##   fitted.fracdiff    fracdiff
##   residuals.fracdiff fracdiff

library(infotheo)
library(mlr)

## Loading required package: ParamHelpers

## 
## Attaching package: 'ParamHelpers'

## The following object is masked from 'package:editrules':
## 
##     isFeasible

## 
## Attaching package: 'mlr'

## The following object is masked from 'package:Hmisc':
## 
##     impute

setwd("~/Desktop/Data Preproc/assignment 3/data analysis")

Executive Summary

The report examines the step-by-step procedures that are often followed to pre-process data, prior to modelling. For this purpose, an open source dataset was utilised. The analysis begins with an explanation of the observations, structure and constituents of the dataset. It was then evaluated using tidy data principles which enabled us to determine that there was no need to manipulate the dataset. Additional columns were created where appropriate to represent a linear combination of several variables. Several methods were then utilised to not only detect but also replace missing values and obvious inconsistencies in our dataset. Followed by which, the dataset was scanned for outliers such that required adjustments could have been made where necessary. Furthermore, the dataset’s distributions were explored, and transformations were made, with the aim of converting skewed variables to normally distributed variables.

Data

The dataset at the crux of this report was the ‘Wine Quality’ dataset that had been sourced from the following link: https://archive.ics.uci.edu/ml/datasets/Wine+Quality . The dataset had comprised of two parts; ‘winequality-red.csv’ and ‘winequality-white.csv’, that consisted of information on the chemical components of wine observations. The variables were; fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol and quality. Of all variables, only ‘quality’ was categorical (factor with ordered levels: 3, 4, 5, 6, 7, 8) while the rest were numerical. Each numeric variable was simply a measurement of the respective chemical content found in each wine observation. Whereas, the quality variable represented the quality of any given wine observation, on a scale of 3 to 8.

Furthermore, there was a total 1,599 red wine observations and 4,898 white wine observations. The ‘redwine’ and ‘whitewine’ datasets were merged together to create a combined dataset called ‘allwine’. The rows of white wine observations were added below the rows of the red wine observations to create a dataset of 6,497 wine observations, by using the ‘bind_rows’ function.

redwine <- read.csv("winequality-red.csv", sep=";")
whitewine <- read.csv("winequality-white.csv", sep=";")
head(redwine)

head(whitewine)

allwine <- bind_rows(redwine, whitewine)
head(allwine)

Understand

We first examined the overall structure of the data and its respective variables. The structure of the dataset revealed that ‘allwine’ was in dataframe format and, consisted of 6497 observations and 12 variables (11 numeric variables and 1 categorical variable). Each of the variables were in numeric (double) format except for quality which was in integer format. Hence, we proceeded to convert the data type of quality, from ‘integer’ to ‘factor’ with ordered levels (3< 4< 5< 6< 7< 8), while keeping the data types of remaining columns the same. Furthermore, the ‘attributes’ function had enabled us to verify the column names and the class of our dataset.

str(allwine)

## 'data.frame':    6497 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

attributes(allwine)$names

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

attributes(allwine)$class

## [1] "data.frame"

allwine$quality <- factor(allwine$quality,levels=c(3,4,5,6,7,8), ordered=TRUE)

Tidy & Manipulate Data I

We had then proceeded to adjust our data frame by adding an additional variable; colour, that was specified as ‘red’ for red wine observations and ‘white’ for white wine observations. We had accomplished this, by creating a new column called ‘colour’ that consisted of NAs. We had then specified the first 1599 observations as “red” and the remaining observations as “white”, given that we had combined the rows of the ‘whitewine’ dataframe below the rows of the ‘redwine’ dataframe. Furthermore, we had ensured the variable was assigned as an unordered factor with levels ‘red’ and ‘white’.

We had also evaluated the data using ‘Tidy Data Principles’:

Did each variable have its own column? Yes, given that each column represented an individual chemical component of the wine.
Did each observations have its own row? Yes, each row represented a single red/white wine observation.
Did each value have its own cell? Yes, each chemical-component measurement, quality and colour had its own cell.

Hence, given that our data had satisfied the tidy principles, a decision had been made to proceed with the pre-processing task.

allwine$colour <- NA
allwine[1:1599,13] = "red"
allwine[1600:6497,13] = "white"
allwine$colour <- factor(allwine$colour,levels=c("red","white"))

Tidy & Manipulate Data II

The total acidity of wine was determined by the amount of citric acid, fixed acidity and volatile acidity. Hence, we had proceeded to create the variable; ‘total.acidity’ as a representation of the total acidity in any given wine observation. For this purpose we had used the ‘mutate’ function as shown in the code chunk below, to add the values of ‘fixed.acidity’, ‘volatile.acidity’ and ‘citric.acid’.

allwine <- allwine %>% mutate(total.acidity = fixed.acidity+volatile.acidity+citric.acid)

#confirm by observing head of allwine
head(allwine[,c(1,2,3,14)])

Scan I

We had then examined our data for obvious inconsistencies and errors. We had specified the minimum and/or maximum bounds for each variable, using boolean operations in the ‘editset’ function. We had then used the ‘violatedEdits’ functions from the ‘editrules’ package, to examine our dataframe against these specified rules. It should be noted that these specified rules were based on standard requirements that wine producers had been provided by the Australian Grape and Wine Authority. For instance, volatile acidity has a maximum requirement of 1.5g/L, chloride has a maximum requirement of 300mg/L and pH content usually varies between 2 to 5. Hence, the rules were specified, having taken these requirements as rough estimates, because wine producers are not bound to them. On examining the summary of ‘violatedEdits’, it had been evident that the dataframe had not had any inconsistencies.

We had then begun to examine our data for missing values. The ‘colSums’ function was initially used to observe the sum of missing values (NA’s) that had existed in each column. The result indicated that we had 5 missing values in the ‘quality’ variable. We had then re-examined the data to ensure the numeric variables did not have any NaNs, infinities or NAs (although checking for NA’s was a bit redundant given that we had examined this using colSums). The result of the ‘is.special’ function that we had created for this purpose, had been applied on each column of the allwine dataframe, using the sapply function and, had revealed that none of the numeric columns consisted of NaNs, infinities or NAs.

As a corrective measure, we could have gotten rid of the 5 rows as it had only represented about 0.077% of our data (less than the 3% benchmark). However, given that the 5 rows had only one missing value (each), a decision was made to replace the missing quality values with the median quality of wine, grouped by the colour of wine, instead of losing whole rows of information on chemical components. This had been successfully done, using a for loop as shown below.

#check for inconsistencies
rules <- editset(c("volatile.acidity < 1.7","chlorides < 1",
                   "free.sulfur.dioxide < 300","pH<5","pH>2"))
violated <- violatedEdits(rules, allwine)
summary(violated)

## No violations detected, 0 checks evaluated to NA

## NULL

#check columns of dataframe for missing values
colSums(is.na(allwine))

##        fixed.acidity     volatile.acidity          citric.acid 
##                    0                    0                    0 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                    0                    0                    0 
## total.sulfur.dioxide              density                   pH 
##                    0                    0                    0 
##            sulphates              alcohol              quality 
##                    0                    0                    5 
##               colour        total.acidity 
##                    0                    0

#check all numeric columns for any type of missing/special value that is not a finite number.
is.special = function(x) {
  if (is.numeric(x)) which(is.na(x)|is.nan(x)|is.infinite(x))
}
sapply(allwine, is.special)

## $fixed.acidity
## integer(0)
## 
## $volatile.acidity
## integer(0)
## 
## $citric.acid
## integer(0)
## 
## $residual.sugar
## integer(0)
## 
## $chlorides
## integer(0)
## 
## $free.sulfur.dioxide
## integer(0)
## 
## $total.sulfur.dioxide
## integer(0)
## 
## $density
## integer(0)
## 
## $pH
## integer(0)
## 
## $sulphates
## integer(0)
## 
## $alcohol
## integer(0)
## 
## $quality
## NULL
## 
## $colour
## NULL
## 
## $total.acidity
## integer(0)

#replace missing quality with median of quality grouped by wine colour.
mean_quality_by_colour=as.data.frame(allwine %>% group_by(colour) %>%
                                       summarise(median(as.numeric(allwine$quality),
                                                        na.rm=TRUE)))
missing_quality = as.vector(which(is.na(allwine$quality)))
for (i in 1:length(missing_quality)) {
  if (allwine[missing_quality[i],13]=="white") (allwine[missing_quality[i],12] =
                                                  mean_quality_by_colour[2,2])
  else 
  if (allwine[missing_quality[i],13]=="red") (allwine[missing_quality[i],12] =
                                                mean_quality_by_colour[1,2])
}

#check if missing values is 0
colSums(is.na(allwine))

##        fixed.acidity     volatile.acidity          citric.acid 
##                    0                    0                    0 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                    0                    0                    0 
## total.sulfur.dioxide              density                   pH 
##                    0                    0                    0 
##            sulphates              alcohol              quality 
##                    0                    0                    0 
##               colour        total.acidity 
##                    0                    0

Scan II

The dataframe ‘allwine’ had then been evaluated for outliers. The preferred method for outlier evaluation had been the Mahalanobis Distance. However, the ‘mvn’ function, from the ‘outliers’ package, had only allowed us to evaluate 5000 observations at once. It is because of this that a decision had been made to get rid of the last 1497 observations of our dataframe. Given that ‘allwine’ consisted of 1599 red wine observations and 4898 white wine observations, our dataframe had been imbalanced. Hence, instead of getting rid of 1497 random wine observations, it had been decided to get rid of 1497 white wine observations in an effort to reduce the aforementioned imbalance. The new dataframe with only 5000 observations had been assigned the name, ‘new_allwine’. Furthermore, a multivariate outlier detection method had been used, given that several variables of our dataframe had been related to each other. For instance, total acidity was a sum of fixed acidity, volatile acidity and citric acid and, pH of wine was partially dependent on acidity levels.

The ‘mvn’ function had then been used to check ‘new_allwine’ for outliers. It had been determined that the dataframe had 106 outliers. The placement of the outliers relative to other observations can be observed in the Q-Q plot provided below. We had then proceeded to cap the outliers such that the maximum and minimum observation for each variable had coincided with the variable’s 95th and 5th percentile, respectively. This had been done by creating a function called ‘cap’, that had been applied to all columns of ‘new_allwine’ using the ‘sapply’ function. The new dataframe with ‘capped’ values had been assigned the name ‘adj_allwine’. The new dataframe had then been reevaluated using the ‘mvn’ function. As can be seen in the second Q-Q plot, ‘adj_allwine’ only had 24 outliers. Despite this, we had chosen to not adjust the data further. This is because, the quantile measurements for the data would have kept changing with each adjustment, thus creating additional outliers. It should be noted that the detection and capping of outliers had been done for all numeric columns only. This was achieved by simply specifying ‘new_allwine[,-c(12,13)]’, given that columns 12 and 13 were non-numeric. Hence, these columns were added back to ‘adj_allwine’. Our final outlier-adjusted dataframe had consisted of 5000 observations and 14 variables with only 24 outliers.

#reduce data to 5000 observations: better balance between red and white wine
new_allwine <- allwine[1:5000,]

#check for outliers using mahalanobis distance.
outlier_check <- mvn(data=new_allwine[,-c(12,13)], multivariateOutlierMethod = "quan", showOutliers = TRUE)

## Warning in covMcd(data, alpha = alpha): The covariance matrix of the data is singular.
## There are 5000 observations (in the entire dataset of 5000 obs.)
## lying on the hyperplane with equation a_1*(x_i1 - m_1) + ... +
## a_p*(x_ip - m_p) = 0 with (m_1, ..., m_p) the mean of these
## observations and coefficients a_i from the vector a <- c(-0.5,
## -0.5, -0.5, 0, 0, 0, 0, 0, 0, 0, 0, 0.5)

outlier_check$multivariateOutliers$Observation

##   [1] 4381 152  259  107  82   1320 93   1371 1373 2345 87   92   1052 3126
##  [15] 755  693  170  1261 452  227  2084 84   731  292  18   282  1166 724 
##  [29] 1300 653  3531 2817 20   43   2287 4752 3253 3263 3017 4650 862  14  
##  [43] 4907 244  245  3934 3375 615  1925 1375 2259 3465 673  3288 3322 2634
##  [57] 354  340  3151 3786 443  3727 4643 39   2283 3238 182  560  565  396 
##  [71] 4472 241  3551 558  128  127  555  556  1571 554  4268 1559 652  796 
##  [85] 2758 2763 545  640  1795 1796 4348 4350 1794 4420 1435 1436 3936 691 
##  [99] 2845 4474 341  3526 4225 95   1972 3525
## 5000 Levels: 1 10 100 1000 1001 1002 1003 1004 1005 1006 1007 1008 ... 999

#create cap function to cap all outliers in dataframe.
cap <- function(x){
  quantiles <- quantile(x, c(0.05,0.25,0.75,0.95))
  x[x<(quantiles[2]-(1.5*IQR(x)))] <-  quantiles[1]
  x[x>(quantiles[3]+(1.5*IQR(x)))] <-  quantiles[4]
  x
}

#assign the new dataframe a new name.
adj_allwine <- as.data.frame(sapply(new_allwine[,-c(12,13)], cap))

#confirm new dataframe doesnt have any outliers by using mahalanobis distance.
outlier_check <- mvn(data=adj_allwine, multivariateOutlierMethod = "quan", showOutliers = TRUE)

## Warning in covMcd(data, alpha = alpha): The covariance matrix has become singular during
## the iterations of the MCD algorithm.
## There are 4432 observations (in the entire dataset of 5000 obs.)
## lying on the hyperplane with equation a_1*(x_i1 - m_1) + ... +
## a_p*(x_ip - m_p) = 0 with (m_1, ..., m_p) the mean of these
## observations and coefficients a_i from the vector a <- c(0.5, 0.5,
## 0.5, 0, 0, 0, 0, 0, 0, 0, 0, -0.5)

#add non-numeric columns c(12,13) for new_allwine.
adj_allwine$quality <- new_allwine[,12]
adj_allwine$colour <- new_allwine[,13]

Transform

The newly created data set had then been transformed in order to decrease the skewness and convert the distribution into a normal distribution. For this purpose, we had to first observe the distribution of each numeric variable of ‘adj_allwine’ by constructing histograms. We did so by using for loops over columns 1 to 12 of ‘adj_allwine’.

#observe all the histogram plot to select the most skewed variable for transformation
par(mfrow=c(3,3))
for (i in (1:length(colnames(adj_allwine[,-c(13,14)])))) {
  hist(adj_allwine[,i], main=colnames(adj_allwine[i]))
}

As can be observed from the histograms above, several variables had been skewed to the right. In the code chunks to come, various transformations had been evaluated to get rid of the right-skew in variables; volatile acidity, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide and alcohol. We had first saved the column numbers of the variables that had a right skew, as a vector called ‘skewed_variables’. We had then observed histograms for these variables by directly using a log (to the base 10) tranformation in a for loop.

#selected variables for transformation:
# volatile.acidity, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, alcohol.
skewed_variable <- c(2,4,5,6,7,11)

#Log to the base 10 transformation
par(mfrow=c(3,3))
for (i in 1:length(skewed_variable)){
  hist(log10(adj_allwine[,skewed_variable[i]]), main=colnames(adj_allwine[skewed_variable[i]]))
}

As can be observed from the histograms above, the transformation had succeeded in making the distribution of most skewed variables, relatively more symmetric. However, this had not been the case for the distribution of free and total sulfur dioxide. The next transformation that had been used was a log transformation.

#Log transformation
par(mfrow=c(3,3))
for (i in 1:length(skewed_variable)){
  hist(log(adj_allwine[,skewed_variable[i]]), main=colnames(adj_allwine[skewed_variable[i]]))
}

As can be observed from the histograms above, the transformation had succeeded in making the distribution of volatile acidity, chlorides and total sulfur dioxide, relatively more symmetric. We had then reconsturcted histograms using a square root transformation.

#Square root transformation
par(mfrow=c(3,3))
for (i in 1:length(skewed_variable)){
  hist(sqrt(adj_allwine[,skewed_variable[i]]), main=colnames(adj_allwine[skewed_variable[i]]))
}

The distrbution for free and total sulfur doxide had seemed to have improved. However, the distribution for volatile acidity had seemed to have gotten relatively more skewed. We had then compared these results to a cube root transformation.

#Cube root transformation
par(mfrow=c(3,3))
for (i in 1:length(skewed_variable)){
  hist((adj_allwine[,skewed_variable[i]])^(1/3), main=colnames(adj_allwine[skewed_variable[i]]))
}

The histograms above had indicated that the result of the cube root transformation were similar to the result of the square root transformation. We had then compared the results to a BoxCox transformation where the lambda measure had been set to “auto”.

#BoxCox transformation
par(mfrow=c(3,3))
for (i in 1:length(skewed_variable)){
  hist(BoxCox(adj_allwine[,skewed_variable[i]],lambda="auto"),
       main=colnames(adj_allwine[skewed_variable[i]]))
}

As can be observed from the histogram above, the transformation had managed to make the distribution for most of our variables, relatively more symmetric. It should be noted however, that no single transformation could have been used to improve the distribution of all skewed variables together (i.e., each variable would have had to be transformed in its own unique way). The results of the histograms we had constructed, indicated that the following transformations had been most successful in making its respective variable’s distribution most symmetric:

#boxcox,log,cuberoot bimodal or boxcox skew, squareroot or boxcox, boxcox or cuberoot, log or boxcox
trans_allwine <- adj_allwine #create a seperate dataframe identical to adj_allwine

#volatile.acidity: boxcox transformation 
trans_allwine[,2] <- BoxCox(adj_allwine[,2], lambda="auto")
#residual.sugar: log transformation 
trans_allwine[,4] <- log(adj_allwine[,4])
#chlorides: cube root transformation 
trans_allwine[,5] <- (adj_allwine[,5])^(1/3)
#free.sulfur.dioxide: square root transformation 
trans_allwine[,6] <- (adj_allwine[,6])^(1/2)
#total.sulfur.dioxide: boxcox transformation 
trans_allwine[,7] <- BoxCox(adj_allwine[,7], lambda="auto")
#alcohol: boxcox transformation 
trans_allwine[,11] <- log(adj_allwine[,11])

#transformed data for adj_allwine
head(trans_allwine)

The new columns (transformed or not) had been saved as a new dataframe called ‘trans_allwine’.

Additional: Regression using MLR package

The ‘mlr’ package had then been used to carry out a regression task to predict the alcohol level in each wine observation. The dataframe ‘reg_wine’ had consisted of all numeric variables in ‘adj_allwine’ except for total acidity, given that ‘total.acidity’ had been a perfect linear combination of three other variables (volatile acidity, fixed acidity, citric acid). The task had been specified as a regression task followed by which a learner had been created. The training sample had then been randomly selected from ‘reg_wine’ and represented 2/3rds of the dataset. The testing sample had been specified to include the observations that had not existed in the training sample and thus, had included only 1667 rows of wine observations. The fitted model had been assigned the name ‘model’ and the predictions had been saved as ‘pred’. As per the ‘performance’ function results, our model had had a Mean Squared Error measure of 0.3208557 and a Mean Absolute Error of 0.4218432. The predicted alcohol levels alongside the true alcohol levels can be observed below.

reg_wine <- adj_allwine[,c(1:11)]

wine.task=makeRegrTask(data=reg_wine, target="alcohol")
lrn=makeLearner("regr.glm")

n=nrow(reg_wine)
train.set=sample(n, size=2/3*n)
test.set=setdiff(1:n, train.set)

model=mlr::train(lrn, wine.task, subset=train.set)
pred=predict(model, task=wine.task, subset=test.set)
performance(pred,measure=list(mse,mae))

##       mse       mae 
## 0.3279932 0.4329706

head(pred$data)