Assignment: Exercises 3.1, 3.2 from the KJ textbook
library(mlbench)
library(corrr)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following object is masked from 'package:tidyr':
##
## crossing
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(ggraph)
## Loading required package: ggplot2
library(e1071)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(caret)
## Loading required package: lattice
library(ggplot2)
library(finalfit)
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
library(DMwR)
## Loading required package: grid
##
## Attaching package: 'DMwR'
## The following object is masked from 'package:plyr':
##
## join
3.1. The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
By looking at a boxplot of the predictors, we can see that the glass samples tend to have higher values for Si while the Na and Ca predictors have larger ranges. Ca seems to have the most outliers and K is right-skewed.
df <- Glass[,1:9]
boxplot(df)
A correlation matrix shows us that correlation coefficients between each of the 9 predictors. We can use this to create a correlation matrix that visualizes the relationships that the predictors have with one another. Si and Ca are the most correlated with the refractive index with Ca being postively correlated and Si being negatively correlated.
correlation <- correlate(df)
##
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
correlation
## # A tibble: 9 x 10
## rowname RI Na Mg Al Si K
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 RI NA -0.192 -0.122 -0.407 -0.542 -0.290
## 2 Na -1.92e-1 NA -0.274 0.157 -0.0698 -0.266
## 3 Mg -1.22e-1 -0.274 NA -0.482 -0.166 0.00540
## 4 Al -4.07e-1 0.157 -0.482 NA -0.00552 0.326
## 5 Si -5.42e-1 -0.0698 -0.166 -0.00552 NA -0.193
## 6 K -2.90e-1 -0.266 0.00540 0.326 -0.193 NA
## 7 Ca 8.10e-1 -0.275 -0.444 -0.260 -0.209 -0.318
## 8 Ba -3.86e-4 0.327 -0.492 0.479 -0.102 -0.0426
## 9 Fe 1.43e-1 -0.241 0.0831 -0.0744 -0.0942 -0.00772
## # … with 3 more variables: Ca <dbl>, Ba <dbl>, Fe <dbl>
#Correlation network for variables
tidy_cors <- df %>%
correlate() %>%
stretch()
##
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
tidy_cors
## # A tibble: 81 x 3
## x y r
## <chr> <chr> <dbl>
## 1 RI RI NA
## 2 RI Na -0.192
## 3 RI Mg -0.122
## 4 RI Al -0.407
## 5 RI Si -0.542
## 6 RI K -0.290
## 7 RI Ca 0.810
## 8 RI Ba -0.000386
## 9 RI Fe 0.143
## 10 Na RI -0.192
## # … with 71 more rows
graph_cors <- tidy_cors %>%
filter(abs(r) > .3) %>%
graph_from_data_frame(directed = FALSE)
ggraph(graph_cors) +
geom_edge_link(aes(edge_alpha = abs(r), edge_width = abs(r), color = r)) +
guides(edge_alpha = "none", edge_width = "none") +
scale_edge_colour_gradientn(limits = c(-1, 1), colors = c("firebrick2", "dodgerblue2")) +
geom_node_point(color = "grey", size = 2) +
geom_node_text(aes(label = name), repel = FALSE) +
theme_graph()
## Using `stress` as default layout
Ca seems to have the most outliers, but K, Na and Ba seem to have a considerable number of outliers as well. K and Ba are both very right-skewed, and Ca also seems to be right-skewed.
We calculate our skewness statistics for our predictors and confirm that K is the most skewed, but Ba, Ca and RI can also be considered highly skewed as their skewness values are greater than 1.
skewValues <- apply(df, 2, skewness)
skewValues
## RI Na Mg Al Si K
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889
## Ca Ba Fe
## 2.0184463 3.3686800 1.7298107
We can use the powerTransform function to calculate lambda values and decide which transformations would be best for these predictors. Based on these results, we would use a square root transformation on Ca, log transformation on K, no transformation on Al or Na, and a square transformation on Mg. The other variables may benefit from other transformations as well.
summary(powerTransform(df[,2:9], family="yjPower")) #removed the RI variable as it was throwing errors
## yjPower Transformations to Multinormality
## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## Na 1.3489 1.00 0.5645 2.1334
## Mg 1.7747 2.00 1.5398 2.0096
## Al 0.8491 1.00 0.4986 1.1996
## Si 13.4963 13.50 8.1223 18.8703
## K -0.1622 0.00 -0.3639 0.0395
## Ca 0.6185 0.50 0.3660 0.8710
## Ba -6.9783 -6.98 -8.1385 -5.8181
## Fe -14.4378 -14.44 -17.2836 -11.5919
##
## Likelihood ratio test that all transformation parameters are equal to 0
## LRT df pval
## LR test, lambda = (0 0 0 0 0 0 0 0) 863.7414 8 < 2.22e-16
3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
data(Soybean)
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
?Soybean
## Help on topic 'Soybean' was found in the following packages:
##
## Package Library
## nlme /Library/Frameworks/R.framework/Versions/3.5/Resources/library
## mlbench /Library/Frameworks/R.framework/Versions/3.5/Resources/library
##
##
## Using the first match ...
A predictor has a degenerate distribution if it only takes on a single possible value or have a handful of distinct values with very low frequencies. These are also called near zero variance predictors, and we can identify them using the nearZeroVar function. Using this, we identify lead.mild, mycelium and sclerotia for possible removal in our model.
nearZeroVar(Soybean, names = TRUE)
## [1] "leaf.mild" "mycelium" "sclerotia"
barplot(table(Soybean$leaf.mild))
barplot(table(Soybean$mycelium))
barplot(table(Soybean$sclerotia))
We can see a plot of the percentage of missing data by predictor. The sever, seed.tmt, lodging and hail predictors are tied for having the most missing data.
results <- sapply(Soybean, function(x) sum(is.na(x)))
results <- results/683*100
results <- as.data.frame(results)
results <- cbind(predictor = rownames(results), results)
rownames(results) <- 1:nrow(results)
ggplot(results, aes(x=reorder(predictor, results), y = results)) +
geom_bar(stat= "identity", fill = "#0073C2FF", width = 1) +
ylab("% of missing values") +
xlab("Predictor") +
ggtitle("Missing values for predictors") +
theme(text=element_text(size=6), panel.grid.major = element_blank(), panel.grid.minor = element_blank(),panel.background = element_blank(), axis.line = element_line(colour = "black")) +
coord_flip()
The following plot shows how missing data in any of the predictors relates to the Class variable. Red indicates a combination of missing values between the predictor and Class variable.
explanatory <- c(colnames(Soybean[,-1]))
dependent <- "Class"
Soybean %>% missing_pattern(dependent, explanatory)
## Class leaves date area.dam crop.hist plant.growth stem temp roots
## 562 1 1 1 1 1 1 1 1 1
## 13 1 1 1 1 1 1 1 1 1
## 55 1 1 1 1 1 1 1 1 1
## 8 1 1 1 1 1 1 1 1 1
## 9 1 1 1 1 1 1 1 1 0
## 6 1 1 1 1 1 1 1 1 0
## 14 1 1 1 1 1 1 1 0 1
## 15 1 1 1 1 0 0 0 0 0
## 1 1 1 0 0 0 0 0 0 0
## 0 0 1 1 16 16 16 30 31
## plant.stand precip stem.cankers canker.lesion ext.decay mycelium
## 562 1 1 1 1 1 1
## 13 1 1 1 1 1 1
## 55 1 1 1 1 1 1
## 8 1 0 0 0 0 0
## 9 1 1 1 1 1 1
## 6 0 1 1 1 1 1
## 14 0 0 0 0 0 0
## 15 0 0 0 0 0 0
## 1 0 0 0 0 0 0
## 36 38 38 38 38 38
## int.discolor sclerotia leaf.halo leaf.marg leaf.size leaf.malf
## 562 1 1 1 1 1 1
## 13 1 1 1 1 1 1
## 55 1 1 0 0 0 0
## 8 0 0 1 1 1 1
## 9 1 1 0 0 0 0
## 6 1 1 0 0 0 0
## 14 0 0 0 0 0 0
## 15 0 0 1 1 1 1
## 1 0 0 1 1 1 1
## 38 38 84 84 84 84
## fruit.pods seed mold.growth seed.size leaf.shread fruiting.bodies
## 562 1 1 1 1 1 1
## 13 0 0 0 0 1 0
## 55 0 0 0 0 0 0
## 8 1 0 0 0 1 0
## 9 1 1 1 1 0 1
## 6 1 1 1 1 0 1
## 14 1 1 1 1 0 0
## 15 0 0 0 0 0 0
## 1 0 0 0 0 0 0
## 84 92 92 92 100 106
## fruit.spots seed.discolor shriveling leaf.mild germ hail sever
## 562 1 1 1 1 1 1 1
## 13 0 0 0 1 0 0 0
## 55 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0
## 9 1 1 1 0 1 0 0
## 6 1 1 1 0 0 0 0
## 14 0 0 0 0 0 0 0
## 15 0 0 0 0 0 0 0
## 1 0 0 0 0 0 0 0
## 106 106 106 108 112 121 121
## seed.tmt lodging
## 562 1 1 0
## 13 0 0 13
## 55 0 0 19
## 8 0 0 20
## 9 0 0 11
## 6 0 0 13
## 14 0 0 24
## 15 0 0 28
## 1 0 0 30
## 121 121 2337
The phytophthora-rot class has the most missing values.
ddply(Soybean, .(Class), colwise(.fun = function(x) sum(is.na(x))))
## Class date plant.stand precip temp hail crop.hist
## 1 2-4-d-injury 1 16 16 16 16 16
## 2 alternarialeaf-spot 0 0 0 0 0 0
## 3 anthracnose 0 0 0 0 0 0
## 4 bacterial-blight 0 0 0 0 0 0
## 5 bacterial-pustule 0 0 0 0 0 0
## 6 brown-spot 0 0 0 0 0 0
## 7 brown-stem-rot 0 0 0 0 0 0
## 8 charcoal-rot 0 0 0 0 0 0
## 9 cyst-nematode 0 14 14 14 14 0
## 10 diaporthe-pod-&-stem-blight 0 6 0 0 15 0
## 11 diaporthe-stem-canker 0 0 0 0 0 0
## 12 downy-mildew 0 0 0 0 0 0
## 13 frog-eye-leaf-spot 0 0 0 0 0 0
## 14 herbicide-injury 0 0 8 0 8 0
## 15 phyllosticta-leaf-spot 0 0 0 0 0 0
## 16 phytophthora-rot 0 0 0 0 68 0
## 17 powdery-mildew 0 0 0 0 0 0
## 18 purple-seed-stain 0 0 0 0 0 0
## 19 rhizoctonia-root-rot 0 0 0 0 0 0
## area.dam sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg
## 1 1 16 16 16 16 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0
## 9 0 14 14 14 0 0 14 14
## 10 0 15 15 6 0 0 15 15
## 11 0 0 0 0 0 0 0 0
## 12 0 0 0 0 0 0 0 0
## 13 0 0 0 0 0 0 0 0
## 14 0 8 8 8 0 0 0 0
## 15 0 0 0 0 0 0 0 0
## 16 0 68 68 68 0 0 55 55
## 17 0 0 0 0 0 0 0 0
## 18 0 0 0 0 0 0 0 0
## 19 0 0 0 0 0 0 0 0
## leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers
## 1 0 16 0 16 16 16 16
## 2 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0
## 9 14 14 14 14 0 14 14
## 10 15 15 15 15 0 15 0
## 11 0 0 0 0 0 0 0
## 12 0 0 0 0 0 0 0
## 13 0 0 0 0 0 0 0
## 14 0 0 0 8 0 8 8
## 15 0 0 0 0 0 0 0
## 16 55 55 55 55 0 68 0
## 17 0 0 0 0 0 0 0
## 18 0 0 0 0 0 0 0
## 19 0 0 0 0 0 0 0
## canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia
## 1 16 16 16 16 16 16
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## 7 0 0 0 0 0 0
## 8 0 0 0 0 0 0
## 9 14 14 14 14 14 14
## 10 0 0 0 0 0 0
## 11 0 0 0 0 0 0
## 12 0 0 0 0 0 0
## 13 0 0 0 0 0 0
## 14 8 8 8 8 8 8
## 15 0 0 0 0 0 0
## 16 0 68 0 0 0 0
## 17 0 0 0 0 0 0
## 18 0 0 0 0 0 0
## 19 0 0 0 0 0 0
## fruit.pods fruit.spots seed mold.growth seed.discolor seed.size
## 1 16 16 16 16 16 16
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## 7 0 0 0 0 0 0
## 8 0 0 0 0 0 0
## 9 0 14 0 0 14 0
## 10 0 0 0 0 0 0
## 11 0 0 0 0 0 0
## 12 0 0 0 0 0 0
## 13 0 0 0 0 0 0
## 14 0 8 8 8 8 8
## 15 0 0 0 0 0 0
## 16 68 68 68 68 68 68
## 17 0 0 0 0 0 0
## 18 0 0 0 0 0 0
## 19 0 0 0 0 0 0
## shriveling roots
## 1 16 16
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## 7 0 0
## 8 0 0
## 9 14 0
## 10 0 15
## 11 0 0
## 12 0 0
## 13 0 0
## 14 8 0
## 15 0 0
## 16 68 0
## 17 0 0
## 18 0 0
## 19 0 0
I’ll use knn imputation since I have not used it before for imputation and would like to learn more about it. This imputes a missing value with the average weighted value of observations near/similar to it. We perform this imputation on all variables except for the response variable.
knnOutput <- knnImputation(Soybean[, !names(Soybean) %in% "Class"])
anyNA(knnOutput)
## [1] FALSE