library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(rpart.plot)
## Loading required package: rpart
1.2. Import the data set:
setwd("E:/ML and DS projects/Kaggle/Breast Cancer Wisconsin")
breast_ml <- read.csv("data.csv")
2.1. We’ll first drop the irrelevant columns which are column 1 and column 33:
breast_ml <- breast_ml[, c(-1, -33)]
breast_ml <- na.omit(breast_ml)
2.2. Remove highly influence samples:
nearZeroVar(breast_ml, saveMetrics = T, freqCut = 2, uniqueCut = 5)
## freqRatio percentUnique zeroVar nzv
## diagnosis 1.683962 0.3514938 FALSE FALSE
## radius_mean 1.333333 80.1405975 FALSE FALSE
## texture_mean 1.000000 84.1827768 FALSE FALSE
## perimeter_mean 1.000000 91.7398946 FALSE FALSE
## area_mean 1.500000 94.7275923 FALSE FALSE
## smoothness_mean 1.250000 83.3040422 FALSE FALSE
## compactness_mean 1.000000 94.3760984 FALSE FALSE
## concavity_mean 4.333333 94.3760984 FALSE FALSE
## concave.points_mean 4.333333 95.2548330 FALSE FALSE
## symmetry_mean 1.000000 75.9226714 FALSE FALSE
## fractal_dimension_mean 1.000000 87.6977153 FALSE FALSE
## radius_se 1.000000 94.9033392 FALSE FALSE
## texture_se 1.000000 91.2126538 FALSE FALSE
## perimeter_se 2.000000 93.6731107 FALSE FALSE
## area_se 1.000000 92.7943761 FALSE FALSE
## smoothness_se 1.000000 96.1335677 FALSE FALSE
## compactness_se 1.000000 95.0790861 FALSE FALSE
## concavity_se 6.500000 93.6731107 FALSE FALSE
## concave.points_se 4.333333 89.1036907 FALSE FALSE
## symmetry_se 1.333333 87.5219684 FALSE FALSE
## fractal_dimension_se 1.000000 95.7820738 FALSE FALSE
## radius_worst 1.250000 80.3163445 FALSE FALSE
## texture_worst 1.000000 89.8066784 FALSE FALSE
## perimeter_worst 1.000000 90.3339192 FALSE FALSE
## area_worst 1.000000 95.6063269 FALSE FALSE
## smoothness_worst 1.000000 72.2319859 FALSE FALSE
## compactness_worst 1.000000 92.9701230 FALSE FALSE
## concavity_worst 4.333333 94.7275923 FALSE FALSE
## concave.points_worst 4.333333 86.4674868 FALSE FALSE
## symmetry_worst 1.000000 87.8734622 FALSE FALSE
## fractal_dimension_worst 1.500000 94.0246046 FALSE FALSE
It seems like no features have a too high “freqRatio” paired with a too low “percentUnique”.
2.3. Check for highly correlated features:
base_cor <- cor(breast_ml[,-1])
extreme_cor <- sum(abs(base_cor[upper.tri(base_cor)]) > .9)
extreme_cor
## [1] 21
It seems there are a lot of pairs of features that have a correlation >0.9. At this point, I will check the “base_cor” and whenever I encounter 2 features with a correlation that is >0.9, I will drop them. Then, I’ll check the “base_cor” and the “extreme_cor” again and redo this process until there are no correlations that are >0.9.
breast_ml <- subset(breast_ml, select = -c(perimeter_mean, area_mean, radius_worst, perimeter_worst, area_worst))
breast_ml <- subset(breast_ml, select = -c(texture_worst, concave.points_mean, perimeter_se, area_se))
2.4. Output the dataset and I’ll continue in Jupyter Notebook:
# uncomment the following line
# write.csv(breast_ml, file = "data_filtered.csv")