1. Preparations: 1.1. Import required packages:
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(rpart.plot)
## Loading required package: rpart

1.2. Import the data set:

setwd("E:/ML and DS projects/Kaggle/Breast Cancer Wisconsin")
breast_ml <- read.csv("data.csv")
  1. Preprocess the data:

2.1. We’ll first drop the irrelevant columns which are column 1 and column 33:

breast_ml <- breast_ml[, c(-1, -33)]
breast_ml <- na.omit(breast_ml)

2.2. Remove highly influence samples:

nearZeroVar(breast_ml, saveMetrics = T, freqCut = 2, uniqueCut = 5)
##                         freqRatio percentUnique zeroVar   nzv
## diagnosis                1.683962     0.3514938   FALSE FALSE
## radius_mean              1.333333    80.1405975   FALSE FALSE
## texture_mean             1.000000    84.1827768   FALSE FALSE
## perimeter_mean           1.000000    91.7398946   FALSE FALSE
## area_mean                1.500000    94.7275923   FALSE FALSE
## smoothness_mean          1.250000    83.3040422   FALSE FALSE
## compactness_mean         1.000000    94.3760984   FALSE FALSE
## concavity_mean           4.333333    94.3760984   FALSE FALSE
## concave.points_mean      4.333333    95.2548330   FALSE FALSE
## symmetry_mean            1.000000    75.9226714   FALSE FALSE
## fractal_dimension_mean   1.000000    87.6977153   FALSE FALSE
## radius_se                1.000000    94.9033392   FALSE FALSE
## texture_se               1.000000    91.2126538   FALSE FALSE
## perimeter_se             2.000000    93.6731107   FALSE FALSE
## area_se                  1.000000    92.7943761   FALSE FALSE
## smoothness_se            1.000000    96.1335677   FALSE FALSE
## compactness_se           1.000000    95.0790861   FALSE FALSE
## concavity_se             6.500000    93.6731107   FALSE FALSE
## concave.points_se        4.333333    89.1036907   FALSE FALSE
## symmetry_se              1.333333    87.5219684   FALSE FALSE
## fractal_dimension_se     1.000000    95.7820738   FALSE FALSE
## radius_worst             1.250000    80.3163445   FALSE FALSE
## texture_worst            1.000000    89.8066784   FALSE FALSE
## perimeter_worst          1.000000    90.3339192   FALSE FALSE
## area_worst               1.000000    95.6063269   FALSE FALSE
## smoothness_worst         1.000000    72.2319859   FALSE FALSE
## compactness_worst        1.000000    92.9701230   FALSE FALSE
## concavity_worst          4.333333    94.7275923   FALSE FALSE
## concave.points_worst     4.333333    86.4674868   FALSE FALSE
## symmetry_worst           1.000000    87.8734622   FALSE FALSE
## fractal_dimension_worst  1.500000    94.0246046   FALSE FALSE

It seems like no features have a too high “freqRatio” paired with a too low “percentUnique”.

2.3. Check for highly correlated features:

base_cor <- cor(breast_ml[,-1])
extreme_cor <- sum(abs(base_cor[upper.tri(base_cor)]) > .9)
extreme_cor
## [1] 21

It seems there are a lot of pairs of features that have a correlation >0.9. At this point, I will check the “base_cor” and whenever I encounter 2 features with a correlation that is >0.9, I will drop them. Then, I’ll check the “base_cor” and the “extreme_cor” again and redo this process until there are no correlations that are >0.9.

breast_ml <- subset(breast_ml, select = -c(perimeter_mean, area_mean, radius_worst, perimeter_worst, area_worst))
breast_ml <- subset(breast_ml, select = -c(texture_worst, concave.points_mean, perimeter_se, area_se))

2.4. Output the dataset and I’ll continue in Jupyter Notebook:

# uncomment the following line
# write.csv(breast_ml, file = "data_filtered.csv")