The breast cancer data set
breast-cancer-wisconsin.data.txt from - https://archive.ics.uci.edu/dataset/15/breast+cancer+wisconsin+original
(description available at the same URL) has missing
values.
1. Use the mean/mode imputation
method to impute values for the missing data.
2. Use regression to impute values for the missing
data.
3. Use regression with perturbation
to impute values for the missing data.
Methodology - To fill in the missing values in our dataset, we will use three different imputation methods: mean or mode imputation, regression imputation, and regression with perturbation. The first step is to identify which fields contain missing data so we know where to apply each method. For mean or mode imputation, we’ll calculate the average for numerical variables and the most common value for categorical variables, then use those values to replace any missing entries. Next, for regression imputation, we’ll create a regression model that predicts the missing values based on other variables in the dataset and use those predictions to fill the gaps. Finally, for regression with perturbation imputation, we’ll take the predicted values from the regression model and add a small amount of random noise to them. This helps maintain the natural variation in the data and prevents the imputed values from being too uniform or unrealistic.
cancer_data = read.table("C:/Users/james/OneDrive/Desktop/Georgia Tech/ISYE 6501/Homeworks/Homework 10/Homework10_ISYE6501/Homework10_ISYE6501/data 14.1/breast-cancer-wisconsin.data.txt", header = TRUE, sep = ",", na.strings = "?")
colnames(cancer_data) = c("Sample_code_number", "Clump_thickness", "Uniformity_of_cell_size", "Uniformity_of_cell_shape", "Marginal_adhesion", "Single_epithelial_cell_size", "Bare_nuclei", "Bland_chromatin", "Normal_nucleoli", "Mitoses", "Class")
head(cancer_data)
cancer_data
Let us find which field(s) have missing values.
cancer_data$Class = as.factor(cancer_data$Class)
levels(cancer_data$Class) = c(0,1)
summary(cancer_data)
## Sample_code_number Clump_thickness Uniformity_of_cell_size
## Min. : 61634 Min. : 1.000 Min. : 1.000
## 1st Qu.: 870258 1st Qu.: 2.000 1st Qu.: 1.000
## Median : 1171710 Median : 4.000 Median : 1.000
## Mean : 1071807 Mean : 4.417 Mean : 3.138
## 3rd Qu.: 1238354 3rd Qu.: 6.000 3rd Qu.: 5.000
## Max. :13454352 Max. :10.000 Max. :10.000
##
## Uniformity_of_cell_shape Marginal_adhesion Single_epithelial_cell_size
## Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 2.000
## Median : 1.000 Median : 1.000 Median : 2.000
## Mean : 3.211 Mean : 2.809 Mean : 3.218
## 3rd Qu.: 5.000 3rd Qu.: 4.000 3rd Qu.: 4.000
## Max. :10.000 Max. :10.000 Max. :10.000
##
## Bare_nuclei Bland_chromatin Normal_nucleoli Mitoses Class
## Min. : 1.000 Min. : 1.000 Min. : 1.00 Min. : 1.00 0:457
## 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 1.00 1st Qu.: 1.00 1:241
## Median : 1.000 Median : 3.000 Median : 1.00 Median : 1.00
## Mean : 3.548 Mean : 3.438 Mean : 2.87 Mean : 1.59
## 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 4.00 3rd Qu.: 1.00
## Max. :10.000 Max. :10.000 Max. :10.00 Max. :10.00
## NA's :16
It appears that our only field with missing values is the Bare_Nuclei field.
cancer_data[is.na(cancer_data$Bare_nuclei),]
Let us now find the mean of our data and impute that mean to replace our spots where we have data missing.
data_mean =round(mean(as.integer(cancer_data$Bare_nuclei),na.rm=TRUE))
data_mean
## [1] 4
cancer_mean = cancer_data
cancer_mean[is.na(cancer_mean)] = data_mean
cancer_mean[c(23, 158, 249),]
Let us now find the rows where the data is missing and go ahead and create our training and testing data sets.
missing_values = as.vector(which(is.na(cancer_data$Bare_nuclei)))
cancer_not_empty = cancer_data[-missing_values, 2:10]
set.seed = (555)
cancer_training_rows = sample(nrow(cancer_not_empty), round(nrow(cancer_not_empty) *.8))
cancer_training = cancer_not_empty[cancer_training_rows,]
cancer_test = cancer_not_empty[-cancer_training_rows,]
nrow(cancer_training) + nrow(cancer_test)
## [1] 682
nrow(cancer_not_empty)
## [1] 682
Let us create our first model with all variables.
cancer_model = lm(Bare_nuclei ~.,data = cancer_training)
summary(cancer_model)
##
## Call:
## lm(formula = Bare_nuclei ~ ., data = cancer_training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9663 -0.8860 -0.2451 0.6095 8.8015
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.52075 0.21970 -2.370 0.018127 *
## Clump_thickness 0.22161 0.04730 4.685 3.55e-06 ***
## Uniformity_of_cell_size -0.01038 0.08599 -0.121 0.903992
## Uniformity_of_cell_shape 0.30749 0.08430 3.648 0.000291 ***
## Marginal_adhesion 0.39653 0.05365 7.391 5.62e-13 ***
## Single_epithelial_cell_size 0.11069 0.07018 1.577 0.115300
## Bland_chromatin 0.21460 0.06760 3.175 0.001586 **
## Normal_nucleoli -0.02371 0.05221 -0.454 0.649942
## Mitoses -0.03743 0.06571 -0.570 0.569149
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.303 on 537 degrees of freedom
## Multiple R-squared: 0.6081, Adjusted R-squared: 0.6023
## F-statistic: 104.2 on 8 and 537 DF, p-value: < 2.2e-16
Let us create a second model with only significant variables from model 1 and then a third model using stepwise variable selection.
cancer_model2 = lm(Bare_nuclei~Clump_thickness + Uniformity_of_cell_shape + Marginal_adhesion + Bland_chromatin, data = cancer_training)
cancer_model3 = step(cancer_model, direction = 'both')
## Start: AIC=919.99
## Bare_nuclei ~ Clump_thickness + Uniformity_of_cell_size + Uniformity_of_cell_shape +
## Marginal_adhesion + Single_epithelial_cell_size + Bland_chromatin +
## Normal_nucleoli + Mitoses
##
## Df Sum of Sq RSS AIC
## - Uniformity_of_cell_size 1 0.077 2848.8 918.00
## - Normal_nucleoli 1 1.094 2849.8 918.19
## - Mitoses 1 1.721 2850.4 918.32
## <none> 2848.7 919.99
## - Single_epithelial_cell_size 1 13.199 2861.9 920.51
## - Bland_chromatin 1 53.465 2902.1 928.14
## - Uniformity_of_cell_shape 1 70.579 2919.2 931.35
## - Clump_thickness 1 116.434 2965.1 939.86
## - Marginal_adhesion 1 289.777 3138.4 970.88
##
## Step: AIC=918
## Bare_nuclei ~ Clump_thickness + Uniformity_of_cell_shape + Marginal_adhesion +
## Single_epithelial_cell_size + Bland_chromatin + Normal_nucleoli +
## Mitoses
##
## Df Sum of Sq RSS AIC
## - Normal_nucleoli 1 1.144 2849.9 916.22
## - Mitoses 1 1.761 2850.5 916.34
## <none> 2848.8 918.00
## - Single_epithelial_cell_size 1 13.461 2862.2 918.57
## + Uniformity_of_cell_size 1 0.077 2848.7 919.99
## - Bland_chromatin 1 54.374 2903.1 926.32
## - Uniformity_of_cell_shape 1 110.674 2959.4 936.81
## - Clump_thickness 1 117.386 2966.1 938.05
## - Marginal_adhesion 1 292.018 3140.8 969.28
##
## Step: AIC=916.22
## Bare_nuclei ~ Clump_thickness + Uniformity_of_cell_shape + Marginal_adhesion +
## Single_epithelial_cell_size + Bland_chromatin + Mitoses
##
## Df Sum of Sq RSS AIC
## - Mitoses 1 2.342 2852.2 914.67
## <none> 2849.9 916.22
## - Single_epithelial_cell_size 1 12.772 2862.7 916.66
## + Normal_nucleoli 1 1.144 2848.8 918.00
## + Uniformity_of_cell_size 1 0.127 2849.8 918.19
## - Bland_chromatin 1 53.315 2903.2 924.34
## - Uniformity_of_cell_shape 1 114.833 2964.7 935.79
## - Clump_thickness 1 116.562 2966.4 936.11
## - Marginal_adhesion 1 292.139 3142.0 967.50
##
## Step: AIC=914.67
## Bare_nuclei ~ Clump_thickness + Uniformity_of_cell_shape + Marginal_adhesion +
## Single_epithelial_cell_size + Bland_chromatin
##
## Df Sum of Sq RSS AIC
## <none> 2852.2 914.67
## - Single_epithelial_cell_size 1 10.890 2863.1 914.75
## + Mitoses 1 2.342 2849.9 916.22
## + Normal_nucleoli 1 1.725 2850.5 916.34
## + Uniformity_of_cell_size 1 0.204 2852.0 916.63
## - Bland_chromatin 1 54.297 2906.5 922.96
## - Uniformity_of_cell_shape 1 113.553 2965.8 933.98
## - Clump_thickness 1 114.653 2966.9 934.19
## - Marginal_adhesion 1 289.906 3142.1 965.52
summary(cancer_model2)
##
## Call:
## lm(formula = Bare_nuclei ~ Clump_thickness + Uniformity_of_cell_shape +
## Marginal_adhesion + Bland_chromatin, data = cancer_training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.8170 -0.9117 -0.2031 0.5274 8.7969
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.40770 0.19799 -2.059 0.039953 *
## Clump_thickness 0.22070 0.04665 4.731 2.85e-06 ***
## Uniformity_of_cell_shape 0.32339 0.05797 5.578 3.84e-08 ***
## Marginal_adhesion 0.40082 0.05187 7.727 5.39e-14 ***
## Bland_chromatin 0.22195 0.06472 3.429 0.000651 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.3 on 541 degrees of freedom
## Multiple R-squared: 0.6061, Adjusted R-squared: 0.6032
## F-statistic: 208.1 on 4 and 541 DF, p-value: < 2.2e-16
summary(cancer_model3)
##
## Call:
## lm(formula = Bare_nuclei ~ Clump_thickness + Uniformity_of_cell_shape +
## Marginal_adhesion + Single_epithelial_cell_size + Bland_chromatin,
## data = cancer_training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0634 -0.8597 -0.2799 0.5625 8.8006
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.51280 0.21090 -2.431 0.01536 *
## Clump_thickness 0.21739 0.04666 4.659 4.01e-06 ***
## Uniformity_of_cell_shape 0.28972 0.06248 4.637 4.45e-06 ***
## Marginal_adhesion 0.38885 0.05249 7.409 4.95e-13 ***
## Single_epithelial_cell_size 0.09426 0.06565 1.436 0.15162
## Bland_chromatin 0.20924 0.06526 3.206 0.00142 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.298 on 540 degrees of freedom
## Multiple R-squared: 0.6076, Adjusted R-squared: 0.604
## F-statistic: 167.2 on 5 and 540 DF, p-value: < 2.2e-16
Let us now predict our models using our testing data and we will use which ever one preforms the best.
predict1 = predict(cancer_model, cancer_test)
predict2 = predict(cancer_model2, cancer_test)
predict3 = predict(cancer_model3, cancer_test)
sst = sum((cancer_test - mean(cancer_test[,7]))^2)
ssr1 = sum((predict1 - cancer_test[,7])^2)
ssr2 = sum((predict2 - cancer_test[,7])^2)
ssr3 = sum((predict3 - cancer_test[,7])^2)
rsq1 =1-ssr1/sst
rsq1
## [1] 0.9561603
rsq2 =1-ssr2/sst
rsq2
## [1] 0.9570105
rsq3 =1-ssr3/sst
rsq3
## [1] 0.9566043
Our Second model preformed the best against our testing data; thus, we are going to use it to impute our missing values.
cancer_missing = predict(cancer_model2, cancer_not_empty[missing_values,])
cancer_regression = cancer_data
cancer_regression[missing_values,]$Bare_nuclei = as.integer(cancer_missing)
cancer_regression[c(23, 158, 249),]
Lastly, we are going to add some variance into our regression values.
rand = rnorm(length(cancer_missing), mean = cancer_missing, sd = sd(cancer_missing))
rand
## [1] 0.9007211 4.5415657 3.3508027 5.7927188 2.3804458 4.1192112
## [7] 3.0781166 1.7722304 11.9505361 5.4117700 3.1183695 3.8375178
## [13] 7.6210357 -0.8016900 3.7571199 4.5042752
Let us now impute these values with the variance included.
cancer_pertub = cancer_data
cancer_pertub[missing_values,]$Bare_nuclei = as.integer(abs(rand))
cancer_pertub[c(23, 158, 249),]
Discussion of Results - When we preformed the mean imputation, we inserted the mean value of 4 into all of the missing values. Then, we we used the our regression imputation, the sample three values imputed were 1, 0, and 1. Lastly, when we added in natural noise to regression predictions, we get sample values of 0, 3, and 1. Accordingly, all of our values make sense in relation to each other, with the our regression with perturbation imputation simulating our most likely real-life data.
Question 15.1 - Describe a situation or problem from your
job, everyday life, current events, etc., for which optimization would
be appropriate. What data would you need?
When planning around existing animal habitats, optimization can be a powerful tool to guide our analysis and decision-making. For example, if we were tasked with creating new shipping routes for supply carriers off the coast of California, we would need to be mindful of the marine animal populations that migrate through these waters. By using optimization methods, we could design routes that minimize disruptions to these migration patterns, reducing potential harm to marine life while still maintaining efficient and cost-effective transportation paths.