ISYE 6501 Homework 10

The breast cancer data set breast-cancer-wisconsin.data.txt from - https://archive.ics.uci.edu/dataset/15/breast+cancer+wisconsin+original (description available at the same URL) has missing values.
1. Use the mean/mode imputation method to impute values for the missing data.
2. Use regression to impute values for the missing data.
3. Use regression with perturbation to impute values for the missing data.

Methodology - To fill in the missing values in our dataset, we will use three different imputation methods: mean or mode imputation, regression imputation, and regression with perturbation. The first step is to identify which fields contain missing data so we know where to apply each method. For mean or mode imputation, we’ll calculate the average for numerical variables and the most common value for categorical variables, then use those values to replace any missing entries. Next, for regression imputation, we’ll create a regression model that predicts the missing values based on other variables in the dataset and use those predictions to fill the gaps. Finally, for regression with perturbation imputation, we’ll take the predicted values from the regression model and add a small amount of random noise to them. This helps maintain the natural variation in the data and prevents the imputed values from being too uniform or unrealistic.

cancer_data = read.table("C:/Users/james/OneDrive/Desktop/Georgia Tech/ISYE 6501/Homeworks/Homework 10/Homework10_ISYE6501/Homework10_ISYE6501/data 14.1/breast-cancer-wisconsin.data.txt", header = TRUE, sep = ",", na.strings = "?")

colnames(cancer_data) = c("Sample_code_number", "Clump_thickness", "Uniformity_of_cell_size", "Uniformity_of_cell_shape", "Marginal_adhesion", "Single_epithelial_cell_size", "Bare_nuclei", "Bland_chromatin", "Normal_nucleoli", "Mitoses", "Class")

head(cancer_data)

cancer_data

Let us find which field(s) have missing values.

cancer_data$Class = as.factor(cancer_data$Class)
levels(cancer_data$Class) = c(0,1)
summary(cancer_data)

##  Sample_code_number Clump_thickness  Uniformity_of_cell_size
##  Min.   :   61634   Min.   : 1.000   Min.   : 1.000         
##  1st Qu.:  870258   1st Qu.: 2.000   1st Qu.: 1.000         
##  Median : 1171710   Median : 4.000   Median : 1.000         
##  Mean   : 1071807   Mean   : 4.417   Mean   : 3.138         
##  3rd Qu.: 1238354   3rd Qu.: 6.000   3rd Qu.: 5.000         
##  Max.   :13454352   Max.   :10.000   Max.   :10.000         
##                                                             
##  Uniformity_of_cell_shape Marginal_adhesion Single_epithelial_cell_size
##  Min.   : 1.000           Min.   : 1.000    Min.   : 1.000             
##  1st Qu.: 1.000           1st Qu.: 1.000    1st Qu.: 2.000             
##  Median : 1.000           Median : 1.000    Median : 2.000             
##  Mean   : 3.211           Mean   : 2.809    Mean   : 3.218             
##  3rd Qu.: 5.000           3rd Qu.: 4.000    3rd Qu.: 4.000             
##  Max.   :10.000           Max.   :10.000    Max.   :10.000             
##                                                                        
##   Bare_nuclei     Bland_chromatin  Normal_nucleoli    Mitoses      Class  
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.00   Min.   : 1.00   0:457  
##  1st Qu.: 1.000   1st Qu.: 2.000   1st Qu.: 1.00   1st Qu.: 1.00   1:241  
##  Median : 1.000   Median : 3.000   Median : 1.00   Median : 1.00          
##  Mean   : 3.548   Mean   : 3.438   Mean   : 2.87   Mean   : 1.59          
##  3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 4.00   3rd Qu.: 1.00          
##  Max.   :10.000   Max.   :10.000   Max.   :10.00   Max.   :10.00          
##  NA's   :16

It appears that our only field with missing values is the Bare_Nuclei field.

cancer_data[is.na(cancer_data$Bare_nuclei),]

Let us now find the mean of our data and impute that mean to replace our spots where we have data missing.

data_mean =round(mean(as.integer(cancer_data$Bare_nuclei),na.rm=TRUE))
data_mean

## [1] 4

cancer_mean = cancer_data
cancer_mean[is.na(cancer_mean)] = data_mean
cancer_mean[c(23, 158, 249),]

Let us now find the rows where the data is missing and go ahead and create our training and testing data sets.

missing_values = as.vector(which(is.na(cancer_data$Bare_nuclei)))
cancer_not_empty = cancer_data[-missing_values, 2:10]

set.seed = (555)

cancer_training_rows = sample(nrow(cancer_not_empty), round(nrow(cancer_not_empty) *.8))

cancer_training = cancer_not_empty[cancer_training_rows,]

cancer_test = cancer_not_empty[-cancer_training_rows,]

nrow(cancer_training) + nrow(cancer_test)

## [1] 682

nrow(cancer_not_empty)

## [1] 682

Let us create our first model with all variables.

cancer_model = lm(Bare_nuclei ~.,data = cancer_training)
summary(cancer_model)

## 
## Call:
## lm(formula = Bare_nuclei ~ ., data = cancer_training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9663 -0.8860 -0.2451  0.6095  8.8015 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -0.52075    0.21970  -2.370 0.018127 *  
## Clump_thickness              0.22161    0.04730   4.685 3.55e-06 ***
## Uniformity_of_cell_size     -0.01038    0.08599  -0.121 0.903992    
## Uniformity_of_cell_shape     0.30749    0.08430   3.648 0.000291 ***
## Marginal_adhesion            0.39653    0.05365   7.391 5.62e-13 ***
## Single_epithelial_cell_size  0.11069    0.07018   1.577 0.115300    
## Bland_chromatin              0.21460    0.06760   3.175 0.001586 ** 
## Normal_nucleoli             -0.02371    0.05221  -0.454 0.649942    
## Mitoses                     -0.03743    0.06571  -0.570 0.569149    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.303 on 537 degrees of freedom
## Multiple R-squared:  0.6081, Adjusted R-squared:  0.6023 
## F-statistic: 104.2 on 8 and 537 DF,  p-value: < 2.2e-16

Let us create a second model with only significant variables from model 1 and then a third model using stepwise variable selection.

cancer_model2 = lm(Bare_nuclei~Clump_thickness + Uniformity_of_cell_shape + Marginal_adhesion + Bland_chromatin, data = cancer_training)

cancer_model3 = step(cancer_model, direction = 'both')

## Start:  AIC=919.99
## Bare_nuclei ~ Clump_thickness + Uniformity_of_cell_size + Uniformity_of_cell_shape + 
##     Marginal_adhesion + Single_epithelial_cell_size + Bland_chromatin + 
##     Normal_nucleoli + Mitoses
## 
##                               Df Sum of Sq    RSS    AIC
## - Uniformity_of_cell_size      1     0.077 2848.8 918.00
## - Normal_nucleoli              1     1.094 2849.8 918.19
## - Mitoses                      1     1.721 2850.4 918.32
## <none>                                     2848.7 919.99
## - Single_epithelial_cell_size  1    13.199 2861.9 920.51
## - Bland_chromatin              1    53.465 2902.1 928.14
## - Uniformity_of_cell_shape     1    70.579 2919.2 931.35
## - Clump_thickness              1   116.434 2965.1 939.86
## - Marginal_adhesion            1   289.777 3138.4 970.88
## 
## Step:  AIC=918
## Bare_nuclei ~ Clump_thickness + Uniformity_of_cell_shape + Marginal_adhesion + 
##     Single_epithelial_cell_size + Bland_chromatin + Normal_nucleoli + 
##     Mitoses
## 
##                               Df Sum of Sq    RSS    AIC
## - Normal_nucleoli              1     1.144 2849.9 916.22
## - Mitoses                      1     1.761 2850.5 916.34
## <none>                                     2848.8 918.00
## - Single_epithelial_cell_size  1    13.461 2862.2 918.57
## + Uniformity_of_cell_size      1     0.077 2848.7 919.99
## - Bland_chromatin              1    54.374 2903.1 926.32
## - Uniformity_of_cell_shape     1   110.674 2959.4 936.81
## - Clump_thickness              1   117.386 2966.1 938.05
## - Marginal_adhesion            1   292.018 3140.8 969.28
## 
## Step:  AIC=916.22
## Bare_nuclei ~ Clump_thickness + Uniformity_of_cell_shape + Marginal_adhesion + 
##     Single_epithelial_cell_size + Bland_chromatin + Mitoses
## 
##                               Df Sum of Sq    RSS    AIC
## - Mitoses                      1     2.342 2852.2 914.67
## <none>                                     2849.9 916.22
## - Single_epithelial_cell_size  1    12.772 2862.7 916.66
## + Normal_nucleoli              1     1.144 2848.8 918.00
## + Uniformity_of_cell_size      1     0.127 2849.8 918.19
## - Bland_chromatin              1    53.315 2903.2 924.34
## - Uniformity_of_cell_shape     1   114.833 2964.7 935.79
## - Clump_thickness              1   116.562 2966.4 936.11
## - Marginal_adhesion            1   292.139 3142.0 967.50
## 
## Step:  AIC=914.67
## Bare_nuclei ~ Clump_thickness + Uniformity_of_cell_shape + Marginal_adhesion + 
##     Single_epithelial_cell_size + Bland_chromatin
## 
##                               Df Sum of Sq    RSS    AIC
## <none>                                     2852.2 914.67
## - Single_epithelial_cell_size  1    10.890 2863.1 914.75
## + Mitoses                      1     2.342 2849.9 916.22
## + Normal_nucleoli              1     1.725 2850.5 916.34
## + Uniformity_of_cell_size      1     0.204 2852.0 916.63
## - Bland_chromatin              1    54.297 2906.5 922.96
## - Uniformity_of_cell_shape     1   113.553 2965.8 933.98
## - Clump_thickness              1   114.653 2966.9 934.19
## - Marginal_adhesion            1   289.906 3142.1 965.52

summary(cancer_model2)

## 
## Call:
## lm(formula = Bare_nuclei ~ Clump_thickness + Uniformity_of_cell_shape + 
##     Marginal_adhesion + Bland_chromatin, data = cancer_training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.8170 -0.9117 -0.2031  0.5274  8.7969 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -0.40770    0.19799  -2.059 0.039953 *  
## Clump_thickness           0.22070    0.04665   4.731 2.85e-06 ***
## Uniformity_of_cell_shape  0.32339    0.05797   5.578 3.84e-08 ***
## Marginal_adhesion         0.40082    0.05187   7.727 5.39e-14 ***
## Bland_chromatin           0.22195    0.06472   3.429 0.000651 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.3 on 541 degrees of freedom
## Multiple R-squared:  0.6061, Adjusted R-squared:  0.6032 
## F-statistic: 208.1 on 4 and 541 DF,  p-value: < 2.2e-16

summary(cancer_model3)

## 
## Call:
## lm(formula = Bare_nuclei ~ Clump_thickness + Uniformity_of_cell_shape + 
##     Marginal_adhesion + Single_epithelial_cell_size + Bland_chromatin, 
##     data = cancer_training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.0634  -0.8597  -0.2799   0.5625   8.8006 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -0.51280    0.21090  -2.431  0.01536 *  
## Clump_thickness              0.21739    0.04666   4.659 4.01e-06 ***
## Uniformity_of_cell_shape     0.28972    0.06248   4.637 4.45e-06 ***
## Marginal_adhesion            0.38885    0.05249   7.409 4.95e-13 ***
## Single_epithelial_cell_size  0.09426    0.06565   1.436  0.15162    
## Bland_chromatin              0.20924    0.06526   3.206  0.00142 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.298 on 540 degrees of freedom
## Multiple R-squared:  0.6076, Adjusted R-squared:  0.604 
## F-statistic: 167.2 on 5 and 540 DF,  p-value: < 2.2e-16

Let us now predict our models using our testing data and we will use which ever one preforms the best.

predict1 = predict(cancer_model, cancer_test)
predict2 = predict(cancer_model2, cancer_test)
predict3 = predict(cancer_model3, cancer_test)


sst = sum((cancer_test - mean(cancer_test[,7]))^2)
ssr1 = sum((predict1 - cancer_test[,7])^2)
ssr2 = sum((predict2 - cancer_test[,7])^2)
ssr3 = sum((predict3 - cancer_test[,7])^2)

rsq1 =1-ssr1/sst
rsq1

## [1] 0.9561603

rsq2 =1-ssr2/sst
rsq2

## [1] 0.9570105

rsq3 =1-ssr3/sst
rsq3

## [1] 0.9566043

Our Second model preformed the best against our testing data; thus, we are going to use it to impute our missing values.

cancer_missing = predict(cancer_model2, cancer_not_empty[missing_values,])
cancer_regression = cancer_data
cancer_regression[missing_values,]$Bare_nuclei = as.integer(cancer_missing)
cancer_regression[c(23, 158, 249),]

Lastly, we are going to add some variance into our regression values.

rand = rnorm(length(cancer_missing), mean = cancer_missing, sd = sd(cancer_missing))
rand

##  [1]  0.9007211  4.5415657  3.3508027  5.7927188  2.3804458  4.1192112
##  [7]  3.0781166  1.7722304 11.9505361  5.4117700  3.1183695  3.8375178
## [13]  7.6210357 -0.8016900  3.7571199  4.5042752

Let us now impute these values with the variance included.

cancer_pertub = cancer_data
cancer_pertub[missing_values,]$Bare_nuclei = as.integer(abs(rand))
cancer_pertub[c(23, 158, 249),]

Discussion of Results - When we preformed the mean imputation, we inserted the mean value of 4 into all of the missing values. Then, we we used the our regression imputation, the sample three values imputed were 1, 0, and 1. Lastly, when we added in natural noise to regression predictions, we get sample values of 0, 3, and 1. Accordingly, all of our values make sense in relation to each other, with the our regression with perturbation imputation simulating our most likely real-life data.

Question 15.1 - Describe a situation or problem from your job, everyday life, current events, etc., for which optimization would be appropriate. What data would you need?

When planning around existing animal habitats, optimization can be a powerful tool to guide our analysis and decision-making. For example, if we were tasked with creating new shipping routes for supply carriers off the coast of California, we would need to be mindful of the marine animal populations that migrate through these waters. By using optimization methods, we could design routes that minimize disruptions to these migration patterns, reducing potential harm to marine life while still maintaining efficient and cost-effective transportation paths.

ISYE 6501 Homework 10

James Jessup