Question 14.1 The breast cancer data set breast-cancer-wisconsin.data.txt from http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/ (description at http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29 ) has missing values.
Use the mean/mode imputation method to impute values for the missing data.
Use regression to impute values for the missing data.
Use regression with perturbation to impute values for the missing data.
(Optional) Compare the results and quality of classification models (e.g., SVM, KNN) build using
the data sets from questions 1,2,3;
the data that remains after data points with missing values are removed; and (3) the data set when a binary variable is introduced to indicate missing values.
Data = read.csv("breast-cancer-wisconsin.data.txt",header = FALSE,na.strings = "?")
str(Data)
'data.frame': 699 obs. of 11 variables:
$ V1 : int 1000025 1002945 1015425 1016277 1017023 1017122 1018099 1018561 1033078 1033078 ...
$ V2 : int 5 5 3 6 4 8 1 2 2 4 ...
$ V3 : int 1 4 1 8 1 10 1 1 1 2 ...
$ V4 : int 1 4 1 8 1 10 1 2 1 1 ...
$ V5 : int 1 5 1 1 3 8 1 1 1 1 ...
$ V6 : int 2 7 2 3 2 7 2 2 2 2 ...
$ V7 : int 1 10 2 4 1 10 10 1 1 1 ...
$ V8 : int 3 3 3 3 3 9 3 3 1 2 ...
$ V9 : int 1 2 1 7 1 7 1 1 1 1 ...
$ V10: int 1 1 1 1 1 1 1 1 5 1 ...
$ V11: int 2 2 2 2 2 4 2 2 2 2 ...
print(summary(Data))
V1 V2 V3 V4
Min. : 61634 Min. : 1.000 Min. : 1.000 Min. : 1.000
1st Qu.: 870688 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000
Median : 1171710 Median : 4.000 Median : 1.000 Median : 1.000
Mean : 1071704 Mean : 4.418 Mean : 3.134 Mean : 3.207
3rd Qu.: 1238298 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 5.000
Max. :13454352 Max. :10.000 Max. :10.000 Max. :10.000
V5 V6 V7 V8
Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 2.000
Median : 1.000 Median : 2.000 Median : 1.000 Median : 3.000
Mean : 2.807 Mean : 3.216 Mean : 3.545 Mean : 3.438
3rd Qu.: 4.000 3rd Qu.: 4.000 3rd Qu.: 6.000 3rd Qu.: 5.000
Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
NA's :16
V9 V10 V11
Min. : 1.000 Min. : 1.000 Min. :2.00
1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.:2.00
Median : 1.000 Median : 1.000 Median :2.00
Mean : 2.867 Mean : 1.589 Mean :2.69
3rd Qu.: 4.000 3rd Qu.: 1.000 3rd Qu.:4.00
Max. :10.000 Max. :10.000 Max. :4.00
# 1. Sample code number: id number
# 2. Clump Thickness: 1 - 10
# 3. Uniformity of Cell Size: 1 - 10
# 4. Uniformity of Cell Shape: 1 - 10
# 5. Marginal Adhesion: 1 - 10
# 6. Single Epithelial Cell Size: 1 - 10
# 7. Bare Nuclei: 1 - 10
# 8. Bland Chromatin: 1 - 10
# 9. Normal Nucleoli: 1 - 10
# 10. Mitoses: 1 - 10
# 11. Class: (2 for benign, 4 for malignant)
colnames(Data) <- c("ID", "Clump_Thickness", "Cell_Size", "Cell_Shape",
"Marginal_Adhesion", "Single_Epith_Cell_Size", "Bare_Nuclei",
"Bland_Chromatin","Normal_Nucleoli", "Mitoses", "Class")
require(dplyr)
Data <- Data %>%mutate(Class = ifelse(Class == 4,1,0))
drops <- c("ID")
Data <- Data[ , !(names(Data) %in% drops)]
summary(Data)
Clump_Thickness Cell_Size Cell_Shape Marginal_Adhesion
Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000
Median : 4.000 Median : 1.000 Median : 1.000 Median : 1.000
Mean : 4.418 Mean : 3.134 Mean : 3.207 Mean : 2.807
3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 5.000 3rd Qu.: 4.000
Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
Single_Epith_Cell_Size Bare_Nuclei Bland_Chromatin Normal_Nucleoli
Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 1.000
Median : 2.000 Median : 1.000 Median : 3.000 Median : 1.000
Mean : 3.216 Mean : 3.545 Mean : 3.438 Mean : 2.867
3rd Qu.: 4.000 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 4.000
Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
NA's :16
Mitoses Class
Min. : 1.000 Min. :0.0000
1st Qu.: 1.000 1st Qu.:0.0000
Median : 1.000 Median :0.0000
Mean : 1.589 Mean :0.3448
3rd Qu.: 1.000 3rd Qu.:1.0000
Max. :10.000 Max. :1.0000
print(summary(Data))
Clump_Thickness Cell_Size Cell_Shape Marginal_Adhesion
Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000
Median : 4.000 Median : 1.000 Median : 1.000 Median : 1.000
Mean : 4.418 Mean : 3.134 Mean : 3.207 Mean : 2.807
3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 5.000 3rd Qu.: 4.000
Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
Single_Epith_Cell_Size Bare_Nuclei Bland_Chromatin Normal_Nucleoli
Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 1.000
Median : 2.000 Median : 1.000 Median : 3.000 Median : 1.000
Mean : 3.216 Mean : 3.545 Mean : 3.438 Mean : 2.867
3rd Qu.: 4.000 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 4.000
Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
NA's :16
Mitoses Class
Min. : 1.000 Min. :0.0000
1st Qu.: 1.000 1st Qu.:0.0000
Median : 1.000 Median :0.0000
Mean : 1.589 Mean :0.3448
3rd Qu.: 1.000 3rd Qu.:1.0000
Max. :10.000 Max. :1.0000
#Bare_Nuclei has 16 missing values
library(VIM)
mice_plot <- aggr(Data, col=c('navyblue','yellow'),numbers=TRUE, sortVars=TRUE,
labels=names(Data), cex.axis=.7,gap=3, ylab=c("Missing data","Pattern"))
Variables sorted by number of missings:
# From the chart I see:
# There are 67% values in the data set with no missing value. There are 97.8% with
# no missing values.
# This chart can be very useful if multiple columns have missing values.
Data_imput_w_mean <- Data
Data_imput_w_mean$Bare_Nuclei[is.na(Data_imput_w_mean$Bare_Nuclei)] <- mean(Data_imput_w_mean$Bare_Nuclei, na.rm=TRUE)
Data_imput_w_mode <- Data
Data_imput_w_mode$Bare_Nuclei[is.na(Data_imput_w_mean$Bare_Nuclei)] <- mode(Data_imput_w_mean$Bare_Nuclei)
# Use regression to impute missing values
library(mice)
Data_imput_w_regression <- Data
imp <- mice(Data, method="norm.predict",m=1)
iter imp variable
1 1 Bare_Nuclei
2 1 Bare_Nuclei
3 1 Bare_Nuclei
4 1 Bare_Nuclei
5 1 Bare_Nuclei
#method="norm.predict"
#Linear regression, predicted values (numeric)
imp
Class: mids
Number of multiple imputations: 1
Imputation methods:
Clump_Thickness Cell_Size Cell_Shape
"" "" ""
Marginal_Adhesion Single_Epith_Cell_Size Bare_Nuclei
"" "" "norm.predict"
Bland_Chromatin Normal_Nucleoli Mitoses
"" "" ""
Class
""
PredictorMatrix:
Clump_Thickness Cell_Size Cell_Shape Marginal_Adhesion
Clump_Thickness 0 1 1 1
Cell_Size 1 0 1 1
Cell_Shape 1 1 0 1
Marginal_Adhesion 1 1 1 0
Single_Epith_Cell_Size 1 1 1 1
Bare_Nuclei 1 1 1 1
Single_Epith_Cell_Size Bare_Nuclei Bland_Chromatin
Clump_Thickness 1 1 1
Cell_Size 1 1 1
Cell_Shape 1 1 1
Marginal_Adhesion 1 1 1
Single_Epith_Cell_Size 0 1 1
Bare_Nuclei 1 0 1
Normal_Nucleoli Mitoses Class
Clump_Thickness 1 1 1
Cell_Size 1 1 1
Cell_Shape 1 1 1
Marginal_Adhesion 1 1 1
Single_Epith_Cell_Size 1 1 1
Bare_Nuclei 1 1 1
# list the actual imputations for Bare_Nuclei
# Class: mids
# Number of multiple imputations: 1
# Imputation methods:
# Clump_Thickness Cell_Size Cell_Shape
# "" "" ""
# Marginal_Adhesion Single_Epith_Cell_Size Bare_Nuclei
# "" "" "norm.predict"
# Bland_Chromatin Normal_Nucleoli Mitoses
# "" "" ""
# Class
# ""
# PredictorMatrix:
# Clump_Thickness Cell_Size Cell_Shape Marginal_Adhesion
# Clump_Thickness 0 1 1 1
# Cell_Size 1 0 1 1
# Cell_Shape 1 1 0 1
# Marginal_Adhesion 1 1 1 0
# Single_Epith_Cell_Size 1 1 1 1
# Bare_Nuclei 1 1 1 1
# Single_Epith_Cell_Size Bare_Nuclei Bland_Chromatin
# Clump_Thickness 1 1 1
# Cell_Size 1 1 1
# Cell_Shape 1 1 1
# Marginal_Adhesion 1 1 1
# Single_Epith_Cell_Size 0 1 1
# Bare_Nuclei 1 0 1
# Normal_Nucleoli Mitoses Class
# Clump_Thickness 1 1 1
# Cell_Size 1 1 1
# Cell_Shape 1 1 1
# Marginal_Adhesion 1 1 1
# Single_Epith_Cell_Size 1 1 1
# Bare_Nuclei 1 1 1
Data_imput_w_regression <- complete(imp)
#Use regression with perturbation impute values
imp_perturbation <- mice(Data, method="norm.nob",m=1)
iter imp variable
1 1 Bare_Nuclei
2 1 Bare_Nuclei
3 1 Bare_Nuclei
4 1 Bare_Nuclei
5 1 Bare_Nuclei
#method="norm.nob"
#Linear regression, predicted values (numeric)
imp_perturbation
Class: mids
Number of multiple imputations: 1
Imputation methods:
Clump_Thickness Cell_Size Cell_Shape
"" "" ""
Marginal_Adhesion Single_Epith_Cell_Size Bare_Nuclei
"" "" "norm.nob"
Bland_Chromatin Normal_Nucleoli Mitoses
"" "" ""
Class
""
PredictorMatrix:
Clump_Thickness Cell_Size Cell_Shape Marginal_Adhesion
Clump_Thickness 0 1 1 1
Cell_Size 1 0 1 1
Cell_Shape 1 1 0 1
Marginal_Adhesion 1 1 1 0
Single_Epith_Cell_Size 1 1 1 1
Bare_Nuclei 1 1 1 1
Single_Epith_Cell_Size Bare_Nuclei Bland_Chromatin
Clump_Thickness 1 1 1
Cell_Size 1 1 1
Cell_Shape 1 1 1
Marginal_Adhesion 1 1 1
Single_Epith_Cell_Size 0 1 1
Bare_Nuclei 1 0 1
Normal_Nucleoli Mitoses Class
Clump_Thickness 1 1 1
Cell_Size 1 1 1
Cell_Shape 1 1 1
Marginal_Adhesion 1 1 1
Single_Epith_Cell_Size 1 1 1
Bare_Nuclei 1 1 1
# list the actual imputations for Bare_Nuclei
Data_imput_w_regression_pert <- complete(imp_perturbation)
the data sets from questions 1,2,3;
the data that remains after data points with missing values are removed; and (3) the data set when a binary variable is introduced to indicate missing values.
split_test_train <- function(df) {
set.seed(1)
train = sample(1:nrow(df), 0.75*nrow(df))
train_df <- df[train,]
test_df <- df[-train,]
return (list("train" = train_df, "test" = test_df))
}
result_w_mean <- split_test_train(Data_imput_w_mean)
summary(result_w_mean$train)
Clump_Thickness Cell_Size Cell_Shape Marginal_Adhesion
Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000
Median : 4.000 Median : 1.000 Median : 2.000 Median : 1.000
Mean : 4.393 Mean : 3.145 Mean : 3.179 Mean : 2.847
3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 5.000 3rd Qu.: 4.000
Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
Single_Epith_Cell_Size Bare_Nuclei Bland_Chromatin Normal_Nucleoli
Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 1.000
Median : 2.000 Median : 1.000 Median : 3.000 Median : 1.000
Mean : 3.174 Mean : 3.498 Mean : 3.454 Mean : 2.855
3rd Qu.: 4.000 3rd Qu.: 5.000 3rd Qu.: 5.000 3rd Qu.: 3.250
Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
Mitoses Class
Min. : 1.000 Min. :0.0000
1st Qu.: 1.000 1st Qu.:0.0000
Median : 1.000 Median :0.0000
Mean : 1.532 Mean :0.3359
3rd Qu.: 1.000 3rd Qu.:1.0000
Max. :10.000 Max. :1.0000
dim(result_w_mean$train)
[1] 524 10
dim(result_w_mean$test)
[1] 175 10
data_train_target_category <- result_w_mean$train[,"Class"]
data_test_target_category <- result_w_mean$test[,"Class"]
#run KNN models
library(class)
pr <- knn(result_w_mean$train,result_w_mean$test, cl=data_train_target_category)
pr
[1] 0 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 1 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 0 0 1 1 0
[38] 0 0 1 0 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 1 1 0 1 1 0 1 0 0 0 0 0 1 0 0 1 1 0
[75] 1 1 1 0 1 0 1 0 1 0 1 0 1 1 0 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
[112] 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
[149] 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
Levels: 0 1
##create confusion matrix
cf <- table(pr,data_test_target_category)
cf
data_test_target_category
pr 0 1
0 110 6
1 0 59
# data_test_target_category
#pr 0 1
# 0 79 38
# 1 31 27
##this function divides the correct predictions by total number of predictions that tell us how accurate teh model is.
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(cf)
[1] 96.57143
#60.5% accuracy of the data imputed with mean
#Data_imput_w_regression
result_w_regression <- split_test_train(Data_imput_w_regression)
summary(result_w_regression$train)
Clump_Thickness Cell_Size Cell_Shape Marginal_Adhesion
Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000
Median : 4.000 Median : 1.000 Median : 2.000 Median : 1.000
Mean : 4.393 Mean : 3.145 Mean : 3.179 Mean : 2.847
3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 5.000 3rd Qu.: 4.000
Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
Single_Epith_Cell_Size Bare_Nuclei Bland_Chromatin Normal_Nucleoli
Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 1.000
Median : 2.000 Median : 1.000 Median : 3.000 Median : 1.000
Mean : 3.174 Mean : 3.462 Mean : 3.454 Mean : 2.855
3rd Qu.: 4.000 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 3.250
Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
Mitoses Class
Min. : 1.000 Min. :0.0000
1st Qu.: 1.000 1st Qu.:0.0000
Median : 1.000 Median :0.0000
Mean : 1.532 Mean :0.3359
3rd Qu.: 1.000 3rd Qu.:1.0000
Max. :10.000 Max. :1.0000
dim(result_w_regression$train)
[1] 524 10
dim(result_w_regression$test)
[1] 175 10
data_train_target_category <- result_w_regression$train[,"Class"]
data_test_target_category <- result_w_regression$test[,"Class"]
pr <- knn(result_w_regression$train,result_w_regression$test, cl=data_train_target_category)
##create confusion matrix
cf <- table(pr,data_test_target_category)
cf
data_test_target_category
pr 0 1
0 110 6
1 0 59
# data_test_target_category
# pr 0 1
# 0 110 6
# 1 0 59
accuracy(cf)
[1] 96.57143
#96% accuracy of the data imputed with regression
#Data_imput_w_regression_w_perturbation
result_w_regression_pert <- split_test_train(Data_imput_w_regression_pert)
summary(result_w_regression_pert$train)
Clump_Thickness Cell_Size Cell_Shape Marginal_Adhesion
Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000
Median : 4.000 Median : 1.000 Median : 2.000 Median : 1.000
Mean : 4.393 Mean : 3.145 Mean : 3.179 Mean : 2.847
3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 5.000 3rd Qu.: 4.000
Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
Single_Epith_Cell_Size Bare_Nuclei Bland_Chromatin Normal_Nucleoli
Min. : 1.000 Min. :-0.5784 Min. : 1.000 Min. : 1.000
1st Qu.: 2.000 1st Qu.: 1.0000 1st Qu.: 2.000 1st Qu.: 1.000
Median : 2.000 Median : 1.0000 Median : 3.000 Median : 1.000
Mean : 3.174 Mean : 3.4775 Mean : 3.454 Mean : 2.855
3rd Qu.: 4.000 3rd Qu.: 6.0000 3rd Qu.: 5.000 3rd Qu.: 3.250
Max. :10.000 Max. :10.0000 Max. :10.000 Max. :10.000
Mitoses Class
Min. : 1.000 Min. :0.0000
1st Qu.: 1.000 1st Qu.:0.0000
Median : 1.000 Median :0.0000
Mean : 1.532 Mean :0.3359
3rd Qu.: 1.000 3rd Qu.:1.0000
Max. :10.000 Max. :1.0000
data_train_target_category <- result_w_regression_pert$train[,"Class"]
data_test_target_category <- result_w_regression_pert$test[,"Class"]
pr <- knn(result_w_regression_pert$train,result_w_regression_pert$test, cl=data_train_target_category)
pr
[1] 0 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 1 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 0 0 1 1 0
[38] 0 0 1 0 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 1 1 0 1 1 0 1 0 0 0 0 0 1 0 0 1 1 0
[75] 1 1 1 0 1 0 1 0 1 0 1 0 1 1 0 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
[112] 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
[149] 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
Levels: 0 1
##create confusion matrix
cf <- table(pr,data_test_target_category)
cf
data_test_target_category
pr 0 1
0 110 6
1 0 59
# data_test_target_category
# pr 0 1
# 0 110 6
# 1 0 59
accuracy(cf)
[1] 96.57143
#96% accuracy of the data imputed with regression
#Data with rows w/ missing values removed
data_w_no_na = Data[complete.cases(Data), ]
dim(data_w_no_na)
[1] 683 10
result_w_no_na <- split_test_train(data_w_no_na)
data_train_target_category <- result_w_no_na$train[,"Class"]
data_test_target_category <- result_w_no_na$test[,"Class"]
pr <- knn(result_w_no_na$train,result_w_no_na$test, cl=data_train_target_category)
pr
[1] 1 0 1 0 0 0 0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 1 0 0 0 0
[38] 0 1 1 1 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 1 0 0 1 0 0 1 1
[75] 1 0 1 0 1 1 0 0 1 1 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 1 1 1 1 0 0 0 0
[112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0
[149] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Levels: 0 1
##create confusion matrix
cf <- table(pr,data_test_target_category)
cf
data_test_target_category
pr 0 1
0 113 3
1 3 52
# data_test_target_category
# pr 0 1
# 0 113 3
# 1 3 52
accuracy(cf)
[1] 96.49123
#96% accuracy of the data without missing values
# Using KNN models with
#(1) data without missing values rows
#(2) data imputed with mean
#(3) data imputed with regression
#(4) data imputed with regression with perturbation
#(1) performs the worst, while the other three perform similarly well
#with accuracy of 96%
#This seems suspiciously high. This might due to overfitting from imputation
#based on multiple independent variables.
Question 15.1 Describe a situation or problem from your job, everyday life, current events, etc., for which optimization would be appropriate. What data would you need?
Optimization can be used for airline to decide the number of first-class tickets, premium tickets, and coach tickets they should sell to maximize their profits for their flights. The company might need to consider certain constraints, for example, only max. number of first class and coach seats are available on the flight. The number of staffs to serve first class might be limited, etc.