I. Executive Summary

As breast cancer continues to affect millions of women around the world, better and more precise treatments as well as diagnosis tests have become available, such as Fine Needle Aspiration tests (FNA). This case study will focus on building the most accurate model for determining benign or malignant tumors in women’s breasts. We will be comparing four different models to help make this determination: logistic regression, linear discriminant analysis (LDA) support vector machine (SVM), random forest.

II. The Problem

a. Introduction/Background

Breast cancer is the second highest cancer that impacts women, with skin cancer being the first. It forms when abnormal cells within the breast tissue conform together to produce a tumor. The tumor can be benign (noncancerous), pre-malignant (pre-cancerous), pr malignant (cancerous). Most of the time women must have surgery to get the tumor removed even when it is noncancerous as it can easily turn into cancer. There are many tests that can help with a diagnosis such as a biopsy, ultrasounds, and mammograms, which every woman must get done yearly once they turn 21.

Women born in the United States have a 1 in 8 possibility of having breast cancer within their lifetime. As of 2022, there are 3.8 million women in the United States alone who have been impacted by breast cancer.

b. Purpose of study/importance of study/statement of problem

The case study will examine the accuracy of cancer cell prediction of different parametric and nonparametric models. It can be a major finding since early diagnosis of cancer can help to increase health standards and life expectancy for many women all around the world.

c. Questions to be answered/conceptual statement of hypotheses.

Do characteristics of a cell (such as concavity, perimeter, smoothness, etc.) determine if the cell is malignant (cancerous) or benign (noncancerous)?
What characteristics of the cell is significant to predict the outcome?
Which model has the highest accuracy in predicting a cancer cell?

d. Outline of remainder of report (brief)

IV. Methodology

a. Identification, classification, and operationalization of variables.

There are 212 patient samples (malignant) and 357 healthy samples (benign).

Dependent variable: - Diagnosis (binary): malignant or benign

Independent variable: (Ten real-valued features are computed for each cell nucleus): - radius (mean of distances from center to points on the perimeter - texture (standard deviation of gray-scale values) - perimeter - area - smoothness (local variation in radius lengths) - compactness (perimeter^2 / area - 1.0) - concavity (severity of concave portions of the contour) - concave points (number of concave portions of the contour) - symmetry - fractal dimension (“coastline approximation” - 1)

The data also include the mean, standard error, and “worst” for each of the cell features, resulting in 30 features.

b. Statements of hypotheses being tested and/or models being developed.

The models being tested are logistic regression, linear discriminant analysis (LDA), support vector machine (SVM), decision tree, boosting, and random forest.

c. Sampling techniques if full data is not being used.

The full dataset is being used for the nonparametric models, which includes the SVM model, decision tree, boosting, and random forest. For the parametric models (logistic regression and LDA) we used a 70/30 split for the training and testing datasets, with only variable ending with ’_mean’.

d. Data collection process, including data sources, data size, etc. Primary/secondary?

The data is collected from Wisconsin Data Base Cancer which contained 569 test samples.

e. Modeling analysis/techniques used

We are generating a logistic regression, linear discriminant analysis and support vector machine model to test which variables are important when determining if a tumor is benign or malignant. To provide the best model and predictions possible, we must review the sensitivity, specificity and accuracy of each individual model.

f. Methodological assumptions and limitations.

Logistic regression has the following assumptions: - Dependent variable must be categorical and binary, which they are in this case - The observations must be independent of one another - Multicollinearity should be investigated and eliminated. Independent variables should not be correlated with one another - There must be a large sample size LDA has these assumptions: - Sample measurements are independent from each other - Distributions are normal - Co-variance of the measurements are identical across different classes. SVM model assumptions: - The margin should be as large as possible. - The data is independent and identically distributed - However, it doesn’t assume normality of the data The methodological assumptions of the decision tree are as follows: - The dependent variable must be binary - The data is assumed to be independent of each other - The data is also assumed to be mostly free of outliers and missing data - The limitations of decision trees are high variance, potential overfitting, and difficulty in interpretation Random forest assumptions and limitations: - Dependent variable is binary - The observations are independent of one another - The data is mostly free of outliers and missing values - The limitations are high variance, overfitting, and difficult to interpret

V. Data

a. Data cleaning

This dataset is exceptionally clean and requires very little modification. The only data cleaning necessary was to change the outcome variable, diagnosis, to a factor to perform binary classification analysis. The diagnosis variable was changed to recognize the two possible levels: ‘B’ for benign, and ‘M’ for malignant.

library(MASS)
library(corrplot)

## corrplot 0.92 loaded

library(tidyverse)

## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──

## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::select() masks MASS::select()

library(SMCRM) # CRM data
library(dplyr) # data wrangling
library(tidyr) # data wrangling
library(ggplot2) # plotting
library(survival) # survival
library(rpart) # DT
library(randomForestSRC)

## Warning: package 'randomForestSRC' was built under R version 4.2.3

## 
##  randomForestSRC 3.2.1 
##  
##  Type rfsrc.news() to see new features, changes, and bug fixes. 
##  
## 
## 
## Attaching package: 'randomForestSRC'
## 
## The following object is masked from 'package:purrr':
## 
##     partial

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:survival':
## 
##     cluster
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.2.3

## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

df <- read.csv("CancerData.csv")

#Change to factor 
df$diagnosis <- as.factor(df$diagnosis)

#Create correlation plot (must have numerical values)
num = select_if(df, is.numeric)
num <- num[,2:11]
M = cor(num)
corrplot::corrplot(M, method = "number")

high_corr <- which(M > 0.9 & M != 1, arr.ind = TRUE)
high_corr

##                     row col
## perimeter_mean        3   1
## area_mean             4   1
## radius_mean           1   3
## area_mean             4   3
## radius_mean           1   4
## perimeter_mean        3   4
## concave.points_mean   8   7
## concavity_mean        7   8

b. Data preprocessing

The variance inflation factor (VIF) was excessively high for numerous variables in the dataset. As a result, the final parametric models only contained the dependent variable diagnosis and the independent variables radius_mean, texture_mean, smoothness_mean, compactness_mean, concave.points_mean, and symmetry_mean. By using these variables, we were able to ensure that the VIF for each variable was below 5 and would not cause multicollinearity issues within the models. As the machine learning, nonparametric models do not require the removal of multicollinearity issues, they were fitted using the entire dataset.

# Keep only columns with mean inside, in order to avoid multicollinearity
df <- df[,2:11]
df = dplyr::select(df, -c(area_mean, concavity_mean, perimeter_mean))

# Create training and testing data
set.seed(1)
idx.train <- sample(1:nrow(df), size = 0.7 * nrow(df))
train <- df[idx.train,]
test <- df[-idx.train,]

table(train$diagnosis)

## 
##   B   M 
## 254 144

table(test$diagnosis)

## 
##   B   M 
## 103  68

# Check VIF for variables to include:
library(car) #for VIF

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

model <- glm(diagnosis~., data = train, family = binomial)
vif(model)

##         radius_mean        texture_mean     smoothness_mean    compactness_mean 
##            2.349201            1.927692            2.497485            2.706049 
## concave.points_mean       symmetry_mean 
##            3.897135            1.911515

c. Data limitations

The only limitation in the dataset was the high collinearity between numerous variables. This issue was dealt with as described in the “Data preprocessing” section above.

VI. Findings (Results)

a. Results presented in tables or charts when appropriate

i. Logistic Regression models (parametric)

#Logistic Regression model
logreg <- glm(diagnosis ~ ., data = train, family = "binomial")
summary(logreg)

## 
## Call:
## glm(formula = diagnosis ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.26698  -0.14933  -0.03961   0.02398   2.88528  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -31.90431    6.61478  -4.823 1.41e-06 ***
## radius_mean           0.74338    0.21535   3.452 0.000556 ***
## texture_mean          0.39548    0.07529   5.253 1.50e-07 ***
## smoothness_mean      43.66871   29.20245   1.495 0.134816    
## compactness_mean    -12.60970   10.49676  -1.201 0.229637    
## concave.points_mean  89.90824   27.44312   3.276 0.001052 ** 
## symmetry_mean        31.34013   13.90255   2.254 0.024179 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 520.94  on 397  degrees of freedom
## Residual deviance: 102.80  on 391  degrees of freedom
## AIC: 116.8
## 
## Number of Fisher Scoring iterations: 8

predprob = predict.glm(logreg, newdata = test, type = "response") 
predclass_log = ifelse(predprob >= 0.5,"M","B") 
caret::confusionMatrix(as.factor(predclass_log), as.factor(test$diagnosis))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 97  6
##          M  6 62
##                                           
##                Accuracy : 0.9298          
##                  95% CI : (0.8806, 0.9632)
##     No Information Rate : 0.6023          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8535          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9417          
##             Specificity : 0.9118          
##          Pos Pred Value : 0.9417          
##          Neg Pred Value : 0.9118          
##              Prevalence : 0.6023          
##          Detection Rate : 0.5673          
##    Detection Prevalence : 0.6023          
##       Balanced Accuracy : 0.9268          
##                                           
##        'Positive' Class : B               
##

The logistic model has an accuracy level of 93% with a sensitivity of 94.17% and a specificity of 91.18%.

ii. Linear Discriminant Analysis model (LDA, parametric)

#LDA model
lda = lda(diagnosis ~ ., data = train)
lda

## Call:
## lda(diagnosis ~ ., data = train)
## 
## Prior probabilities of groups:
##        B        M 
## 0.638191 0.361809 
## 
## Group means:
##   radius_mean texture_mean smoothness_mean compactness_mean concave.points_mean
## B    12.13154     17.83161      0.09166201       0.07898228          0.02508475
## M    17.64986     21.65347      0.10291486       0.14485500          0.09007229
##   symmetry_mean
## B     0.1730815
## M     0.1926306
## 
## Coefficients of linear discriminants:
##                             LD1
## radius_mean          0.16955009
## texture_mean         0.09408202
## smoothness_mean     11.23571469
## compactness_mean    -4.03515325
## concave.points_mean 26.32177015
## symmetry_mean        6.29311302

predclass_lda = predict(lda, newdata = test)
caret::confusionMatrix(as.factor(predclass_lda$class), as.factor(test$diagnosis))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 102  16
##          M   1  52
##                                          
##                Accuracy : 0.9006         
##                  95% CI : (0.8456, 0.941)
##     No Information Rate : 0.6023         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7844         
##                                          
##  Mcnemar's Test P-Value : 0.000685       
##                                          
##             Sensitivity : 0.9903         
##             Specificity : 0.7647         
##          Pos Pred Value : 0.8644         
##          Neg Pred Value : 0.9811         
##              Prevalence : 0.6023         
##          Detection Rate : 0.5965         
##    Detection Prevalence : 0.6901         
##       Balanced Accuracy : 0.8775         
##                                          
##        'Positive' Class : B              
##

Compared to the logistic regression model, the linear discriminant analysis model has slightly lower accuracy level and specificity of 90.06% and 76.47%, respectively. It does have a higher sensitivity of 99.03%.

iii. SVM (non-parametric model)

We decide to reload the dataset in order to include all variables possible for the non-parametric models

#Reload the dataset
df1 <- read.csv("CancerData.csv")
df1$diagnosis <- as.factor(df1$diagnosis)

#Remove ID
df1 <- df1[,-1]
set.seed(1)
idx.train <- sample(1:nrow(df1), size = 0.7 * nrow(df1))
train1 <- df1[idx.train,]
test1 <- df1[-idx.train,]

table(train1$diagnosis)

## 
##   B   M 
## 254 144

table(test1$diagnosis)

## 
##   B   M 
## 103  68

#SVM model:
library(e1071) #for SVM model

## 
## Attaching package: 'e1071'

## The following objects are masked from 'package:randomForestSRC':
## 
##     impute, tune

form1 = diagnosis ~.
tuned = tune.svm(form1, data=train1, gamma = seq(0.1, .1, by= 0.01), cost = seq(.1,1, by = .1))

mysvm = svm(formula = form1, data = train1, gamma = tuned$best.parameters$gamma, cost = tuned$best.parameters$cost) #This line might take 2,3 minutes to run
summary(mysvm) #this line is to build the model on training data

## 
## Call:
## svm(formula = form1, data = train1, gamma = tuned$best.parameters$gamma, 
##     cost = tuned$best.parameters$cost)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.9 
## 
## Number of Support Vectors:  169
## 
##  ( 85 84 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  B M

svmpredict = predict(mysvm, test1, type = 'response')  # This line to classify our test data and get the confusion matrix.
caret::confusionMatrix(as.factor(svmpredict), as.factor(test1$diagnosis))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 98  3
##          M  5 65
##                                           
##                Accuracy : 0.9532          
##                  95% CI : (0.9099, 0.9796)
##     No Information Rate : 0.6023          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9028          
##                                           
##  Mcnemar's Test P-Value : 0.7237          
##                                           
##             Sensitivity : 0.9515          
##             Specificity : 0.9559          
##          Pos Pred Value : 0.9703          
##          Neg Pred Value : 0.9286          
##              Prevalence : 0.6023          
##          Detection Rate : 0.5731          
##    Detection Prevalence : 0.5906          
##       Balanced Accuracy : 0.9537          
##                                           
##        'Positive' Class : B               
##

When comparing the svm model with the LDA and logistic regression models, it shows that it has a higher accuracy of 95.32% and specificity of 95.59%. It does, however, have a lower sensitivity than the LDA model of 95.15%.

Decision Trees

#Decision Tree 
dt.model <- rpart(diagnosis ~ ., 
                  data = train1) # simple DT model

rattle::fancyRpartPlot(dt.model, sub = "") # vizualize the DT

From decision tree, we know that concave.points_mean is the most influential predictors for the result, since the tree begins a split with this variable.

Random Forests (non-parametric)

#Random forest
set.seed(123)
forest1 <- randomForest(diagnosis ~ .,
                        data = train1)
predictions <- predict (forest1, newdata = test1)
caret::confusionMatrix(predictions, test1$diagnosis)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 102   7
##          M   1  61
##                                           
##                Accuracy : 0.9532          
##                  95% CI : (0.9099, 0.9796)
##     No Information Rate : 0.6023          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9009          
##                                           
##  Mcnemar's Test P-Value : 0.0771          
##                                           
##             Sensitivity : 0.9903          
##             Specificity : 0.8971          
##          Pos Pred Value : 0.9358          
##          Neg Pred Value : 0.9839          
##              Prevalence : 0.6023          
##          Detection Rate : 0.5965          
##    Detection Prevalence : 0.6374          
##       Balanced Accuracy : 0.9437          
##                                           
##        'Positive' Class : B               
##

When comparing the random forest model to svm models, it shows that it has a similar accuracy of prediction 95.32%. However, its sensitivity is much better, at 99.03%, while the specificity is lower at 89.71%.

We then try boosting to see if it can improve accuracy of the model:

#Boosting set up:
library(gbm)

## Warning: package 'gbm' was built under R version 4.2.3

## Loaded gbm 2.1.8.1

set.seed(1)
train1$diagnosis <- ifelse(train1$diagnosis == "B", 1, 0) # Needed for boosting function
test1$diagnosis <- ifelse(test1$diagnosis == "B", 1, 0) # Needed for boosting function.

#Perform boosting:
boost <- gbm(diagnosis ~ concave.points_mean, data = train1, distribution = "bernoulli", 
                 n.trees = 1000, interaction.depth = 1, shrinkage = 0.001, verbose = F)


boost.probs <- predict(boost, newdata = test1, n.trees = 1000, type = "response")
boost.pred <- ifelse(boost.probs > 0.5, 1, 0)
caret::confusionMatrix(as.factor(boost.pred), as.factor(test1$diagnosis))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 58 10
##          1 10 93
##                                           
##                Accuracy : 0.883           
##                  95% CI : (0.8252, 0.9271)
##     No Information Rate : 0.6023          
##     P-Value [Acc > NIR] : 4.103e-16       
##                                           
##                   Kappa : 0.7559          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8529          
##             Specificity : 0.9029          
##          Pos Pred Value : 0.8529          
##          Neg Pred Value : 0.9029          
##              Prevalence : 0.3977          
##          Detection Rate : 0.3392          
##    Detection Prevalence : 0.3977          
##       Balanced Accuracy : 0.8779          
##                                           
##        'Positive' Class : 0               
##

Since there is no biased with the predictions of random forest model, boosting doesn’t help much in improving accuracy of prediction. Its accuracy and sensitivity is lower than random forest, at 88.3% and 85.29% respectively. Specificity is only slightly higher, at 90.29%

b. Results reported with respect to hypotheses/models

models <- c("Log Reg", "LDA", "SVM", "RF", "Boost RF") 
Accuracy <- c(0.9298, 0.9006, 0.9532, 0.9532, 0.883)
Sensitivity <- c(0.9417, 0.9903, 0.9515, 0.9903, 0.8529)
Specificity <- c(0.9118, 0.7647, 0.9559, 0.8971, 0.9029)
models.df <- data.frame(models, Accuracy, Sensitivity, Specificity)

df_long <- tidyr::pivot_longer(models.df, -1, names_to = "variable", values_to = "value") 

ggplot(df_long, aes(x = variable, y = value, fill = models)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("#9933FF", "#33FFFF", "red", "darkblue", "green")) +
  xlab("Performance Metrics") +
  ylab("Value") +
  ggtitle("Comparison of Five Models") +
  theme(plot.title = element_text(hjust = 0.5))

As we can see from the plot, the SVM model is the best out of all, with high accuracy, sensitivity, and specificity overall. Random forest and LDA only perform slightly better than SVM for sensitivity; however, their accuracy and specificity wer both lower.

c. Find important features from the data with respect to explain ability and predictive power.

# Decision Tree Variable importance
varImpPlot(forest1, type = 2)

# Logistic regression beta coefficient
summary(logreg)

## 
## Call:
## glm(formula = diagnosis ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.26698  -0.14933  -0.03961   0.02398   2.88528  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -31.90431    6.61478  -4.823 1.41e-06 ***
## radius_mean           0.74338    0.21535   3.452 0.000556 ***
## texture_mean          0.39548    0.07529   5.253 1.50e-07 ***
## smoothness_mean      43.66871   29.20245   1.495 0.134816    
## compactness_mean    -12.60970   10.49676  -1.201 0.229637    
## concave.points_mean  89.90824   27.44312   3.276 0.001052 ** 
## symmetry_mean        31.34013   13.90255   2.254 0.024179 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 520.94  on 397  degrees of freedom
## Residual deviance: 102.80  on 391  degrees of freedom
## AIC: 116.8
## 
## Number of Fisher Scoring iterations: 8

When we evaluate the importance based on Random Forest, then concave.points_mean has the highest predictive power. This is also true for the regression result, as we can see that concave.points_mean has highest beta coefficients of 89.91 out of all important variables with p_value less than 0.05

d. Discuss the tradeoffs of different methods you tried.

The parametric models performed very well with this dataset and were easily interpretable as the final models only contained seven variables including the dependent variable. The logistic regression model was more accurate overall (93%) than the LDA (90%), but the LDA had an impressive sensitivity of 99.03%. If the target of the parametric model is overall accuracy, then the logistic regression would be the best choice. If the target is to identify true positives, then the LDA model would be the best choice.

The best performing models that we built were the SVM and random forest models. The accuracy, sensitivity, and specificity of the SVM model were all just over 95%. The random forest model had an accuracy of 95%, sensitivity of 99%, and specificity of 90%. These models were more accurate than the parametric models, but much more difficult to interpret as they do not provide beta coefficients to help describe the relationship of the independent variables and the dependent variable.

VII. Conclusions

In conclusion, this analysis show that there are few features with more predictive value for the diagnosis of cancer prediction. The observations were confirmed by variance important plot as well as beta coefficient, showing that concave.points_mean, perimeter_worst, and concave.point_worst are the most influential predictors.

We found that it is better to use non-parametric model to predict the result than parametric ones, with SVM performs the best, following by random forest model. The dataset is clean and have almost equal distribution of benign and malignant cells, therefore boosting doesn’t help much in improving the accuracy.

VIII. APPENDIX

Visualization of data

#Reload data
df <- read.csv("CancerData.csv")

#Break up columns into groups, according to (_mean, _se,and __worst)
data_mean <-  dplyr::select(df, c("radius_mean", "texture_mean","perimeter_mean", "area_mean", "smoothness_mean", "compactness_mean", "concavity_mean", "concave.points_mean", "symmetry_mean", "fractal_dimension_mean" ))

data_se <- dplyr::select(df, c("radius_se", "texture_se","perimeter_se", "area_se", "smoothness_se", "compactness_se", "concavity_se", "concave.points_se", "symmetry_se", "fractal_dimension_se" ))

data_worst <- dplyr::select(df, c("radius_worst", "texture_worst","perimeter_worst", "area_worst", "smoothness_worst", "compactness_worst", "concavity_worst", "concave.points_worst", "symmetry_worst", "fractal_dimension_worst" ))

#Plot histograms of "_mean" variables group by diagnosis
par(mfrow=c(3,3))
var_names <- names(data_mean)
for (var in var_names) {
  var_data <- data_mean[[var]]
  hist(var_data, main = paste("Distribution of", var), xlab = var)
}

As we can see, most of the variables in ‘mean’ are normally distributed, except for area, concavity, and concave points

#Plot histograms of "_se" variables group by diagnosis
par(mfrow=c(3,3))
var_names <- names(data_se)
for (i in var_names) {
  var_data <- data_se[[i]]
  hist(var_data, main = paste("Distribution of", i), xlab = i)
}

#Plot histograms of "_worst" variables group by diagnosis
par(mfrow=c(3,3))
var_names <- names(data_worst)
for (i in var_names) {
  var_data <- data_worst[[i]]
  hist(var_data, main = paste("Distribution of", i), xlab = i)
}

The variables in ‘se’ and ‘worst’ are all right skewed

Cancer Prediction Case Study

Tu Thai, Drew Hinck, Nicole Hayek

2023-04-12

I. Executive Summary

II. The Problem

a. Introduction/Background

b. Purpose of study/importance of study/statement of problem

c. Questions to be answered/conceptual statement of hypotheses.

d. Outline of remainder of report (brief)

IV. Methodology

a. Identification, classification, and operationalization of variables.

b. Statements of hypotheses being tested and/or models being developed.

c. Sampling techniques if full data is not being used.

d. Data collection process, including data sources, data size, etc. Primary/secondary?

e. Modeling analysis/techniques used

f. Methodological assumptions and limitations.

V. Data

a. Data cleaning

b. Data preprocessing

c. Data limitations

VI. Findings (Results)

a. Results presented in tables or charts when appropriate

i. Logistic Regression models (parametric)

ii. Linear Discriminant Analysis model (LDA, parametric)

iii. SVM (non-parametric model)

Decision Trees

Random Forests (non-parametric)

b. Results reported with respect to hypotheses/models

c. Find important features from the data with respect to explain ability and predictive power.

d. Discuss the tradeoffs of different methods you tried.

VII. Conclusions

VIII. APPENDIX

Visualization of data

Cancer Prediction Case Study

Tu Thai, Drew Hinck, Nicole Hayek

2023-04-12

I. Executive Summary

II. The Problem

a. Introduction/Background

b. Purpose of study/importance of study/statement of problem

c. Questions to be answered/conceptual statement of hypotheses.

d. Outline of remainder of report (brief)

III. Review of Related Literature

a. Acquaint readers with existing methodologies used in this area.

IV. Methodology

a. Identification, classification, and operationalization of variables.

b. Statements of hypotheses being tested and/or models being developed.

c. Sampling techniques if full data is not being used.

d. Data collection process, including data sources, data size, etc. Primary/secondary?

e. Modeling analysis/techniques used

f. Methodological assumptions and limitations.

V. Data

a. Data cleaning

b. Data preprocessing

c. Data limitations

VI. Findings (Results)

a. Results presented in tables or charts when appropriate

i. Logistic Regression models (parametric)

ii. Linear Discriminant Analysis model (LDA, parametric)

iii. SVM (non-parametric model)

Decision Trees

Random Forests (non-parametric)

b. Results reported with respect to hypotheses/models

c. Find important features from the data with respect to explain ability and predictive power.

d. Discuss the tradeoffs of different methods you tried.

VII. Conclusions

VIII. APPENDIX

Visualization of data