R Notebook

library(dplyr)
library(tidyverse)
library(caret)
library(InformationValue)
library(ROSE)
library(caTools)

You get to decide which dataset you want to work on. The data set must be different You can work on a problem from your work, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portal.

Select one of the methodologies studied in weeks 1-10, and one methodology from weeks 11-15 to apply in the new dataset selected. To complete this task:. - Describe the problem you are trying to solve. - Describe your dataset and what you did to prepare the data for analysis. - Methodologies you used for analyzing the data - Why you did what you did - Make your conclusions from your analysis. Please be sure to address the business impact (it could be of any domain) of your solution.

Your final presentation could be the traditional R file and essay, or it could be an oral presentation with the execution and explanation of your code, recorded on any platform of your choice (Youtube, Free Cam). If you select the presentation, it should be a 5 to 8 minutes recording.

Problem

For this project, I’m planning on comparing several of the more advanced machine learning techniques along with replicating the results of the original paper comparing naive Bayes and logistic regression. The task at hand is prediction of cervical cancer given using over a dozen patient features. Risk stratification is an important part of medicine as diagnosing an illness has costs and risks, which can be further exacerbated by the presence of false positive and negative results. Cervical cancer in particular is a unique cases because some of its most common manifestations are infectious in origin and can be prevented by immunization.

Dataset

The dataset I’m using is available through UCI’s Machine Learning Repository. These data originate from a 2026 paper entitled Behavior Determinant based Cervical Cancer Early Detection with Machine Learning Algorithm. The features are a set of 18 attributes representing 7 larger behavioral determinants of illness: behavior, intention, attitude, norm, perception, social support, and empowerment. These refer to four competing models of the psychological perception of health and illness. These dimensions of behavior were part of a questionaire with nine questions per variable. I’ve described the variables further in the table 1.

Methods

cc_df <- read.csv('sobar-72.csv')

cc_scale <- cc_df %>%
  select(-ca_cervix) %>%
  mutate_all(scale)
corrplot::corrplot(cor(cc_scale), diag = T)

pca <- (prcomp(cc_df[-20], center =  TRUE, scale = TRUE))
summary(pca)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.3355 1.8802 1.29646 1.25329 1.11881 0.98099 0.91314
## Proportion of Variance 0.2871 0.1861 0.08846 0.08267 0.06588 0.05065 0.04389
## Cumulative Proportion  0.2871 0.4732 0.56161 0.64428 0.71017 0.76081 0.80470
##                           PC8     PC9   PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.8131 0.75586 0.6837 0.67570 0.62814 0.54410 0.49351
## Proportion of Variance 0.0348 0.03007 0.0246 0.02403 0.02077 0.01558 0.01282
## Cumulative Proportion  0.8395 0.86957 0.8942 0.91820 0.93897 0.95455 0.96737
##                           PC15    PC16    PC17    PC18    PC19
## Standard deviation     0.45012 0.39060 0.33609 0.29679 0.25268
## Proportion of Variance 0.01066 0.00803 0.00594 0.00464 0.00336
## Cumulative Proportion  0.97803 0.98606 0.99200 0.99664 1.00000

cc_tots <- cc_df %>% 
  rowwise() %>%
  mutate(total_behavior = sum(behavior_sexualRisk, behavior_eating, behavior_personalHygine)) %>%
  mutate(total_intention = sum(intention_aggregation, intention_commitment)) %>%
  mutate(total_attitude = sum(attitude_consistency, attitude_spontaneity)) %>%
  mutate(total_norm = sum(norm_significantPerson, norm_fulfillment)) %>%
  mutate(total_perception = sum(perception_severity, perception_severity)) %>%
  mutate(total_motivation = sum(motivation_strength, motivation_willingness)) %>%
  mutate(total_social = sum(socialSupport_emotionality, socialSupport_appreciation, socialSupport_instrumental)) %>%
  mutate(total_empowerment = sum(empowerment_knowledge, empowerment_abilities, empowerment_desires)) %>%
  mutate(total_HBM = sum(total_behavior, total_perception)) %>%
  mutate(total_PMT = sum(total_motivation, total_intention)) %>%
  mutate(total_TPB = sum(total_norm, total_attitude)) %>%
  mutate(total_SCT = sum(total_social, total_empowerment))

set.seed(0522)
sample <- sample.split(1:nrow(cc_df), SplitRatio = 0.8)
train <- subset(cc_df, sample == TRUE)
test <- subset(cc_df, sample == FALSE)

os_train <- ovun.sample(ca_cervix~., train, method = 'over', p  = 0.5)
cc_mod1 <- glm(os_train$data, formula = ca_cervix~., family = binomial(link = 'logit'))

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

predicted1 <- predict(cc_mod1, test, type = 'response')
optimal1 <- optimalCutoff(test$ca_cervix, predicted1[1])
confusionMatrix(test$ca_cervix, predicted1)

##    0 1
## 0 10 3
## 1  1 1

sample <- sample.split(1:nrow(cc_tots), SplitRatio = 0.8)
train <- subset(cc_tots, sample == TRUE)
test <- subset(cc_tots, sample == FALSE)

os_train <- ovun.sample(ca_cervix~., train, method = 'over', p  = 0.5)
cc_mod2 <- glm(os_train$data, formula = ca_cervix~., family = binomial(link = 'logit'))

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

predicted2<- predict(cc_mod1, test, type = 'response')
optimal2 <- optimalCutoff(test$ca_cervix, predicted2[1])
confusionMatrix(test$ca_cervix, predicted2)

##    0 1
## 0 11 0
## 1  1 3

pca_tots <- prcomp(train[-20], center = TRUE, scale = TRUE)
summary(pca_tots)

## Importance of components:
##                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     3.120 2.6821 1.7879 1.63248 1.35178 1.12795 1.10121
## Proportion of Variance 0.314 0.2321 0.1031 0.08597 0.05895 0.04104 0.03912
## Cumulative Proportion  0.314 0.5461 0.6492 0.73518 0.79413 0.83517 0.87429
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.86757 0.77220 0.70218 0.67615 0.61603 0.52874 0.51613
## Proportion of Variance 0.02428 0.01924 0.01591 0.01475 0.01224 0.00902 0.00859
## Cumulative Proportion  0.89857 0.91780 0.93371 0.94845 0.96070 0.96971 0.97831
##                          PC15    PC16    PC17    PC18    PC19      PC20
## Standard deviation     0.4659 0.42618 0.36872 0.28498 0.23790 8.327e-16
## Proportion of Variance 0.0070 0.00586 0.00439 0.00262 0.00183 0.000e+00
## Cumulative Proportion  0.9853 0.99117 0.99555 0.99817 1.00000 1.000e+00
##                             PC21      PC22      PC23      PC24      PC25
## Standard deviation     5.313e-16 2.499e-16 2.499e-16 2.499e-16 2.499e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                             PC26      PC27      PC28      PC29      PC30
## Standard deviation     2.499e-16 2.499e-16 2.499e-16 2.499e-16 2.499e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                             PC31
## Standard deviation     2.063e-16
## Proportion of Variance 0.000e+00
## Cumulative Proportion  1.000e+00

screeplot(pca_tots)

library()
pca_train <- predict(pca_tots, train)
pca_train <- pca_train %>%
  cbind(train[20])
pca_train$ca_cervix <- as.factor(pca_train$ca_cervix)
pca_test <- predict(pca_tots, test)
pca_test <- pca_test %>%
  cbind(test[20])

pca_mod <- glm(data = pca_train, formula = ca_cervix ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7, family = binomial(link = 'logit'))

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

#train_predictions <- predict(pca_mod, pca_train, type = 'response')
#table(round(train_predictions,0), as.numeric(pca_train$ca_cervix))
test_predictions <- predict(pca_mod, pca_test, type = 'response')
table(round(test_predictions,0), as.numeric(pca_test$ca_cervix))

##    
##      0  1
##   0 12  0
##   1  0  3

The SVM model for this dataset helped push this logistic regression model to perfect classification.

Conclusion

The addition of oversampling and PCA can improve upon simple supervised learning techniques like logistic regression. The dataset was a unique one for a medical application as it was based only on patient responses and no demographic or clinical findings. Coupled with these findings, responses could offer clinicians additional support in targeting and persuading patients to seek medical care. PCA may be useful in validating other behavioral determinant theories or combining existing ones into a more comprehensive therapy.

R Notebook

by Thomas Hill

Problem

Dataset

Methods

Conclusion