library(dplyr)
library(tidyverse)
library(caret)
library(InformationValue)
library(ROSE)
library(caTools)You get to decide which dataset you want to work on. The data set must be different You can work on a problem from your work, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portal.
Select one of the methodologies studied in weeks 1-10, and one methodology from weeks 11-15 to apply in the new dataset selected. To complete this task:. - Describe the problem you are trying to solve. - Describe your dataset and what you did to prepare the data for analysis. - Methodologies you used for analyzing the data - Why you did what you did - Make your conclusions from your analysis. Please be sure to address the business impact (it could be of any domain) of your solution.
Your final presentation could be the traditional R file and essay, or it could be an oral presentation with the execution and explanation of your code, recorded on any platform of your choice (Youtube, Free Cam). If you select the presentation, it should be a 5 to 8 minutes recording.
For this project, I’m planning on comparing several of the more advanced machine learning techniques along with replicating the results of the original paper comparing naive Bayes and logistic regression. The task at hand is prediction of cervical cancer given using over a dozen patient features. Risk stratification is an important part of medicine as diagnosing an illness has costs and risks, which can be further exacerbated by the presence of false positive and negative results. Cervical cancer in particular is a unique cases because some of its most common manifestations are infectious in origin and can be prevented by immunization.
The dataset I’m using is available through UCI’s Machine Learning Repository. These data originate from a 2026 paper entitled Behavior Determinant based Cervical Cancer Early Detection with Machine Learning Algorithm. The features are a set of 18 attributes representing 7 larger behavioral determinants of illness: behavior, intention, attitude, norm, perception, social support, and empowerment. These refer to four competing models of the psychological perception of health and illness. These dimensions of behavior were part of a questionaire with nine questions per variable. I’ve described the variables further in the table 1.
cc_df <- read.csv('sobar-72.csv')cc_scale <- cc_df %>%
select(-ca_cervix) %>%
mutate_all(scale)
corrplot::corrplot(cor(cc_scale), diag = T)pca <- (prcomp(cc_df[-20], center = TRUE, scale = TRUE))
summary(pca)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.3355 1.8802 1.29646 1.25329 1.11881 0.98099 0.91314
## Proportion of Variance 0.2871 0.1861 0.08846 0.08267 0.06588 0.05065 0.04389
## Cumulative Proportion 0.2871 0.4732 0.56161 0.64428 0.71017 0.76081 0.80470
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.8131 0.75586 0.6837 0.67570 0.62814 0.54410 0.49351
## Proportion of Variance 0.0348 0.03007 0.0246 0.02403 0.02077 0.01558 0.01282
## Cumulative Proportion 0.8395 0.86957 0.8942 0.91820 0.93897 0.95455 0.96737
## PC15 PC16 PC17 PC18 PC19
## Standard deviation 0.45012 0.39060 0.33609 0.29679 0.25268
## Proportion of Variance 0.01066 0.00803 0.00594 0.00464 0.00336
## Cumulative Proportion 0.97803 0.98606 0.99200 0.99664 1.00000
cc_tots <- cc_df %>%
rowwise() %>%
mutate(total_behavior = sum(behavior_sexualRisk, behavior_eating, behavior_personalHygine)) %>%
mutate(total_intention = sum(intention_aggregation, intention_commitment)) %>%
mutate(total_attitude = sum(attitude_consistency, attitude_spontaneity)) %>%
mutate(total_norm = sum(norm_significantPerson, norm_fulfillment)) %>%
mutate(total_perception = sum(perception_severity, perception_severity)) %>%
mutate(total_motivation = sum(motivation_strength, motivation_willingness)) %>%
mutate(total_social = sum(socialSupport_emotionality, socialSupport_appreciation, socialSupport_instrumental)) %>%
mutate(total_empowerment = sum(empowerment_knowledge, empowerment_abilities, empowerment_desires)) %>%
mutate(total_HBM = sum(total_behavior, total_perception)) %>%
mutate(total_PMT = sum(total_motivation, total_intention)) %>%
mutate(total_TPB = sum(total_norm, total_attitude)) %>%
mutate(total_SCT = sum(total_social, total_empowerment))set.seed(0522)
sample <- sample.split(1:nrow(cc_df), SplitRatio = 0.8)
train <- subset(cc_df, sample == TRUE)
test <- subset(cc_df, sample == FALSE)os_train <- ovun.sample(ca_cervix~., train, method = 'over', p = 0.5)
cc_mod1 <- glm(os_train$data, formula = ca_cervix~., family = binomial(link = 'logit'))## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
predicted1 <- predict(cc_mod1, test, type = 'response')
optimal1 <- optimalCutoff(test$ca_cervix, predicted1[1])
confusionMatrix(test$ca_cervix, predicted1)## 0 1
## 0 10 3
## 1 1 1
sample <- sample.split(1:nrow(cc_tots), SplitRatio = 0.8)
train <- subset(cc_tots, sample == TRUE)
test <- subset(cc_tots, sample == FALSE)os_train <- ovun.sample(ca_cervix~., train, method = 'over', p = 0.5)
cc_mod2 <- glm(os_train$data, formula = ca_cervix~., family = binomial(link = 'logit'))## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
predicted2<- predict(cc_mod1, test, type = 'response')
optimal2 <- optimalCutoff(test$ca_cervix, predicted2[1])
confusionMatrix(test$ca_cervix, predicted2)## 0 1
## 0 11 0
## 1 1 3
pca_tots <- prcomp(train[-20], center = TRUE, scale = TRUE)
summary(pca_tots)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.120 2.6821 1.7879 1.63248 1.35178 1.12795 1.10121
## Proportion of Variance 0.314 0.2321 0.1031 0.08597 0.05895 0.04104 0.03912
## Cumulative Proportion 0.314 0.5461 0.6492 0.73518 0.79413 0.83517 0.87429
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.86757 0.77220 0.70218 0.67615 0.61603 0.52874 0.51613
## Proportion of Variance 0.02428 0.01924 0.01591 0.01475 0.01224 0.00902 0.00859
## Cumulative Proportion 0.89857 0.91780 0.93371 0.94845 0.96070 0.96971 0.97831
## PC15 PC16 PC17 PC18 PC19 PC20
## Standard deviation 0.4659 0.42618 0.36872 0.28498 0.23790 8.327e-16
## Proportion of Variance 0.0070 0.00586 0.00439 0.00262 0.00183 0.000e+00
## Cumulative Proportion 0.9853 0.99117 0.99555 0.99817 1.00000 1.000e+00
## PC21 PC22 PC23 PC24 PC25
## Standard deviation 5.313e-16 2.499e-16 2.499e-16 2.499e-16 2.499e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
## PC26 PC27 PC28 PC29 PC30
## Standard deviation 2.499e-16 2.499e-16 2.499e-16 2.499e-16 2.499e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
## PC31
## Standard deviation 2.063e-16
## Proportion of Variance 0.000e+00
## Cumulative Proportion 1.000e+00
screeplot(pca_tots)library()
pca_train <- predict(pca_tots, train)
pca_train <- pca_train %>%
cbind(train[20])
pca_train$ca_cervix <- as.factor(pca_train$ca_cervix)
pca_test <- predict(pca_tots, test)
pca_test <- pca_test %>%
cbind(test[20])
pca_mod <- glm(data = pca_train, formula = ca_cervix ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7, family = binomial(link = 'logit'))## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#train_predictions <- predict(pca_mod, pca_train, type = 'response')
#table(round(train_predictions,0), as.numeric(pca_train$ca_cervix))
test_predictions <- predict(pca_mod, pca_test, type = 'response')
table(round(test_predictions,0), as.numeric(pca_test$ca_cervix))##
## 0 1
## 0 12 0
## 1 0 3
The SVM model for this dataset helped push this logistic regression model to perfect classification.
The addition of oversampling and PCA can improve upon simple supervised learning techniques like logistic regression. The dataset was a unique one for a medical application as it was based only on patient responses and no demographic or clinical findings. Coupled with these findings, responses could offer clinicians additional support in targeting and persuading patients to seek medical care. PCA may be useful in validating other behavioral determinant theories or combining existing ones into a more comprehensive therapy.