Imagine you’re a detective walking into a room full of clues. Each clue might tell you a part of the story, but on its own, it’s not enough to draw a conclusion. In our case, the ‘room’ is this dataset, and the ‘clues’ are various lifestyle factors, medical histories, and habits of different women. Our ‘mystery’ to solve is understanding what factors are most strongly linked to the risk of developing cervical cancer.
Before touching a line of code, let’s understand what we’re looking at. This dataset contains medical and demographic information about a large group of women. It tracks potential risk factors for cervical cancer, such as:
Demographics & Habits: Age, smoking history.
Sexual History: Number of partners, age of first intercourse.
Medical History: Use of hormonal contraceptives, IUDs (Intrauterine Devices), and history of Sexually Transmitted Diseases (STDs).
Outcomes (The Diagnosis): We have several
columns at the very end like Hinselmann,
Schiller, Citology, and Biopsy.
These are different medical tests used to detect abnormalities that
could lead to cervical cancer. A ‘1’ means the test found something
abnormal, and a ‘0’ means it didn’t.
Our ultimate goal is to figure out which of these risk factors are the “prime suspects” that cause the highest risk for an abnormal biopsy result.
Now that we know what we’re looking for, we need to carefully bring our evidence into the lab. In data science, our lab is the software R. We are going to load our CSV file into the system so we can start poking and prodding it.
One important detail: the caretakers of this data used a question
mark (?) to represent information they didn’t have (missing
data). We need to tell our system to recognize ? as ‘Not
Available’ (or NA in R).
# Define the path to our dataset
data_path <- "archive (5)/kag_risk_factors_cervical_cancer.csv"
# Load the data, making sure to treat '?' as missing values (NA)
cervical_data <- read.csv(data_path, na.strings = c("?", "NA", ""))
# Let's take a quick peek at the first few rows just to make sure it loaded correctly
head(cervical_data)
## Age Number.of.sexual.partners First.sexual.intercourse Num.of.pregnancies
## 1 18 4 15 1
## 2 15 1 14 1
## 3 34 1 NA 1
## 4 52 5 16 4
## 5 46 3 21 4
## 6 42 3 23 2
## Smokes Smokes..years. Smokes..packs.year. Hormonal.Contraceptives
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 1 37 37 1
## 5 0 0 0 1
## 6 0 0 0 0
## Hormonal.Contraceptives..years. IUD IUD..years. STDs STDs..number.
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 3 0 0 0 0
## 5 15 0 0 0 0
## 6 0 0 0 0 0
## STDs.condylomatosis STDs.cervical.condylomatosis STDs.vaginal.condylomatosis
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## STDs.vulvo.perineal.condylomatosis STDs.syphilis
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## STDs.pelvic.inflammatory.disease STDs.genital.herpes
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## STDs.molluscum.contagiosum STDs.AIDS STDs.HIV STDs.Hepatitis.B STDs.HPV
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## STDs..Number.of.diagnosis STDs..Time.since.first.diagnosis
## 1 0 NA
## 2 0 NA
## 3 0 NA
## 4 0 NA
## 5 0 NA
## 6 0 NA
## STDs..Time.since.last.diagnosis Dx.Cancer Dx.CIN Dx.HPV Dx Hinselmann
## 1 NA 0 0 0 0 0
## 2 NA 0 0 0 0 0
## 3 NA 0 0 0 0 0
## 4 NA 1 0 1 0 0
## 5 NA 0 0 0 0 0
## 6 NA 0 0 0 0 0
## Schiller Citology Biopsy
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
Imagine opening a filing cabinet. Before reading every single document, you first check how many folders there are, how many pages are in each folder, and what kind of information is stamped on them. This step is about getting the “lay of the land.” How big is our dataset? What kinds of numbers or text are we dealing with?
# How many rows (women) and columns (pieces of information) do we have?
dim(cervical_data)
## [1] 858 36
# Let's look at the structure of the data.
# Are the columns treated as text (characters)? Or as numbers?
str(cervical_data)
## 'data.frame': 858 obs. of 36 variables:
## $ Age : int 18 15 34 52 46 42 51 26 45 44 ...
## $ Number.of.sexual.partners : num 4 1 1 5 3 3 3 1 1 3 ...
## $ First.sexual.intercourse : num 15 14 NA 16 21 23 17 26 20 15 ...
## $ Num.of.pregnancies : num 1 1 1 4 4 2 6 3 5 NA ...
## $ Smokes : num 0 0 0 1 0 0 1 0 0 1 ...
## $ Smokes..years. : num 0 0 0 37 0 ...
## $ Smokes..packs.year. : num 0 0 0 37 0 0 3.4 0 0 2.8 ...
## $ Hormonal.Contraceptives : num 0 0 0 1 1 0 0 1 0 0 ...
## $ Hormonal.Contraceptives..years. : num 0 0 0 3 15 0 0 2 0 0 ...
## $ IUD : num 0 0 0 0 0 0 1 1 0 NA ...
## $ IUD..years. : num 0 0 0 0 0 0 7 7 0 NA ...
## $ STDs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ STDs..number. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ STDs.condylomatosis : num 0 0 0 0 0 0 0 0 0 0 ...
## $ STDs.cervical.condylomatosis : num 0 0 0 0 0 0 0 0 0 0 ...
## $ STDs.vaginal.condylomatosis : num 0 0 0 0 0 0 0 0 0 0 ...
## $ STDs.vulvo.perineal.condylomatosis: num 0 0 0 0 0 0 0 0 0 0 ...
## $ STDs.syphilis : num 0 0 0 0 0 0 0 0 0 0 ...
## $ STDs.pelvic.inflammatory.disease : num 0 0 0 0 0 0 0 0 0 0 ...
## $ STDs.genital.herpes : num 0 0 0 0 0 0 0 0 0 0 ...
## $ STDs.molluscum.contagiosum : num 0 0 0 0 0 0 0 0 0 0 ...
## $ STDs.AIDS : num 0 0 0 0 0 0 0 0 0 0 ...
## $ STDs.HIV : num 0 0 0 0 0 0 0 0 0 0 ...
## $ STDs.Hepatitis.B : num 0 0 0 0 0 0 0 0 0 0 ...
## $ STDs.HPV : num 0 0 0 0 0 0 0 0 0 0 ...
## $ STDs..Number.of.diagnosis : int 0 0 0 0 0 0 0 0 0 0 ...
## $ STDs..Time.since.first.diagnosis : num NA NA NA NA NA NA NA NA NA NA ...
## $ STDs..Time.since.last.diagnosis : num NA NA NA NA NA NA NA NA NA NA ...
## $ Dx.Cancer : int 0 0 0 1 0 0 0 0 1 0 ...
## $ Dx.CIN : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Dx.HPV : int 0 0 0 1 0 0 0 0 1 0 ...
## $ Dx : int 0 0 0 0 0 0 0 0 1 0 ...
## $ Hinselmann : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Schiller : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Citology : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Biopsy : int 0 0 0 0 0 0 1 0 0 0 ...
Every detective knows that crime scenes can be messy. Data is exactly the same. We often have missing pieces of information, numbers accidentally saved as text, or strange values that don’t make sense. Before we can do any real analysis, we have to clean the room. We will convert all those text columns back into proper numbers, and we’ll decide what to do about the missing gaps in our knowledge.
# Many columns loaded as 'character' because of the '?' marks that were present originally.
# Let's convert all applicable columns to numeric.
cervical_data_clean <- cervical_data %>%
mutate_all(as.numeric) %>%
clean_names()
# Let's check how many missing values we have in each column now.
colSums(is.na(cervical_data_clean))
## age number_of_sexual_partners
## 0 26
## first_sexual_intercourse num_of_pregnancies
## 7 56
## smokes smokes_years
## 13 13
## smokes_packs_year hormonal_contraceptives
## 13 108
## hormonal_contraceptives_years iud
## 108 117
## iud_years st_ds
## 117 105
## st_ds_number st_ds_condylomatosis
## 105 105
## st_ds_cervical_condylomatosis st_ds_vaginal_condylomatosis
## 105 105
## st_ds_vulvo_perineal_condylomatosis st_ds_syphilis
## 105 105
## st_ds_pelvic_inflammatory_disease st_ds_genital_herpes
## 105 105
## st_ds_molluscum_contagiosum st_ds_aids
## 105 105
## st_ds_hiv st_ds_hepatitis_b
## 105 105
## st_ds_hpv st_ds_number_of_diagnosis
## 105 0
## st_ds_time_since_first_diagnosis st_ds_time_since_last_diagnosis
## 787 787
## dx_cancer dx_cin
## 0 0
## dx_hpv dx
## 0 0
## hinselmann schiller
## 0 0
## citology biopsy
## 0 0
# Based on the missing data count:
# 1. 'st_ds_time_since_first_diagnosis' and 'st_ds_time_since_last_diagnosis' have 787 missing values.
# Since we only have 858 rows total, these columns are almost entirely empty. We must drop them.
cervical_data_clean <- cervical_data_clean %>%
select(-st_ds_time_since_first_diagnosis, -st_ds_time_since_last_diagnosis)
# 2. For the other columns with missing values (like smokes, hormonal_contraceptives, iud, st_ds),
# we have a relatively small number of missing rows (around 13 to 117).
# A common and safe approach for boolean (0/1) or count data is to impute (fill in) the median
# value, which essentially assumes the "typical" patient profile for missing data,
# or to just omit rows when building models.
# Here, let's impute the median for the core risk factors to retain as much data as possible.
impute_median <- function(x) {
replace_na(x, median(x, na.rm = TRUE))
}
cervical_data_clean <- cervical_data_clean %>%
mutate(across(everything(), impute_median))
# Verify it's clean
sum(is.na(cervical_data_clean))
## [1] 0
In univariate analysis, we look at each variable entirely on its own. How old are most of the women in this dataset? How many of them smoke?
1. Age Distribution
# Distribution of Age
ggplot(cervical_data_clean, aes(x = age)) +
geom_histogram(binwidth = 2, fill = "skyblue", color = "black") +
theme_minimal() +
labs(title = "Age Distribution of Patients", x = "Age", y = "Count")
Looking at this age lineup in this dataset, it’s clear we’re largely dealing with a very young crowd. The vast majority of the women in our records are in their late teens to early twenties, with the highest spike sitting right around age 20. As we trace the graph to the right, the numbers drop off sharply—meaning we have far fewer women in their 40s, 50s, or older. This shape (what statisticians call a ‘right skew’) tells us that any patterns we find in this data are going to be heavily weighted toward what young women are experiencing. It’s a vital piece of context: we aren’t looking at a retirement community; we are investigating the health profiles of women in the peak of their reproductive and sexually active years.
2. What is a Biopsy and Why Does it Matter?
The cervix is the lower part of the uterus that connects to the vagina. Sometimes, usually due to a viral infection like HPV, the cells on the surface of the cervix begin to mutate and grow abnormally. If left unchecked over many years, these abnormal cells can turn into cervical cancer.
A biopsy is a definitive medical procedure where a doctor removes a tiny pinch of tissue from the cervix to examine under a microscope. It is the final judge and jury on whether a patient actually has precancerous or cancerous cells.
# How many patients had an abnormal Biopsy?
ggplot(cervical_data_clean, aes(x = factor(biopsy))) +
geom_bar(fill = c("lightgreen", "salmon"), color = "black") +
theme_minimal() +
labs(title = "Biopsy Results (0 = Normal, 1 = Abnormal)", x = "Biopsy Result", y = "Count")
This bar chart reveals a classic “needle in a haystack” problem in data science. Over 800 of our patients (the towering green bar) had perfectly healthy, normal biopsies. Only a very small handful (the sliver of red on the right) actually had abnormal, potentially dangerous cells.
Why is this important? Because it means cervical abnormalities are extremely rare in this dataset. We have to be very careful detectives. If we try to build a psychological profile of a “high risk” patient, our mathematical models might get lazy and just automatically guess “Healthy” every single time, because they know they’ll be right 93% of the time! We now know going forward that predicting the rare “1” is going to require finding very specific, undeniably strong risk factors.
3. Smoking Habits
Smoking is a known immunosuppressant and a major risk factor for many cancers. Let’s see what proportion of the women in this dataset are smokers.
ggplot(cervical_data_clean, aes(x = factor(smokes))) +
geom_bar(fill = c("lightblue", "orange"), color = "black") +
theme_minimal() +
labs(title = "Smoking Habits (0 = Non-Smoker, 1 = Smoker)", x = "Smokes?", y = "Count")
The vast majority of the women in this dataset are non-smokers (0). While this is good for general public health, for our investigation it means that if we do find a strong link between smoking and abnormal biopsies later on, that link is clinically very potent, because the smoking group (1) is quite small to begin with.
4. Number of Sexual Partners
Cervical cancer is strongly linked to HPV, a sexually transmitted infection. Therefore, the number of sexual partners is a key behavioral metric to investigate.
ggplot(cervical_data_clean, aes(x = number_of_sexual_partners)) +
geom_histogram(binwidth = 1, fill = "purple", color = "black", alpha = 0.7) +
theme_minimal() +
labs(title = "Distribution of Number of Sexual Partners", x = "Number of Partners", y = "Count")
This distribution is also heavily ‘right-skewed’. Most women report having between 1 and 3 partners, with a massive peak at 1 or 2. There is a long tail stretching out to the right, showing a few individuals with a highly elevated number of partners, which we will need to keep an eye on when examining STD and biopsy rates.
Bivariate analysis looks at two variables at the same time to see if they are related. For example, does age have a connection to whether a biopsy is abnormal? Does smoking increase the chances?
# Let's look at Age vs Biopsy Result
ggplot(cervical_data_clean, aes(x = factor(biopsy), y = age, fill = factor(biopsy))) +
geom_boxplot() +
theme_minimal() +
labs(title = "Age vs. Biopsy Result", x = "Biopsy Result (0 = Normal, 1 = Abnormal)", y = "Age") +
scale_fill_manual(values = c("lightgreen", "salmon")) +
theme(legend.position = "none")
Interpreting the Age vs Biopsy Boxplot: What exactly are we looking at here? A boxplot is a fantastic way to see the “typical” range of a group. The thick black line in the middle of each box represents the median (average) age of that group.
# Smoking vs Biopsy Result
cervical_data_clean %>%
ggplot(aes(x = factor(smokes), fill = factor(biopsy))) +
geom_bar(position = "dodge", color = "black") +
theme_minimal() +
labs(
title = "Count of Biopsy Results by Smoking Status",
x = "Smokes? (0 = No, 1 = Yes)", y = "Count of Patients"
) +
scale_fill_manual(values = c("lightgreen", "salmon"), name = "Biopsy")
Interpreting the Smoking vs Biopsy Chart (Side-by-Side Counts): This is a much more straightforward chart. It simply places the healthy people (green bars) next to the people with abnormal biopsies (red bars) and groups them by whether they smoke or not.
In multivariate analysis, we look at many variables at once. Are smoking and age connected? Are older women more likely to use certain contraceptives? We use a correlation matrix, a heat map to instantly see which clues are strongly linked to each other.
# We want to see how numerical variables correlate.
# We'll select a subset of important numerical features and compute correlations.
cor_data <- cervical_data_clean %>%
select(
age, number_of_sexual_partners, first_sexual_intercourse,
num_of_pregnancies, smokes, hormonal_contraceptives, iud, st_ds_number_of_diagnosis, biopsy
)
# Compute the correlation matrix
cor_matrix <- cor(cor_data)
cor_matrix
## age number_of_sexual_partners
## age 1.000000000 0.0859713224
## number_of_sexual_partners 0.085971322 1.0000000000
## first_sexual_intercourse 0.369175414 -0.1458466166
## num_of_pregnancies 0.525891837 0.0774392125
## smokes 0.057203756 0.2368575171
## hormonal_contraceptives 0.029201254 0.0040274531
## iud 0.279429180 0.0324596214
## st_ds_number_of_diagnosis -0.001605942 0.0530557811
## biopsy 0.055955515 -0.0004082348
## first_sexual_intercourse num_of_pregnancies
## age 0.369175414 0.52589184
## number_of_sexual_partners -0.145846617 0.07743921
## first_sexual_intercourse 1.000000000 -0.05637438
## num_of_pregnancies -0.056374378 1.00000000
## smokes -0.123279803 0.08151714
## hormonal_contraceptives -0.009232491 0.11893816
## iud -0.010757838 0.20450054
## st_ds_number_of_diagnosis -0.013331389 0.03491232
## biopsy 0.007258726 0.04021507
## smokes hormonal_contraceptives iud
## age 0.057203756 0.0292012536 0.2794291799
## number_of_sexual_partners 0.236857517 0.0040274531 0.0324596214
## first_sexual_intercourse -0.123279803 -0.0092324912 -0.0107578382
## num_of_pregnancies 0.081517136 0.1189381564 0.2045005423
## smokes 1.000000000 0.0040356837 -0.0551154385
## hormonal_contraceptives 0.004035684 1.0000000000 0.0001882045
## iud -0.055115439 0.0001882045 1.0000000000
## st_ds_number_of_diagnosis 0.090725487 -0.0621985801 0.0357912898
## biopsy 0.028723760 -0.0180152523 0.0592305229
## st_ds_number_of_diagnosis biopsy
## age -0.001605942 0.0559555151
## number_of_sexual_partners 0.053055781 -0.0004082348
## first_sexual_intercourse -0.013331389 0.0072587257
## num_of_pregnancies 0.034912321 0.0402150719
## smokes 0.090725487 0.0287237598
## hormonal_contraceptives -0.062198580 -0.0180152523
## iud 0.035791290 0.0592305229
## st_ds_number_of_diagnosis 1.000000000 0.0974489209
## biopsy 0.097448921 1.0000000000
# Convert to a format suitable for ggplot
cor_melted <- as.data.frame(as.table(cor_matrix))
cor_melted
## Var1 Var2 Freq
## 1 age age 1.0000000000
## 2 number_of_sexual_partners age 0.0859713224
## 3 first_sexual_intercourse age 0.3691754140
## 4 num_of_pregnancies age 0.5258918373
## 5 smokes age 0.0572037563
## 6 hormonal_contraceptives age 0.0292012536
## 7 iud age 0.2794291799
## 8 st_ds_number_of_diagnosis age -0.0016059424
## 9 biopsy age 0.0559555151
## 10 age number_of_sexual_partners 0.0859713224
## 11 number_of_sexual_partners number_of_sexual_partners 1.0000000000
## 12 first_sexual_intercourse number_of_sexual_partners -0.1458466166
## 13 num_of_pregnancies number_of_sexual_partners 0.0774392125
## 14 smokes number_of_sexual_partners 0.2368575171
## 15 hormonal_contraceptives number_of_sexual_partners 0.0040274531
## 16 iud number_of_sexual_partners 0.0324596214
## 17 st_ds_number_of_diagnosis number_of_sexual_partners 0.0530557811
## 18 biopsy number_of_sexual_partners -0.0004082348
## 19 age first_sexual_intercourse 0.3691754140
## 20 number_of_sexual_partners first_sexual_intercourse -0.1458466166
## 21 first_sexual_intercourse first_sexual_intercourse 1.0000000000
## 22 num_of_pregnancies first_sexual_intercourse -0.0563743783
## 23 smokes first_sexual_intercourse -0.1232798026
## 24 hormonal_contraceptives first_sexual_intercourse -0.0092324912
## 25 iud first_sexual_intercourse -0.0107578382
## 26 st_ds_number_of_diagnosis first_sexual_intercourse -0.0133313891
## 27 biopsy first_sexual_intercourse 0.0072587257
## 28 age num_of_pregnancies 0.5258918373
## 29 number_of_sexual_partners num_of_pregnancies 0.0774392125
## 30 first_sexual_intercourse num_of_pregnancies -0.0563743783
## 31 num_of_pregnancies num_of_pregnancies 1.0000000000
## 32 smokes num_of_pregnancies 0.0815171361
## 33 hormonal_contraceptives num_of_pregnancies 0.1189381564
## 34 iud num_of_pregnancies 0.2045005423
## 35 st_ds_number_of_diagnosis num_of_pregnancies 0.0349123213
## 36 biopsy num_of_pregnancies 0.0402150719
## 37 age smokes 0.0572037563
## 38 number_of_sexual_partners smokes 0.2368575171
## 39 first_sexual_intercourse smokes -0.1232798026
## 40 num_of_pregnancies smokes 0.0815171361
## 41 smokes smokes 1.0000000000
## 42 hormonal_contraceptives smokes 0.0040356837
## 43 iud smokes -0.0551154385
## 44 st_ds_number_of_diagnosis smokes 0.0907254866
## 45 biopsy smokes 0.0287237598
## 46 age hormonal_contraceptives 0.0292012536
## 47 number_of_sexual_partners hormonal_contraceptives 0.0040274531
## 48 first_sexual_intercourse hormonal_contraceptives -0.0092324912
## 49 num_of_pregnancies hormonal_contraceptives 0.1189381564
## 50 smokes hormonal_contraceptives 0.0040356837
## 51 hormonal_contraceptives hormonal_contraceptives 1.0000000000
## 52 iud hormonal_contraceptives 0.0001882045
## 53 st_ds_number_of_diagnosis hormonal_contraceptives -0.0621985801
## 54 biopsy hormonal_contraceptives -0.0180152523
## 55 age iud 0.2794291799
## 56 number_of_sexual_partners iud 0.0324596214
## 57 first_sexual_intercourse iud -0.0107578382
## 58 num_of_pregnancies iud 0.2045005423
## 59 smokes iud -0.0551154385
## 60 hormonal_contraceptives iud 0.0001882045
## 61 iud iud 1.0000000000
## 62 st_ds_number_of_diagnosis iud 0.0357912898
## 63 biopsy iud 0.0592305229
## 64 age st_ds_number_of_diagnosis -0.0016059424
## 65 number_of_sexual_partners st_ds_number_of_diagnosis 0.0530557811
## 66 first_sexual_intercourse st_ds_number_of_diagnosis -0.0133313891
## 67 num_of_pregnancies st_ds_number_of_diagnosis 0.0349123213
## 68 smokes st_ds_number_of_diagnosis 0.0907254866
## 69 hormonal_contraceptives st_ds_number_of_diagnosis -0.0621985801
## 70 iud st_ds_number_of_diagnosis 0.0357912898
## 71 st_ds_number_of_diagnosis st_ds_number_of_diagnosis 1.0000000000
## 72 biopsy st_ds_number_of_diagnosis 0.0974489209
## 73 age biopsy 0.0559555151
## 74 number_of_sexual_partners biopsy -0.0004082348
## 75 first_sexual_intercourse biopsy 0.0072587257
## 76 num_of_pregnancies biopsy 0.0402150719
## 77 smokes biopsy 0.0287237598
## 78 hormonal_contraceptives biopsy -0.0180152523
## 79 iud biopsy 0.0592305229
## 80 st_ds_number_of_diagnosis biopsy 0.0974489209
## 81 biopsy biopsy 1.0000000000
# Plot the heatmap
ggplot(cor_melted, aes(Var1, Var2, fill = Freq)) +
geom_tile(color = "white") +
scale_fill_gradient2(
low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), name = "Pearson\nCorrelation"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
labs(title = "Correlation Heatmap of Risk Factors", x = "", y = "")
How to Read the Evidence Heatmap:
This grid is called a “Correlation Matrix.” The colors tell us the strength* of the relationship between any two clues.*
age and num_of_pregnancies intersect. It’s
solid red. This makes perfect logical sense: as women get older, they
typically have had more pregnancies.biopsy):
The most important row in this entire grid is the top row (labeled
biopsy). Scan your eyes across that top row. It is mostly
very faint white/pink squares. What does this mean? It means there is no
single, massive “smoking gun” variable that perfectly predicts cervical
cancer all by itself. All the connections to an abnormal biopsy are weak
on their own. This proves that cervical cancer risk is
multivariate it’s likely caused by a combination of smaller
factors rather than one giant obvious one.*Sometimes a clue is so unusual that it demands special attention. Is it a typo, or a genuinely extreme case? We call these ‘outliers.’ Let’s look at the ‘Number of sexual partners’ and ‘Num of pregnancies’ to see if anyone falls far outside the norm.
# Boxplot for Number of Sexual Partners
ggplot(cervical_data_clean, aes(y = number_of_sexual_partners)) +
geom_boxplot(fill = "orange") +
theme_minimal() +
labs(title = "Outlier Detection: Number of Sexual Partners", y = "Partners")
Interpreting the ‘Number of Sexual Partners’ Outliers:
# Boxplot for Number of Pregnancies
ggplot(cervical_data_clean, aes(y = num_of_pregnancies)) +
geom_boxplot(fill = "lightblue") +
theme_minimal() +
labs(title = "Outlier Detection: Number of Pregnancies", y = "Pregnancies")
# We can see some extreme values (e.g., individuals reporting > 15 partners or > 10 pregnancies).
# In medical data, these could be real values, so we won't delete them, but we will keep them in mind
# for our final analysis, as they might heavily influence our results.
Interpreting the ‘Number of Pregnancies’ Outliers:
Sometimes, the clues we have aren’t quite enough, but if we combine them, they tell a bigger story. This is “Feature Engineering.” For example, maybe it’s not just about one specific STD, but whether the patient has a history of ANY STD combined with smoking.
# Let's create a new 'High Risk Behavior Score' just as an example.
# We will sum up indications of smoking, high number of partners (which we'll define as > 5), and STD history.
# (Note: We already handled all missing NA values in Step 4, so we can cleanly add these together in one pass!)
cervical_data_clean <- cervical_data_clean %>%
mutate(
high_partners_flag = ifelse(number_of_sexual_partners > 5, 1, 0),
behavior_risk_score = smokes + st_ds + high_partners_flag
)
# Let's see if this new score relates to Biopsy results
cervical_data_clean %>%
ggplot(aes(x = factor(behavior_risk_score), fill = factor(biopsy))) +
geom_bar(position = "dodge", color = "black") +
theme_minimal() +
labs(title = "Count of Biopsy Results by Behavior Risk Score", x = "Risk Score (0 - 3)", y = "Count of Patients") +
scale_fill_manual(values = c("lightgreen", "salmon"), name = "Biopsy")
Interpreting the Feature Engineering Chart (Behavior Risk Score):
score 0. This tells us that the vast majority of our
patients exhibited none of these three risky behaviors and their
biopsies were overwhelmingly normal.After sifting through all the clues, we need to gather our prime suspects into one interrogation room. Medical research and our initial exploration tell us that cervical cancer is heavily linked to sexual history (especially HPV infection), smoking, and age. We are now going to create a smaller, focused dataset a ‘Prime Suspects’ list to dig deeper.
# Let's create a focused dataframe containing only the most critical risk factors
# and our main target: Biopsy.
prime_suspects_data <- cervical_data_clean %>%
select(
age,
number_of_sexual_partners,
first_sexual_intercourse,
smokes,
smokes_years,
hormonal_contraceptives,
hormonal_contraceptives_years,
iud,
st_ds_number_of_diagnosis,
st_ds_hpv, # HPV is the leading cause of cervical cancer
st_ds_hiv, # Immune system suppression is a massive risk factor
biopsy
)
# Let's run a quick statistical summary (a Logistic Regression model)
# to see which of these prime suspects are actually guilty beyond a reasonable doubt.
# In layman's terms: This math equation tries to predict the 'Biopsy' result using our clues.
risk_model <- glm(biopsy ~ ., data = prime_suspects_data, family = "binomial")
# Let's view the summary of our 'trial'
summary(risk_model)
##
## Call:
## glm(formula = biopsy ~ ., family = "binomial", data = prime_suspects_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.980528 1.002690 -2.973 0.00295 **
## age 0.002312 0.020097 0.115 0.90842
## number_of_sexual_partners -0.034906 0.103524 -0.337 0.73598
## first_sexual_intercourse 0.008598 0.056426 0.152 0.87888
## smokes -0.137280 0.605746 -0.227 0.82071
## smokes_years 0.040550 0.042202 0.961 0.33663
## hormonal_contraceptives -0.396010 0.339082 -1.168 0.24285
## hormonal_contraceptives_years 0.092681 0.034375 2.696 0.00701 **
## iud 0.460522 0.420309 1.096 0.27322
## st_ds_number_of_diagnosis 0.432076 0.455683 0.948 0.34303
## st_ds_hpv -12.255220 623.920041 -0.020 0.98433
## st_ds_hiv 1.264908 0.769217 1.644 0.10009
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 408.60 on 857 degrees of freedom
## Residual deviance: 387.65 on 846 degrees of freedom
## AIC: 411.65
##
## Number of Fisher Scoring iterations: 13
# Evaluating Feature Importance:
# Which factors contribute the most to the prediction?
# We can look at the absolute value of the t-statistic (or z-value in a binomial model) for each variable.
# We use the 'broom' package which is lightweight and built for cleanly extracting model data.
library(broom)
importance <- tidy(risk_model) %>%
filter(term != "(Intercept)") %>% # We don't care about the baseline intercept
mutate(Overall = abs(statistic)) %>%
arrange(desc(Overall)) %>%
rename(Feature = term)
# Let's visualize the "Guilt" of our suspects
ggplot(importance, aes(x = reorder(Feature, Overall), y = Overall)) +
geom_bar(stat = "identity", fill = "steelblue", color = "black") +
coord_flip() +
theme_minimal() +
labs(title = "Feature Importance for Predicting Abnormal Biopsy", x = "Risk Factor", y = "Importance (absolute z-value)")
The Verdict: Who is Guilty? The bar chart above is the final result of our Logistic Regression model. It ranks the clues based on how strongly they pull the prediction toward an “Abnormal” biopsy. Let’s look at the top suspects in our specific dataset:
hormonal_contraceptives_years and
hormonal_contraceptives are near the very top of the list!
While the pill does not cause cancer directly like a virus
does, medical studies confirm that long-term use can slightly increase
the risk of cervical cancer in women who are already infected with HPV.
The pill changes the hormonal environment of the cervix, making it
easier for the virus to thrive and mutate cells.st_ds_hiv):
HIV is extremely high on the list. This makes perfect medical sense! HIV
attacks the body’s immune system. If a woman’s immune system is severely
compromised, she cannot fight off an underlying HPV infection, allowing
the virus to rapidly develop into cervical dysplasia (which is exactly
what an abnormal Biopsy detects).st_ds_hpv)?
Surprisingly, it is at the very bottom of this specific chart! Does this
mean medical science is wrong? Absolutely not. It just means in this
specific, real-world dataset, the recording of HPV diagnoses was
likely sparse, incomplete, or patients were undiagnosed at the time of
their visits. Models can only learn from the data they are handed—this
highlights why domain knowledge (like knowing HIV suppresses the immune
system) is so critical when interpreting data!*Our detective work is almost done. We know the suspects and we know how they operate. But solving a crime isn’t just about catching the bad guy it’s about preventing the next one. How can we use this data to inform public health policy?
Actionable Policy and Prevention Insights: Based on our deep dive into the data, we shouldn’t just rely on generic advice. We need policies targeted at what our logistic regression model actually found in this population:
Revising Hormonal Contraceptive Guidelines:
Since hormonal_contraceptives (and years used) featured so
heavily as a predictor of abnormal biopsies, policymakers need to ensure
women are fully educated on the long-term risks. Clinics should be
required to offer mandatory, aggressive HPV screening before prescribing
long-term hormonal birth control, as the viral-hormonal interaction is a
massive risk multiplier.
Integrated HIV & Cervical Cancer Screening:
The presence of st_ds_hiv was one of the strongest
predictors of abnormal cells. Public health policy must stop treating
these diseases in silos. Funding should be directed toward “One-Stop”
clinics: if a woman tests positive for HIV, an immediate, subsidized
cervical biopsy and HPV test must become the standard of care due to her
suppressed immune system.
Addressing the “Age & Partners” Reality: The data showed abnormal biopsies ticking upward in older patients and those with a high number of sexual partners. Policies need to aggressively fund accessible, free Pap smears specifically targeting women in their late 20s and 30s. Education campaigns must destigmatize regular checks for women with multiple partners, framing it as routine maintenance rather than a consequence.
Fixing the Data Gap (The HPV Vaccine): We know
medically that HPV causes cervical cancer, yet st_ds_hpv
barely registered in our model. Why? Because the data is missing or
patients are undiagnosed! The biggest policy recommendation is to fund
massive, subsidized HPV testing and vaccination campaigns for young
girls and boys before their first sexual encounter. We need
better real-world tracking of HPV to stop this cancer before it
starts.
Conclusion: The data tells a clear story: In this population, predicting cervical cancer risk is heavily reliant on a patient’s hormonal history (birth control) and their immune system’s strength (HIV/STDs). By rewriting public health policy to aggressively test for HPV when prescribing birth control or treating HIV, we can catch the precursor to cancer and save lives.