Cervical Cancer

Step 1: Understanding the Problem and Dataset

Imagine you’re a detective walking into a room full of clues. Each clue might tell you a part of the story, but on its own, it’s not enough to draw a conclusion. In our case, the ‘room’ is this dataset, and the ‘clues’ are various lifestyle factors, medical histories, and habits of different women. Our ‘mystery’ to solve is understanding what factors are most strongly linked to the risk of developing cervical cancer.

Before touching a line of code, let’s understand what we’re looking at. This dataset contains medical and demographic information about a large group of women. It tracks potential risk factors for cervical cancer, such as:

Demographics & Habits: Age, smoking history.
Sexual History: Number of partners, age of first intercourse.
Medical History: Use of hormonal contraceptives, IUDs (Intrauterine Devices), and history of Sexually Transmitted Diseases (STDs).
Outcomes (The Diagnosis): We have several columns at the very end like Hinselmann, Schiller, Citology, and Biopsy. These are different medical tests used to detect abnormalities that could lead to cervical cancer. A ‘1’ means the test found something abnormal, and a ‘0’ means it didn’t.
Our ultimate goal is to figure out which of these risk factors are the “prime suspects” that cause the highest risk for an abnormal biopsy result.

Step 2: Import / Load the Data

Now that we know what we’re looking for, we need to carefully bring our evidence into the lab. In data science, our lab is the software R. We are going to load our CSV file into the system so we can start poking and prodding it.

One important detail: the caretakers of this data used a question mark (?) to represent information they didn’t have (missing data). We need to tell our system to recognize ? as ‘Not Available’ (or NA in R).

# Define the path to our dataset
data_path <- "archive (5)/kag_risk_factors_cervical_cancer.csv"

# Load the data, making sure to treat '?' as missing values (NA)
cervical_data <- read.csv(data_path, na.strings = c("?", "NA", ""))

# Let's take a quick peek at the first few rows just to make sure it loaded correctly
head(cervical_data)

##   Age Number.of.sexual.partners First.sexual.intercourse Num.of.pregnancies
## 1  18                         4                       15                  1
## 2  15                         1                       14                  1
## 3  34                         1                       NA                  1
## 4  52                         5                       16                  4
## 5  46                         3                       21                  4
## 6  42                         3                       23                  2
##   Smokes Smokes..years. Smokes..packs.year. Hormonal.Contraceptives
## 1      0              0                   0                       0
## 2      0              0                   0                       0
## 3      0              0                   0                       0
## 4      1             37                  37                       1
## 5      0              0                   0                       1
## 6      0              0                   0                       0
##   Hormonal.Contraceptives..years. IUD IUD..years. STDs STDs..number.
## 1                               0   0           0    0             0
## 2                               0   0           0    0             0
## 3                               0   0           0    0             0
## 4                               3   0           0    0             0
## 5                              15   0           0    0             0
## 6                               0   0           0    0             0
##   STDs.condylomatosis STDs.cervical.condylomatosis STDs.vaginal.condylomatosis
## 1                   0                            0                           0
## 2                   0                            0                           0
## 3                   0                            0                           0
## 4                   0                            0                           0
## 5                   0                            0                           0
## 6                   0                            0                           0
##   STDs.vulvo.perineal.condylomatosis STDs.syphilis
## 1                                  0             0
## 2                                  0             0
## 3                                  0             0
## 4                                  0             0
## 5                                  0             0
## 6                                  0             0
##   STDs.pelvic.inflammatory.disease STDs.genital.herpes
## 1                                0                   0
## 2                                0                   0
## 3                                0                   0
## 4                                0                   0
## 5                                0                   0
## 6                                0                   0
##   STDs.molluscum.contagiosum STDs.AIDS STDs.HIV STDs.Hepatitis.B STDs.HPV
## 1                          0         0        0                0        0
## 2                          0         0        0                0        0
## 3                          0         0        0                0        0
## 4                          0         0        0                0        0
## 5                          0         0        0                0        0
## 6                          0         0        0                0        0
##   STDs..Number.of.diagnosis STDs..Time.since.first.diagnosis
## 1                         0                               NA
## 2                         0                               NA
## 3                         0                               NA
## 4                         0                               NA
## 5                         0                               NA
## 6                         0                               NA
##   STDs..Time.since.last.diagnosis Dx.Cancer Dx.CIN Dx.HPV Dx Hinselmann
## 1                              NA         0      0      0  0          0
## 2                              NA         0      0      0  0          0
## 3                              NA         0      0      0  0          0
## 4                              NA         1      0      1  0          0
## 5                              NA         0      0      0  0          0
## 6                              NA         0      0      0  0          0
##   Schiller Citology Biopsy
## 1        0        0      0
## 2        0        0      0
## 3        0        0      0
## 4        0        0      0
## 5        0        0      0
## 6        0        0      0

Step 3: Inspect the Data Structure

Imagine opening a filing cabinet. Before reading every single document, you first check how many folders there are, how many pages are in each folder, and what kind of information is stamped on them. This step is about getting the “lay of the land.” How big is our dataset? What kinds of numbers or text are we dealing with?

# How many rows (women) and columns (pieces of information) do we have?
dim(cervical_data)

## [1] 858  36

# Let's look at the structure of the data.
# Are the columns treated as text (characters)? Or as numbers?
str(cervical_data)

## 'data.frame':    858 obs. of  36 variables:
##  $ Age                               : int  18 15 34 52 46 42 51 26 45 44 ...
##  $ Number.of.sexual.partners         : num  4 1 1 5 3 3 3 1 1 3 ...
##  $ First.sexual.intercourse          : num  15 14 NA 16 21 23 17 26 20 15 ...
##  $ Num.of.pregnancies                : num  1 1 1 4 4 2 6 3 5 NA ...
##  $ Smokes                            : num  0 0 0 1 0 0 1 0 0 1 ...
##  $ Smokes..years.                    : num  0 0 0 37 0 ...
##  $ Smokes..packs.year.               : num  0 0 0 37 0 0 3.4 0 0 2.8 ...
##  $ Hormonal.Contraceptives           : num  0 0 0 1 1 0 0 1 0 0 ...
##  $ Hormonal.Contraceptives..years.   : num  0 0 0 3 15 0 0 2 0 0 ...
##  $ IUD                               : num  0 0 0 0 0 0 1 1 0 NA ...
##  $ IUD..years.                       : num  0 0 0 0 0 0 7 7 0 NA ...
##  $ STDs                              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs..number.                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs.condylomatosis               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs.cervical.condylomatosis      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs.vaginal.condylomatosis       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs.vulvo.perineal.condylomatosis: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs.syphilis                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs.pelvic.inflammatory.disease  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs.genital.herpes               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs.molluscum.contagiosum        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs.AIDS                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs.HIV                          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs.Hepatitis.B                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs.HPV                          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs..Number.of.diagnosis         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs..Time.since.first.diagnosis  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ STDs..Time.since.last.diagnosis   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Dx.Cancer                         : int  0 0 0 1 0 0 0 0 1 0 ...
##  $ Dx.CIN                            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Dx.HPV                            : int  0 0 0 1 0 0 0 0 1 0 ...
##  $ Dx                                : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ Hinselmann                        : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Schiller                          : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Citology                          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Biopsy                            : int  0 0 0 0 0 0 1 0 0 0 ...

Step 4: Data Cleaning

Every detective knows that crime scenes can be messy. Data is exactly the same. We often have missing pieces of information, numbers accidentally saved as text, or strange values that don’t make sense. Before we can do any real analysis, we have to clean the room. We will convert all those text columns back into proper numbers, and we’ll decide what to do about the missing gaps in our knowledge.

# Many columns loaded as 'character' because of the '?' marks that were present originally.
# Let's convert all applicable columns to numeric.
cervical_data_clean <- cervical_data %>%
  mutate_all(as.numeric) %>%
  clean_names()

# Let's check how many missing values we have in each column now.
colSums(is.na(cervical_data_clean))

##                                 age           number_of_sexual_partners 
##                                   0                                  26 
##            first_sexual_intercourse                  num_of_pregnancies 
##                                   7                                  56 
##                              smokes                        smokes_years 
##                                  13                                  13 
##                   smokes_packs_year             hormonal_contraceptives 
##                                  13                                 108 
##       hormonal_contraceptives_years                                 iud 
##                                 108                                 117 
##                           iud_years                               st_ds 
##                                 117                                 105 
##                        st_ds_number                st_ds_condylomatosis 
##                                 105                                 105 
##       st_ds_cervical_condylomatosis        st_ds_vaginal_condylomatosis 
##                                 105                                 105 
## st_ds_vulvo_perineal_condylomatosis                      st_ds_syphilis 
##                                 105                                 105 
##   st_ds_pelvic_inflammatory_disease                st_ds_genital_herpes 
##                                 105                                 105 
##         st_ds_molluscum_contagiosum                          st_ds_aids 
##                                 105                                 105 
##                           st_ds_hiv                   st_ds_hepatitis_b 
##                                 105                                 105 
##                           st_ds_hpv           st_ds_number_of_diagnosis 
##                                 105                                   0 
##    st_ds_time_since_first_diagnosis     st_ds_time_since_last_diagnosis 
##                                 787                                 787 
##                           dx_cancer                              dx_cin 
##                                   0                                   0 
##                              dx_hpv                                  dx 
##                                   0                                   0 
##                          hinselmann                            schiller 
##                                   0                                   0 
##                            citology                              biopsy 
##                                   0                                   0

# Based on the missing data count:
# 1. 'st_ds_time_since_first_diagnosis' and 'st_ds_time_since_last_diagnosis' have 787 missing values.
#    Since we only have 858 rows total, these columns are almost entirely empty. We must drop them.
cervical_data_clean <- cervical_data_clean %>%
  select(-st_ds_time_since_first_diagnosis, -st_ds_time_since_last_diagnosis)

# 2. For the other columns with missing values (like smokes, hormonal_contraceptives, iud, st_ds),
#    we have a relatively small number of missing rows (around 13 to 117).
#    A common and safe approach for boolean (0/1) or count data is to impute (fill in) the median
#    value, which essentially assumes the "typical" patient profile for missing data,
#    or to just omit rows when building models.
#    Here, let's impute the median for the core risk factors to retain as much data as possible.

impute_median <- function(x) {
  replace_na(x, median(x, na.rm = TRUE))
}

cervical_data_clean <- cervical_data_clean %>%
  mutate(across(everything(), impute_median))

# Verify it's clean
sum(is.na(cervical_data_clean))

## [1] 0

Step 5: Univariate Analysis

In univariate analysis, we look at each variable entirely on its own. How old are most of the women in this dataset? How many of them smoke?

1. Age Distribution

# Distribution of Age
ggplot(cervical_data_clean, aes(x = age)) +
  geom_histogram(binwidth = 2, fill = "skyblue", color = "black") +
  theme_minimal() +
  labs(title = "Age Distribution of Patients", x = "Age", y = "Count")

Looking at this age lineup in this dataset, it’s clear we’re largely dealing with a very young crowd. The vast majority of the women in our records are in their late teens to early twenties, with the highest spike sitting right around age 20. As we trace the graph to the right, the numbers drop off sharply—meaning we have far fewer women in their 40s, 50s, or older. This shape (what statisticians call a ‘right skew’) tells us that any patterns we find in this data are going to be heavily weighted toward what young women are experiencing. It’s a vital piece of context: we aren’t looking at a retirement community; we are investigating the health profiles of women in the peak of their reproductive and sexually active years.

2. What is a Biopsy and Why Does it Matter?

The cervix is the lower part of the uterus that connects to the vagina. Sometimes, usually due to a viral infection like HPV, the cells on the surface of the cervix begin to mutate and grow abnormally. If left unchecked over many years, these abnormal cells can turn into cervical cancer.

A biopsy is a definitive medical procedure where a doctor removes a tiny pinch of tissue from the cervix to examine under a microscope. It is the final judge and jury on whether a patient actually has precancerous or cancerous cells.

- Normal (0 - The Giant Green Bar): The tissue sample showed healthy, perfectly normal cells. The patient is clear of immediate cervical cancer risk.
- Abnormal (1 - The Tiny Red Bar): The tissue sample showed dysplasia (precancerous abnormal cells) or actual cervical cancer. Finding the specific clues and risk factors that lead to this dangerous ‘1’ ranking is the entire purpose of our investigation.*

# How many patients had an abnormal Biopsy?
ggplot(cervical_data_clean, aes(x = factor(biopsy))) +
  geom_bar(fill = c("lightgreen", "salmon"), color = "black") +
  theme_minimal() +
  labs(title = "Biopsy Results (0 = Normal, 1 = Abnormal)", x = "Biopsy Result", y = "Count")

This bar chart reveals a classic “needle in a haystack” problem in data science. Over 800 of our patients (the towering green bar) had perfectly healthy, normal biopsies. Only a very small handful (the sliver of red on the right) actually had abnormal, potentially dangerous cells.

Why is this important? Because it means cervical abnormalities are extremely rare in this dataset. We have to be very careful detectives. If we try to build a psychological profile of a “high risk” patient, our mathematical models might get lazy and just automatically guess “Healthy” every single time, because they know they’ll be right 93% of the time! We now know going forward that predicting the rare “1” is going to require finding very specific, undeniably strong risk factors.

3. Smoking Habits

Smoking is a known immunosuppressant and a major risk factor for many cancers. Let’s see what proportion of the women in this dataset are smokers.

ggplot(cervical_data_clean, aes(x = factor(smokes))) +
  geom_bar(fill = c("lightblue", "orange"), color = "black") +
  theme_minimal() +
  labs(title = "Smoking Habits (0 = Non-Smoker, 1 = Smoker)", x = "Smokes?", y = "Count")

The vast majority of the women in this dataset are non-smokers (0). While this is good for general public health, for our investigation it means that if we do find a strong link between smoking and abnormal biopsies later on, that link is clinically very potent, because the smoking group (1) is quite small to begin with.

4. Number of Sexual Partners

Cervical cancer is strongly linked to HPV, a sexually transmitted infection. Therefore, the number of sexual partners is a key behavioral metric to investigate.

ggplot(cervical_data_clean, aes(x = number_of_sexual_partners)) +
  geom_histogram(binwidth = 1, fill = "purple", color = "black", alpha = 0.7) +
  theme_minimal() +
  labs(title = "Distribution of Number of Sexual Partners", x = "Number of Partners", y = "Count")

This distribution is also heavily ‘right-skewed’. Most women report having between 1 and 3 partners, with a massive peak at 1 or 2. There is a long tail stretching out to the right, showing a few individuals with a highly elevated number of partners, which we will need to keep an eye on when examining STD and biopsy rates.

Step 6: Bivariate Analysis

Bivariate analysis looks at two variables at the same time to see if they are related. For example, does age have a connection to whether a biopsy is abnormal? Does smoking increase the chances?

# Let's look at Age vs Biopsy Result
ggplot(cervical_data_clean, aes(x = factor(biopsy), y = age, fill = factor(biopsy))) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Age vs. Biopsy Result", x = "Biopsy Result (0 = Normal, 1 = Abnormal)", y = "Age") +
  scale_fill_manual(values = c("lightgreen", "salmon")) +
  theme(legend.position = "none")

Interpreting the Age vs Biopsy Boxplot: What exactly are we looking at here? A boxplot is a fantastic way to see the “typical” range of a group. The thick black line in the middle of each box represents the median (average) age of that group.

- The Green Box (Normal Biopsy): This box sits lower, showing that the vast majority of women with healthy, normal biopsies are in their late teens to mid-20s.
- The Dots (Outliers): You’ll notice several black dots hovering way above the green box. These are called “outliers.” Does this mean the data is broken? No! Because our overarching dataset is overwhelmingly young (as we saw in the age histogram earlier), the statistical formula considers anyone over the age of 45-50 to be “unusual” compared to the rest of the pool. Those are real women in their 50s and 60s who simply fall outside the typical age cluster of this specific clinic’s patient demographic.*
- The Red Box (Abnormal Biopsy): Notice how the red box is shifted slightly higher than the green box. The entire “typical range” moves up a few years, suggesting that abnormal biopsies occur slightly more often as women get slightly older (moving from early 20s to late 20s/early 30s).*

# Smoking vs Biopsy Result
cervical_data_clean %>%
  ggplot(aes(x = factor(smokes), fill = factor(biopsy))) +
  geom_bar(position = "dodge", color = "black") +
  theme_minimal() +
  labs(
    title = "Count of Biopsy Results by Smoking Status",
    x = "Smokes? (0 = No, 1 = Yes)", y = "Count of Patients"
  ) +
  scale_fill_manual(values = c("lightgreen", "salmon"), name = "Biopsy")

Interpreting the Smoking vs Biopsy Chart (Side-by-Side Counts): This is a much more straightforward chart. It simply places the healthy people (green bars) next to the people with abnormal biopsies (red bars) and groups them by whether they smoke or not.

- The Non-Smokers (0): Look at the giant green bar on the left. This shows that the vast majority of non-smokers had perfectly healthy biopsies. The tiny red bar next to it shows the small handful of non-smokers who did have an abnormal biopsy.
- The Smokers (1): Now look at the pair of bars on the right. Because we have far fewer smokers in our data overall, both of these bars are much shorter. However, notice the height of the red bar (abnormal biopsy) compared to the green bar (healthy biopsy) in this smoker group.
- The Conclusion: While the total number of sick smokers is small, the ratio of sick-to-healthy is worse in the smoker group than it is in the non-smoker group. This gives us our first visual clue that smoking might be a contributing risk factor.

Step 7: Multivariate Analysis & Data Visualization

In multivariate analysis, we look at many variables at once. Are smoking and age connected? Are older women more likely to use certain contraceptives? We use a correlation matrix, a heat map to instantly see which clues are strongly linked to each other.

# We want to see how numerical variables correlate.
# We'll select a subset of important numerical features and compute correlations.
cor_data <- cervical_data_clean %>%
  select(
    age, number_of_sexual_partners, first_sexual_intercourse,
    num_of_pregnancies, smokes, hormonal_contraceptives, iud, st_ds_number_of_diagnosis, biopsy
  )

# Compute the correlation matrix
cor_matrix <- cor(cor_data)
cor_matrix

##                                    age number_of_sexual_partners
## age                        1.000000000              0.0859713224
## number_of_sexual_partners  0.085971322              1.0000000000
## first_sexual_intercourse   0.369175414             -0.1458466166
## num_of_pregnancies         0.525891837              0.0774392125
## smokes                     0.057203756              0.2368575171
## hormonal_contraceptives    0.029201254              0.0040274531
## iud                        0.279429180              0.0324596214
## st_ds_number_of_diagnosis -0.001605942              0.0530557811
## biopsy                     0.055955515             -0.0004082348
##                           first_sexual_intercourse num_of_pregnancies
## age                                    0.369175414         0.52589184
## number_of_sexual_partners             -0.145846617         0.07743921
## first_sexual_intercourse               1.000000000        -0.05637438
## num_of_pregnancies                    -0.056374378         1.00000000
## smokes                                -0.123279803         0.08151714
## hormonal_contraceptives               -0.009232491         0.11893816
## iud                                   -0.010757838         0.20450054
## st_ds_number_of_diagnosis             -0.013331389         0.03491232
## biopsy                                 0.007258726         0.04021507
##                                 smokes hormonal_contraceptives           iud
## age                        0.057203756            0.0292012536  0.2794291799
## number_of_sexual_partners  0.236857517            0.0040274531  0.0324596214
## first_sexual_intercourse  -0.123279803           -0.0092324912 -0.0107578382
## num_of_pregnancies         0.081517136            0.1189381564  0.2045005423
## smokes                     1.000000000            0.0040356837 -0.0551154385
## hormonal_contraceptives    0.004035684            1.0000000000  0.0001882045
## iud                       -0.055115439            0.0001882045  1.0000000000
## st_ds_number_of_diagnosis  0.090725487           -0.0621985801  0.0357912898
## biopsy                     0.028723760           -0.0180152523  0.0592305229
##                           st_ds_number_of_diagnosis        biopsy
## age                                    -0.001605942  0.0559555151
## number_of_sexual_partners               0.053055781 -0.0004082348
## first_sexual_intercourse               -0.013331389  0.0072587257
## num_of_pregnancies                      0.034912321  0.0402150719
## smokes                                  0.090725487  0.0287237598
## hormonal_contraceptives                -0.062198580 -0.0180152523
## iud                                     0.035791290  0.0592305229
## st_ds_number_of_diagnosis               1.000000000  0.0974489209
## biopsy                                  0.097448921  1.0000000000

# Convert to a format suitable for ggplot
cor_melted <- as.data.frame(as.table(cor_matrix))
cor_melted

##                         Var1                      Var2          Freq
## 1                        age                       age  1.0000000000
## 2  number_of_sexual_partners                       age  0.0859713224
## 3   first_sexual_intercourse                       age  0.3691754140
## 4         num_of_pregnancies                       age  0.5258918373
## 5                     smokes                       age  0.0572037563
## 6    hormonal_contraceptives                       age  0.0292012536
## 7                        iud                       age  0.2794291799
## 8  st_ds_number_of_diagnosis                       age -0.0016059424
## 9                     biopsy                       age  0.0559555151
## 10                       age number_of_sexual_partners  0.0859713224
## 11 number_of_sexual_partners number_of_sexual_partners  1.0000000000
## 12  first_sexual_intercourse number_of_sexual_partners -0.1458466166
## 13        num_of_pregnancies number_of_sexual_partners  0.0774392125
## 14                    smokes number_of_sexual_partners  0.2368575171
## 15   hormonal_contraceptives number_of_sexual_partners  0.0040274531
## 16                       iud number_of_sexual_partners  0.0324596214
## 17 st_ds_number_of_diagnosis number_of_sexual_partners  0.0530557811
## 18                    biopsy number_of_sexual_partners -0.0004082348
## 19                       age  first_sexual_intercourse  0.3691754140
## 20 number_of_sexual_partners  first_sexual_intercourse -0.1458466166
## 21  first_sexual_intercourse  first_sexual_intercourse  1.0000000000
## 22        num_of_pregnancies  first_sexual_intercourse -0.0563743783
## 23                    smokes  first_sexual_intercourse -0.1232798026
## 24   hormonal_contraceptives  first_sexual_intercourse -0.0092324912
## 25                       iud  first_sexual_intercourse -0.0107578382
## 26 st_ds_number_of_diagnosis  first_sexual_intercourse -0.0133313891
## 27                    biopsy  first_sexual_intercourse  0.0072587257
## 28                       age        num_of_pregnancies  0.5258918373
## 29 number_of_sexual_partners        num_of_pregnancies  0.0774392125
## 30  first_sexual_intercourse        num_of_pregnancies -0.0563743783
## 31        num_of_pregnancies        num_of_pregnancies  1.0000000000
## 32                    smokes        num_of_pregnancies  0.0815171361
## 33   hormonal_contraceptives        num_of_pregnancies  0.1189381564
## 34                       iud        num_of_pregnancies  0.2045005423
## 35 st_ds_number_of_diagnosis        num_of_pregnancies  0.0349123213
## 36                    biopsy        num_of_pregnancies  0.0402150719
## 37                       age                    smokes  0.0572037563
## 38 number_of_sexual_partners                    smokes  0.2368575171
## 39  first_sexual_intercourse                    smokes -0.1232798026
## 40        num_of_pregnancies                    smokes  0.0815171361
## 41                    smokes                    smokes  1.0000000000
## 42   hormonal_contraceptives                    smokes  0.0040356837
## 43                       iud                    smokes -0.0551154385
## 44 st_ds_number_of_diagnosis                    smokes  0.0907254866
## 45                    biopsy                    smokes  0.0287237598
## 46                       age   hormonal_contraceptives  0.0292012536
## 47 number_of_sexual_partners   hormonal_contraceptives  0.0040274531
## 48  first_sexual_intercourse   hormonal_contraceptives -0.0092324912
## 49        num_of_pregnancies   hormonal_contraceptives  0.1189381564
## 50                    smokes   hormonal_contraceptives  0.0040356837
## 51   hormonal_contraceptives   hormonal_contraceptives  1.0000000000
## 52                       iud   hormonal_contraceptives  0.0001882045
## 53 st_ds_number_of_diagnosis   hormonal_contraceptives -0.0621985801
## 54                    biopsy   hormonal_contraceptives -0.0180152523
## 55                       age                       iud  0.2794291799
## 56 number_of_sexual_partners                       iud  0.0324596214
## 57  first_sexual_intercourse                       iud -0.0107578382
## 58        num_of_pregnancies                       iud  0.2045005423
## 59                    smokes                       iud -0.0551154385
## 60   hormonal_contraceptives                       iud  0.0001882045
## 61                       iud                       iud  1.0000000000
## 62 st_ds_number_of_diagnosis                       iud  0.0357912898
## 63                    biopsy                       iud  0.0592305229
## 64                       age st_ds_number_of_diagnosis -0.0016059424
## 65 number_of_sexual_partners st_ds_number_of_diagnosis  0.0530557811
## 66  first_sexual_intercourse st_ds_number_of_diagnosis -0.0133313891
## 67        num_of_pregnancies st_ds_number_of_diagnosis  0.0349123213
## 68                    smokes st_ds_number_of_diagnosis  0.0907254866
## 69   hormonal_contraceptives st_ds_number_of_diagnosis -0.0621985801
## 70                       iud st_ds_number_of_diagnosis  0.0357912898
## 71 st_ds_number_of_diagnosis st_ds_number_of_diagnosis  1.0000000000
## 72                    biopsy st_ds_number_of_diagnosis  0.0974489209
## 73                       age                    biopsy  0.0559555151
## 74 number_of_sexual_partners                    biopsy -0.0004082348
## 75  first_sexual_intercourse                    biopsy  0.0072587257
## 76        num_of_pregnancies                    biopsy  0.0402150719
## 77                    smokes                    biopsy  0.0287237598
## 78   hormonal_contraceptives                    biopsy -0.0180152523
## 79                       iud                    biopsy  0.0592305229
## 80 st_ds_number_of_diagnosis                    biopsy  0.0974489209
## 81                    biopsy                    biopsy  1.0000000000

# Plot the heatmap
ggplot(cor_melted, aes(Var1, Var2, fill = Freq)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(
    low = "blue", high = "red", mid = "white",
    midpoint = 0, limit = c(-1, 1), name = "Pearson\nCorrelation"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  labs(title = "Correlation Heatmap of Risk Factors", x = "", y = "")

How to Read the Evidence Heatmap:

This grid is called a “Correlation Matrix.” The colors tell us the strength* of the relationship between any two clues.*

- The White Diagonal Line: Notice the solid white squares going diagonally down the middle? That’s just a variable matching with itself (e.g., Age vs Age). We can ignore those.
- Red (Positive Connection): The darker the red, the more likely two things happen together. For example, find where age and num_of_pregnancies intersect. It’s solid red. This makes perfect logical sense: as women get older, they typically have had more pregnancies.
- Blue (Negative Connection): If we saw dark blue squares, it would mean “as one goes up, the other goes down.”
- Looking at our Prime Target (biopsy): The most important row in this entire grid is the top row (labeled biopsy). Scan your eyes across that top row. It is mostly very faint white/pink squares. What does this mean? It means there is no single, massive “smoking gun” variable that perfectly predicts cervical cancer all by itself. All the connections to an abnormal biopsy are weak on their own. This proves that cervical cancer risk is multivariate it’s likely caused by a combination of smaller factors rather than one giant obvious one.*

Step 8: Detect Outliers and Anomalies

Sometimes a clue is so unusual that it demands special attention. Is it a typo, or a genuinely extreme case? We call these ‘outliers.’ Let’s look at the ‘Number of sexual partners’ and ‘Num of pregnancies’ to see if anyone falls far outside the norm.

# Boxplot for Number of Sexual Partners
ggplot(cervical_data_clean, aes(y = number_of_sexual_partners)) +
  geom_boxplot(fill = "orange") +
  theme_minimal() +
  labs(title = "Outlier Detection: Number of Sexual Partners", y = "Partners")

Interpreting the ‘Number of Sexual Partners’ Outliers:

- The Squished Orange Box: You’ll notice the actual “box” part of the graph is incredibly thin and squished all the way down at the bottom near 1, 2, and 3. This matches our earlier histogram: the vast majority of women in this dataset report having very few sexual partners, so the biological “norm” for this specific group is very low.
- The Spire of Dots (Outliers): Look at the long vertical line of dots stretching high above the box. Each dot is a real patient. The statistical formula flagged them as “outliers” because having 10, 15, or even 28 partners is mathematically highly unusual compared to the rest of the women in the waiting room who only have 1 or 2.
- Medical Significance: We absolutely DO NOT delete these outliers. Cervical Cancer is primarily driven by HPV (a sexually transmitted infection). A patient with 28 partners (that top dot!) is biologically exposing themselves to a significantly higher risk of encountering an HPV strain than someone with 1 partner. Therefore, these statistical “outliers” are actually our prime medical suspects!

# Boxplot for Number of Pregnancies
ggplot(cervical_data_clean, aes(y = num_of_pregnancies)) +
  geom_boxplot(fill = "lightblue") +
  theme_minimal() +
  labs(title = "Outlier Detection: Number of Pregnancies", y = "Pregnancies")

# We can see some extreme values (e.g., individuals reporting > 15 partners or > 10 pregnancies).
# In medical data, these could be real values, so we won't delete them, but we will keep them in mind
# for our final analysis, as they might heavily influence our results.

Interpreting the ‘Number of Pregnancies’ Outliers:

- The Light Blue Box: This box is a bit wider, showing that the “typical” or average number of pregnancies for women in this dataset ranges largely from 1 to 3.
- The Dots: We again see a few dots stretching upwards, maxing out near 11 pregnancies. Again, the math calls these stats “anomalies,” but medically, they are perfectly possible.
- Why We Keep Them: In the medical world, multiple full-term pregnancies have been loosely linked to an increased risk of cervical cancer in HPV-positive women (due to hormonal changes and immune suppression during pregnancy). The women represented by those top dots aren’t “bad data”—they are just a high-risk demographic that our predictive model absolutely needs to learn from!

Step 9: Feature Transformation / Feature Engineering

Sometimes, the clues we have aren’t quite enough, but if we combine them, they tell a bigger story. This is “Feature Engineering.” For example, maybe it’s not just about one specific STD, but whether the patient has a history of ANY STD combined with smoking.

# Let's create a new 'High Risk Behavior Score' just as an example.
# We will sum up indications of smoking, high number of partners (which we'll define as > 5), and STD history.
# (Note: We already handled all missing NA values in Step 4, so we can cleanly add these together in one pass!)

cervical_data_clean <- cervical_data_clean %>%
  mutate(
    high_partners_flag = ifelse(number_of_sexual_partners > 5, 1, 0),
    behavior_risk_score = smokes + st_ds + high_partners_flag
  )

# Let's see if this new score relates to Biopsy results
cervical_data_clean %>%
  ggplot(aes(x = factor(behavior_risk_score), fill = factor(biopsy))) +
  geom_bar(position = "dodge", color = "black") +
  theme_minimal() +
  labs(title = "Count of Biopsy Results by Behavior Risk Score", x = "Risk Score (0 - 3)", y = "Count of Patients") +
  scale_fill_manual(values = c("lightgreen", "salmon"), name = "Biopsy")

Interpreting the Feature Engineering Chart (Behavior Risk Score):

- What did we build? We created a custom “Behavior Risk Score” ranging from 0 to 3. A patient gets a +1 if they smoke, a +1 if they’ve had any STDs, and a +1 if they have a highly unusual number of sexual partners (over 5).
- The ‘0’ Score: Look at the massive green bar at score 0. This tells us that the vast majority of our patients exhibited none of these three risky behaviors and their biopsies were overwhelmingly normal.
- The ‘1’ to ‘3’ Scores: As the score increases, the total number of patients drops significantly. However, look at the ratio of the red bar (abnormal) to the green bar (normal). As the risk score ticks upward, the red bar becomes taller relative to its green neighbor. This engineered feature successfully combined three weak clues into one stronger metric showing that stacked risky behaviors correlate with an increase in abnormal biopsies!*

Step 10: Identifying the Major Causes

After sifting through all the clues, we need to gather our prime suspects into one interrogation room. Medical research and our initial exploration tell us that cervical cancer is heavily linked to sexual history (especially HPV infection), smoking, and age. We are now going to create a smaller, focused dataset a ‘Prime Suspects’ list to dig deeper.

# Let's create a focused dataframe containing only the most critical risk factors
# and our main target: Biopsy.
prime_suspects_data <- cervical_data_clean %>%
  select(
    age,
    number_of_sexual_partners,
    first_sexual_intercourse,
    smokes,
    smokes_years,
    hormonal_contraceptives,
    hormonal_contraceptives_years,
    iud,
    st_ds_number_of_diagnosis,
    st_ds_hpv, # HPV is the leading cause of cervical cancer
    st_ds_hiv, # Immune system suppression is a massive risk factor
    biopsy
  )

# Let's run a quick statistical summary (a Logistic Regression model)
# to see which of these prime suspects are actually guilty beyond a reasonable doubt.
# In layman's terms: This math equation tries to predict the 'Biopsy' result using our clues.
risk_model <- glm(biopsy ~ ., data = prime_suspects_data, family = "binomial")

# Let's view the summary of our 'trial'
summary(risk_model)

## 
## Call:
## glm(formula = biopsy ~ ., family = "binomial", data = prime_suspects_data)
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                    -2.980528   1.002690  -2.973  0.00295 **
## age                             0.002312   0.020097   0.115  0.90842   
## number_of_sexual_partners      -0.034906   0.103524  -0.337  0.73598   
## first_sexual_intercourse        0.008598   0.056426   0.152  0.87888   
## smokes                         -0.137280   0.605746  -0.227  0.82071   
## smokes_years                    0.040550   0.042202   0.961  0.33663   
## hormonal_contraceptives        -0.396010   0.339082  -1.168  0.24285   
## hormonal_contraceptives_years   0.092681   0.034375   2.696  0.00701 **
## iud                             0.460522   0.420309   1.096  0.27322   
## st_ds_number_of_diagnosis       0.432076   0.455683   0.948  0.34303   
## st_ds_hpv                     -12.255220 623.920041  -0.020  0.98433   
## st_ds_hiv                       1.264908   0.769217   1.644  0.10009   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 408.60  on 857  degrees of freedom
## Residual deviance: 387.65  on 846  degrees of freedom
## AIC: 411.65
## 
## Number of Fisher Scoring iterations: 13

# Evaluating Feature Importance:
# Which factors contribute the most to the prediction?
# We can look at the absolute value of the t-statistic (or z-value in a binomial model) for each variable.
# We use the 'broom' package which is lightweight and built for cleanly extracting model data.
library(broom)

importance <- tidy(risk_model) %>%
  filter(term != "(Intercept)") %>% # We don't care about the baseline intercept
  mutate(Overall = abs(statistic)) %>%
  arrange(desc(Overall)) %>%
  rename(Feature = term)

# Let's visualize the "Guilt" of our suspects
ggplot(importance, aes(x = reorder(Feature, Overall), y = Overall)) +
  geom_bar(stat = "identity", fill = "steelblue", color = "black") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Feature Importance for Predicting Abnormal Biopsy", x = "Risk Factor", y = "Importance (absolute z-value)")

The Verdict: Who is Guilty? The bar chart above is the final result of our Logistic Regression model. It ranks the clues based on how strongly they pull the prediction toward an “Abnormal” biopsy. Let’s look at the top suspects in our specific dataset:

- The Unsuspecting Leader (Hormonal Contraceptives): Both hormonal_contraceptives_years and hormonal_contraceptives are near the very top of the list! While the pill does not cause cancer directly like a virus does, medical studies confirm that long-term use can slightly increase the risk of cervical cancer in women who are already infected with HPV. The pill changes the hormonal environment of the cervix, making it easier for the virus to thrive and mutate cells.
- The Immune System Clue (st_ds_hiv): HIV is extremely high on the list. This makes perfect medical sense! HIV attacks the body’s immune system. If a woman’s immune system is severely compromised, she cannot fight off an underlying HPV infection, allowing the virus to rapidly develop into cervical dysplasia (which is exactly what an abnormal Biopsy detects).
- Where is HPV (st_ds_hpv)? Surprisingly, it is at the very bottom of this specific chart! Does this mean medical science is wrong? Absolutely not. It just means in this specific, real-world dataset, the recording of HPV diagnoses was likely sparse, incomplete, or patients were undiagnosed at the time of their visits. Models can only learn from the data they are handed—this highlights why domain knowledge (like knowing HIV suppresses the immune system) is so critical when interpreting data!*

Step 11: Prevention and Policy Recommendations

Our detective work is almost done. We know the suspects and we know how they operate. But solving a crime isn’t just about catching the bad guy it’s about preventing the next one. How can we use this data to inform public health policy?

Actionable Policy and Prevention Insights: Based on our deep dive into the data, we shouldn’t just rely on generic advice. We need policies targeted at what our logistic regression model actually found in this population:

Revising Hormonal Contraceptive Guidelines: Since hormonal_contraceptives (and years used) featured so heavily as a predictor of abnormal biopsies, policymakers need to ensure women are fully educated on the long-term risks. Clinics should be required to offer mandatory, aggressive HPV screening before prescribing long-term hormonal birth control, as the viral-hormonal interaction is a massive risk multiplier.
Integrated HIV & Cervical Cancer Screening: The presence of st_ds_hiv was one of the strongest predictors of abnormal cells. Public health policy must stop treating these diseases in silos. Funding should be directed toward “One-Stop” clinics: if a woman tests positive for HIV, an immediate, subsidized cervical biopsy and HPV test must become the standard of care due to her suppressed immune system.
Addressing the “Age & Partners” Reality: The data showed abnormal biopsies ticking upward in older patients and those with a high number of sexual partners. Policies need to aggressively fund accessible, free Pap smears specifically targeting women in their late 20s and 30s. Education campaigns must destigmatize regular checks for women with multiple partners, framing it as routine maintenance rather than a consequence.
Fixing the Data Gap (The HPV Vaccine): We know medically that HPV causes cervical cancer, yet st_ds_hpv barely registered in our model. Why? Because the data is missing or patients are undiagnosed! The biggest policy recommendation is to fund massive, subsidized HPV testing and vaccination campaigns for young girls and boys before their first sexual encounter. We need better real-world tracking of HPV to stop this cancer before it starts.

Conclusion: The data tells a clear story: In this population, predicting cervical cancer risk is heavily reliant on a patient’s hormonal history (birth control) and their immune system’s strength (HIV/STDs). By rewriting public health policy to aggressively test for HPV when prescribing birth control or treating HIV, we can catch the precursor to cancer and save lives.