Step 1: Understanding the Problem and Dataset

Imagine you’re a detective walking into a room full of clues. Each clue might tell you a part of the story, but on its own, it’s not enough to draw a conclusion. In our case, the ‘room’ is this dataset, and the ‘clues’ are various lifestyle factors, medical histories, and habits of different women. Our ‘mystery’ to solve is understanding what factors are most strongly linked to the risk of developing cervical cancer.

Before touching a line of code, let’s understand what we’re looking at. This dataset contains medical and demographic information about a large group of women. It tracks potential risk factors for cervical cancer, such as:


Step 2: Import / Load the Data

Now that we know what we’re looking for, we need to carefully bring our evidence into the lab. In data science, our lab is the software R. We are going to load our CSV file into the system so we can start poking and prodding it.

One important detail: the caretakers of this data used a question mark (?) to represent information they didn’t have (missing data). We need to tell our system to recognize ? as ‘Not Available’ (or NA in R).

# Define the path to our dataset
data_path <- "archive (5)/kag_risk_factors_cervical_cancer.csv"

# Load the data, making sure to treat '?' as missing values (NA)
cervical_data <- read.csv(data_path, na.strings = c("?", "NA", ""))

# Let's take a quick peek at the first few rows just to make sure it loaded correctly
head(cervical_data)
##   Age Number.of.sexual.partners First.sexual.intercourse Num.of.pregnancies
## 1  18                         4                       15                  1
## 2  15                         1                       14                  1
## 3  34                         1                       NA                  1
## 4  52                         5                       16                  4
## 5  46                         3                       21                  4
## 6  42                         3                       23                  2
##   Smokes Smokes..years. Smokes..packs.year. Hormonal.Contraceptives
## 1      0              0                   0                       0
## 2      0              0                   0                       0
## 3      0              0                   0                       0
## 4      1             37                  37                       1
## 5      0              0                   0                       1
## 6      0              0                   0                       0
##   Hormonal.Contraceptives..years. IUD IUD..years. STDs STDs..number.
## 1                               0   0           0    0             0
## 2                               0   0           0    0             0
## 3                               0   0           0    0             0
## 4                               3   0           0    0             0
## 5                              15   0           0    0             0
## 6                               0   0           0    0             0
##   STDs.condylomatosis STDs.cervical.condylomatosis STDs.vaginal.condylomatosis
## 1                   0                            0                           0
## 2                   0                            0                           0
## 3                   0                            0                           0
## 4                   0                            0                           0
## 5                   0                            0                           0
## 6                   0                            0                           0
##   STDs.vulvo.perineal.condylomatosis STDs.syphilis
## 1                                  0             0
## 2                                  0             0
## 3                                  0             0
## 4                                  0             0
## 5                                  0             0
## 6                                  0             0
##   STDs.pelvic.inflammatory.disease STDs.genital.herpes
## 1                                0                   0
## 2                                0                   0
## 3                                0                   0
## 4                                0                   0
## 5                                0                   0
## 6                                0                   0
##   STDs.molluscum.contagiosum STDs.AIDS STDs.HIV STDs.Hepatitis.B STDs.HPV
## 1                          0         0        0                0        0
## 2                          0         0        0                0        0
## 3                          0         0        0                0        0
## 4                          0         0        0                0        0
## 5                          0         0        0                0        0
## 6                          0         0        0                0        0
##   STDs..Number.of.diagnosis STDs..Time.since.first.diagnosis
## 1                         0                               NA
## 2                         0                               NA
## 3                         0                               NA
## 4                         0                               NA
## 5                         0                               NA
## 6                         0                               NA
##   STDs..Time.since.last.diagnosis Dx.Cancer Dx.CIN Dx.HPV Dx Hinselmann
## 1                              NA         0      0      0  0          0
## 2                              NA         0      0      0  0          0
## 3                              NA         0      0      0  0          0
## 4                              NA         1      0      1  0          0
## 5                              NA         0      0      0  0          0
## 6                              NA         0      0      0  0          0
##   Schiller Citology Biopsy
## 1        0        0      0
## 2        0        0      0
## 3        0        0      0
## 4        0        0      0
## 5        0        0      0
## 6        0        0      0

Step 3: Inspect the Data Structure

Imagine opening a filing cabinet. Before reading every single document, you first check how many folders there are, how many pages are in each folder, and what kind of information is stamped on them. This step is about getting the “lay of the land.” How big is our dataset? What kinds of numbers or text are we dealing with?

# How many rows (women) and columns (pieces of information) do we have?
dim(cervical_data)
## [1] 858  36
# Let's look at the structure of the data.
# Are the columns treated as text (characters)? Or as numbers?
str(cervical_data)
## 'data.frame':    858 obs. of  36 variables:
##  $ Age                               : int  18 15 34 52 46 42 51 26 45 44 ...
##  $ Number.of.sexual.partners         : num  4 1 1 5 3 3 3 1 1 3 ...
##  $ First.sexual.intercourse          : num  15 14 NA 16 21 23 17 26 20 15 ...
##  $ Num.of.pregnancies                : num  1 1 1 4 4 2 6 3 5 NA ...
##  $ Smokes                            : num  0 0 0 1 0 0 1 0 0 1 ...
##  $ Smokes..years.                    : num  0 0 0 37 0 ...
##  $ Smokes..packs.year.               : num  0 0 0 37 0 0 3.4 0 0 2.8 ...
##  $ Hormonal.Contraceptives           : num  0 0 0 1 1 0 0 1 0 0 ...
##  $ Hormonal.Contraceptives..years.   : num  0 0 0 3 15 0 0 2 0 0 ...
##  $ IUD                               : num  0 0 0 0 0 0 1 1 0 NA ...
##  $ IUD..years.                       : num  0 0 0 0 0 0 7 7 0 NA ...
##  $ STDs                              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs..number.                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs.condylomatosis               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs.cervical.condylomatosis      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs.vaginal.condylomatosis       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs.vulvo.perineal.condylomatosis: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs.syphilis                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs.pelvic.inflammatory.disease  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs.genital.herpes               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs.molluscum.contagiosum        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs.AIDS                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs.HIV                          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs.Hepatitis.B                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs.HPV                          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs..Number.of.diagnosis         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ STDs..Time.since.first.diagnosis  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ STDs..Time.since.last.diagnosis   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Dx.Cancer                         : int  0 0 0 1 0 0 0 0 1 0 ...
##  $ Dx.CIN                            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Dx.HPV                            : int  0 0 0 1 0 0 0 0 1 0 ...
##  $ Dx                                : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ Hinselmann                        : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Schiller                          : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Citology                          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Biopsy                            : int  0 0 0 0 0 0 1 0 0 0 ...

Step 4: Data Cleaning

Every detective knows that crime scenes can be messy. Data is exactly the same. We often have missing pieces of information, numbers accidentally saved as text, or strange values that don’t make sense. Before we can do any real analysis, we have to clean the room. We will convert all those text columns back into proper numbers, and we’ll decide what to do about the missing gaps in our knowledge.

# Many columns loaded as 'character' because of the '?' marks that were present originally.
# Let's convert all applicable columns to numeric.
cervical_data_clean <- cervical_data %>%
  mutate_all(as.numeric) %>%
  clean_names()

# Let's check how many missing values we have in each column now.
colSums(is.na(cervical_data_clean))
##                                 age           number_of_sexual_partners 
##                                   0                                  26 
##            first_sexual_intercourse                  num_of_pregnancies 
##                                   7                                  56 
##                              smokes                        smokes_years 
##                                  13                                  13 
##                   smokes_packs_year             hormonal_contraceptives 
##                                  13                                 108 
##       hormonal_contraceptives_years                                 iud 
##                                 108                                 117 
##                           iud_years                               st_ds 
##                                 117                                 105 
##                        st_ds_number                st_ds_condylomatosis 
##                                 105                                 105 
##       st_ds_cervical_condylomatosis        st_ds_vaginal_condylomatosis 
##                                 105                                 105 
## st_ds_vulvo_perineal_condylomatosis                      st_ds_syphilis 
##                                 105                                 105 
##   st_ds_pelvic_inflammatory_disease                st_ds_genital_herpes 
##                                 105                                 105 
##         st_ds_molluscum_contagiosum                          st_ds_aids 
##                                 105                                 105 
##                           st_ds_hiv                   st_ds_hepatitis_b 
##                                 105                                 105 
##                           st_ds_hpv           st_ds_number_of_diagnosis 
##                                 105                                   0 
##    st_ds_time_since_first_diagnosis     st_ds_time_since_last_diagnosis 
##                                 787                                 787 
##                           dx_cancer                              dx_cin 
##                                   0                                   0 
##                              dx_hpv                                  dx 
##                                   0                                   0 
##                          hinselmann                            schiller 
##                                   0                                   0 
##                            citology                              biopsy 
##                                   0                                   0
# Based on the missing data count:
# 1. 'st_ds_time_since_first_diagnosis' and 'st_ds_time_since_last_diagnosis' have 787 missing values.
#    Since we only have 858 rows total, these columns are almost entirely empty. We must drop them.
cervical_data_clean <- cervical_data_clean %>%
  select(-st_ds_time_since_first_diagnosis, -st_ds_time_since_last_diagnosis)

# 2. For the other columns with missing values (like smokes, hormonal_contraceptives, iud, st_ds),
#    we have a relatively small number of missing rows (around 13 to 117).
#    A common and safe approach for boolean (0/1) or count data is to impute (fill in) the median
#    value, which essentially assumes the "typical" patient profile for missing data,
#    or to just omit rows when building models.
#    Here, let's impute the median for the core risk factors to retain as much data as possible.

impute_median <- function(x) {
  replace_na(x, median(x, na.rm = TRUE))
}

cervical_data_clean <- cervical_data_clean %>%
  mutate(across(everything(), impute_median))

# Verify it's clean
sum(is.na(cervical_data_clean))
## [1] 0

Step 5: Univariate Analysis

In univariate analysis, we look at each variable entirely on its own. How old are most of the women in this dataset? How many of them smoke?

1. Age Distribution

# Distribution of Age
ggplot(cervical_data_clean, aes(x = age)) +
  geom_histogram(binwidth = 2, fill = "skyblue", color = "black") +
  theme_minimal() +
  labs(title = "Age Distribution of Patients", x = "Age", y = "Count")

Looking at this age lineup in this dataset, it’s clear we’re largely dealing with a very young crowd. The vast majority of the women in our records are in their late teens to early twenties, with the highest spike sitting right around age 20. As we trace the graph to the right, the numbers drop off sharply—meaning we have far fewer women in their 40s, 50s, or older. This shape (what statisticians call a ‘right skew’) tells us that any patterns we find in this data are going to be heavily weighted toward what young women are experiencing. It’s a vital piece of context: we aren’t looking at a retirement community; we are investigating the health profiles of women in the peak of their reproductive and sexually active years.

2. What is a Biopsy and Why Does it Matter?

The cervix is the lower part of the uterus that connects to the vagina. Sometimes, usually due to a viral infection like HPV, the cells on the surface of the cervix begin to mutate and grow abnormally. If left unchecked over many years, these abnormal cells can turn into cervical cancer.

A biopsy is a definitive medical procedure where a doctor removes a tiny pinch of tissue from the cervix to examine under a microscope. It is the final judge and jury on whether a patient actually has precancerous or cancerous cells.

# How many patients had an abnormal Biopsy?
ggplot(cervical_data_clean, aes(x = factor(biopsy))) +
  geom_bar(fill = c("lightgreen", "salmon"), color = "black") +
  theme_minimal() +
  labs(title = "Biopsy Results (0 = Normal, 1 = Abnormal)", x = "Biopsy Result", y = "Count")

This bar chart reveals a classic “needle in a haystack” problem in data science. Over 800 of our patients (the towering green bar) had perfectly healthy, normal biopsies. Only a very small handful (the sliver of red on the right) actually had abnormal, potentially dangerous cells.

Why is this important? Because it means cervical abnormalities are extremely rare in this dataset. We have to be very careful detectives. If we try to build a psychological profile of a “high risk” patient, our mathematical models might get lazy and just automatically guess “Healthy” every single time, because they know they’ll be right 93% of the time! We now know going forward that predicting the rare “1” is going to require finding very specific, undeniably strong risk factors.

3. Smoking Habits

Smoking is a known immunosuppressant and a major risk factor for many cancers. Let’s see what proportion of the women in this dataset are smokers.

ggplot(cervical_data_clean, aes(x = factor(smokes))) +
  geom_bar(fill = c("lightblue", "orange"), color = "black") +
  theme_minimal() +
  labs(title = "Smoking Habits (0 = Non-Smoker, 1 = Smoker)", x = "Smokes?", y = "Count")

The vast majority of the women in this dataset are non-smokers (0). While this is good for general public health, for our investigation it means that if we do find a strong link between smoking and abnormal biopsies later on, that link is clinically very potent, because the smoking group (1) is quite small to begin with.

4. Number of Sexual Partners

Cervical cancer is strongly linked to HPV, a sexually transmitted infection. Therefore, the number of sexual partners is a key behavioral metric to investigate.

ggplot(cervical_data_clean, aes(x = number_of_sexual_partners)) +
  geom_histogram(binwidth = 1, fill = "purple", color = "black", alpha = 0.7) +
  theme_minimal() +
  labs(title = "Distribution of Number of Sexual Partners", x = "Number of Partners", y = "Count")

This distribution is also heavily ‘right-skewed’. Most women report having between 1 and 3 partners, with a massive peak at 1 or 2. There is a long tail stretching out to the right, showing a few individuals with a highly elevated number of partners, which we will need to keep an eye on when examining STD and biopsy rates.


Step 6: Bivariate Analysis

Bivariate analysis looks at two variables at the same time to see if they are related. For example, does age have a connection to whether a biopsy is abnormal? Does smoking increase the chances?

# Let's look at Age vs Biopsy Result
ggplot(cervical_data_clean, aes(x = factor(biopsy), y = age, fill = factor(biopsy))) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Age vs. Biopsy Result", x = "Biopsy Result (0 = Normal, 1 = Abnormal)", y = "Age") +
  scale_fill_manual(values = c("lightgreen", "salmon")) +
  theme(legend.position = "none")

Interpreting the Age vs Biopsy Boxplot: What exactly are we looking at here? A boxplot is a fantastic way to see the “typical” range of a group. The thick black line in the middle of each box represents the median (average) age of that group.

# Smoking vs Biopsy Result
cervical_data_clean %>%
  ggplot(aes(x = factor(smokes), fill = factor(biopsy))) +
  geom_bar(position = "dodge", color = "black") +
  theme_minimal() +
  labs(
    title = "Count of Biopsy Results by Smoking Status",
    x = "Smokes? (0 = No, 1 = Yes)", y = "Count of Patients"
  ) +
  scale_fill_manual(values = c("lightgreen", "salmon"), name = "Biopsy")

Interpreting the Smoking vs Biopsy Chart (Side-by-Side Counts): This is a much more straightforward chart. It simply places the healthy people (green bars) next to the people with abnormal biopsies (red bars) and groups them by whether they smoke or not.


Step 7: Multivariate Analysis & Data Visualization

In multivariate analysis, we look at many variables at once. Are smoking and age connected? Are older women more likely to use certain contraceptives? We use a correlation matrix, a heat map to instantly see which clues are strongly linked to each other.

# We want to see how numerical variables correlate.
# We'll select a subset of important numerical features and compute correlations.
cor_data <- cervical_data_clean %>%
  select(
    age, number_of_sexual_partners, first_sexual_intercourse,
    num_of_pregnancies, smokes, hormonal_contraceptives, iud, st_ds_number_of_diagnosis, biopsy
  )

# Compute the correlation matrix
cor_matrix <- cor(cor_data)
cor_matrix
##                                    age number_of_sexual_partners
## age                        1.000000000              0.0859713224
## number_of_sexual_partners  0.085971322              1.0000000000
## first_sexual_intercourse   0.369175414             -0.1458466166
## num_of_pregnancies         0.525891837              0.0774392125
## smokes                     0.057203756              0.2368575171
## hormonal_contraceptives    0.029201254              0.0040274531
## iud                        0.279429180              0.0324596214
## st_ds_number_of_diagnosis -0.001605942              0.0530557811
## biopsy                     0.055955515             -0.0004082348
##                           first_sexual_intercourse num_of_pregnancies
## age                                    0.369175414         0.52589184
## number_of_sexual_partners             -0.145846617         0.07743921
## first_sexual_intercourse               1.000000000        -0.05637438
## num_of_pregnancies                    -0.056374378         1.00000000
## smokes                                -0.123279803         0.08151714
## hormonal_contraceptives               -0.009232491         0.11893816
## iud                                   -0.010757838         0.20450054
## st_ds_number_of_diagnosis             -0.013331389         0.03491232
## biopsy                                 0.007258726         0.04021507
##                                 smokes hormonal_contraceptives           iud
## age                        0.057203756            0.0292012536  0.2794291799
## number_of_sexual_partners  0.236857517            0.0040274531  0.0324596214
## first_sexual_intercourse  -0.123279803           -0.0092324912 -0.0107578382
## num_of_pregnancies         0.081517136            0.1189381564  0.2045005423
## smokes                     1.000000000            0.0040356837 -0.0551154385
## hormonal_contraceptives    0.004035684            1.0000000000  0.0001882045
## iud                       -0.055115439            0.0001882045  1.0000000000
## st_ds_number_of_diagnosis  0.090725487           -0.0621985801  0.0357912898
## biopsy                     0.028723760           -0.0180152523  0.0592305229
##                           st_ds_number_of_diagnosis        biopsy
## age                                    -0.001605942  0.0559555151
## number_of_sexual_partners               0.053055781 -0.0004082348
## first_sexual_intercourse               -0.013331389  0.0072587257
## num_of_pregnancies                      0.034912321  0.0402150719
## smokes                                  0.090725487  0.0287237598
## hormonal_contraceptives                -0.062198580 -0.0180152523
## iud                                     0.035791290  0.0592305229
## st_ds_number_of_diagnosis               1.000000000  0.0974489209
## biopsy                                  0.097448921  1.0000000000
# Convert to a format suitable for ggplot
cor_melted <- as.data.frame(as.table(cor_matrix))
cor_melted
##                         Var1                      Var2          Freq
## 1                        age                       age  1.0000000000
## 2  number_of_sexual_partners                       age  0.0859713224
## 3   first_sexual_intercourse                       age  0.3691754140
## 4         num_of_pregnancies                       age  0.5258918373
## 5                     smokes                       age  0.0572037563
## 6    hormonal_contraceptives                       age  0.0292012536
## 7                        iud                       age  0.2794291799
## 8  st_ds_number_of_diagnosis                       age -0.0016059424
## 9                     biopsy                       age  0.0559555151
## 10                       age number_of_sexual_partners  0.0859713224
## 11 number_of_sexual_partners number_of_sexual_partners  1.0000000000
## 12  first_sexual_intercourse number_of_sexual_partners -0.1458466166
## 13        num_of_pregnancies number_of_sexual_partners  0.0774392125
## 14                    smokes number_of_sexual_partners  0.2368575171
## 15   hormonal_contraceptives number_of_sexual_partners  0.0040274531
## 16                       iud number_of_sexual_partners  0.0324596214
## 17 st_ds_number_of_diagnosis number_of_sexual_partners  0.0530557811
## 18                    biopsy number_of_sexual_partners -0.0004082348
## 19                       age  first_sexual_intercourse  0.3691754140
## 20 number_of_sexual_partners  first_sexual_intercourse -0.1458466166
## 21  first_sexual_intercourse  first_sexual_intercourse  1.0000000000
## 22        num_of_pregnancies  first_sexual_intercourse -0.0563743783
## 23                    smokes  first_sexual_intercourse -0.1232798026
## 24   hormonal_contraceptives  first_sexual_intercourse -0.0092324912
## 25                       iud  first_sexual_intercourse -0.0107578382
## 26 st_ds_number_of_diagnosis  first_sexual_intercourse -0.0133313891
## 27                    biopsy  first_sexual_intercourse  0.0072587257
## 28                       age        num_of_pregnancies  0.5258918373
## 29 number_of_sexual_partners        num_of_pregnancies  0.0774392125
## 30  first_sexual_intercourse        num_of_pregnancies -0.0563743783
## 31        num_of_pregnancies        num_of_pregnancies  1.0000000000
## 32                    smokes        num_of_pregnancies  0.0815171361
## 33   hormonal_contraceptives        num_of_pregnancies  0.1189381564
## 34                       iud        num_of_pregnancies  0.2045005423
## 35 st_ds_number_of_diagnosis        num_of_pregnancies  0.0349123213
## 36                    biopsy        num_of_pregnancies  0.0402150719
## 37                       age                    smokes  0.0572037563
## 38 number_of_sexual_partners                    smokes  0.2368575171
## 39  first_sexual_intercourse                    smokes -0.1232798026
## 40        num_of_pregnancies                    smokes  0.0815171361
## 41                    smokes                    smokes  1.0000000000
## 42   hormonal_contraceptives                    smokes  0.0040356837
## 43                       iud                    smokes -0.0551154385
## 44 st_ds_number_of_diagnosis                    smokes  0.0907254866
## 45                    biopsy                    smokes  0.0287237598
## 46                       age   hormonal_contraceptives  0.0292012536
## 47 number_of_sexual_partners   hormonal_contraceptives  0.0040274531
## 48  first_sexual_intercourse   hormonal_contraceptives -0.0092324912
## 49        num_of_pregnancies   hormonal_contraceptives  0.1189381564
## 50                    smokes   hormonal_contraceptives  0.0040356837
## 51   hormonal_contraceptives   hormonal_contraceptives  1.0000000000
## 52                       iud   hormonal_contraceptives  0.0001882045
## 53 st_ds_number_of_diagnosis   hormonal_contraceptives -0.0621985801
## 54                    biopsy   hormonal_contraceptives -0.0180152523
## 55                       age                       iud  0.2794291799
## 56 number_of_sexual_partners                       iud  0.0324596214
## 57  first_sexual_intercourse                       iud -0.0107578382
## 58        num_of_pregnancies                       iud  0.2045005423
## 59                    smokes                       iud -0.0551154385
## 60   hormonal_contraceptives                       iud  0.0001882045
## 61                       iud                       iud  1.0000000000
## 62 st_ds_number_of_diagnosis                       iud  0.0357912898
## 63                    biopsy                       iud  0.0592305229
## 64                       age st_ds_number_of_diagnosis -0.0016059424
## 65 number_of_sexual_partners st_ds_number_of_diagnosis  0.0530557811
## 66  first_sexual_intercourse st_ds_number_of_diagnosis -0.0133313891
## 67        num_of_pregnancies st_ds_number_of_diagnosis  0.0349123213
## 68                    smokes st_ds_number_of_diagnosis  0.0907254866
## 69   hormonal_contraceptives st_ds_number_of_diagnosis -0.0621985801
## 70                       iud st_ds_number_of_diagnosis  0.0357912898
## 71 st_ds_number_of_diagnosis st_ds_number_of_diagnosis  1.0000000000
## 72                    biopsy st_ds_number_of_diagnosis  0.0974489209
## 73                       age                    biopsy  0.0559555151
## 74 number_of_sexual_partners                    biopsy -0.0004082348
## 75  first_sexual_intercourse                    biopsy  0.0072587257
## 76        num_of_pregnancies                    biopsy  0.0402150719
## 77                    smokes                    biopsy  0.0287237598
## 78   hormonal_contraceptives                    biopsy -0.0180152523
## 79                       iud                    biopsy  0.0592305229
## 80 st_ds_number_of_diagnosis                    biopsy  0.0974489209
## 81                    biopsy                    biopsy  1.0000000000
# Plot the heatmap
ggplot(cor_melted, aes(Var1, Var2, fill = Freq)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(
    low = "blue", high = "red", mid = "white",
    midpoint = 0, limit = c(-1, 1), name = "Pearson\nCorrelation"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  labs(title = "Correlation Heatmap of Risk Factors", x = "", y = "")

How to Read the Evidence Heatmap:

This grid is called a “Correlation Matrix.” The colors tell us the strength* of the relationship between any two clues.*


Step 8: Detect Outliers and Anomalies

Sometimes a clue is so unusual that it demands special attention. Is it a typo, or a genuinely extreme case? We call these ‘outliers.’ Let’s look at the ‘Number of sexual partners’ and ‘Num of pregnancies’ to see if anyone falls far outside the norm.

# Boxplot for Number of Sexual Partners
ggplot(cervical_data_clean, aes(y = number_of_sexual_partners)) +
  geom_boxplot(fill = "orange") +
  theme_minimal() +
  labs(title = "Outlier Detection: Number of Sexual Partners", y = "Partners")

Interpreting the ‘Number of Sexual Partners’ Outliers:

# Boxplot for Number of Pregnancies
ggplot(cervical_data_clean, aes(y = num_of_pregnancies)) +
  geom_boxplot(fill = "lightblue") +
  theme_minimal() +
  labs(title = "Outlier Detection: Number of Pregnancies", y = "Pregnancies")

# We can see some extreme values (e.g., individuals reporting > 15 partners or > 10 pregnancies).
# In medical data, these could be real values, so we won't delete them, but we will keep them in mind
# for our final analysis, as they might heavily influence our results.

Interpreting the ‘Number of Pregnancies’ Outliers:


Step 9: Feature Transformation / Feature Engineering

Sometimes, the clues we have aren’t quite enough, but if we combine them, they tell a bigger story. This is “Feature Engineering.” For example, maybe it’s not just about one specific STD, but whether the patient has a history of ANY STD combined with smoking.

# Let's create a new 'High Risk Behavior Score' just as an example.
# We will sum up indications of smoking, high number of partners (which we'll define as > 5), and STD history.
# (Note: We already handled all missing NA values in Step 4, so we can cleanly add these together in one pass!)

cervical_data_clean <- cervical_data_clean %>%
  mutate(
    high_partners_flag = ifelse(number_of_sexual_partners > 5, 1, 0),
    behavior_risk_score = smokes + st_ds + high_partners_flag
  )

# Let's see if this new score relates to Biopsy results
cervical_data_clean %>%
  ggplot(aes(x = factor(behavior_risk_score), fill = factor(biopsy))) +
  geom_bar(position = "dodge", color = "black") +
  theme_minimal() +
  labs(title = "Count of Biopsy Results by Behavior Risk Score", x = "Risk Score (0 - 3)", y = "Count of Patients") +
  scale_fill_manual(values = c("lightgreen", "salmon"), name = "Biopsy")

Interpreting the Feature Engineering Chart (Behavior Risk Score):


Step 10: Identifying the Major Causes

After sifting through all the clues, we need to gather our prime suspects into one interrogation room. Medical research and our initial exploration tell us that cervical cancer is heavily linked to sexual history (especially HPV infection), smoking, and age. We are now going to create a smaller, focused dataset a ‘Prime Suspects’ list to dig deeper.

# Let's create a focused dataframe containing only the most critical risk factors
# and our main target: Biopsy.
prime_suspects_data <- cervical_data_clean %>%
  select(
    age,
    number_of_sexual_partners,
    first_sexual_intercourse,
    smokes,
    smokes_years,
    hormonal_contraceptives,
    hormonal_contraceptives_years,
    iud,
    st_ds_number_of_diagnosis,
    st_ds_hpv, # HPV is the leading cause of cervical cancer
    st_ds_hiv, # Immune system suppression is a massive risk factor
    biopsy
  )

# Let's run a quick statistical summary (a Logistic Regression model)
# to see which of these prime suspects are actually guilty beyond a reasonable doubt.
# In layman's terms: This math equation tries to predict the 'Biopsy' result using our clues.
risk_model <- glm(biopsy ~ ., data = prime_suspects_data, family = "binomial")

# Let's view the summary of our 'trial'
summary(risk_model)
## 
## Call:
## glm(formula = biopsy ~ ., family = "binomial", data = prime_suspects_data)
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                    -2.980528   1.002690  -2.973  0.00295 **
## age                             0.002312   0.020097   0.115  0.90842   
## number_of_sexual_partners      -0.034906   0.103524  -0.337  0.73598   
## first_sexual_intercourse        0.008598   0.056426   0.152  0.87888   
## smokes                         -0.137280   0.605746  -0.227  0.82071   
## smokes_years                    0.040550   0.042202   0.961  0.33663   
## hormonal_contraceptives        -0.396010   0.339082  -1.168  0.24285   
## hormonal_contraceptives_years   0.092681   0.034375   2.696  0.00701 **
## iud                             0.460522   0.420309   1.096  0.27322   
## st_ds_number_of_diagnosis       0.432076   0.455683   0.948  0.34303   
## st_ds_hpv                     -12.255220 623.920041  -0.020  0.98433   
## st_ds_hiv                       1.264908   0.769217   1.644  0.10009   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 408.60  on 857  degrees of freedom
## Residual deviance: 387.65  on 846  degrees of freedom
## AIC: 411.65
## 
## Number of Fisher Scoring iterations: 13
# Evaluating Feature Importance:
# Which factors contribute the most to the prediction?
# We can look at the absolute value of the t-statistic (or z-value in a binomial model) for each variable.
# We use the 'broom' package which is lightweight and built for cleanly extracting model data.
library(broom)

importance <- tidy(risk_model) %>%
  filter(term != "(Intercept)") %>% # We don't care about the baseline intercept
  mutate(Overall = abs(statistic)) %>%
  arrange(desc(Overall)) %>%
  rename(Feature = term)

# Let's visualize the "Guilt" of our suspects
ggplot(importance, aes(x = reorder(Feature, Overall), y = Overall)) +
  geom_bar(stat = "identity", fill = "steelblue", color = "black") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Feature Importance for Predicting Abnormal Biopsy", x = "Risk Factor", y = "Importance (absolute z-value)")

The Verdict: Who is Guilty? The bar chart above is the final result of our Logistic Regression model. It ranks the clues based on how strongly they pull the prediction toward an “Abnormal” biopsy. Let’s look at the top suspects in our specific dataset:


Step 11: Prevention and Policy Recommendations

Our detective work is almost done. We know the suspects and we know how they operate. But solving a crime isn’t just about catching the bad guy it’s about preventing the next one. How can we use this data to inform public health policy?

Actionable Policy and Prevention Insights: Based on our deep dive into the data, we shouldn’t just rely on generic advice. We need policies targeted at what our logistic regression model actually found in this population:

  1. Revising Hormonal Contraceptive Guidelines: Since hormonal_contraceptives (and years used) featured so heavily as a predictor of abnormal biopsies, policymakers need to ensure women are fully educated on the long-term risks. Clinics should be required to offer mandatory, aggressive HPV screening before prescribing long-term hormonal birth control, as the viral-hormonal interaction is a massive risk multiplier.

  2. Integrated HIV & Cervical Cancer Screening: The presence of st_ds_hiv was one of the strongest predictors of abnormal cells. Public health policy must stop treating these diseases in silos. Funding should be directed toward “One-Stop” clinics: if a woman tests positive for HIV, an immediate, subsidized cervical biopsy and HPV test must become the standard of care due to her suppressed immune system.

  3. Addressing the “Age & Partners” Reality: The data showed abnormal biopsies ticking upward in older patients and those with a high number of sexual partners. Policies need to aggressively fund accessible, free Pap smears specifically targeting women in their late 20s and 30s. Education campaigns must destigmatize regular checks for women with multiple partners, framing it as routine maintenance rather than a consequence.

  4. Fixing the Data Gap (The HPV Vaccine): We know medically that HPV causes cervical cancer, yet st_ds_hpv barely registered in our model. Why? Because the data is missing or patients are undiagnosed! The biggest policy recommendation is to fund massive, subsidized HPV testing and vaccination campaigns for young girls and boys before their first sexual encounter. We need better real-world tracking of HPV to stop this cancer before it starts.

Conclusion: The data tells a clear story: In this population, predicting cervical cancer risk is heavily reliant on a patient’s hormonal history (birth control) and their immune system’s strength (HIV/STDs). By rewriting public health policy to aggressively test for HPV when prescribing birth control or treating HIV, we can catch the precursor to cancer and save lives.