Cancer Risk Factors

2025-10-19

Data Description

This project uses data from the Kaggle dataset “cancer_risk.” The dataset showcases the lifestyle, environmental, and genetic factors of patients diagnosed with skin, lung, colon, prostate, or breast cancer. It was constructed using reported public health information and epidemiological patterns.
The dataset includes 2,000 records and 21 features. Features include notable cancer risk factors such as Alcohol_Use, Smoking, and Obesity.
- Engineered features include Overall_Risk_Score and Risk_Level. Overall_Risk_Score is a numerical variable that serves as a composite risk score based on an individual’s risk factor values. Risk_Level is a categorical variable that measures an individual’s level of cancer risk with “Low”, “Medium”, and “High”.
We cleaned up the dataset by removing null values, creating the dataframe “cancer_data”.

Pie chart (plotly)

Of 2000 patients, the most common cancer type was lung cancer, at 26.4% frequency.

Pie chart (plotly)

Most cancer patients had a “medium” risk of cancer. Interestingly, the least common risk diagnosis among such patients was “high”.

3D Scatter plot (plotly)

Commentary for 3D Scatter plot

Cancer types overlap considerably.
The overall spread of data points suggests that risk increases gradually and not just abruptly with age and BMI range.
Older patients generally have higher risk scores, while younger patients tend to cluster at lower risk scores.
Breast and colon cancers cluster in moderate BMI and risk ranges, while lung cancer cases are more widely dispersed.
Skin cancer typically appears for patients with lower BMI and moderate age levels.
A few outliers show high risk despite average BMI and age, hinting at other underlying factors.

Box plot (ggplot)

Commentary for Box plot

Risk Level is a factor variable based on Overall Risk Score, a composite risk metric with values between 0 and 1.
Risk Level categorizes Overall Risk Scores from 0 to .35 as Low, .35 to .65 as Medium, and .65 to 1 as High.
The notches on the box plots provide a 95% confidence interval around the median value for each Risk Level.
Since none of these notches overlap, the plot indicates that the differences between the median Smoking Levels for each Risk Level are statistically significant.

Statistical Analysis for Box Plot

##   Risk_Level min q1 median q3 max     mean
## 1       High   0  6    8.5 10  10 7.519608
## 2        Low   0  1    2.0  4  10 2.845679
## 3     Medium   0  3    6.0  9  10 5.479670

Higher smoking levels exhibited a higher mean risk level at approximately 7.52, which makes sense since smoking is a known cancer-causing substance.
While a low smoking level showed a smaller mean at 2.84 level, there were three outliers at risk level 9 and 10. It can be concluded, as a result, that there are other factors at play impacting the outlier risk levels. For example, there could be other lifestyle habits negatively impacting their overall risk level or inaccurate self-reporting.

Bar graph (ggplot)

All 4 risk factors plotted showcase a positive correlation with overall cancer risk score.

Statistical Analysis for Bar Graph

Call:
lm(formula = Overall_Risk_Score ~ Air_Pollution + Alcohol_Use + 
    Obesity + Occupational_Hazards, data = cancer_data)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.305332 -0.055599 -0.002445  0.056468  0.263308 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          0.1688402  0.0064234   26.29   <2e-16 ***
Air_Pollution        0.0178606  0.0005837   30.60   <2e-16 ***
Alcohol_Use          0.0134763  0.0005705   23.62   <2e-16 ***
Obesity              0.0103827  0.0006076   17.09   <2e-16 ***
Occupational_Hazards 0.0121961  0.0005797   21.04   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.08296 on 1995 degrees of freedom
Multiple R-squared:  0.5466,    Adjusted R-squared:  0.5457 
F-statistic: 601.2 on 4 and 1995 DF,  p-value: < 2.2e-16

Air pollution, alcohol use, obesity, and occupational hazards all demonstrated a very small p-value; we can conclude there is a statistically significant effect on the overall risk score that isn’t a result of chance.
Additionally, all the variables have high t values (cutoff for signficance is 0.05, so for t we used the cutoff 2), which further supports that each risk factor has a legitimate effect on overall risk score.
Lastly, the linear model explains 50% of the variation (shown by r-squared being approximately 0.55).

Statistical Analysis of Dataset

Call:
lm(formula = Overall_Risk_Score ~ Smoking + Alcohol_Use + Obesity + 
    Family_History + Fruit_Veg_Intake + Physical_Activity_Level + 
    BMI + Diet_Red_Meat + Diet_Salted_Processed + BRCA_Mutation + 
    H_Pylori_Infection + Calcium_Intake + Occupational_Hazards, 
    data = cancer_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.22215 -0.04111 -0.00037  0.04240  0.19684 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)              1.66e-02   1.14e-02    1.46   0.1449    
Smoking                  1.81e-02   4.22e-04   43.02  < 2e-16 ***
Alcohol_Use              1.33e-02   4.23e-04   31.54  < 2e-16 ***
Obesity                  1.18e-02   4.50e-04   26.27  < 2e-16 ***
Family_History           1.48e-02   3.45e-03    4.30  1.8e-05 ***
Fruit_Veg_Intake        -1.45e-03   4.68e-04   -3.11   0.0019 ** 
Physical_Activity_Level -9.67e-05   4.32e-04   -0.22   0.8230    
BMI                      8.75e-04   3.45e-04    2.53   0.0114 *  
Diet_Red_Meat            1.20e-02   4.53e-04   26.50  < 2e-16 ***
Diet_Salted_Processed    1.31e-02   4.62e-04   28.39  < 2e-16 ***
BRCA_Mutation            1.20e-02   7.73e-03    1.55   0.1221    
H_Pylori_Infection       3.45e-03   3.50e-03    0.99   0.3247    
Calcium_Intake          -1.14e-04   4.58e-04   -0.25   0.8032    
Occupational_Hazards     1.32e-02   4.27e-04   30.91  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0609 on 1986 degrees of freedom
Multiple R-squared:  0.757, Adjusted R-squared:  0.755 
F-statistic:  475 on 13 and 1986 DF,  p-value: <2e-16

Statistical Analysis of Dataset (cont’d)

Smoking, alcohol use, obesity, family history, how much red meat is in a patient’s diet, how much salty processed food a patient is consuming, and frequency of occupational hazards had the smallest p-values. These risk factors have a statistically significant effect on overall risk score that’s not due to chance. Fruit and vegetable intake also demonstrated a statistically significant p-value, but to a lesser degree. Surprisingly, physical activity level has no statistically significant effect on an individual’s overall risk score.
Additionally, whether a patient has the BRCA genetic mutation, history of an H. pylori infection, or intakes a certain amount of calcium has no statically significant effect on their overall risk score. This makes sense logically, because they aren’t typical common factors among most people and have a smaller scope of effect compared to lifestyle habits that persist more frequently over long periods of time. For example, taking a calcium supplement every day likely won’t make as much of an impact compared to if an individual cut out alcohol or red meat from their diet.

Mean and Standard Deviations of Statistically Significant Factors

##                         Risk_Factor   Mean Standard_Deviation
## 1                           Smoking 5.1570          3.3253391
## 2                       Alcohol Use 5.0350          3.2609956
## 3                           Obesity 5.9675          3.0613934
## 4                    Family History 0.1945          0.3959143
## 5                  Diet of Red Meat 5.1895          3.1544516
## 6 Diet of Salted and Processed Food 4.5635          3.0883226
## 7              Occupational Hazards 4.9790          3.2128991

All of these statistically significant risk factors were measured on a scale of 0-10, 0 being least frequently practiced and 10 being most frequently practiced. Family history was measured either as a 0 or 1, 1 meaning an individual has a family history of cancer.
Overall, people fall in the moderate frequency range for the risk factors, around a value of 5. This means that on average, patients reported smoking consumption at a level of around 5.16, alcohol consumption around 5.04, red meat consumption around 5.19, and salty processed food consumption at about 4.56.
Additionally, patients are on average moderately obese, at a level of around 5.97. Occupational hazards were reported to be, on average, moderately frequent, at a value of approximately 4.98. That’s a questionably high mean for the frequency of occupational hazards among the patients. Lastly, about 19% of individuals in the dataset have a family history of cancer.

Overall Risk Score between Sexes - Mean and S.D

##   Gender mean_risk   sd_risk
## 1 Female 0.4540896 0.1222258
## 2   Male 0.4548243 0.1240155

There is practically no difference between the average female and male overall risk score. It’s likely that if you analyzed the overall risk score between sexes by cancer type, there would be more of a noticeable difference. For example, it is a known fact that women are more at risk of breast cancer than men.

ANOVA Test comparing Overall Risk Score, Cancer_Type, Smoking, and Obesity

##               Df Sum Sq Mean Sq F value Pr(>F)    
## Cancer_Type    4  2.646   0.662   63.93 <2e-16 ***
## Smoking        1  4.413   4.413  426.41 <2e-16 ***
## Obesity        1  2.594   2.594  250.65 <2e-16 ***
## Residuals   1993 20.626   0.010                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA test shows that there is a statistically significant effect on overall risk score by cancer type, smoking level, and obesity level.
Each exhibit low p-values, which shows their effects are not because of chance. Smoking level resulted in a very large F-value, 426.41, which tells us that smoking has a particularly large effect on the overall risk score.
Similarly, obesity level has a notable impact on the overall risk score with an F-statistic at 250.65. Cancer type also affects overall risk score, but not as strongly compared to smoking and obesity levels, with an F-value at 63.93.

Pairs() Visualization

Using the pairs() visualization didn’t give a clear picture of whether the variables had a positive or negative linear relationship; they all visually seemed independent of eachother.
The only two variables that showed the possibility of a linear trend was smoking and obesity against overall risk score. When graphing the relationships using scatterplots, it was hard to come to any conclusion with the data because they were separated by level and didn’t show an obvious trend. As a result, we decided to visualize the relationships using box plots instead.

Visualizing using Scatter Plots Instead

Smoking vs Overall Score Scatterplot Analysis

Smoking vs Overall Score Scatterplot Analysis (cont’d)

Smoking versus overall risk score shows a positive linear relationship. This makes sense, since higher smoking levels would put an individual at increased risk of cancer.
Compared to our previous box plot that showed smoking against the categorical variable Risk_Level, this box plot reveals more outliers that weren’t obvious before with the bins “Low”, “Medium”, and “High”. For example, one patient who reported a smoking level of 0 exhibited an overall risk score of approximately 0.7, compared to the median score of 0.37 for their smoking level.
Similarly, there were three outliers at smoking level 9 that had noticeably low overall risk scores ranging from approximately 0.15 to 0.21, compared to the approximate median risk score 0.53 for the smoking level.
It was interesting to see that one patient at smoking level 1 had an almost 0 overall risk score (approx. 0.03). These outliers are likely explained by other risk factors impacting their risk scores. For example, family history of cancer and/or poor lifestyle habits like frequent alcohol consumption could impact the risk score of a patient with a low smoking level but high overall risk score.

Obesity vs Overall Score Scatterplot Analysis

Obesity versus overall risk score also demonstrates a positive linear trend. As obesity level approaches ten, the median overall risk score increases from approximately 0.38 to 0.48. The spreads of the plots remain consistently similar across all smoking levels.
Smoking level 6 and level 9 had highly similar lower fences and upper fences. However, smoking level 6 saw an outlier patient with an overall risk score of approximately 0.105, the minimum for the level. This is a surprisingly low overall risk score, taking the reported smoking frequency was a level 6.
Similarly, an individual with a smoking level of 5 demonstrated a very low overall risk score at roughly 0.08. Other outliers included an individual with a smoking level 4 having an overall risk score at approximately 0.76, and an individual at smoking level 2 with an extremely low overall risk score at approximately 0.03. These outliers are likely a result of other factors impacting the individuals’ overall risk scores.

Age vs. Cancer Type, categorized by Risk Level

Interpreting Age vs. Cancer Type - Low Risk Level Box Plot

The median age for patients with breast cancer was 60 years old. Colon cancer, lung cancer, and skin cancer had a similar median at 64 years, 63 years, and 61 years respectively.
These four cancer types also had similar upper and lower fences to eachother, with breast cancer’s being most similar to lung cancer and colon cancer’s being similar to skin cancer.
Prostate cancer had the highest median age, at 72. Overall, prostate cancer’s spread had higher values compared to the other cancer types, with its age range being from 46 years to 89 years.
Overall, two outliers were present: a 38-year old patient with colon cancer and a 32-year old patient with skin cancer.

Interpretation - Medium Risk Level Box Plot

The median ages for patients with breast cancer, colon cancer, lung cancer, and skin cancer were similar to the medians in the low risk level box plot. They had a median of 63, 61, 62, and 63 years respectively.
Additionally, the four cancer types had similar spreads, with colon cancer and breast cancer having an identical lower fence and upper fence from 37 to 85 years.
Prostate cancer exhibited higher ages and the highest median at 73 years of age compared to the other cancer types.
Outliers included two individuals with breast cancer aged 31 and 34, an individual with lung cancer aged 32, and an individual with prostate cancer aged 46 years.
The medium risk level box plot and the low risk level box plot had very similar distributions overall.

Interpretation - High Risk Level Box Plot

The high risk level box plot varied most differently from the other two box plots in terms of shape. Most noticeably, prostate cancer’s spread is extremely condensed. It’s range is ages 67 to 74, with its median age being on the higher end at 73 years.
Interestingly, the median age didn’t change much across the three risk levels for prostate cancer. Breast cancer, lung cancer, and skin cancer have extremely similar ranges, going from 45 years to 81 years (breast cancer and lung cancer) or 80 years (skin cancer). Colon cancer’s fences were shorter, from 50 to 78.
Across the whole plot, there was only one outlier, an 85 year-old patient with colon cancer.
Colon cancer, lung cancer, and skin cancer all had a similar median around 63.5 (63, 64, and 63.5 respectively). Breast cancer’s median was on the low end, at 59 years of age.
Overall, the spreads of the high risk level box plots were more condensed compared to the other risk level’s box plots. The median across the three risk level plots typically stayed in the early to mid 60s range.

Conclusions

Of the dataset’s 2000 patients, the most common cancer type was lung cancer.
Most patients had a medium level risk of cancer. The least common risk level was “high,” which was an interesting find taking that they are all diagnosed with either breast, lung, colon, skin, or prostate cancer.
Older patients generally had a higher overall risk score. Conversely, younger patients were clustered at lower risk scores. Additionally, in our distribution of age versus cancer type, we found that across all risk levels and cancer types, the median age for a patient with cancer was typically in the early to mid 60s. Prostate cancer consistently had the median age of 72 to 73 years across all three risk levels.
Men and women across the dataset had, on average, practically identical overall risk scores. If separated by cancer type, there would likely be a greater difference between the means.

Conclusions (cont’d)

In our statistical analysis, we wanted to see which factors had a statistically significant effect on the overall risk a patient has for developing cancer. By looking at the relationships between the variables and the overall risk score through linear modeling, an ANOVA test, and visualizations, we found that smoking, alchohol use, obesity, family history, red meat consumption, processed food consumption, and frequency of occupational hazards impacted the overall risk score in a statistically significant way. Smoking had the largest impact on overall risk score when looking at our F-statistics for each value, which makes sense because its a known carcinogen.

Thank you!