## GENDER AGE SMOKING YELLOW_FINGERS ANXIETY PEER_PRESSURE CHRONIC_DISEASE
## 1 M 65 1 1 1 2 2
## 2 F 55 1 2 2 1 1
## 3 F 78 2 2 1 1 1
## 4 M 60 2 1 1 1 2
## 5 F 80 1 1 2 1 1
## 6 F 58 1 1 1 2 2
## FATIGUE ALLERGY WHEEZING ALCOHOL_CONSUMING COUGHING SHORTNESS_OF_BREATH
## 1 1 2 2 2 2 2
## 2 2 2 2 1 1 1
## 3 2 1 2 1 1 2
## 4 1 2 1 1 2 1
## 5 2 1 2 1 1 1
## 6 2 2 1 2 2 1
## SWALLOWING_DIFFICULTY CHEST_PAIN LUNG_CANCER
## 1 2 1 NO
## 2 2 2 NO
## 3 1 1 YES
## 4 2 2 YES
## 5 1 2 NO
## 6 1 2 YES
This report analyzes a dataset on lung cancer to predict the patterns of patients having lung cancer based on various attributes.
Displays the structure of the dataset, showing the number of observations, variables, and their data types.
## 'data.frame': 3000 obs. of 16 variables:
## $ GENDER : chr "M" "F" "F" "M" ...
## $ AGE : int 65 55 78 60 80 58 70 74 77 67 ...
## $ SMOKING : int 1 1 2 2 1 1 1 2 1 2 ...
## $ YELLOW_FINGERS : int 1 2 2 1 1 1 1 2 2 2 ...
## $ ANXIETY : int 1 2 1 1 2 1 1 1 1 2 ...
## $ PEER_PRESSURE : int 2 1 1 1 1 2 2 1 2 2 ...
## $ CHRONIC_DISEASE : int 2 1 1 2 1 2 2 1 1 1 ...
## $ FATIGUE : int 1 2 2 1 2 2 1 1 1 2 ...
## $ ALLERGY : int 2 2 1 2 1 2 2 2 1 2 ...
## $ WHEEZING : int 2 2 2 1 2 1 2 1 1 1 ...
## $ ALCOHOL_CONSUMING : int 2 1 1 1 1 2 2 1 2 2 ...
## $ COUGHING : int 2 1 1 2 1 2 2 1 1 1 ...
## $ SHORTNESS_OF_BREATH : int 2 1 2 1 1 1 2 1 1 2 ...
## $ SWALLOWING_DIFFICULTY: int 2 2 1 2 1 1 2 2 1 1 ...
## $ CHEST_PAIN : int 1 2 1 2 2 2 1 1 2 1 ...
## $ LUNG_CANCER : chr "NO" "NO" "YES" "YES" ...
## [1] "GENDER" "AGE" "SMOKING"
## [4] "YELLOW_FINGERS" "ANXIETY" "PEER_PRESSURE"
## [7] "CHRONIC_DISEASE" "FATIGUE" "ALLERGY"
## [10] "WHEEZING" "ALCOHOL_CONSUMING" "COUGHING"
## [13] "SHORTNESS_OF_BREATH" "SWALLOWING_DIFFICULTY" "CHEST_PAIN"
## [16] "LUNG_CANCER"
Generates summary statistics for all variables, giving insight into distributions and missing values.
## GENDER AGE SMOKING YELLOW_FINGERS
## Length:3000 Min. :30.00 Min. :1.000 Min. :1.000
## Class :character 1st Qu.:42.00 1st Qu.:1.000 1st Qu.:1.000
## Mode :character Median :55.00 Median :1.000 Median :2.000
## Mean :55.17 Mean :1.491 Mean :1.514
## 3rd Qu.:68.00 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :80.00 Max. :2.000 Max. :2.000
## ANXIETY PEER_PRESSURE CHRONIC_DISEASE FATIGUE ALLERGY
## Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.00 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:1.00 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :2.00 Median :1.00 Median :2.000
## Mean :1.494 Mean :1.499 Mean :1.51 Mean :1.49 Mean :1.507
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.00 3rd Qu.:2.00 3rd Qu.:2.000
## Max. :2.000 Max. :2.000 Max. :2.00 Max. :2.00 Max. :2.000
## WHEEZING ALCOHOL_CONSUMING COUGHING SHORTNESS_OF_BREATH
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :2.000 Median :1.000
## Mean :1.497 Mean :1.491 Mean :1.511 Mean :1.488
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :2.000 Max. :2.000 Max. :2.000 Max. :2.000
## SWALLOWING_DIFFICULTY CHEST_PAIN LUNG_CANCER
## Min. :1.00 Min. :1.000 NO :1482
## 1st Qu.:1.00 1st Qu.:1.000 YES:1518
## Median :1.00 Median :1.000
## Mean :1.49 Mean :1.499
## 3rd Qu.:2.00 3rd Qu.:2.000
## Max. :2.00 Max. :2.000
Pie Chart of Lung Cancer Cases
Uses ggplot2 to create a pie chart showing the proportion of lung cancer cases (“Yes” vs “No”).
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Psychosocial Variables vs Lung Cancer
Uses ggplot2 to create grouped bar charts showing how anxiety and
fatigue differ between individuals with and without lung cancer.
Clinical Variables vs Lung Cancer
Uses ggplot2 to create bar charts comparing clinical symptoms between
lung cancer and non–lung cancer patients.
Environmental Variables vs Lung Cancer
Uses ggplot2 to create bar charts showing how environmental factors
relate to lung cancer presence.
Demographic Variables vs Lung Cancer
Correlation Plots (Psychosocial, Clinical, Demographic,
Environmental) Psychosocial Correlation
## corrplot 0.95 loaded
Clinical Correlation
Uses corrplot to show correlations among clinical symptoms and lung
cancer.
Demographic Correlation
Visualizes how gender and age relate to lung cancer.
Environmental Correlation
Logistic Regression Models Demographic
Model
##
## Call:
## glm(formula = LUNG_CANCER ~ GENDER + AGE_GROUP, family = binomial,
## data = lung_model)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.22556 0.09558 2.360 0.0183 *
## GENDERM -0.06616 0.07316 -0.904 0.3658
## AGE_GROUP40–49 -0.11665 0.12100 -0.964 0.3350
## AGE_GROUP50–59 -0.25875 0.11966 -2.162 0.0306 *
## AGE_GROUP60–69 -0.23162 0.12113 -1.912 0.0558 .
## AGE_GROUP70+ -0.20473 0.11527 -1.776 0.0757 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4158.5 on 2999 degrees of freedom
## Residual deviance: 4151.5 on 2994 degrees of freedom
## AIC: 4163.5
##
## Number of Fisher Scoring iterations: 3
##
## Call:
## glm(formula = LUNG_CANCER ~ GENDER + AGE_GROUP_NUM, family = binomial,
## data = demo_model)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.25651 0.11797 2.174 0.0297 *
## GENDER -0.06359 0.07309 -0.870 0.3843
## AGE_GROUP_NUM -0.04855 0.02571 -1.888 0.0590 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4158.5 on 2999 degrees of freedom
## Residual deviance: 4154.1 on 2997 degrees of freedom
## AIC: 4160.1
##
## Number of Fisher Scoring iterations: 3
## (Intercept) GENDER AGE_GROUP_NUM
## 1.2924141 0.9383884 0.9526135
## (Intercept) GENDER AGE_GROUP_NUM
## 1.2924141 0.9383884 0.9526135
Psychosocial Model
##
## Call:
## glm(formula = LUNG_CANCER ~ ANXIETY + FATIGUE, family = binomial,
## data = psych_model)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.0009635 0.0640820 -0.015 0.988
## ANXIETY 0.0580576 0.0730492 0.795 0.427
## FATIGUE -0.0086361 0.0730599 -0.118 0.906
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4158.5 on 2999 degrees of freedom
## Residual deviance: 4157.8 on 2997 degrees of freedom
## AIC: 4163.8
##
## Number of Fisher Scoring iterations: 3
## (Intercept) ANXIETY FATIGUE
## 0.9990369 1.0597760 0.9914011
Environmental Model
##
## Call:
## glm(formula = LUNG_CANCER ~ SMOKING + YELLOW_FINGERS + ALCOHOL_CONSUMING +
## PEER_PRESSURE, family = binomial, data = env_model)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.03108 0.08157 -0.381 0.7032
## SMOKING -0.05437 0.07315 -0.743 0.4573
## YELLOW_FINGERS -0.05663 0.07317 -0.774 0.4389
## ALCOHOL_CONSUMING 0.12241 0.07315 1.673 0.0943 .
## PEER_PRESSURE 0.09592 0.07315 1.311 0.1898
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4158.5 on 2999 degrees of freedom
## Residual deviance: 4152.8 on 2995 degrees of freedom
## AIC: 4162.8
##
## Number of Fisher Scoring iterations: 3
## (Intercept) SMOKING YELLOW_FINGERS ALCOHOL_CONSUMING
## 0.9693992 0.9470810 0.9449407 1.1302120
## PEER_PRESSURE
## 1.1006759
Clinical Model
##
## Call:
## glm(formula = LUNG_CANCER ~ CHRONIC_DISEASE + ALLERGY + WHEEZING +
## COUGHING + SHORTNESS_OF_BREATH + SWALLOWING_DIFFICULTY +
## CHEST_PAIN, family = binomial, data = clinical_model)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.010255 0.104916 -0.098 0.9221
## CHRONIC_DISEASE 0.035889 0.073218 0.490 0.6240
## ALLERGY -0.032506 0.073331 -0.443 0.6576
## WHEEZING 0.157201 0.073208 2.147 0.0318 *
## COUGHING -0.136435 0.073277 -1.862 0.0626 .
## SHORTNESS_OF_BREATH 0.008370 0.073288 0.114 0.9091
## SWALLOWING_DIFFICULTY 0.038078 0.073215 0.520 0.6030
## CHEST_PAIN -0.006406 0.073187 -0.088 0.9303
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4158.5 on 2999 degrees of freedom
## Residual deviance: 4149.8 on 2992 degrees of freedom
## AIC: 4165.8
##
## Number of Fisher Scoring iterations: 3
## (Intercept) CHRONIC_DISEASE ALLERGY
## 0.9897972 1.0365407 0.9680169
## WHEEZING COUGHING SHORTNESS_OF_BREATH
## 1.1702310 0.8724630 1.0084047
## SWALLOWING_DIFFICULTY CHEST_PAIN
## 1.0388121 0.9936148
Combined Model
##
## Call:
## glm(formula = formula, family = binomial, data = lung_clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.584876 0.415155 1.409 0.1589
## AGE_GROUP40–49 -0.112444 0.121764 -0.923 0.3558
## AGE_GROUP50–59 -0.259972 0.120139 -2.164 0.0305 *
## AGE_GROUP60–69 -0.239082 0.121665 -1.965 0.0494 *
## AGE_GROUP70+ -0.204974 0.115812 -1.770 0.0767 .
## GENDERM -0.068707 0.073438 -0.936 0.3495
## CHRONIC_DISEASE -0.043137 0.073575 -0.586 0.5577
## ALLERGY 0.035763 0.073569 0.486 0.6269
## WHEEZING -0.163271 0.073437 -2.223 0.0262 *
## COUGHING 0.135164 0.073529 1.838 0.0660 .
## SHORTNESS_OF_BREATH -0.013984 0.073672 -0.190 0.8495
## SWALLOWING_DIFFICULTY -0.030237 0.073576 -0.411 0.6811
## CHEST_PAIN 0.005679 0.073496 0.077 0.9384
## SMOKING 0.049094 0.073674 0.666 0.5052
## YELLOW_FINGERS 0.055328 0.073635 0.751 0.4524
## ALCOHOL_CONSUMING -0.125788 0.073565 -1.710 0.0873 .
## PEER_PRESSURE -0.100039 0.073545 -1.360 0.1738
## ANXIETY -0.056038 0.073641 -0.761 0.4467
## FATIGUE 0.010885 0.073427 0.148 0.8822
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4158.5 on 2999 degrees of freedom
## Residual deviance: 4136.2 on 2981 degrees of freedom
## AIC: 4174.2
##
## Number of Fisher Scoring iterations: 3