##   GENDER AGE SMOKING YELLOW_FINGERS ANXIETY PEER_PRESSURE CHRONIC_DISEASE
## 1      M  65       1              1       1             2               2
## 2      F  55       1              2       2             1               1
## 3      F  78       2              2       1             1               1
## 4      M  60       2              1       1             1               2
## 5      F  80       1              1       2             1               1
## 6      F  58       1              1       1             2               2
##   FATIGUE ALLERGY WHEEZING ALCOHOL_CONSUMING COUGHING SHORTNESS_OF_BREATH
## 1       1       2        2                 2        2                   2
## 2       2       2        2                 1        1                   1
## 3       2       1        2                 1        1                   2
## 4       1       2        1                 1        2                   1
## 5       2       1        2                 1        1                   1
## 6       2       2        1                 2        2                   1
##   SWALLOWING_DIFFICULTY CHEST_PAIN LUNG_CANCER
## 1                     2          1          NO
## 2                     2          2          NO
## 3                     1          1         YES
## 4                     2          2         YES
## 5                     1          2          NO
## 6                     1          2         YES

Lung Cancer

This report analyzes a dataset on lung cancer to predict the patterns of patients having lung cancer based on various attributes.

Data Preprocessing

Displays the structure of the dataset, showing the number of observations, variables, and their data types.

## 'data.frame':    3000 obs. of  16 variables:
##  $ GENDER               : chr  "M" "F" "F" "M" ...
##  $ AGE                  : int  65 55 78 60 80 58 70 74 77 67 ...
##  $ SMOKING              : int  1 1 2 2 1 1 1 2 1 2 ...
##  $ YELLOW_FINGERS       : int  1 2 2 1 1 1 1 2 2 2 ...
##  $ ANXIETY              : int  1 2 1 1 2 1 1 1 1 2 ...
##  $ PEER_PRESSURE        : int  2 1 1 1 1 2 2 1 2 2 ...
##  $ CHRONIC_DISEASE      : int  2 1 1 2 1 2 2 1 1 1 ...
##  $ FATIGUE              : int  1 2 2 1 2 2 1 1 1 2 ...
##  $ ALLERGY              : int  2 2 1 2 1 2 2 2 1 2 ...
##  $ WHEEZING             : int  2 2 2 1 2 1 2 1 1 1 ...
##  $ ALCOHOL_CONSUMING    : int  2 1 1 1 1 2 2 1 2 2 ...
##  $ COUGHING             : int  2 1 1 2 1 2 2 1 1 1 ...
##  $ SHORTNESS_OF_BREATH  : int  2 1 2 1 1 1 2 1 1 2 ...
##  $ SWALLOWING_DIFFICULTY: int  2 2 1 2 1 1 2 2 1 1 ...
##  $ CHEST_PAIN           : int  1 2 1 2 2 2 1 1 2 1 ...
##  $ LUNG_CANCER          : chr  "NO" "NO" "YES" "YES" ...
##  [1] "GENDER"                "AGE"                   "SMOKING"              
##  [4] "YELLOW_FINGERS"        "ANXIETY"               "PEER_PRESSURE"        
##  [7] "CHRONIC_DISEASE"       "FATIGUE"               "ALLERGY"              
## [10] "WHEEZING"              "ALCOHOL_CONSUMING"     "COUGHING"             
## [13] "SHORTNESS_OF_BREATH"   "SWALLOWING_DIFFICULTY" "CHEST_PAIN"           
## [16] "LUNG_CANCER"

Generates summary statistics for all variables, giving insight into distributions and missing values.

##     GENDER               AGE           SMOKING      YELLOW_FINGERS 
##  Length:3000        Min.   :30.00   Min.   :1.000   Min.   :1.000  
##  Class :character   1st Qu.:42.00   1st Qu.:1.000   1st Qu.:1.000  
##  Mode  :character   Median :55.00   Median :1.000   Median :2.000  
##                     Mean   :55.17   Mean   :1.491   Mean   :1.514  
##                     3rd Qu.:68.00   3rd Qu.:2.000   3rd Qu.:2.000  
##                     Max.   :80.00   Max.   :2.000   Max.   :2.000  
##     ANXIETY      PEER_PRESSURE   CHRONIC_DISEASE    FATIGUE        ALLERGY     
##  Min.   :1.000   Min.   :1.000   Min.   :1.00    Min.   :1.00   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.00    1st Qu.:1.00   1st Qu.:1.000  
##  Median :1.000   Median :1.000   Median :2.00    Median :1.00   Median :2.000  
##  Mean   :1.494   Mean   :1.499   Mean   :1.51    Mean   :1.49   Mean   :1.507  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.00    3rd Qu.:2.00   3rd Qu.:2.000  
##  Max.   :2.000   Max.   :2.000   Max.   :2.00    Max.   :2.00   Max.   :2.000  
##     WHEEZING     ALCOHOL_CONSUMING    COUGHING     SHORTNESS_OF_BREATH
##  Min.   :1.000   Min.   :1.000     Min.   :1.000   Min.   :1.000      
##  1st Qu.:1.000   1st Qu.:1.000     1st Qu.:1.000   1st Qu.:1.000      
##  Median :1.000   Median :1.000     Median :2.000   Median :1.000      
##  Mean   :1.497   Mean   :1.491     Mean   :1.511   Mean   :1.488      
##  3rd Qu.:2.000   3rd Qu.:2.000     3rd Qu.:2.000   3rd Qu.:2.000      
##  Max.   :2.000   Max.   :2.000     Max.   :2.000   Max.   :2.000      
##  SWALLOWING_DIFFICULTY   CHEST_PAIN    LUNG_CANCER
##  Min.   :1.00          Min.   :1.000   NO :1482   
##  1st Qu.:1.00          1st Qu.:1.000   YES:1518   
##  Median :1.00          Median :1.000              
##  Mean   :1.49          Mean   :1.499              
##  3rd Qu.:2.00          3rd Qu.:2.000              
##  Max.   :2.00          Max.   :2.000

Pie Chart of Lung Cancer Cases

Uses ggplot2 to create a pie chart showing the proportion of lung cancer cases (“Yes” vs “No”).

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Psychosocial Variables vs Lung Cancer

Uses ggplot2 to create grouped bar charts showing how anxiety and fatigue differ between individuals with and without lung cancer. Clinical Variables vs Lung Cancer

Uses ggplot2 to create bar charts comparing clinical symptoms between lung cancer and non–lung cancer patients. Environmental Variables vs Lung Cancer

Uses ggplot2 to create bar charts showing how environmental factors relate to lung cancer presence. Demographic Variables vs Lung Cancer Correlation Plots (Psychosocial, Clinical, Demographic, Environmental) Psychosocial Correlation

## corrplot 0.95 loaded

Clinical Correlation

Uses corrplot to show correlations among clinical symptoms and lung cancer. Demographic Correlation

Visualizes how gender and age relate to lung cancer. Environmental Correlation Logistic Regression Models Demographic Model

## 
## Call:
## glm(formula = LUNG_CANCER ~ GENDER + AGE_GROUP, family = binomial, 
##     data = lung_model)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)  
## (Intercept)     0.22556    0.09558   2.360   0.0183 *
## GENDERM        -0.06616    0.07316  -0.904   0.3658  
## AGE_GROUP40–49 -0.11665    0.12100  -0.964   0.3350  
## AGE_GROUP50–59 -0.25875    0.11966  -2.162   0.0306 *
## AGE_GROUP60–69 -0.23162    0.12113  -1.912   0.0558 .
## AGE_GROUP70+   -0.20473    0.11527  -1.776   0.0757 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4158.5  on 2999  degrees of freedom
## Residual deviance: 4151.5  on 2994  degrees of freedom
## AIC: 4163.5
## 
## Number of Fisher Scoring iterations: 3
## 
## Call:
## glm(formula = LUNG_CANCER ~ GENDER + AGE_GROUP_NUM, family = binomial, 
##     data = demo_model)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)    0.25651    0.11797   2.174   0.0297 *
## GENDER        -0.06359    0.07309  -0.870   0.3843  
## AGE_GROUP_NUM -0.04855    0.02571  -1.888   0.0590 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4158.5  on 2999  degrees of freedom
## Residual deviance: 4154.1  on 2997  degrees of freedom
## AIC: 4160.1
## 
## Number of Fisher Scoring iterations: 3
##   (Intercept)        GENDER AGE_GROUP_NUM 
##     1.2924141     0.9383884     0.9526135
##   (Intercept)        GENDER AGE_GROUP_NUM 
##     1.2924141     0.9383884     0.9526135

Psychosocial Model

## 
## Call:
## glm(formula = LUNG_CANCER ~ ANXIETY + FATIGUE, family = binomial, 
##     data = psych_model)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.0009635  0.0640820  -0.015    0.988
## ANXIETY      0.0580576  0.0730492   0.795    0.427
## FATIGUE     -0.0086361  0.0730599  -0.118    0.906
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4158.5  on 2999  degrees of freedom
## Residual deviance: 4157.8  on 2997  degrees of freedom
## AIC: 4163.8
## 
## Number of Fisher Scoring iterations: 3
## (Intercept)     ANXIETY     FATIGUE 
##   0.9990369   1.0597760   0.9914011

Environmental Model

## 
## Call:
## glm(formula = LUNG_CANCER ~ SMOKING + YELLOW_FINGERS + ALCOHOL_CONSUMING + 
##     PEER_PRESSURE, family = binomial, data = env_model)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)  
## (Intercept)       -0.03108    0.08157  -0.381   0.7032  
## SMOKING           -0.05437    0.07315  -0.743   0.4573  
## YELLOW_FINGERS    -0.05663    0.07317  -0.774   0.4389  
## ALCOHOL_CONSUMING  0.12241    0.07315   1.673   0.0943 .
## PEER_PRESSURE      0.09592    0.07315   1.311   0.1898  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4158.5  on 2999  degrees of freedom
## Residual deviance: 4152.8  on 2995  degrees of freedom
## AIC: 4162.8
## 
## Number of Fisher Scoring iterations: 3
##       (Intercept)           SMOKING    YELLOW_FINGERS ALCOHOL_CONSUMING 
##         0.9693992         0.9470810         0.9449407         1.1302120 
##     PEER_PRESSURE 
##         1.1006759

Clinical Model

## 
## Call:
## glm(formula = LUNG_CANCER ~ CHRONIC_DISEASE + ALLERGY + WHEEZING + 
##     COUGHING + SHORTNESS_OF_BREATH + SWALLOWING_DIFFICULTY + 
##     CHEST_PAIN, family = binomial, data = clinical_model)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)  
## (Intercept)           -0.010255   0.104916  -0.098   0.9221  
## CHRONIC_DISEASE        0.035889   0.073218   0.490   0.6240  
## ALLERGY               -0.032506   0.073331  -0.443   0.6576  
## WHEEZING               0.157201   0.073208   2.147   0.0318 *
## COUGHING              -0.136435   0.073277  -1.862   0.0626 .
## SHORTNESS_OF_BREATH    0.008370   0.073288   0.114   0.9091  
## SWALLOWING_DIFFICULTY  0.038078   0.073215   0.520   0.6030  
## CHEST_PAIN            -0.006406   0.073187  -0.088   0.9303  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4158.5  on 2999  degrees of freedom
## Residual deviance: 4149.8  on 2992  degrees of freedom
## AIC: 4165.8
## 
## Number of Fisher Scoring iterations: 3
##           (Intercept)       CHRONIC_DISEASE               ALLERGY 
##             0.9897972             1.0365407             0.9680169 
##              WHEEZING              COUGHING   SHORTNESS_OF_BREATH 
##             1.1702310             0.8724630             1.0084047 
## SWALLOWING_DIFFICULTY            CHEST_PAIN 
##             1.0388121             0.9936148

Combined Model

## 
## Call:
## glm(formula = formula, family = binomial, data = lung_clean)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)  
## (Intercept)            0.584876   0.415155   1.409   0.1589  
## AGE_GROUP40–49        -0.112444   0.121764  -0.923   0.3558  
## AGE_GROUP50–59        -0.259972   0.120139  -2.164   0.0305 *
## AGE_GROUP60–69        -0.239082   0.121665  -1.965   0.0494 *
## AGE_GROUP70+          -0.204974   0.115812  -1.770   0.0767 .
## GENDERM               -0.068707   0.073438  -0.936   0.3495  
## CHRONIC_DISEASE       -0.043137   0.073575  -0.586   0.5577  
## ALLERGY                0.035763   0.073569   0.486   0.6269  
## WHEEZING              -0.163271   0.073437  -2.223   0.0262 *
## COUGHING               0.135164   0.073529   1.838   0.0660 .
## SHORTNESS_OF_BREATH   -0.013984   0.073672  -0.190   0.8495  
## SWALLOWING_DIFFICULTY -0.030237   0.073576  -0.411   0.6811  
## CHEST_PAIN             0.005679   0.073496   0.077   0.9384  
## SMOKING                0.049094   0.073674   0.666   0.5052  
## YELLOW_FINGERS         0.055328   0.073635   0.751   0.4524  
## ALCOHOL_CONSUMING     -0.125788   0.073565  -1.710   0.0873 .
## PEER_PRESSURE         -0.100039   0.073545  -1.360   0.1738  
## ANXIETY               -0.056038   0.073641  -0.761   0.4467  
## FATIGUE                0.010885   0.073427   0.148   0.8822  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4158.5  on 2999  degrees of freedom
## Residual deviance: 4136.2  on 2981  degrees of freedom
## AIC: 4174.2
## 
## Number of Fisher Scoring iterations: 3