Dataset source/url: https://www.kaggle.com/datasets/thedevastator/cancer-patients-and-air-pollution-a-new-link

Introduction

Lung cancer also known as lung carcinoma or Adenocarcinoma is the most common type of lung cancer around the globe. Cancer in general is a condition in which the body’s cells proliferate uncontrollably and spreads the cells vigorously to other parts of the body. As for lung cancer, as it’s been known as the common type of carcinoma disease which is frequently a result from many of the famous risk factors known to society, especially smoking. This project particularly focuses on the risk factors of lung cancer and the overall social stigma circulating around smoking habits and its contribution towards the particular disease mentioned. An article by the Malaysian government health portal, common lung cancer cases is caused by smoking (90%) and this includes both active and second-hand smoking as tobacco smoke contains more than 4000 harmful substances, most of them have been identified to cause lung cancer.

Throughout this project several stages have been conducted in order to produce adequate and accurate findings regarding the risk factors of lung cancer and the overall social stigma circulating around smoking habits. Data pre-processing, Data cleaning, Exploratory Data Analysis (EDA), Data Analysis and Data Interpretation were conducted to process, clean, analyse, validate and summarize the raw data sets which in return produced graphical and statistical visualization and evaluation models that’s supports the findings for this project.

Objective:

• To demonstrate the veracity of the social stigma that smoking causes lung cancer.

• To discover the lung cancer risk factors

Problem:

• Is smoking is the only cause of cancer?

• What additional risk factors are related with lung cancer?

Data Preprocessing

Data pre-processing includes the techniques of examining, evaluating, and reviewing data in order to compile statistics regarding its quality is the process of data profiling and processing. It begins with an analysis of the properties of the current data, whereby the data sets that are relevant to the issue at hand, list their key characteristics, and then speculate on which properties might be pertinent for the suggested analytics or machine learning activity. As for this project, the raw data set used, the participants were lung disease patients and the dataset contained personal details of patients, their occupation, whether they were active smokers and the common symptoms that they face with lung cancer. This stage of data pre-processing gives us a vague view and understanding regarding the topic, however the data is still raw and uncleaned. Inaccurate views and assumptions about statistics insights, inadequately informed judgements based on those insights, and general mistrust in the analytics process can all result from dirty data.

#Import library
#install.packages("janitor")
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("stats")
library(tidyr)
# getting data from csv
data_uncleaned<-read.csv(file = "cancer_patient_datasets.csv")
head(data_uncleaned)
##   index Patient.Id Age Gender Air.Pollution Alcohol.use Dust.Allergy
## 1     0         P1  33      1             2           4            5
## 2     1        P10  17      1             3           1            5
## 3     2       P100  35      1             4           5            6
## 4     3      P1000  37      1             7           7            7
## 5     4       P101  46      1             6           8            7
## 6     5       P102  35      1             4           5            6
##   OccuPational.Hazards Genetic.Risk chronic.Lung.Disease Balanced.Diet Obesity
## 1                    4            3                    2             2       4
## 2                    3            4                    2             2       2
## 3                    5            5                    4             6       7
## 4                    7            6                    7             7       7
## 5                    7            7                    6             7       7
## 6                    5            5                    4             6       7
##   Smoking Passive.Smoker Chest.Pain Coughing.of.Blood Fatigue Weight.Loss
## 1       3              2          2                 4       3           4
## 2       2              4          2                 3       1           3
## 3       2              3          4                 8       8           7
## 4       7              7          7                 8       4           2
## 5       8              7          7                 9       3           2
## 6       2              3          4                 8       8           7
##   Shortness.of.Breath Wheezing Swallowing.Difficulty Clubbing.of.Finger.Nails
## 1                   2        2                     3                        1
## 2                   7        8                     6                        2
## 3                   9        2                     1                        4
## 4                   3        1                     4                        5
## 5                   4        1                     4                        2
## 6                   9        2                     1                        4
##   Frequent.Cold Dry.Cough Snoring  Level
## 1             2         3       4    Low
## 2             1         7       2 Medium
## 3             6         7       2   High
## 4             6         7       5   High
## 5             4         2       3   High
## 6             6         7       2   High

Data Cleaning

Data cleansing in general is defined as the practise of correcting or removing inaccurate, damaged, poorly formatted, duplicate, or incomplete data from a dataset. The first step in data cleansing id deduplication of data. Deduplication of data is done by removing duplicate or pointless data as well as undesirable observations from the dataset as most of the duplicated data appears during data gathering. As shown below, during this stage, the null values or duplicated rows and values will be identified, extracted, and removed from the raw dataset.

# extracting positions of NA values
print ("Row and Col positions of NA values")
## [1] "Row and Col positions of NA values"
which(is.na(data_uncleaned), arr.ind=TRUE)
##       row col
##  [1,] 943   3
##  [2,] 977   3
##  [3,] 943   4
##  [4,] 977   4
##  [5,] 943   5
##  [6,] 977   5
##  [7,] 943   6
##  [8,] 977   6
##  [9,] 943   7
## [10,] 943   8
## [11,] 943   9
## [12,] 943  10
## [13,] 977  11
## [14,] 977  12
## [15,] 977  13
## [16,] 943  14
## [17,] 977  14
## [18,] 943  15
## [19,] 977  15
## [20,] 943  16
## [21,] 977  16
## [22,] 943  17
## [23,] 977  17
## [24,] 943  18
## [25,] 977  18
## [26,] 943  19
## [27,] 977  19
## [28,] 943  20
## [29,] 943  21
## [30,] 943  22
## [31,] 943  23
## [32,] 943  24
## [33,] 977  24
## [34,] 943  25
## [35,] 977  25
# Identifying Duplicate Data rows
print (" Sum of duplicate row")
## [1] " Sum of duplicate row"
sum(duplicated(data_uncleaned))
## [1] 3
# Remove NA values
data_uncleaned_2<- data_uncleaned %>% drop_na()

# Remove duplicates from data frame:
data<- data_uncleaned_2 %>% distinct()
# let see if we still have duplicate rows and NA values
# extracting positions of NA values
print ("Row and Col positions of NA values")
## [1] "Row and Col positions of NA values"
which(is.na(data), arr.ind=TRUE)
##      row col
# Identifying Duplicate Data rows
print ("Row and Col sum of duplicate row")
## [1] "Row and Col sum of duplicate row"
sum(duplicated(data))
## [1] 0
#change level of lung Cancer to numeric data
data$Level<-c(Low=1,Medium=2,High=3)[data$Level]
head(data)
##   index Patient.Id Age Gender Air.Pollution Alcohol.use Dust.Allergy
## 1     0         P1  33      1             2           4            5
## 2     1        P10  17      1             3           1            5
## 3     2       P100  35      1             4           5            6
## 4     3      P1000  37      1             7           7            7
## 5     4       P101  46      1             6           8            7
## 6     5       P102  35      1             4           5            6
##   OccuPational.Hazards Genetic.Risk chronic.Lung.Disease Balanced.Diet Obesity
## 1                    4            3                    2             2       4
## 2                    3            4                    2             2       2
## 3                    5            5                    4             6       7
## 4                    7            6                    7             7       7
## 5                    7            7                    6             7       7
## 6                    5            5                    4             6       7
##   Smoking Passive.Smoker Chest.Pain Coughing.of.Blood Fatigue Weight.Loss
## 1       3              2          2                 4       3           4
## 2       2              4          2                 3       1           3
## 3       2              3          4                 8       8           7
## 4       7              7          7                 8       4           2
## 5       8              7          7                 9       3           2
## 6       2              3          4                 8       8           7
##   Shortness.of.Breath Wheezing Swallowing.Difficulty Clubbing.of.Finger.Nails
## 1                   2        2                     3                        1
## 2                   7        8                     6                        2
## 3                   9        2                     1                        4
## 4                   3        1                     4                        5
## 5                   4        1                     4                        2
## 6                   9        2                     1                        4
##   Frequent.Cold Dry.Cough Snoring Level
## 1             2         3       4     1
## 2             1         7       2     2
## 3             6         7       2     3
## 4             6         7       5     3
## 5             4         2       3     3
## 6             6         7       2     3

Exploratory Data Analysis (EDA)

Data visualisation techniques are frequently employed in exploratory data analysis (EDA), which is used to examine and summarise large data sets. It makes it simpler to find patterns, identify anomalies, evaluate hypotheses, or verify assumptions by determining how to manipulate data sources to achieve the answers they need. In this project, the EDA stage is conducted to determine the dimensions of the data frame, obtain the summary of data frame and obtain the overall summary of data. The overall summary of data as shown below summarises the minimum, 1st quarter, median, average, 3rd quarter and maximum values for each category of data in the data set, hence, producing a thoroughly detailed finding.

Correlation Analysis is conducted to calculate the level of change in one variable due to the change in the other as it is a statistical technique for calculating the connection between two variables and determining the strength of their linear relationship. Correlation Matrix on the other hand, is simply a table which displays the correlation coefficients for different variables and applies to only numerical columns. Several statistical visualisation diagrams has been plotted based on the cleaned data produced in the earlier stages. From the ggplot diagram shown below, we can understand that patients around the age 30-40 have a higher density value compared to the other patients. Also, highlighting that the sample participants are mostly of the age 30-40 years old.

#To determine the dimensions of the data frame
dim(data)
## [1] 998  26
#To get the summary of the data frame
str(data)
## 'data.frame':    998 obs. of  26 variables:
##  $ index                   : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Patient.Id              : chr  "P1" "P10" "P100" "P1000" ...
##  $ Age                     : int  33 17 35 37 46 35 52 28 35 46 ...
##  $ Gender                  : int  1 1 1 1 1 1 2 2 2 1 ...
##  $ Air.Pollution           : int  2 3 4 7 6 4 2 3 4 2 ...
##  $ Alcohol.use             : int  4 1 5 7 8 5 4 1 5 3 ...
##  $ Dust.Allergy            : int  5 5 6 7 7 6 5 4 6 4 ...
##  $ OccuPational.Hazards    : int  4 3 5 7 7 5 4 3 5 2 ...
##  $ Genetic.Risk            : int  3 4 5 6 7 5 3 2 6 4 ...
##  $ chronic.Lung.Disease    : int  2 2 4 7 6 4 2 3 5 3 ...
##  $ Balanced.Diet           : int  2 2 6 7 7 6 2 4 5 3 ...
##  $ Obesity                 : int  4 2 7 7 7 7 4 3 5 3 ...
##  $ Smoking                 : int  3 2 2 7 8 2 3 1 6 2 ...
##  $ Passive.Smoker          : int  2 4 3 7 7 3 2 4 6 3 ...
##  $ Chest.Pain              : int  2 2 4 7 7 4 2 3 6 4 ...
##  $ Coughing.of.Blood       : int  4 3 8 8 9 8 4 1 5 4 ...
##  $ Fatigue                 : int  3 1 8 4 3 8 3 3 1 1 ...
##  $ Weight.Loss             : int  4 3 7 2 2 7 4 2 4 2 ...
##  $ Shortness.of.Breath     : int  2 7 9 3 4 9 2 2 3 4 ...
##  $ Wheezing                : int  2 8 2 1 1 2 2 4 2 6 ...
##  $ Swallowing.Difficulty   : int  3 6 1 4 4 1 3 2 4 5 ...
##  $ Clubbing.of.Finger.Nails: int  1 2 4 5 2 4 1 2 6 4 ...
##  $ Frequent.Cold           : int  2 1 6 6 4 6 2 3 2 2 ...
##  $ Dry.Cough               : int  3 7 7 7 2 7 3 4 4 1 ...
##  $ Snoring                 : int  4 2 2 5 3 2 4 3 1 5 ...
##  $ Level                   : num  1 2 3 3 3 3 1 1 2 2 ...
#summary of the data
summary(data)
##      index        Patient.Id             Age            Gender     
##  Min.   :  0.0   Length:998         Min.   :14.00   Min.   :1.000  
##  1st Qu.:251.2   Class :character   1st Qu.:28.00   1st Qu.:1.000  
##  Median :500.5   Mode  :character   Median :36.00   Median :1.000  
##  Mean   :500.3                      Mean   :37.18   Mean   :1.403  
##  3rd Qu.:749.8                      3rd Qu.:45.00   3rd Qu.:2.000  
##  Max.   :999.0                      Max.   :73.00   Max.   :2.000  
##  Air.Pollution    Alcohol.use     Dust.Allergy   OccuPational.Hazards
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000       
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:4.000   1st Qu.:3.000       
##  Median :3.000   Median :5.000   Median :6.000   Median :5.000       
##  Mean   :3.839   Mean   :4.565   Mean   :5.165   Mean   :4.843       
##  3rd Qu.:6.000   3rd Qu.:7.000   3rd Qu.:7.000   3rd Qu.:7.000       
##  Max.   :8.000   Max.   :8.000   Max.   :8.000   Max.   :8.000       
##   Genetic.Risk   chronic.Lung.Disease Balanced.Diet      Obesity     
##  Min.   :1.000   Min.   :1.000        Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:3.000        1st Qu.:2.000   1st Qu.:3.000  
##  Median :5.000   Median :4.000        Median :4.000   Median :4.000  
##  Mean   :4.581   Mean   :4.383        Mean   :4.491   Mean   :4.464  
##  3rd Qu.:7.000   3rd Qu.:6.000        3rd Qu.:7.000   3rd Qu.:7.000  
##  Max.   :7.000   Max.   :7.000        Max.   :7.000   Max.   :7.000  
##     Smoking      Passive.Smoker    Chest.Pain   Coughing.of.Blood
##  Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000    
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.00   1st Qu.:3.000    
##  Median :3.000   Median :4.000   Median :4.00   Median :4.000    
##  Mean   :3.952   Mean   :4.198   Mean   :4.44   Mean   :4.858    
##  3rd Qu.:7.000   3rd Qu.:7.000   3rd Qu.:7.00   3rd Qu.:7.000    
##  Max.   :8.000   Max.   :8.000   Max.   :9.00   Max.   :9.000    
##     Fatigue       Weight.Loss    Shortness.of.Breath    Wheezing    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000       Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000       1st Qu.:2.000  
##  Median :3.000   Median :3.000   Median :4.000       Median :4.000  
##  Mean   :3.852   Mean   :3.851   Mean   :4.233       Mean   :3.778  
##  3rd Qu.:5.000   3rd Qu.:6.000   3rd Qu.:6.000       3rd Qu.:5.000  
##  Max.   :9.000   Max.   :8.000   Max.   :9.000       Max.   :8.000  
##  Swallowing.Difficulty Clubbing.of.Finger.Nails Frequent.Cold     Dry.Cough    
##  Min.   :1.000         Min.   :1.000            Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000         1st Qu.:2.000            1st Qu.:2.000   1st Qu.:2.000  
##  Median :4.000         Median :4.000            Median :3.000   Median :4.000  
##  Mean   :3.747         Mean   :3.923            Mean   :3.531   Mean   :3.849  
##  3rd Qu.:5.000         3rd Qu.:5.000            3rd Qu.:5.000   3rd Qu.:6.000  
##  Max.   :8.000         Max.   :9.000            Max.   :7.000   Max.   :7.000  
##     Snoring          Level      
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:1.000  
##  Median :3.000   Median :2.000  
##  Mean   :2.926   Mean   :2.061  
##  3rd Qu.:4.000   3rd Qu.:3.000  
##  Max.   :7.000   Max.   :3.000
#Correlation Analysis is statistical method that is used to discover if 
#there is a relationship between two variables/datasets, and how strong that relationship may be
# correlation matrix (only apply to numerical column)

corr_data <- cor(data[, unlist(lapply(data, is.numeric))]) 
round(corr_data, 2)
##                          index   Age Gender Air.Pollution Alcohol.use
## index                     1.00  0.00  -0.03          0.05        0.04
## Age                       0.00  1.00  -0.20          0.10        0.15
## Gender                   -0.03 -0.20   1.00         -0.25       -0.23
## Air.Pollution             0.05  0.10  -0.25          1.00        0.75
## Alcohol.use               0.04  0.15  -0.23          0.75        1.00
## Dust.Allergy              0.04  0.03  -0.20          0.64        0.82
## OccuPational.Hazards      0.03  0.06  -0.19          0.61        0.88
## Genetic.Risk              0.03  0.07  -0.22          0.71        0.88
## chronic.Lung.Disease      0.02  0.13  -0.21          0.63        0.76
## Balanced.Diet             0.03  0.00  -0.10          0.52        0.65
## Obesity                   0.05  0.03  -0.12          0.60        0.67
## Smoking                   0.02  0.08  -0.21          0.48        0.55
## Passive.Smoker            0.02  0.00  -0.19          0.61        0.59
## Chest.Pain                0.02  0.01  -0.22          0.59        0.72
## Coughing.of.Blood         0.05  0.05  -0.15          0.61        0.67
## Fatigue                   0.05  0.09  -0.12          0.21        0.24
## Weight.Loss               0.03  0.11  -0.06          0.26        0.21
## Shortness.of.Breath       0.03  0.03  -0.04          0.27        0.44
## Wheezing                  0.01 -0.09  -0.08          0.06        0.18
## Swallowing.Difficulty     0.00 -0.10  -0.06         -0.08       -0.11
## Clubbing.of.Finger.Nails  0.02  0.04  -0.03          0.24        0.42
## Frequent.Cold             0.05 -0.01   0.00          0.17        0.18
## Dry.Cough                 0.01  0.01  -0.12          0.26        0.21
## Snoring                   0.00  0.00  -0.18         -0.02        0.12
## Level                     0.06  0.06  -0.16          0.64        0.72
##                          Dust.Allergy OccuPational.Hazards Genetic.Risk
## index                            0.04                 0.03         0.03
## Age                              0.03                 0.06         0.07
## Gender                          -0.20                -0.19        -0.22
## Air.Pollution                    0.64                 0.61         0.71
## Alcohol.use                      0.82                 0.88         0.88
## Dust.Allergy                     1.00                 0.84         0.79
## OccuPational.Hazards             0.84                 1.00         0.89
## Genetic.Risk                     0.79                 0.89         1.00
## chronic.Lung.Disease             0.62                 0.86         0.84
## Balanced.Diet                    0.65                 0.69         0.68
## Obesity                          0.70                 0.72         0.73
## Smoking                          0.36                 0.50         0.54
## Passive.Smoker                   0.56                 0.55         0.61
## Chest.Pain                       0.64                 0.78         0.83
## Coughing.of.Blood                0.63                 0.65         0.63
## Fatigue                          0.33                 0.27         0.23
## Weight.Loss                      0.32                 0.18         0.27
## Shortness.of.Breath              0.52                 0.37         0.46
## Wheezing                         0.31                 0.18         0.21
## Swallowing.Difficulty            0.03                 0.00        -0.06
## Clubbing.of.Finger.Nails         0.35                 0.37         0.36
## Frequent.Cold                    0.22                 0.08         0.09
## Dry.Cough                        0.30                 0.16         0.19
## Snoring                          0.05                 0.02        -0.06
## Level                            0.71                 0.67         0.70
##                          chronic.Lung.Disease Balanced.Diet Obesity Smoking
## index                                    0.02          0.03    0.05    0.02
## Age                                      0.13          0.00    0.03    0.08
## Gender                                  -0.21         -0.10   -0.12   -0.21
## Air.Pollution                            0.63          0.52    0.60    0.48
## Alcohol.use                              0.76          0.65    0.67    0.55
## Dust.Allergy                             0.62          0.65    0.70    0.36
## OccuPational.Hazards                     0.86          0.69    0.72    0.50
## Genetic.Risk                             0.84          0.68    0.73    0.54
## chronic.Lung.Disease                     1.00          0.62    0.60    0.58
## Balanced.Diet                            0.62          1.00    0.71    0.65
## Obesity                                  0.60          0.71    1.00    0.49
## Smoking                                  0.58          0.65    0.49    1.00
## Passive.Smoker                           0.57          0.73    0.68    0.76
## Chest.Pain                               0.78          0.80    0.67    0.65
## Coughing.of.Blood                        0.60          0.74    0.81    0.56
## Fatigue                                  0.25          0.40    0.55    0.20
## Weight.Loss                              0.11         -0.01    0.31   -0.21
## Shortness.of.Breath                      0.18          0.34    0.41   -0.02
## Wheezing                                 0.06          0.07    0.10   -0.05
## Swallowing.Difficulty                    0.01          0.05    0.13    0.24
## Clubbing.of.Finger.Nails                 0.30          0.04    0.15   -0.04
## Frequent.Cold                            0.03          0.26    0.29    0.04
## Dry.Cough                                0.12          0.33    0.20    0.01
## Snoring                                  0.04          0.15    0.04    0.19
## Level                                    0.61          0.71    0.83    0.52
##                          Passive.Smoker Chest.Pain Coughing.of.Blood Fatigue
## index                              0.02       0.02              0.05    0.05
## Age                                0.00       0.01              0.05    0.09
## Gender                            -0.19      -0.22             -0.15   -0.12
## Air.Pollution                      0.61       0.59              0.61    0.21
## Alcohol.use                        0.59       0.72              0.67    0.24
## Dust.Allergy                       0.56       0.64              0.63    0.33
## OccuPational.Hazards               0.55       0.78              0.65    0.27
## Genetic.Risk                       0.61       0.83              0.63    0.23
## chronic.Lung.Disease               0.57       0.78              0.60    0.25
## Balanced.Diet                      0.73       0.80              0.74    0.40
## Obesity                            0.68       0.67              0.81    0.55
## Smoking                            0.76       0.65              0.56    0.20
## Passive.Smoker                     1.00       0.70              0.64    0.38
## Chest.Pain                         0.70       1.00              0.71    0.25
## Coughing.of.Blood                  0.64       0.71              1.00    0.48
## Fatigue                            0.38       0.25              0.48    1.00
## Weight.Loss                        0.06       0.00              0.10    0.47
## Shortness.of.Breath                0.06       0.24              0.32    0.40
## Wheezing                           0.20       0.11             -0.08    0.18
## Swallowing.Difficulty              0.35       0.07              0.09    0.15
## Clubbing.of.Finger.Nails          -0.04       0.08             -0.07    0.04
## Frequent.Cold                      0.11       0.04              0.24    0.41
## Dry.Cough                          0.12       0.14              0.15    0.27
## Snoring                            0.25       0.14              0.09    0.23
## Level                              0.71       0.65              0.78    0.62
##                          Weight.Loss Shortness.of.Breath Wheezing
## index                           0.03                0.03     0.01
## Age                             0.11                0.03    -0.09
## Gender                         -0.06               -0.04    -0.08
## Air.Pollution                   0.26                0.27     0.06
## Alcohol.use                     0.21                0.44     0.18
## Dust.Allergy                    0.32                0.52     0.31
## OccuPational.Hazards            0.18                0.37     0.18
## Genetic.Risk                    0.27                0.46     0.21
## chronic.Lung.Disease            0.11                0.18     0.06
## Balanced.Diet                  -0.01                0.34     0.07
## Obesity                         0.31                0.41     0.10
## Smoking                        -0.21               -0.02    -0.05
## Passive.Smoker                  0.06                0.06     0.20
## Chest.Pain                      0.00                0.24     0.11
## Coughing.of.Blood               0.10                0.32    -0.08
## Fatigue                         0.47                0.40     0.18
## Weight.Loss                     1.00                0.57     0.33
## Shortness.of.Breath             0.57                1.00     0.21
## Wheezing                        0.33                0.21     1.00
## Swallowing.Difficulty           0.05               -0.20     0.39
## Clubbing.of.Finger.Nails        0.38                0.48     0.34
## Frequent.Cold                   0.16                0.35     0.10
## Dry.Cough                       0.19                0.49     0.06
## Snoring                        -0.19               -0.16     0.12
## Level                           0.35                0.50     0.24
##                          Swallowing.Difficulty Clubbing.of.Finger.Nails
## index                                     0.00                     0.02
## Age                                      -0.10                     0.04
## Gender                                   -0.06                    -0.03
## Air.Pollution                            -0.08                     0.24
## Alcohol.use                              -0.11                     0.42
## Dust.Allergy                              0.03                     0.35
## OccuPational.Hazards                      0.00                     0.37
## Genetic.Risk                             -0.06                     0.36
## chronic.Lung.Disease                      0.01                     0.30
## Balanced.Diet                             0.05                     0.04
## Obesity                                   0.13                     0.15
## Smoking                                   0.24                    -0.04
## Passive.Smoker                            0.35                    -0.04
## Chest.Pain                                0.07                     0.08
## Coughing.of.Blood                         0.09                    -0.07
## Fatigue                                   0.15                     0.04
## Weight.Loss                               0.05                     0.38
## Shortness.of.Breath                      -0.20                     0.48
## Wheezing                                  0.39                     0.34
## Swallowing.Difficulty                     1.00                    -0.12
## Clubbing.of.Finger.Nails                 -0.12                     1.00
## Frequent.Cold                             0.13                     0.24
## Dry.Cough                                -0.05                     0.31
## Snoring                                   0.21                    -0.02
## Level                                     0.25                     0.28
##                          Frequent.Cold Dry.Cough Snoring Level
## index                             0.05      0.01    0.00  0.06
## Age                              -0.01      0.01    0.00  0.06
## Gender                            0.00     -0.12   -0.18 -0.16
## Air.Pollution                     0.17      0.26   -0.02  0.64
## Alcohol.use                       0.18      0.21    0.12  0.72
## Dust.Allergy                      0.22      0.30    0.05  0.71
## OccuPational.Hazards              0.08      0.16    0.02  0.67
## Genetic.Risk                      0.09      0.19   -0.06  0.70
## chronic.Lung.Disease              0.03      0.12    0.04  0.61
## Balanced.Diet                     0.26      0.33    0.15  0.71
## Obesity                           0.29      0.20    0.04  0.83
## Smoking                           0.04      0.01    0.19  0.52
## Passive.Smoker                    0.11      0.12    0.25  0.71
## Chest.Pain                        0.04      0.14    0.14  0.65
## Coughing.of.Blood                 0.24      0.15    0.09  0.78
## Fatigue                           0.41      0.27    0.23  0.62
## Weight.Loss                       0.16      0.19   -0.19  0.35
## Shortness.of.Breath               0.35      0.49   -0.16  0.50
## Wheezing                          0.10      0.06    0.12  0.24
## Swallowing.Difficulty             0.13     -0.05    0.21  0.25
## Clubbing.of.Finger.Nails          0.24      0.31   -0.02  0.28
## Frequent.Cold                     1.00      0.51    0.34  0.44
## Dry.Cough                         0.51      1.00    0.18  0.37
## Snoring                           0.34      0.18    1.00  0.29
## Level                             0.44      0.37    0.29  1.00

The boxplot diagram below the ggplot diagram highlights that the mean of patients with lung cancer is highest around the patients aged 30-40. From the scatter plot, 3 levels of severity for the mentioned disease were highlighted, as Level 1, Level 2 and Level 3. From the scatter plot it can be deduced that the level of severity for lung cancer among patients aged 30-40 is the highest at Level 3 and the lowest for patients aged below 30 years.

library(ggplot2)
ggplot(data, aes(x = Age)) + 
  geom_histogram(aes(y = ..density..), binwidth = 5,
                 col = "black", fill = "green", alpha= .2) +
  geom_density(lwd = 1.0, colour = "red")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.

Interpretation: The data mostly consist of participant of Age 30-40

# A really basic boxplot.
ggplot(data, aes( y=Age)) + 
    geom_boxplot(fill="slateblue", alpha=0.2) + 
    ylab("Age")

Interpretation: The data mostly consist of participant of Age 30-40 Outliers on Age: 70 Mean around Age: 30-40

#install.packages("ggbeeswarm")
library(ggbeeswarm)

# Beeswarm plot in ggplot2
ggplot(data, aes(x = Level, y = Age, color = Level)) +
  geom_beeswarm(cex = 1)

Interpretation: Graph shows scatter plot of participant’s age and their level of lung cancer prediction. (1:Low, 3:High)

The 4 boxplot graphs below highlights every variable as per its categorical values in the cleaned dataset used. The categories extracted from the dataset are a compilation of risk factors and symptoms of lung cancer.

data_boxplot <- data[,5:10]
ggplot(stack(data_boxplot), aes(x = ind, y = values)) +
  geom_boxplot(fill="slateblue", alpha=0.2)

The first boxplot displays a visualization of variables that highlights the categories air pollution, dust allergy, alcohol use, occupational hazard, and chronic lung disease. The highest median value was obtained by the Dust Allergy category and Air pollution has the lowest median value for this graph.

data_boxplot <- data[,11:15]
ggplot(stack(data_boxplot), aes(x = ind, y = values)) +
  geom_boxplot(fill="slateblue", alpha=0.2)

The second boxplot graph displays a visualization of variables that highlights the categories: balanced diet, obesity, smoking, passive smoker, and chest pain. For this graph, smoking has the lowest median compared to other categories, which have a uniform median value.

data_boxplot <- data[,16:20]
ggplot(stack(data_boxplot), aes(x = ind, y = values)) +
  geom_boxplot(fill="slateblue", alpha=0.2)

The third boxplot graph displays a visualisation of variables that highlights the categories: coughing of blood, fatigue, weight loss, shortness of breath, and wheezing. For this graph, the fatigue and weight loss category have the lowest median compared to other three categories, which have a uniform median values.

data_boxplot <- data[,21:26]
ggplot(stack(data_boxplot), aes(x = ind, y = values)) +
  geom_boxplot(fill="slateblue", alpha=0.2)

The fourth boxplot graph displays a visualisation of variables that highlights the categories: difficulty swallowing, clubbing of fingernails, frequent cold, dry cough, and snoring. For this graph, the difficulty swallowing, clubbing of fingernails and dry cough category has a higher median value compared to the other two categories.

count_level_barchart<- ggplot(data, aes(x = Level))+ 
  geom_bar( stat = "count", fill="slateblue", alpha=0.4)+ 
  ggtitle("Plot of Count by Level of Lung Cancer Prediction") 
count_level_barchart+ coord_flip()

Interpretation: Graph shows Count of participants data by Level of Lung Cancer Prediction (1:Low, 3:High)

From the plot chart displayed below, a number of participants from various age has been sampled to predict the level of lung cancer. Level 3, as the most severe level has the highest count of prediction among the sampled participants, followed by level 2 and lastly Level 1 as the lowest prediction. Based on the results from the plot chart, a pie chart was created, displaying percentages for each level of severity.

count_level_piechart<- table(data$Level)
pct <- round(100*count_level_piechart/sum(count_level_piechart))
pie(count_level_piechart, labels = paste(names(count_level_piechart), sep = " ", pct, "%")
    , main = "Percentage by Level of Lung Cancer Prediction")

Interpretation: Graph shows Percentage of participants by Level of Lung Cancer Prediction High level of Lung Cancer has the highest percentage, followed by Medium and Low.

test<- table(data$Level, data$Gender)
barplot(t(as.matrix(test)),beside=TRUE, col = 1:ncol(test),
        main = "Bar Plot Level vs Gender Count")
legend("topright",
       legend = colnames(test),
       pch = 15,
       col = 1:ncol(test))

Interpretation: Graph shows Bar Plot Level vs Gender by Count. Black: Male, Pink: Female 1: Low Level 2: Medium, 3: High level of Lung Cancer

The bar chart below shows result of graphical visualisation based on the relationship of gender of participants and the level of severity of lung cancer. For Level 1 both male and female have an overall uniform count compared to other level of severity. Level 2 has a higher count of male participants with lung cancer severity compared to women meanwhile Level 3 has the highest count of male participants with lung disease and lowest among women participants.

Data Analysis

Correlation coefficients are used to measure the strength of the linear relationship between two variables. A correlation coefficient greater than zero indicates a positive relationship while a value less than zero signifies a negative relationship. A value of zero indicates no relationship between the two variables being compared. A negative correlation, or inverse correlation, is a key concept in the creation of diversified portfolios that can better withstand portfolio volatility.

# heatmap for correlation matirx
#install.packages("pheatmap")        
library("pheatmap")
pheatmap(corr_data, main = "Heatmap for correlation matirx", cellwidth=12, cellheight=12)

The Heatmap for correlation matrix shows correlation between all variables in dataset. Then, we can clearly see a uniform motion highlighting that the center box has high correlation as they transform from orange to red gradient. Hence, allowing the analysis to begin with variables within the center box, which is: Smoking, Passive Smoker, Balanced diet, Obesity, Coughing of Blood, chronic lung disease, chest pain, air pollution, dust allergy, alcohol use, occupational hazards and genetic risk

Based on the 12 variables within the center box mentioned above, 12 individual bar graphs were plotted to study and understand the relationship between the variables and likeliness of lung cancer disease.

test<- table(data$Level, data$Smoking)
barplot(t(as.matrix(test)),beside=TRUE, col = 1:ncol(test),
        main = "Bar Plot Level vs Smoking Count")
legend("topright",
       cex = 0.7,
       legend = colnames(test),
       pch = 15,
       col = 1:ncol(test))

The first bar graph visualization displays the relationship between the severity of smoking habit and lung cancer. From the graph we are able to interpret that a person with higher severity of smoking habit or smoking count, has a higher chance of having lung cancer.

test<- table(data$Level, data$Passive.Smoker)
barplot(t(as.matrix(test)),beside=TRUE, col = 1:ncol(test),
        main = "Bar Plot Level vs Passive.Smoker Count")
legend("topright",
       cex = 0.7,
       legend = colnames(test),
       pch = 15,
       col = 1:ncol(test))

The second bar graph visualization displays the relationship between the severity of smoking habit with a passive smoker and lung cancer. From the graph we can interpret that a passive smoker with presence of high severity of smoking or smoking count, has a higher chance of having lung cancer.

test<- table(data$Level, data$Balanced.Diet)
barplot(t(as.matrix(test)),beside=TRUE, col = 1:ncol(test),
        main = "Bar Plot Level vs Balanced.Diet Count")
legend("topright",
       cex = 0.7,
       legend = colnames(test),
       pch = 15,
       col = 1:ncol(test))

The third bar graph visualization displays the relationship between the presence of a balanced diet in a person’s lifestyle and probability of having lung cancer. From the graph we can deduce that a balanced diet and lifestyle does play an important role towards a person having lung cancer. A person with an imbalanced diet is prone to have higher chances of inhibiting this disease.

test<- table(data$Level, data$Obesity)
barplot(t(as.matrix(test)),beside=TRUE, col = 1:ncol(test),
        main = "Bar Plot Level vs Obesity Count")
legend("topright",
       cex = 0.7,
       legend = colnames(test),
       pch = 15,
       col = 1:ncol(test))

The fourth bar graph visualization displays the relationship between overweight or obesity factor and the probability of a person having lung cancer. From the graph we can deduce that an overweight person or a person suffering with obesity has a higher chance of suffering from lung cancer. This is also related to the previous bar graph which supports the claim that a balance diet and healthy lifestyle plays a sufficient role towards hindering such carcinoma diseases.

test<- table(data$Level, data$Coughing.of.Blood)
barplot(t(as.matrix(test)),beside=TRUE, col = 1:ncol(test),
        main = "Bar Plot Level vs Coughing.of.Blood Count")
legend("topright",
       cex = 0.7,
       legend = colnames(test),
       pch = 15,
       col = 1:ncol(test))

The fifth bar graph visualization displays the relationship between an individual with coughing blood symptom and the probability of a person having lung cancer. From the graph we can understand that a person who has symptoms such as coughing blood, has a higher chance of suffering from lung cancer as coughing blood is primarily an early stage symptom for most carcinoma diseases.

test<- table(data$Level, data$chronic.Lung.Disease)
barplot(t(as.matrix(test)),beside=TRUE, col = 1:ncol(test),
        main = "Bar Plot Level vs chronic.Lung.Disease Count")
legend("topright",
       cex = 0.7,
       legend = colnames(test),
       pch = 15,
       col = 1:ncol(test))

The sixth bar graph visualization displays the relationship between an individual with presence of prior chronic lung disease and the probability of a person having lung cancer. From the graph we can understand that a person who has presence of chronic lung disease, has a higher chance of suffering from lung cancer as chronic lung disease is primarily an early stage symptom for most carcinoma diseases.

test<- table(data$Level, data$Chest.Pain)
barplot(t(as.matrix(test)),beside=TRUE, col = 1:ncol(test),
        main = "Bar Plot Level vs Chest.Pain Count")
legend("topright",
       cex = 0.7,
       legend = colnames(test),
       pch = 15,
       col = 1:ncol(test))

The seventh bar graph visualization displays the relationship between an individual with presence of chest pain and the probability of a person having lung cancer. From the graph we can understand that a person who has presence of chest pain, has a higher chance of suffering from lung cancer as chest pain is one of the early-stage symptom lung cancer

test<- table(data$Level, data$Air.Pollution)
barplot(t(as.matrix(test)),beside=TRUE, col = 1:ncol(test),
        main = "Bar Plot Level vs Air.Pollution Count")
legend("topright",
       cex = 0.7,
       legend = colnames(test),
       pch = 15,
       col = 1:ncol(test))

The eighth bar graph visualization displays the relationship between severity of exposure to air pollution and the probability of a person having lung cancer. From the graph we can understand that a person is highly exposed to polluted air has a higher chance to suffer from lung disease compared to a person who is not exposed. Polluted air in general contains different kinds of toxic substances and gases emitted from vehicles, factories and dust, which all leads to respiratory diseases when one is exposed utterly more.

test<- table(data$Level, data$Dust.Allergy)
barplot(t(as.matrix(test)),beside=TRUE, col = 1:ncol(test),
        main = "Bar Plot Level vs Dust.Allergy Count")
legend("topright",
       cex = 0.7,
       legend = colnames(test),
       pch = 15,
       col = 1:ncol(test))

The nineth bar graph visualization displays the relationship between the presence of dust allergy for an individual and the probability of a person having lung cancer. From the graph we can understand that a person who is allergic to dust has a higher chance to suffer from lung disease compared to a person who is not allergic. Dust allergy usually is the first symptom possessed by an individual which in the long run would cause other respiratory diseases due to the lack of oxygen and difficulty of breathing.

test<- table(data$Level, data$Alcohol.use)
barplot(t(as.matrix(test)),beside=TRUE, col = 1:ncol(test),
        main = "Bar Plot Level vs Alcohol.use Count")
legend("topright",
       cex = 0.7,
       legend = colnames(test),
       pch = 15,
       col = 1:ncol(test))

The tenth bar graph visualization displays the relationship between the presence of alcohol in an individual’s lifestyle and the probability of a person having lung cancer. From the graph we can understand that a person who consumes alcohol has a higher chance to suffer from lung disease compared to a person who does not drink.

test<- table(data$Level, data$OccuPational.Hazards)
barplot(t(as.matrix(test)),beside=TRUE, col = 1:ncol(test),
        main = "Bar Plot Level vs OccuPational.Hazards Count")
legend("topright",
       cex = 0.7,
       legend = colnames(test),
       pch = 15,
       col = 1:ncol(test))

The eleventh bar graph visualization displays the relationship between the presence of occupational hazard in an individual’s lifestyle and the probability of a person having lung cancer. From the graph we can understand that a person who has the presence of occupational hazard has a higher chance to suffer from lung disease compared to a person who doesn’t. Occupational hazard can vary into many categories but narrowing down to lung cancer, it’s understandable that the person exposed to various chemicals or harmful smoke exposure in their working environment has a higher tendency to suffer from lung cancer due to the amount of toxic being respired on a daily basis.

test<- table(data$Level, data$Genetic.Risk)
barplot(t(as.matrix(test)),beside=TRUE, col = 1:ncol(test),
        main = "Bar Plot Level vs Genetic.Risk Count")
legend("topright",
       cex = 0.7,
       legend = colnames(test),
       pch = 15,
       col = 1:ncol(test))

The twelfth bar graph visualization displays the relationship between the presence of genetic risk and the probability of a person having lung cancer. From the graph we can deduce that a person who has genetic risk has a higher chance to suffer from lung disease compared to a person who doesn’t.

Data Interpretation/Conclusion

Problems:

From the stages and processes conducted above from the data set utilized, we are able to confirm that smoking isn’t the only factor that causes lung cancer. Other factors such as imbalanced diet, obesity, exposure to air pollution and occupational hazard also play a vital role towards the contributing factors towards lung cancer. The statistical and graphical diagrams as shown above proves that the sample of participants have been exposed in various environments and suffers from various symptoms such as coughing blood, dust allergy, chronic lung diseases and wheezing, which all are early symptoms of lung cancer. Moreover, even if a person has no habit of consuming alcohol or smoking, they still have a chance of suffering from lung cancer due to their second-hand smoke or environment.

From this project conducted and data extracted, it is deduced that smoking and passive smoke isn’t the only factor of lung cancer, however it is among the highest contributor to this carcinoma disease. as shown in “Bar Plot Level vs Smoking Count” and “Bar Plot Level vs Passive Smoker Count”.

#To determine the highest correlation with Level variable

#to select the variables that is going to be included in the analysis
df_subset <- data %>% select(-1, -2)

#to the correlations between the Level variable and all other variables in the subsetted data frame
cor_vector_level <- cor(df_subset, data$Level)

#save as df ad sort according to corr value
Corr_Level_as_df_Combine<-as.data.frame(as.table(cor_vector_level))
Corr_Level_as_df_Combine<- subset(Corr_Level_as_df_Combine, select = -c(Var2) )
Corr_Level_as_df_Combine <- arrange(Corr_Level_as_df_Combine, desc(Freq))
Corr_Level_as_df_Combine
##                        Var1        Freq
## 1                     Level  1.00000000
## 2                   Obesity  0.82734548
## 3         Coughing.of.Blood  0.78194836
## 4               Alcohol.use  0.71927828
## 5              Dust.Allergy  0.71397135
## 6             Balanced.Diet  0.70622520
## 7            Passive.Smoker  0.70500156
## 8              Genetic.Risk  0.70169424
## 9      OccuPational.Hazards  0.67413018
## 10               Chest.Pain  0.64620639
## 11            Air.Pollution  0.63561909
## 12                  Fatigue  0.62447434
## 13     chronic.Lung.Disease  0.61103911
## 14                  Smoking  0.52103646
## 15      Shortness.of.Breath  0.49623934
## 16            Frequent.Cold  0.44366809
## 17                Dry.Cough  0.37298462
## 18              Weight.Loss  0.35177365
## 19                  Snoring  0.29047223
## 20 Clubbing.of.Finger.Nails  0.28021504
## 21    Swallowing.Difficulty  0.25096655
## 22                 Wheezing  0.24414078
## 23                      Age  0.05911626
## 24                   Gender -0.16432218
df_corr_level <- arrange(Corr_Level_as_df_Combine, Freq)
df_corr_level$Var1 <- factor(df_corr_level$Var1, levels = df_corr_level$Var1)
plot_corr_df <- ggplot(df_corr_level,
                aes(x = Freq ,y = Var1, fill=Freq)) +
              geom_col() + 
    scale_fill_gradient(low = "#353436",
                      high = "#FF7373",
                      guide = "colorbar")+
  labs(title="Horizontal Barcharts of Correlation Matrix against Level variable",
        x ="Correlations", y = "Variables") 

  


plot_corr_df

#To determine the highest correlation with Smoking variable

#df_subset <- data %>% select(-1, -2, -ncol(data))
df_subset <- data %>% select(-1, -2)

#to the correlations between the Smoking variable and all other variables in the subsetted data frame
cor_vector_smoking <- cor(df_subset, data$Smoking)

#save as df ad sort according to corr value
Corr_Smoking_as_df_Combine<-as.data.frame(as.table(cor_vector_smoking))
Corr_Smoking_as_df_Combine<- subset(Corr_Smoking_as_df_Combine, select = -c(Var2) )
Corr_Smoking_as_df_Combine <- arrange(Corr_Smoking_as_df_Combine, desc(Freq))
Corr_Smoking_as_df_Combine
##                        Var1        Freq
## 1                   Smoking  1.00000000
## 2            Passive.Smoker  0.76138495
## 3                Chest.Pain  0.64781906
## 4             Balanced.Diet  0.64611037
## 5      chronic.Lung.Disease  0.57826244
## 6         Coughing.of.Blood  0.55667269
## 7               Alcohol.use  0.54700342
## 8              Genetic.Risk  0.54332507
## 9                     Level  0.52103646
## 10     OccuPational.Hazards  0.49745503
## 11                  Obesity  0.48795077
## 12            Air.Pollution  0.48302073
## 13             Dust.Allergy  0.35887350
## 14    Swallowing.Difficulty  0.23597980
## 15                  Fatigue  0.20199370
## 16                  Snoring  0.18933648
## 17                      Age  0.07517265
## 18            Frequent.Cold  0.04179185
## 19                Dry.Cough  0.01177305
## 20      Shortness.of.Breath -0.02109016
## 21 Clubbing.of.Finger.Nails -0.04112159
## 22                 Wheezing -0.04732729
## 23                   Gender -0.20847511
## 24              Weight.Loss -0.21178878
df_corr_smoking <- arrange(Corr_Smoking_as_df_Combine, Freq)
df_corr_smoking$Var1 <- factor(df_corr_smoking$Var1, levels = df_corr_smoking$Var1)
plot_corr_df <- ggplot(df_corr_smoking,
                aes(x = Freq ,y = Var1, fill=Freq)) +
              geom_col() + 
    scale_fill_gradient(low = "#353436",
                      high = "#FF7373",
                      guide = "colorbar")+
  labs(title="Horizontal Barcharts of Correlation Matrix against Smoking variable",
        x ="Correlations", y = "Variables") 

  


plot_corr_df

Answers

Problem:

1. Is smoking is the only cause of cancer?

Smoking is not the only cause of cancer but it play major role in contributing to Lung Cancer. Refer: “Bar Plot Level vs Smoking Count” and “Bar Plot Level vs Passive Smoker Count”.

Obesity is the highest factor based on Correlation Matrix that contribute to Lung Cancer. Refer: “Horizontal Bar charts of Correlation Matrix against Level variable”.

“Horizontal Bar charts of Correlation Matrix against Smoking variable” also portrays that smoking isn’t the main factor of lung cancer but it does have a high correlation , compared with other factors listed in TOP 15. Smoking correlated to most of the factors we list in question below .

  1. Passive.Smoker 0.76138495

  2. Chest.Pain 0.64781906

  3. Balanced.Diet 0.64611037

  4. chronic.Lung.Disease 0.57826244