Data Exploration

1.1 Load all of the necessary packages

1.2 Read The dataset

data <- read_excel("~/Desktop/aicha_labs_for_data_mining/liver_data_with_metadata(4).xlsx")

# View the first few rows of the dataset
head(data)

## # A tibble: 6 × 11
##   `Age of the patient` `Gender of the patient` `Total Bilirubin`
##                  <dbl> <chr>                               <dbl>
## 1                   65 Female                                0.7
## 2                   62 Male                                 10.9
## 3                   62 Male                                  7.3
## 4                   58 Male                                  1  
## 5                   72 Male                                  3.9
## 6                   46 Male                                  1.8
## # ℹ 8 more variables: `Direct Bilirubin` <dbl>,
## #   ` Alkphos Alkaline Phosphotase` <dbl>,
## #   ` Sgpt Alamine Aminotransferase` <dbl>,
## #   `Sgot Aspartate Aminotransferase` <dbl>, `Total Protiens` <dbl>,
## #   ` ALB Albumin` <dbl>, `A/G Ratio Albumin and Globulin Ratio` <dbl>,
## #   Result <dbl>

1.3 Check the size of the dataset

Display the number of rows and columns in the dataset

dim(data)

## [1] 30691    11

Display the number of rows

nrow(data)

## [1] 30691

Display the number of columns

ncol(data)

## [1] 11

1.4 Overview of the Variables

Display the column names of the dataset

colnames(data)

##  [1] "Age of the patient"                  
##  [2] "Gender of the patient"               
##  [3] "Total Bilirubin"                     
##  [4] "Direct Bilirubin"                    
##  [5] " Alkphos Alkaline Phosphotase"       
##  [6] " Sgpt Alamine Aminotransferase"      
##  [7] "Sgot Aspartate Aminotransferase"     
##  [8] "Total Protiens"                      
##  [9] " ALB Albumin"                        
## [10] "A/G Ratio Albumin and Globulin Ratio"
## [11] "Result"

Display a summary of the dataset structure

str(data)

## tibble [30,691 × 11] (S3: tbl_df/tbl/data.frame)
##  $ Age of the patient                  : num [1:30691] 65 62 62 58 72 46 26 29 17 55 ...
##  $ Gender of the patient               : chr [1:30691] "Female" "Male" "Male" "Male" ...
##  $ Total Bilirubin                     : num [1:30691] 0.7 10.9 7.3 1 3.9 1.8 0.9 0.9 0.9 0.7 ...
##  $ Direct Bilirubin                    : num [1:30691] 0.1 5.5 4.1 0.4 2 0.7 0.2 0.3 0.3 0.2 ...
##  $  Alkphos Alkaline Phosphotase       : num [1:30691] 187 699 490 182 195 208 154 202 202 290 ...
##  $  Sgpt Alamine Aminotransferase      : num [1:30691] 16 64 60 14 27 19 NA 14 22 53 ...
##  $ Sgot Aspartate Aminotransferase     : num [1:30691] 18 100 68 20 59 14 12 11 19 58 ...
##  $ Total Protiens                      : num [1:30691] 6.8 7.5 7 6.8 7.3 7.6 7 6.7 7.4 6.8 ...
##  $  ALB Albumin                        : num [1:30691] 3.3 3.2 3.3 3.4 2.4 4.4 3.5 3.6 4.1 3.4 ...
##  $ A/G Ratio Albumin and Globulin Ratio: num [1:30691] 0.9 0.74 0.89 1 0.4 1.3 1 1.1 1.2 1 ...
##  $ Result                              : num [1:30691] 1 1 1 1 1 1 1 1 2 1 ...

1.5 Display a summary of the dataset

# gives basic statistics for numeric columns and a summary of factor levels
summary(data)

##  Age of the patient Gender of the patient Total Bilirubin Direct Bilirubin
##  Min.   : 4.00      Length:30691          Min.   : 0.40   Min.   : 0.100  
##  1st Qu.:32.00      Class :character      1st Qu.: 0.80   1st Qu.: 0.200  
##  Median :45.00      Mode  :character      Median : 1.00   Median : 0.300  
##  Mean   :44.11                            Mean   : 3.37   Mean   : 1.528  
##  3rd Qu.:55.00                            3rd Qu.: 2.70   3rd Qu.: 1.300  
##  Max.   :90.00                            Max.   :75.00   Max.   :19.700  
##  NA's   :2                                NA's   :648     NA's   :561     
##   Alkphos Alkaline Phosphotase  Sgpt Alamine Aminotransferase
##  Min.   :  63.0                Min.   :  10.00               
##  1st Qu.: 175.0                1st Qu.:  23.00               
##  Median : 209.0                Median :  35.00               
##  Mean   : 289.1                Mean   :  81.49               
##  3rd Qu.: 298.0                3rd Qu.:  62.00               
##  Max.   :2110.0                Max.   :2000.00               
##  NA's   :796                   NA's   :538                   
##  Sgot Aspartate Aminotransferase Total Protiens   ALB Albumin 
##  Min.   :  10.0                  Min.   :2.70   Min.   :0.90  
##  1st Qu.:  26.0                  1st Qu.:5.80   1st Qu.:2.60  
##  Median :  42.0                  Median :6.60   Median :3.10  
##  Mean   : 111.5                  Mean   :6.48   Mean   :3.13  
##  3rd Qu.:  88.0                  3rd Qu.:7.20   3rd Qu.:3.80  
##  Max.   :4929.0                  Max.   :9.60   Max.   :5.50  
##  NA's   :462                     NA's   :463    NA's   :494   
##  A/G Ratio Albumin and Globulin Ratio     Result     
##  Min.   :0.3000                       Min.   :1.000  
##  1st Qu.:0.7000                       1st Qu.:1.000  
##  Median :0.9000                       Median :1.000  
##  Mean   :0.9435                       Mean   :1.286  
##  3rd Qu.:1.1000                       3rd Qu.:2.000  
##  Max.   :2.8000                       Max.   :2.000  
##  NA's   :559

1.6 Create a heatmap-like visualization of missing values

missmap(data)

1.7 Create another visual representation of missingness in the dataset

vis_miss(data)

1.8 Generate a detailed summary of the dataset, including statistics and data structure

# Ensure summarytools is loaded
library(summarytools)
# Disable interactive view mode
st_options(plain.ascii = TRUE)

# Generate the summary table
summary <- dfSummary(data)


# Print the table in R Markdown with HTML styling
print(summary, method = "render")

Data Frame Summary

data

Dimensions: 30691 x 11
Duplicates: 11323

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

Age of the patient [numeric]

Mean (sd) : 44.1 (16)

min ≤ med ≤ max:

4 ≤ 45 ≤ 90

IQR (CV) : 23 (0.4)

77 distinct values

30689 (100.0%)

2 (0.0%)

Gender of the patient [character]

1. Female

2. Male

7803	(	26.2%	)
21986	(	73.8%	)

29789 (97.1%)

902 (2.9%)

Total Bilirubin [numeric]

Mean (sd) : 3.4 (6.3)

min ≤ med ≤ max:

0.4 ≤ 1 ≤ 75

IQR (CV) : 1.9 (1.9)

113 distinct values

30043 (97.9%)

648 (2.1%)

Direct Bilirubin [numeric]

Mean (sd) : 1.5 (2.9)

min ≤ med ≤ max:

0.1 ≤ 0.3 ≤ 19.7

IQR (CV) : 1.1 (1.9)

80 distinct values

30130 (98.2%)

561 (1.8%)

Alkphos Alkaline Phosphotase [numeric]

Mean (sd) : 289.1 (238.5)

min ≤ med ≤ max:

63 ≤ 209 ≤ 2110

IQR (CV) : 123 (0.8)

263 distinct values

29895 (97.4%)

796 (2.6%)

Sgpt Alamine Aminotransferase [numeric]

Mean (sd) : 81.5 (182.2)

min ≤ med ≤ max:

10 ≤ 35 ≤ 2000

IQR (CV) : 39 (2.2)

152 distinct values

30153 (98.2%)

538 (1.8%)

Sgot Aspartate Aminotransferase [numeric]

Mean (sd) : 111.5 (280.9)

min ≤ med ≤ max:

10 ≤ 42 ≤ 4929

IQR (CV) : 62 (2.5)

177 distinct values

30229 (98.5%)

462 (1.5%)

Total Protiens [numeric]

Mean (sd) : 6.5 (1.1)

min ≤ med ≤ max:

2.7 ≤ 6.6 ≤ 9.6

IQR (CV) : 1.4 (0.2)

58 distinct values

30228 (98.5%)

463 (1.5%)

ALB Albumin [numeric]

Mean (sd) : 3.1 (0.8)

min ≤ med ≤ max:

0.9 ≤ 3.1 ≤ 5.5

IQR (CV) : 1.2 (0.3)

40 distinct values

30197 (98.4%)

494 (1.6%)

A/G Ratio Albumin and Globulin Ratio [numeric]

Mean (sd) : 0.9 (0.3)

min ≤ med ≤ max:

0.3 ≤ 0.9 ≤ 2.8

IQR (CV) : 0.4 (0.3)

69 distinct values

30132 (98.2%)

559 (1.8%)

Result [numeric]

Min : 1

Mean : 1.3

Max : 2

1	:	21917	(	71.4%	)
2	:	8774	(	28.6%	)

30691 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.1)
2025-01-07

Data Cleaning

2.1 Identify numeric columns in the dataset

k <- which(unlist(lapply(data, is.numeric)) == TRUE) #The lapply() function checks each column to determine if it's numeric.


# unlist() converts the logical list result to a vector.
# which() identifies the indices where the value is TRUE
k # Print the indices of numeric columns for verification

##                   Age of the patient                      Total Bilirubin 
##                                    1                                    3 
##                     Direct Bilirubin         Alkphos Alkaline Phosphotase 
##                                    4                                    5 
##        Sgpt Alamine Aminotransferase      Sgot Aspartate Aminotransferase 
##                                    6                                    7 
##                       Total Protiens                          ALB Albumin 
##                                    8                                    9 
## A/G Ratio Albumin and Globulin Ratio                               Result 
##                                   10                                   11

2.2 Apply a logarithmic transformation to all numeric columns

Xdata <- log(data[, k])
# The log() function applies a logarithm to the selected numeric columns.
# This transformation is often used to reduce skewness in data.

2.3 Perform Principal Component Analysis (PCA) to impute missing values

pc <- imputePCA(Xdata)

# imputePCA() imputes missing values based on the structure of the data 
# This method uses relationships between variables to predict and fill in missing data.

2.4 Extract the completed dataset with imputed values

Xdata <- pc$completeObs
# pc$completeObs contains the imputed data where missing values have been replaced.

2.5 Revert the logarithmic transformation by applying the exponential function

data[, k] <- exp(Xdata)

# The exp() function reverses the natural logarithm, restoring the original scale of the data.

2.6 Generate a final detailed summary of the cleaned and transformed dataset

summary <- dfSummary(data)


print(summary, method = "render")

Data Frame Summary

data

Dimensions: 30691 x 11
Duplicates: 11323

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

Age of the patient [numeric]

Mean (sd) : 44.1 (16)

min ≤ med ≤ max:

4 ≤ 45 ≤ 90

IQR (CV) : 23 (0.4)

78 distinct values

30691 (100.0%)

0 (0.0%)

Gender of the patient [character]

1. Female

2. Male

7803	(	26.2%	)
21986	(	73.8%	)

29789 (97.1%)

902 (2.9%)

Total Bilirubin [numeric]

Mean (sd) : 3.3 (6.2)

min ≤ med ≤ max:

0.4 ≤ 1 ≤ 75

IQR (CV) : 1.9 (1.9)

643 distinct values

30691 (100.0%)

0 (0.0%)

Direct Bilirubin [numeric]

Mean (sd) : 1.5 (2.8)

min ≤ med ≤ max:

0.1 ≤ 0.3 ≤ 19.7

IQR (CV) : 1.1 (1.9)

560 distinct values

30691 (100.0%)

0 (0.0%)

Alkphos Alkaline Phosphotase [numeric]

Mean (sd) : 288 (235.7)

min ≤ med ≤ max:

63 ≤ 210 ≤ 2110

IQR (CV) : 122 (0.8)

933 distinct values

30691 (100.0%)

0 (0.0%)

Sgpt Alamine Aminotransferase [numeric]

Mean (sd) : 80.9 (180.7)

min ≤ med ≤ max:

10 ≤ 35 ≤ 2000

IQR (CV) : 38 (2.2)

604 distinct values

30691 (100.0%)

0 (0.0%)

Sgot Aspartate Aminotransferase [numeric]

Mean (sd) : 110.9 (278.9)

min ≤ med ≤ max:

10 ≤ 42 ≤ 4929

IQR (CV) : 62 (2.5)

561 distinct values

30691 (100.0%)

0 (0.0%)

Total Protiens [numeric]

Mean (sd) : 6.5 (1.1)

min ≤ med ≤ max:

2.7 ≤ 6.6 ≤ 9.6

IQR (CV) : 1.4 (0.2)

415 distinct values

30691 (100.0%)

0 (0.0%)

ALB Albumin [numeric]

Mean (sd) : 3.1 (0.8)

min ≤ med ≤ max:

0.9 ≤ 3.1 ≤ 5.5

IQR (CV) : 1.1 (0.3)

438 distinct values

30691 (100.0%)

0 (0.0%)

A/G Ratio Albumin and Globulin Ratio [numeric]

Mean (sd) : 0.9 (0.3)

min ≤ med ≤ max:

0.3 ≤ 0.9 ≤ 2.8

IQR (CV) : 0.4 (0.3)

484 distinct values

30691 (100.0%)

0 (0.0%)

Result [numeric]

Min : 1

Mean : 1.3

Max : 2

1	:	21917	(	71.4%	)
2	:	8774	(	28.6%	)

30691 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.1)
2025-01-07

2.7 Identify rows with missing values in the “Gender of the patient” column

j <- which(is.na(data$`Gender of the patient`) == TRUE)


# is.na(data$`Gender of the patient`): Checks each value in the "Gender of the patient" column.
# Returns TRUE for missing values (NA) and FALSE otherwise.
# which(... == TRUE): Finds the indices (row numbers) where the value is missing (TRUE).
# j: This variable stores the row indices of missing values for easier reference later.

2.8 Remove rows with missing values in the “Gender of the patient” column

data <- data[-j,]

# data[-j,]: Removes rows from the dataset where the row indices match those in `j`.
# The minus sign (-j) tells R to exclude these rows.
# After this step, the dataset will no longer have any rows with missing "Gender of the patient" values.
# This step ensures that analyses requiring this column won't fail due to missing data.

2.9 Regenerate the summary table after cleaning the dataset

# The dataset has been cleaned by removing rows with missing values in the "Gender of the patient" column.
# Now, we generate an updated summary of the dataset to review its current structure and statistics.

summary <- dfSummary(data)

print(summary, method = "render")

Data Frame Summary

data

Dimensions: 29789 x 11
Duplicates: 11217

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

Age of the patient [numeric]

Mean (sd) : 44.1 (16)

min ≤ med ≤ max:

4 ≤ 45 ≤ 90

IQR (CV) : 23 (0.4)

78 distinct values

29789 (100.0%)

0 (0.0%)

Gender of the patient [character]

1. Female

2. Male

7803	(	26.2%	)
21986	(	73.8%	)

29789 (100.0%)

0 (0.0%)

Total Bilirubin [numeric]

Mean (sd) : 3.4 (6.2)

min ≤ med ≤ max:

0.4 ≤ 1 ≤ 75

IQR (CV) : 1.9 (1.9)

459 distinct values

29789 (100.0%)

0 (0.0%)

Direct Bilirubin [numeric]

Mean (sd) : 1.5 (2.9)

min ≤ med ≤ max:

0.1 ≤ 0.3 ≤ 19.7

IQR (CV) : 1.1 (1.9)

471 distinct values

29789 (100.0%)

0 (0.0%)

Alkphos Alkaline Phosphotase [numeric]

Mean (sd) : 288.4 (236.4)

min ≤ med ≤ max:

63 ≤ 210 ≤ 2110

IQR (CV) : 122 (0.8)

871 distinct values

29789 (100.0%)

0 (0.0%)

Sgpt Alamine Aminotransferase [numeric]

Mean (sd) : 80.9 (180.5)

min ≤ med ≤ max:

10 ≤ 35 ≤ 2000

IQR (CV) : 38 (2.2)

554 distinct values

29789 (100.0%)

0 (0.0%)

Sgot Aspartate Aminotransferase [numeric]

Mean (sd) : 111 (278.2)

min ≤ med ≤ max:

10 ≤ 42 ≤ 4929

IQR (CV) : 62 (2.5)

517 distinct values

29789 (100.0%)

0 (0.0%)

Total Protiens [numeric]

Mean (sd) : 6.5 (1.1)

min ≤ med ≤ max:

2.7 ≤ 6.6 ≤ 9.6

IQR (CV) : 1.4 (0.2)

382 distinct values

29789 (100.0%)

0 (0.0%)

ALB Albumin [numeric]

Mean (sd) : 3.1 (0.8)

min ≤ med ≤ max:

0.9 ≤ 3.1 ≤ 5.5

IQR (CV) : 1.1 (0.3)

403 distinct values

29789 (100.0%)

0 (0.0%)

A/G Ratio Albumin and Globulin Ratio [numeric]

Mean (sd) : 0.9 (0.3)

min ≤ med ≤ max:

0.3 ≤ 0.9 ≤ 2.8

IQR (CV) : 0.4 (0.3)

449 distinct values

29789 (100.0%)

0 (0.0%)

Result [numeric]

Min : 1

Mean : 1.3

Max : 2

1	:	21295	(	71.5%	)
2	:	8494	(	28.5%	)

29789 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.1)
2025-01-07

cat("The summary table has been updated")

## The summary table has been updated

2.10 Outliers detection

# Use robust Mahalanobis distance to detect multivariate outliers
robust_dist <- covMcd(data[,-c(2,11)])  # Compute robust covariance and center for columns excluding 2 and 11 (gender and results)
threshold <- qchisq(0.99, df = ncol(data[,-c(2,11)]))  # Chi-squared threshold for 99% confidence level

# Identify multivariate outliers based on robust distances
robust_outliers <- mahalanobis(data[,-c(2,11)], robust_dist$center, robust_dist$cov) > threshold

2.11 Display a summary of detected robust outliers

table(robust_outliers)

## robust_outliers
## FALSE  TRUE 
## 15034 14755

2.12 Remove rows identified as outliers to create a clean dataset

robust_clean_data <- data[!robust_outliers, ]

2.13 Display the dimensions of the cleaned dataset

dim(robust_clean_data)

## [1] 15034    11

2.14 Generate a detailed summary of the cleaned dataset

summary <- dfSummary(robust_clean_data)

print(summary, method = "render")

Data Frame Summary

robust_clean_data

Dimensions: 15034 x 11
Duplicates: 5661

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

Age of the patient [numeric]

Mean (sd) : 44.1 (15.8)

min ≤ med ≤ max:

4 ≤ 45 ≤ 90

IQR (CV) : 22 (0.4)

75 distinct values

15034 (100.0%)

0 (0.0%)

Gender of the patient [character]

1. Female

2. Male

4012	(	26.7%	)
11022	(	73.3%	)

15034 (100.0%)

0 (0.0%)

Total Bilirubin [numeric]

Mean (sd) : 0.9 (0.3)

min ≤ med ≤ max:

0.5 ≤ 0.8 ≤ 2.2

IQR (CV) : 0.3 (0.3)

159 distinct values

15034 (100.0%)

0 (0.0%)

Direct Bilirubin [numeric]

Mean (sd) : 0.3 (0.2)

min ≤ med ≤ max:

0.1 ≤ 0.2 ≤ 1

IQR (CV) : 0.1 (0.7)

189 distinct values

15034 (100.0%)

0 (0.0%)

Alkphos Alkaline Phosphotase [numeric]

Mean (sd) : 197.1 (54)

min ≤ med ≤ max:

63 ≤ 188 ≤ 418

IQR (CV) : 52 (0.3)

450 distinct values

15034 (100.0%)

0 (0.0%)

Sgpt Alamine Aminotransferase [numeric]

Mean (sd) : 29.2 (12.2)

min ≤ med ≤ max:

10 ≤ 26 ≤ 72

IQR (CV) : 16 (0.4)

262 distinct values

15034 (100.0%)

0 (0.0%)

Sgot Aspartate Aminotransferase [numeric]

Mean (sd) : 32.4 (15.6)

min ≤ med ≤ max:

10 ≤ 28 ≤ 95

IQR (CV) : 20 (0.5)

242 distinct values

15034 (100.0%)

0 (0.0%)

Total Protiens [numeric]

Mean (sd) : 6.6 (1)

min ≤ med ≤ max:

3.6 ≤ 6.7 ≤ 9.2

IQR (CV) : 1.3 (0.2)

164 distinct values

15034 (100.0%)

0 (0.0%)

ALB Albumin [numeric]

Mean (sd) : 3.4 (0.7)

min ≤ med ≤ max:

0.9 ≤ 3.4 ≤ 5.5

IQR (CV) : 1.1 (0.2)

170 distinct values

15034 (100.0%)

0 (0.0%)

A/G Ratio Albumin and Globulin Ratio [numeric]

Mean (sd) : 1 (0.3)

min ≤ med ≤ max:

0.3 ≤ 1 ≤ 1.8

IQR (CV) : 0.3 (0.2)

209 distinct values

15034 (100.0%)

0 (0.0%)

Result [numeric]

Min : 1

Mean : 1.4

Max : 2

1	:	8423	(	56.0%	)
2	:	6611	(	44.0%	)

15034 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.1)
2025-01-07

2.15 Save the cleaned dataset to a CSV file

write.csv(robust_clean_data, file = "cleaned_data.csv", row.names = FALSE)
# robust_clean_data: This is the cleaned data you want to save.
# file: The name of the output CSV file. 
# row.names = FALSE: Ensures that row numbers are not added as a separate column in the CSV file.

Data Preprocessing (Discretization)

3.1 Clean column names of the dataset to ensure uniform and easy-to-use format

data<-robust_clean_data

data <- data %>% clean_names()

3.2 Categorize the ‘age_of_the_patient’ column into age groups

data$age_cat <- cut(data$age_of_the_patient,
                    breaks = c(-Inf, 18, 40, 65, Inf), # Define the age group boundaries
                    labels = c("Youth (<18)", "Adult (18-40)", "Middle-aged (40-65)", "Senior (>65)")) # Assign group labels

3.3 Discretize the ‘total_bilirubin’ variable into categories based on medical thresholds

data$total_bilirubin_cat <- cut(data$total_bilirubin, 
                                breaks = c(-Inf, 1.2, Inf), # Threshold for 'Normal' and 'Elevated'
                                labels = c("Normal", "Elevated"))

3.4 Similarly, discretize ‘direct_bilirubin’ into ‘Normal’ and ‘Elevated’

data$direct_bilirubin_cat <- cut(data$direct_bilirubin, 
                                 breaks = c(-Inf, 0.3, Inf), 
                                 labels = c("Normal", "Elevated"))

3.5 Categorize ‘alkphos_alkaline_phosphotase’ into ‘Low’, ‘Normal’, and ‘High’ based on specified ranges

data$alkphos_cat <- cut(data$alkphos_alkaline_phosphotase, 
                        breaks = c(-Inf, 44, 147, Inf), 
                        labels = c("Low", "Normal", "High"))

3.6 Discretize ‘sgpt_alamine_aminotransferase’ as ‘Normal’ or ‘Elevated’ using a threshold

data$sgpt_cat <- cut(data$sgpt_alamine_aminotransferase, 
                     breaks = c(-Inf, 45, Inf), 
                     labels = c("Normal", "Elevated"))

3.7 Discretize ‘sgot_aspartate_aminotransferase’ using similar logic

data$sgot_cat <- cut(data$sgot_aspartate_aminotransferase, 
                     breaks = c(-Inf, 40, Inf), 
                     labels = c("Normal", "Elevated"))

3.7 Categorize ‘total_proteins’ based on ranges for ‘Low’, ‘Normal’, and ‘High’

data$total_protiens_cat <- cut(data$total_protiens, 
                               breaks = c(-Inf, 6.4, 8.3, Inf), 
                               labels = c("Low", "Normal", "High"))

3.8 Categorize ‘alb_albumin’ into similar categories

data$alb_cat <- cut(data$alb_albumin, 
                    breaks = c(-Inf, 3.5, 5.5, Inf), 
                    labels = c("Low", "Normal", "High"))

3.9 Discretize ‘a_g_ratio_albumin_and_globulin_ratio’ into ‘Low’, ‘Normal’, and ‘High’

data$a_g_ratio_cat <- cut(data$a_g_ratio_albumin_and_globulin_ratio, 
                          breaks = c(-Inf, 1.2, 2.2, Inf), 
                          labels = c("Low", "Normal", "High"))

3.10 Create a new summarized data frame containing only categorized variables and Gender

data_categorized <- data.frame(
  Age = data$age_cat,
  Gender = data$gender_of_the_patient,
  Total_Bilirubin = data$total_bilirubin_cat,
  Direct_Bilirubin = data$direct_bilirubin_cat,
  Alkaline_Phosphatase = data$alkphos_cat,
  Alanine_Aminotransferase = data$sgpt_cat,
  Aspartate_Aminotransferase = data$sgot_cat,
  Total_Proteins = data$total_protiens_cat,
  Albumin = data$alb_cat,
  AG_Ratio = data$a_g_ratio_cat,
  Result = as.factor(data$result) # Convert 'Result' to a factor for categorical analysis
)

3.11 Convert ‘Gender’ to a factor type for better handling in analysis

data_categorized$Gender <- as.factor(data_categorized$Gender)

3.12 Drop unused levels from the factors in the data frame

data_categorized <- droplevels(data_categorized)

3.13 Display a summary of the cleaned and categorized data frame

summary(data_categorized)

##                   Age          Gender      Total_Bilirubin  Direct_Bilirubin
##  Youth (<18)        : 831   Female: 4012   Normal  :13232   Normal  :12193  
##  Adult (18-40)      :5506   Male  :11022   Elevated: 1802   Elevated: 2841  
##  Middle-aged (40-65):7242                                                   
##  Senior (>65)       :1455                                                   
##  Alkaline_Phosphatase Alanine_Aminotransferase Aspartate_Aminotransferase
##  Normal: 1876         Normal  :13092           Normal  :11133            
##  High  :13158         Elevated: 1942           Elevated: 3901            
##                                                                          
##                                                                          
##  Total_Proteins   Albumin       AG_Ratio     Result  
##  Low   :6653    Low   :8690   Low   :12379   1:8423  
##  Normal:7821    Normal:6344   Normal: 2655   2:6611  
##  High  : 560                                         
##

Data Preprocessing (Dimensionality Reduction)

4.1 Perform Principal Component Analysis (PCA) to reduce dimensionality and explore variance

library(FactoMineR)
library(factoextra)
# PCA requires numeric (quantitative) variables.

4.2 Run PCA, treating columns 2 and 11 as supplementary (not used for PCA computation)

library(readr)
data <- read_csv("cleaned_data.csv")
summary(data)

##  Age of the patient Gender of the patient Total Bilirubin  Direct Bilirubin
##  Min.   : 4.00      Length:15034          Min.   :0.4598   Min.   :0.1000  
##  1st Qu.:33.00      Class :character      1st Qu.:0.7000   1st Qu.:0.2000  
##  Median :45.00      Mode  :character      Median :0.8000   Median :0.2000  
##  Mean   :44.07                            Mean   :0.8991   Mean   :0.2704  
##  3rd Qu.:55.00                            3rd Qu.:1.0000   3rd Qu.:0.3000  
##  Max.   :90.00                            Max.   :2.2000   Max.   :1.0000  
##   Alkphos Alkaline Phosphotase  Sgpt Alamine Aminotransferase
##  Min.   : 63.0                 Min.   :10.00                 
##  1st Qu.:163.0                 1st Qu.:20.00                 
##  Median :188.0                 Median :26.00                 
##  Mean   :197.1                 Mean   :29.21                 
##  3rd Qu.:215.0                 3rd Qu.:36.00                 
##  Max.   :418.0                 Max.   :72.00                 
##  Sgot Aspartate Aminotransferase Total Protiens   ALB Albumin  
##  Min.   :10.00                   Min.   :3.6    Min.   :0.900  
##  1st Qu.:21.00                   1st Qu.:6.0    1st Qu.:2.900  
##  Median :28.03                   Median :6.7    Median :3.400  
##  Mean   :32.41                   Mean   :6.6    Mean   :3.358  
##  3rd Qu.:41.00                   3rd Qu.:7.3    3rd Qu.:4.000  
##  Max.   :95.00                   Max.   :9.2    Max.   :5.500  
##  A/G Ratio Albumin and Globulin Ratio     Result    
##  Min.   :0.300                        Min.   :1.00  
##  1st Qu.:0.880                        1st Qu.:1.00  
##  Median :1.000                        Median :1.00  
##  Mean   :1.021                        Mean   :1.44  
##  3rd Qu.:1.200                        3rd Qu.:2.00  
##  Max.   :1.800                        Max.   :2.00

pca_res <- PCA(data, quali.sup = c(2, 11), scale.unit = TRUE, graph = FALSE)
# scale.unit = TRUE in PCA to standardize your variables (subtract mean and divide by standard deviation
# (quali.sup) Some variables (like categorical ones) may not contribute to PCA but can provide additional insights. These are marked as "supplementary."
pca_res # Show how much variance each principal component explains.

## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 15034 individuals, described by 11 variables
## *The results are available in the following objects:
## 
##    name                description                                          
## 1  "$eig"              "eigenvalues"                                        
## 2  "$var"              "results for the variables"                          
## 3  "$var$coord"        "coord. for the variables"                           
## 4  "$var$cor"          "correlations variables - dimensions"                
## 5  "$var$cos2"         "cos2 for the variables"                             
## 6  "$var$contrib"      "contributions of the variables"                     
## 7  "$ind"              "results for the individuals"                        
## 8  "$ind$coord"        "coord. for the individuals"                         
## 9  "$ind$cos2"         "cos2 for the individuals"                           
## 10 "$ind$contrib"      "contributions of the individuals"                   
## 11 "$quali.sup"        "results for the supplementary categorical variables"
## 12 "$quali.sup$coord"  "coord. for the supplementary categories"            
## 13 "$quali.sup$v.test" "v-test of the supplementary categories"             
## 14 "$call"             "summary statistics"                                 
## 15 "$call$centre"      "mean of the variables"                              
## 16 "$call$ecart.type"  "standard error of the variables"                    
## 17 "$call$row.w"       "weights for the individuals"                        
## 18 "$call$col.w"       "weights for the variables"

Visualize Results

4.3 Visualize the percentage of variance explained by each principal component

fviz_screeplot(pca_res, addlabels = TRUE) #Visualizes the variance explained by each principal component.

# (fviz_pca_var): Highlights which variables contribute most to the components.

4.4 Plot the variables in the PCA space to see their contributions and correlation

fviz_pca_var(pca_res, repel = TRUE) # Variable contributions

## 4.5 Extract and display eigenvalues to understand the variance captured by each component

pca_res$eig

##         eigenvalue percentage of variance cumulative percentage of variance
## comp 1 2.490702555            27.67447284                          27.67447
## comp 2 2.113850233            23.48722481                          51.16170
## comp 3 1.318682351            14.65202612                          65.81372
## comp 4 1.023384805            11.37094228                          77.18467
## comp 5 0.990938100            11.01042334                          88.19509
## comp 6 0.588098669             6.53442965                          94.72952
## comp 7 0.410417994             4.56019993                          99.28972
## comp 8 0.057601672             0.64001858                          99.92974
## comp 9 0.006323621             0.07026246                         100.00000

Data Preprocessing (Feature Selection)

5.1 Build the full logistic regression model

library(readr)
data <- read_csv("cleaned_data.csv")
data$Result <- as.factor(data$Result)  # Convert to factor

# The `glm()` function fits a generalized linear model.
# `result ~ .` means using all predictors to predict the `result` variable.
# `family = binomial` specifies that this is a logistic regression model.
full_model <- glm(Result ~ ., data = data, family = binomial)

5.2 Perform stepwise backward elimination

# The `step()` function eliminates predictors based on their AIC (Akaike Information Criterion) value.
# `direction = "backward"` starts with all predictors and removes the least significant ones iteratively.
# AIC measures the goodness of fit of the model; lower values indicate better models.
stepwise_model <- step(full_model, direction = "backward")

## Start:  AIC=20021.44
## Result ~ `Age of the patient` + `Gender of the patient` + `Total Bilirubin` + 
##     `Direct Bilirubin` + ` Alkphos Alkaline Phosphotase` + ` Sgpt Alamine Aminotransferase` + 
##     `Sgot Aspartate Aminotransferase` + `Total Protiens` + ` ALB Albumin` + 
##     `A/G Ratio Albumin and Globulin Ratio`
## 
##                                          Df Deviance   AIC
## - `Age of the patient`                    1    19999 20019
## - `Gender of the patient`                 1    20000 20020
## - `Sgot Aspartate Aminotransferase`       1    20001 20021
## <none>                                         19999 20021
## - `Direct Bilirubin`                      1    20027 20047
## - ` Alkphos Alkaline Phosphotase`         1    20034 20054
## - ` Sgpt Alamine Aminotransferase`        1    20042 20062
## - `Total Bilirubin`                       1    20059 20079
## - `A/G Ratio Albumin and Globulin Ratio`  1    20276 20296
## - `Total Protiens`                        1    20324 20344
## - ` ALB Albumin`                          1    20328 20348
## 
## Step:  AIC=20019.44
## Result ~ `Gender of the patient` + `Total Bilirubin` + `Direct Bilirubin` + 
##     ` Alkphos Alkaline Phosphotase` + ` Sgpt Alamine Aminotransferase` + 
##     `Sgot Aspartate Aminotransferase` + `Total Protiens` + ` ALB Albumin` + 
##     `A/G Ratio Albumin and Globulin Ratio`
## 
##                                          Df Deviance   AIC
## - `Gender of the patient`                 1    20000 20018
## - `Sgot Aspartate Aminotransferase`       1    20001 20019
## <none>                                         19999 20019
## - `Direct Bilirubin`                      1    20027 20045
## - ` Alkphos Alkaline Phosphotase`         1    20034 20052
## - ` Sgpt Alamine Aminotransferase`        1    20042 20060
## - `Total Bilirubin`                       1    20059 20077
## - `A/G Ratio Albumin and Globulin Ratio`  1    20276 20294
## - `Total Protiens`                        1    20324 20342
## - ` ALB Albumin`                          1    20328 20346
## 
## Step:  AIC=20018.45
## Result ~ `Total Bilirubin` + `Direct Bilirubin` + ` Alkphos Alkaline Phosphotase` + 
##     ` Sgpt Alamine Aminotransferase` + `Sgot Aspartate Aminotransferase` + 
##     `Total Protiens` + ` ALB Albumin` + `A/G Ratio Albumin and Globulin Ratio`
## 
##                                          Df Deviance   AIC
## - `Sgot Aspartate Aminotransferase`       1    20002 20018
## <none>                                         20000 20018
## - `Direct Bilirubin`                      1    20028 20044
## - ` Alkphos Alkaline Phosphotase`         1    20036 20052
## - ` Sgpt Alamine Aminotransferase`        1    20043 20059
## - `Total Bilirubin`                       1    20060 20076
## - `A/G Ratio Albumin and Globulin Ratio`  1    20277 20293
## - `Total Protiens`                        1    20324 20340
## - ` ALB Albumin`                          1    20329 20345
## 
## Step:  AIC=20018.25
## Result ~ `Total Bilirubin` + `Direct Bilirubin` + ` Alkphos Alkaline Phosphotase` + 
##     ` Sgpt Alamine Aminotransferase` + `Total Protiens` + ` ALB Albumin` + 
##     `A/G Ratio Albumin and Globulin Ratio`
## 
##                                          Df Deviance   AIC
## <none>                                         20002 20018
## - `Direct Bilirubin`                      1    20028 20042
## - ` Alkphos Alkaline Phosphotase`         1    20039 20053
## - `Total Bilirubin`                       1    20061 20075
## - ` Sgpt Alamine Aminotransferase`        1    20079 20093
## - `A/G Ratio Albumin and Globulin Ratio`  1    20278 20292
## - `Total Protiens`                        1    20326 20340
## - ` ALB Albumin`                          1    20331 20345

5.3 Summarize the selected model

# `summary()` provides the coefficients of the selected model and their statistical significance.
# Look at the `Pr(>|z|)` column to see which predictors are significant:
# Values < 0.05 indicate statistically significant predictors.
# The model's residual deviance and AIC are also useful for assessing overall fit.
summary(stepwise_model)

## 
## Call:
## glm(formula = Result ~ `Total Bilirubin` + `Direct Bilirubin` + 
##     ` Alkphos Alkaline Phosphotase` + ` Sgpt Alamine Aminotransferase` + 
##     `Total Protiens` + ` ALB Albumin` + `A/G Ratio Albumin and Globulin Ratio`, 
##     family = binomial, data = data)
## 
## Coefficients:
##                                          Estimate Std. Error z value Pr(>|z|)
## (Intercept)                             6.7816537  0.3630895  18.678  < 2e-16
## `Total Bilirubin`                      -1.2440321  0.1635907  -7.605 2.86e-14
## `Direct Bilirubin`                      1.4184726  0.2771691   5.118 3.09e-07
## ` Alkphos Alkaline Phosphotase`        -0.0019375  0.0003222  -6.013 1.82e-09
## ` Sgpt Alamine Aminotransferase`       -0.0126591  0.0014520  -8.719  < 2e-16
## `Total Protiens`                       -2.0064790  0.1139280 -17.612  < 2e-16
## ` ALB Albumin`                          3.9617879  0.2232686  17.744  < 2e-16
## `A/G Ratio Albumin and Globulin Ratio` -5.4919388  0.3371597 -16.289  < 2e-16
##                                           
## (Intercept)                            ***
## `Total Bilirubin`                      ***
## `Direct Bilirubin`                     ***
## ` Alkphos Alkaline Phosphotase`        ***
## ` Sgpt Alamine Aminotransferase`       ***
## `Total Protiens`                       ***
## ` ALB Albumin`                         ***
## `A/G Ratio Albumin and Globulin Ratio` ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20623  on 15033  degrees of freedom
## Residual deviance: 20002  on 15026  degrees of freedom
## AIC: 20018
## 
## Number of Fisher Scoring iterations: 4

Data Mining

Build a logistic regression model

library(readr)
data <- read_csv("cleaned_data.csv")
data$Result <- as.factor(data$Result)
set.seed(123)
# Uses all variables in dataset to predict 'result'
model <- glm(Result ~ ., data = data, family = binomial)

6.1 Generate predictions based on the fitted model using 0.5 threshold

initial_probs <- fitted(model) # fitted(model) gets probability predictions
initial_preds <- ifelse(initial_probs > 0.5, 2, 1) #  ifelse() converts probabilities to class labels (1 or 2)

6.2 Create initial confusion matrix

print("Confusion Matrix with 0.5 threshold:")

## [1] "Confusion Matrix with 0.5 threshold:"

initial_conf_matrix <- table(initial_preds, data$Result) # table() compares predicted vs actual classes
# Shows how many predictions were correct/incorrect
print(initial_conf_matrix)

##              
## initial_preds    1    2
##             1 6447 3986
##             2 1976 2625

6.3 ROC curve analysis

library(pROC) # Load the `pROC` package to compute and visualize the ROC curve.
# Creates ROC curve comparing true vs false positive rates
# Higher AUC (Area Under Curve) means better model
roc_obj <- roc(data$Result, initial_probs)
plot(roc_obj, col = "blue", lwd = 2, 
     main = paste("ROC Curve (AUC =", round(auc(roc_obj), 3), ")"))

6.4 Find optimal threshold

# coords() finds threshold that best balances sensitivity/specificity
# "best" method uses Youden's index
best_threshold <- coords(roc_obj, "best", ret = "threshold", 
                        best.method = "youden")
print(paste("Best threshold:", round(best_threshold, 3)))

## [1] "Best threshold: 0.462"

6.5 Make new predictions using optimal threshold

initial_probs <- fitted(model) # fitted(model) gets probability predictions from the logistic regression model
best_threshold <- best_threshold[1,1] #best_threshold is a matrix containing the optimal threshold determined earlier, likely from an ROC curve analysis.
optimized_preds <- ifelse(initial_probs > best_threshold, 2, 1)

6.6 Create new confusion matrix with optimal threshold

print("Confusion Matrix with optimal threshold:")

## [1] "Confusion Matrix with optimal threshold:"

optimized_conf_matrix <- table(optimized_preds, data$Result)
print(optimized_conf_matrix)

##                
## optimized_preds    1    2
##               1 5499 2967
##               2 2924 3644

Build a Decision Trees model

# `rpart` is a library used for recursive partitioning like building decision trees.
library(rpart)
library(readr)
data <- read_csv("cleaned_data.csv")
data$Result <- as.factor(data$Result)
model <- rpart(Result ~ ., data = data, method = "class")

# `result ~ .` means using all predictors in the dataset to predict the `result` variable.
# `data = data` specifies the dataset.
# `method = "class"` indicates that this is a classification tree.

7.1 Visualize the decision tree

# `rpart.plot` is a library used for plotting decision trees in an interpretable manner.
library(rpart.plot)
# rpart.plot(model) creates a visual representation of the decision tree.
#   This plot shows how the data is split based on different predictors at each node.
#  Leaf nodes represent the final predictions
rpart.plot(model)

7.2 Make predictions using the decision tree model

# `predict()` generates predictions for the `result` variable based on the model.
#  `newdata = data` specifies that predictions are being made on the same dataset used to train the model.
# `type = "class"` returns the predicted class labels.
predictions <- predict(model, newdata = data, type = "class")

7.3 Evaluate the decision tree model

# Load the `caret` library, which provides tools for model evaluation.
library(caret)

#  Compare actual (`data$result`) and predicted (`predictions`) classes.
#  `confusionMatrix()` computes metrics like accuracy, precision, recall, and F1 score.
confusionMatrix <- confusionMatrix(as.factor(predictions), as.factor(data$Result))
print(confusionMatrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2
##          1 6870 1850
##          2 1553 4761
##                                           
##                Accuracy : 0.7736          
##                  95% CI : (0.7669, 0.7803)
##     No Information Rate : 0.5603          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5384          
##                                           
##  Mcnemar's Test P-Value : 3.893e-07       
##                                           
##             Sensitivity : 0.8156          
##             Specificity : 0.7202          
##          Pos Pred Value : 0.7878          
##          Neg Pred Value : 0.7540          
##              Prevalence : 0.5603          
##          Detection Rate : 0.4570          
##    Detection Prevalence : 0.5800          
##       Balanced Accuracy : 0.7679          
##                                           
##        'Positive' Class : 1               
##

7.4 Build a logistic regression model for comparison

# `glm()` fits a logistic regression model with `result` as the outcome and all other predictors as inputs.
# `family = binomial` specifies that this is a logistic regression for binary outcomes.
library(readr)
data <- read_csv("cleaned_data.csv")
data$Result <- as.factor(data$Result)
set.seed(123)
model1 <- glm(Result ~ ., data = data, family = binomial)

7.5 Generate predictions from the logistic regression model

# `predict()` returns predicted probabilities for each observation.
# `type = "response"` specifies that probabilities (not log-odds) are returned.
predictions1 <- predict(model1, newdata = data, type = "response")

7.6 Convert probabilities to class labels

# A threshold of 0.5 is used for classification:
# If the predicted probability is greater than 0.5, classify as "2" (positive class).
# Otherwise, classify as "1" (negative class).
# Ensure the predicted classes align with the actual levels of `data$result`.
predicted_classes1 <- ifelse(predictions1 > 0.5, "2", "1")

7.7 Convert probabilities to class labels

# Compare the predicted class labels with the actual class labels.
# `positive = "1"` specifies that the positive class is labeled as "1".
# Ensure data$result is a factor
data$result <- as.factor(data$Result)

# Check and align levels for both variables
levels(predicted_classes1) <- levels(data$result)

# Compute the confusion matrix
confusionMatrix1 <- confusionMatrix(as.factor(predicted_classes1), as.factor(data$result), positive = "1")

# Print the confusion matrix
print(confusionMatrix1)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2
##          1 6447 3986
##          2 1976 2625
##                                           
##                Accuracy : 0.6034          
##                  95% CI : (0.5956, 0.6113)
##     No Information Rate : 0.5603          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.168           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7654          
##             Specificity : 0.3971          
##          Pos Pred Value : 0.6179          
##          Neg Pred Value : 0.5705          
##              Prevalence : 0.5603          
##          Detection Rate : 0.4288          
##    Detection Prevalence : 0.6940          
##       Balanced Accuracy : 0.5812          
##                                           
##        'Positive' Class : 1               
##

Group Project

Group 7

2024-12-19

Data Exploration

1.1 Load all of the necessary packages

1.2 Read The dataset

1.3 Check the size of the dataset

Display the number of rows and columns in the dataset

Display the number of rows

Display the number of columns

1.4 Overview of the Variables

Display the column names of the dataset

Display a summary of the dataset structure

1.5 Display a summary of the dataset

1.6 Create a heatmap-like visualization of missing values

1.7 Create another visual representation of missingness in the dataset

1.8 Generate a detailed summary of the dataset, including statistics and data structure

Data Frame Summary

data

Data Cleaning

2.1 Identify numeric columns in the dataset

2.2 Apply a logarithmic transformation to all numeric columns

2.3 Perform Principal Component Analysis (PCA) to impute missing values

2.4 Extract the completed dataset with imputed values

2.5 Revert the logarithmic transformation by applying the exponential function

2.6 Generate a final detailed summary of the cleaned and transformed dataset

Data Frame Summary

data

2.7 Identify rows with missing values in the “Gender of the patient” column

2.8 Remove rows with missing values in the “Gender of the patient” column

2.9 Regenerate the summary table after cleaning the dataset

Data Frame Summary

data

2.10 Outliers detection

2.11 Display a summary of detected robust outliers

2.12 Remove rows identified as outliers to create a clean dataset

2.13 Display the dimensions of the cleaned dataset

2.14 Generate a detailed summary of the cleaned dataset

Data Frame Summary

robust_clean_data

2.15 Save the cleaned dataset to a CSV file

Data Preprocessing (Discretization)

3.1 Clean column names of the dataset to ensure uniform and easy-to-use format

3.2 Categorize the ‘age_of_the_patient’ column into age groups

3.3 Discretize the ‘total_bilirubin’ variable into categories based on medical thresholds

3.4 Similarly, discretize ‘direct_bilirubin’ into ‘Normal’ and ‘Elevated’

3.5 Categorize ‘alkphos_alkaline_phosphotase’ into ‘Low’, ‘Normal’, and ‘High’ based on specified ranges

3.6 Discretize ‘sgpt_alamine_aminotransferase’ as ‘Normal’ or ‘Elevated’ using a threshold

3.7 Discretize ‘sgot_aspartate_aminotransferase’ using similar logic

3.7 Categorize ‘total_proteins’ based on ranges for ‘Low’, ‘Normal’, and ‘High’

3.8 Categorize ‘alb_albumin’ into similar categories

3.9 Discretize ‘a_g_ratio_albumin_and_globulin_ratio’ into ‘Low’, ‘Normal’, and ‘High’

3.10 Create a new summarized data frame containing only categorized variables and Gender

3.11 Convert ‘Gender’ to a factor type for better handling in analysis

3.12 Drop unused levels from the factors in the data frame

3.13 Display a summary of the cleaned and categorized data frame

Data Preprocessing (Dimensionality Reduction)

4.1 Perform Principal Component Analysis (PCA) to reduce dimensionality and explore variance

4.2 Run PCA, treating columns 2 and 11 as supplementary (not used for PCA computation)

Visualize Results

4.3 Visualize the percentage of variance explained by each principal component

4.4 Plot the variables in the PCA space to see their contributions and correlation

Data Preprocessing (Feature Selection)

5.1 Build the full logistic regression model

5.2 Perform stepwise backward elimination

5.3 Summarize the selected model

Data Mining

Build a logistic regression model

6.1 Generate predictions based on the fitted model using 0.5 threshold

6.2 Create initial confusion matrix

6.3 ROC curve analysis

6.4 Find optimal threshold

6.5 Make new predictions using optimal threshold

6.6 Create new confusion matrix with optimal threshold

Build a Decision Trees model

7.1 Visualize the decision tree

7.2 Make predictions using the decision tree model

7.3 Evaluate the decision tree model

7.4 Build a logistic regression model for comparison

7.5 Generate predictions from the logistic regression model