library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(readr)
latestdeprecated <-read.csv('latest_deprecated.csv')

Works Cited:

Direct dataset: https://raw.githubusercontent.com/globaldothealth/monkeypox/946edb545947af7f5195459ce52bb71d098e240c/latest_deprecated.csv

Website: https://github.com/globaldothealth/monkeypox

Question: Does gender and age group predict the likelihood of hospitalization amongst confirmed monkeypox patients?

Introduction:

The dataset I chose talks about the cases of monkeypox that have been confirmed all over the world and factors that were relative to the confirmed cases. The original dataset, “latestdeprecated” had 69595 individual cases (cases being each person) and 36 columns. The variables I used for this dataset were, ““Hospitalised..Y.N.NA.” which essentially tells you if this individual ended up being admitted to the hospital due to their monkeypox by using y= yes, and N= no and na= not available =. I also used “Gender” which included male and female and lastly age, which were put into age groups “60-64”, “65-69”, “70-74”, 20-24”, “20-25”, “20-29”, “20-35”, “20-39”, “20-44”, “20-49”, “20-50”, “20-54”, “20-59,”15-19”, “15-20”, “15-39”, “15-59”,“15-64”, “15-69”, “15-74”, “0-14”, “0-5”, “0-9”. I chose this topic because I thought it would be interesting to discover how statistically meaningful factors like, age and gender would be in predicting the chances of hospitalization.

#Data exploration

dim(latestdeprecated)
## [1] 69595    36
colnames(latestdeprecated)
##  [1] "ID"                      "Status"                 
##  [3] "Location"                "City"                   
##  [5] "Country"                 "Country_ISO3"           
##  [7] "Age"                     "Gender"                 
##  [9] "Date_onset"              "Date_confirmation"      
## [11] "Symptoms"                "Hospitalised..Y.N.NA."  
## [13] "Date_hospitalisation"    "Isolated..Y.N.NA."      
## [15] "Date_isolation"          "Outcome"                
## [17] "Contact_comment"         "Contact_ID"             
## [19] "Contact_location"        "Travel_history..Y.N.NA."
## [21] "Travel_history_entry"    "Travel_history_start"   
## [23] "Travel_history_location" "Travel_history_country" 
## [25] "Genomics_Metadata"       "Confirmation_method"    
## [27] "Source"                  "Source_II"              
## [29] "Source_III"              "Source_IV"              
## [31] "Source_V"                "Source_VI"              
## [33] "Source_VII"              "Date_entry"             
## [35] "Date_death"              "Date_last_modified"
str(latestdeprecated)
## 'data.frame':    69595 obs. of  36 variables:
##  $ ID                     : chr  "N1" "N2" "N3" "N4" ...
##  $ Status                 : chr  "confirmed" "confirmed" "confirmed" "confirmed" ...
##  $ Location               : chr  "Guy's and St Thomas Hospital London" "Guy's and St Thomas Hospital London" "London" "London" ...
##  $ City                   : chr  "London" "London" "London" "London" ...
##  $ Country                : chr  "England" "England" "England" "England" ...
##  $ Country_ISO3           : chr  "GBR" "GBR" "GBR" "GBR" ...
##  $ Age                    : chr  "" "" "" "" ...
##  $ Gender                 : chr  "" "" "" "male" ...
##  $ Date_onset             : chr  "2022-04-29" "2022-05-05" "2022-04-30" "" ...
##  $ Date_confirmation      : chr  "2022-05-06" "2022-05-12" "2022-05-13" "2022-05-15" ...
##  $ Symptoms               : chr  "rash" "rash" "vesicular rash" "vesicular rash" ...
##  $ Hospitalised..Y.N.NA.  : chr  "Y" "Y" "N" "Y" ...
##  $ Date_hospitalisation   : chr  "2022-05-04" "2022-05-06" "" "" ...
##  $ Isolated..Y.N.NA.      : chr  "Y" "Y" "Y" "Y" ...
##  $ Date_isolation         : chr  "2022-05-04" "2022-05-09" "" "" ...
##  $ Outcome                : chr  "" "" "" "" ...
##  $ Contact_comment        : chr  "" "Index Case of household cluster" "" "Under investigation" ...
##  $ Contact_ID             : int  NA 3 2 NA NA NA NA NA NA NA ...
##  $ Contact_location       : chr  "" "Household" "Household" "" ...
##  $ Travel_history..Y.N.NA.: chr  "Y" "N" "N" "N" ...
##  $ Travel_history_entry   : chr  "2022-05-04" "" "" "" ...
##  $ Travel_history_start   : chr  "late April" "" "" "" ...
##  $ Travel_history_location: chr  "Lagos and Delta States" "" "" "" ...
##  $ Travel_history_country : chr  "Nigeria" "" "" "" ...
##  $ Genomics_Metadata      : chr  "West African Clade" "West African Clade" "West African Clade" "West African Clade" ...
##  $ Confirmation_method    : chr  "RT-PCR" "RT-PCR" "RT-PCR" "" ...
##  $ Source                 : chr  "https://www.gov.uk/government/news/monkeypox-cases-confirmed-in-england-latest-updates" "https://www.gov.uk/government/news/monkeypox-cases-confirmed-in-england-latest-updates" "https://www.gov.uk/government/news/monkeypox-cases-confirmed-in-england-latest-updates" "https://www.gov.uk/government/news/monkeypox-cases-confirmed-in-england-latest-updates" ...
##  $ Source_II              : chr  "https://www.who.int/emergencies/disease-outbreak-news/item/2022-DON381" "" "" "" ...
##  $ Source_III             : chr  "" "" "" "" ...
##  $ Source_IV              : chr  "" "" "" "" ...
##  $ Source_V               : logi  NA NA NA NA NA NA ...
##  $ Source_VI              : logi  NA NA NA NA NA NA ...
##  $ Source_VII             : logi  NA NA NA NA NA NA ...
##  $ Date_entry             : chr  "2022-05-18" "2022-05-18" "2022-05-18" "2022-05-18" ...
##  $ Date_death             : chr  "" "" "" "" ...
##  $ Date_last_modified     : chr  "2022-05-18" "2022-05-18" "2022-05-18" "2022-05-18" ...
head(latestdeprecated)
##   ID    Status                            Location   City Country Country_ISO3
## 1 N1 confirmed Guy's and St Thomas Hospital London London England          GBR
## 2 N2 confirmed Guy's and St Thomas Hospital London London England          GBR
## 3 N3 confirmed                              London London England          GBR
## 4 N4 confirmed                              London London England          GBR
## 5 N5 confirmed                              London London England          GBR
## 6 N6 confirmed                              London London England          GBR
##   Age Gender Date_onset Date_confirmation       Symptoms Hospitalised..Y.N.NA.
## 1            2022-04-29        2022-05-06           rash                     Y
## 2            2022-05-05        2022-05-12           rash                     Y
## 3            2022-04-30        2022-05-13 vesicular rash                     N
## 4       male                   2022-05-15 vesicular rash                     Y
## 5       male                   2022-05-15 vesicular rash                     Y
## 6       male                   2022-05-15 vesicular rash                      
##   Date_hospitalisation Isolated..Y.N.NA. Date_isolation Outcome
## 1           2022-05-04                 Y     2022-05-04        
## 2           2022-05-06                 Y     2022-05-09        
## 3                                      Y                       
## 4                                      Y                       
## 5                                      Y                       
## 6                                      Y                       
##                   Contact_comment Contact_ID Contact_location
## 1                                         NA                 
## 2 Index Case of household cluster          3        Household
## 3                                          2        Household
## 4             Under investigation         NA                 
## 5             Under investigation         NA                 
## 6             Under investigation         NA                 
##   Travel_history..Y.N.NA. Travel_history_entry Travel_history_start
## 1                       Y           2022-05-04           late April
## 2                       N                                          
## 3                       N                                          
## 4                       N                                          
## 5                       N                                          
## 6                       N                                          
##   Travel_history_location Travel_history_country  Genomics_Metadata
## 1  Lagos and Delta States                Nigeria West African Clade
## 2                                                West African Clade
## 3                                                West African Clade
## 4                                                West African Clade
## 5                                                West African Clade
## 6                                                West African Clade
##   Confirmation_method
## 1              RT-PCR
## 2              RT-PCR
## 3              RT-PCR
## 4                    
## 5                    
## 6                    
##                                                                                   Source
## 1 https://www.gov.uk/government/news/monkeypox-cases-confirmed-in-england-latest-updates
## 2 https://www.gov.uk/government/news/monkeypox-cases-confirmed-in-england-latest-updates
## 3 https://www.gov.uk/government/news/monkeypox-cases-confirmed-in-england-latest-updates
## 4 https://www.gov.uk/government/news/monkeypox-cases-confirmed-in-england-latest-updates
## 5 https://www.gov.uk/government/news/monkeypox-cases-confirmed-in-england-latest-updates
## 6 https://www.gov.uk/government/news/monkeypox-cases-confirmed-in-england-latest-updates
##                                                                Source_II
## 1 https://www.who.int/emergencies/disease-outbreak-news/item/2022-DON381
## 2                                                                       
## 3                                                                       
## 4                                                                       
## 5                                                                       
## 6                                                                       
##   Source_III Source_IV Source_V Source_VI Source_VII Date_entry Date_death
## 1                            NA        NA         NA 2022-05-18           
## 2                            NA        NA         NA 2022-05-18           
## 3                            NA        NA         NA 2022-05-18           
## 4                            NA        NA         NA 2022-05-18           
## 5                            NA        NA         NA 2022-05-18           
## 6                            NA        NA         NA 2022-05-18           
##   Date_last_modified
## 1         2022-05-18
## 2         2022-05-18
## 3         2022-05-18
## 4         2022-05-18
## 5         2022-05-18
## 6         2022-05-18

Explain: In this chunk, we begin by exploring the dimension of the dataset “latestdeprecated” and found that originally there were “69595 individual cases and 36 columns”. We also explored the column names by using colnames and we will be using, “Hospitalised..Y.N.NA.”, “Age” and “Gender”. Then we check the structure and used head to get a overview and saw that column age and gender have blanks which we will take care of in the next chunks.

#Data cleaning

latestdeprecated_2 <- latestdeprecated %>% 

  filter(`Hospitalised..Y.N.NA.` %in% c("Y", "N")) %>% 
  mutate(
    Age =ifelse(Age %in% c("", " "), "Undisclosed", Age),
    Hospitalised_2 = ifelse(`Hospitalised..Y.N.NA.` == "Y", 1, 0), 
    Gender = case_when(
      Gender %in% c("male", "Male") ~ "Male",
      Gender %in% c("Female", "female") ~ "Female",
      TRUE ~ "Uncertain"
    )
   ) %>% 
    mutate(Gender = factor(Gender))

Explain: In this chunk, we begin cleaning the dataset “latestdeprecated” and rename it to latestdeprecated_2 which will be our cleaned dataset. We used filter to keep only the rows where “Hospitalised..Y.N.NA.” had a y or a N and got rid of any blanks found. Then, we replaced the blank values with “Undisclosed”, had to recode the gender columns to create a standard variable and any that came back blank call “Uncertain”. Lastly we used mutate function, to convert the “Gender” column into a categorical factor to use it for the logistic model.

#Data cleaning 2

latestdeprecated_2 <- latestdeprecated_2 %>% 
 mutate(
Age_Groups = case_when(
      Age %in% c("60-64", "65-69", "70-74") ~"60-74",
      Age %in% c("20-24", "20-25", "20-29", "20-35", "20-39", "20-44", "20-49", "20-50", "20-54", "20-59") ~ "20-59", 
      Age %in% c("15-19", "15-20", "15-39", "15-59","15-64", "15-69", "15-74") ~ "15-74",
      Age %in% c("0-14", "0-5", "0-9") ~ "0-14",
      Age == "Undisclosed"~"Undisclosed"
    ),
    Age_Groups = factor(Age_Groups)
  ) %>%
  select(Age, Age_Groups, Gender, Hospitalised_2)

Explain: In this chunk, we continue to clean the latestdeprecated_2 dataset, in the beginning our original dataset was difficult to use for interpretation. Especially in the age column, so we used mutate to create a new column called “Age_Groups” to make our age groups (overall understandable and condensed the age groups into 5 categories instead of scattered age groups.) Then, used undisclosed to take care of age groups that didnt fit into our categories instead of being bunched into a group that isn’t correct and so we can still use it lastly, we used select to make sure that this dataset only contains the columns we will need for our logistic model which are “Age, Age_Groups, Gender, Hospitalised_2”.

#Summarization and verification

head(latestdeprecated_2)
##           Age  Age_Groups    Gender Hospitalised_2
## 1 Undisclosed Undisclosed Uncertain              1
## 2 Undisclosed Undisclosed Uncertain              1
## 3 Undisclosed Undisclosed Uncertain              0
## 4 Undisclosed Undisclosed      Male              1
## 5 Undisclosed Undisclosed      Male              1
## 6 Undisclosed Undisclosed      Male              1

Explain: In this chunk, we do a quick check that overall our “latestdeprecated_2” dataset is configured to our standards and ready to use for our logistic model.

#Logistic Model

logistic_model <- glm(Hospitalised_2 ~ Age_Groups + Gender,
                      data = latestdeprecated_2, family = "binomial")

summary(logistic_model)
## 
## Call:
## glm(formula = Hospitalised_2 ~ Age_Groups + Gender, family = "binomial", 
##     data = latestdeprecated_2)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)   
## (Intercept)             3.1096     1.3211   2.354  0.01858 * 
## Age_Groups20-59        -1.0995     0.8166  -1.346  0.17818   
## Age_Groups60-74        16.0769   840.2745   0.019  0.98474   
## Age_GroupsUndisclosed  -0.3159     0.7667  -0.412  0.68033   
## GenderMale             -3.6204     1.1009  -3.289  0.00101 **
## GenderUncertain        -2.6555     1.1464  -2.316  0.02054 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 287.66  on 217  degrees of freedom
## Residual deviance: 250.55  on 212  degrees of freedom
##   (136 observations deleted due to missingness)
## AIC: 262.55
## 
## Number of Fisher Scoring iterations: 14

#Calculating R

r_square <- 1 - (logistic_model$deviance/logistic_model$null.deviance)

r_square
## [1] 0.1290054
#Confusion Matrix and Important Metric

# Predicted probabilities

predicted_prob <- logistic_model$fitted.values

# Predicted classes

predicted_class <-ifelse(predicted_prob > 0.5,1,0)

#values used in model

values_used <-latestdeprecated_2$Hospitalised_2[complete.cases(latestdeprecated_2$Age_Groups, latestdeprecated_2$Gender)]

# Confusion matrix

confusion_max <- table(Predicted = factor(predicted_class, levels = c(0,1)), Actual = factor(values_used, levels = c(0,1)))

confusion_max
##          Actual
## Predicted   0   1
##         0 109  38
##         1  28  43

Explain: From our logistic model, we found that within the age groups all were not statistically significant because all of the p-values > 0.05. Although, we did find that gender was significiant with gender male and gender uncertain were less than 0.05. Overall, we can conclude that gender does have an impact in prediction of hospitalization, with males/indiv. labeled as uncertain have a lower log-odds of being hospitalised than females. When we calculated r, we had a result of “0.1290054”, tells us that overall our model indicates a weak model fit, because our predictors only explain roughly 12.9% of the variability in hospitalisation. Furthermore, indicates that there are other variables than the ones included in our model which do a better job at prediction of hospitalization than the variables we have.

#Extract Values:
TN <- 196
FP <- 7
FN <- 133
TP <- 18

#Metrics    
accuracy <- (TP + TN) / (TP + TN + FP + FN)
sensitivity <- TP / (TP + FN)
specificity <- TN / (TN + FP)
precision <-  TP / (TP + FP) 

cat("Accuracy:", round(accuracy, 3), "\nSensitivity:", round(sensitivity, 3), "\nSpecificity:", round(specificity, 3), "\nPrecision:", round(precision, 3))
## Accuracy: 0.605 
## Sensitivity: 0.119 
## Specificity: 0.966 
## Precision: 0.72

Explain: With the help of our preformance metrics, we found that about 60.5% of both hospitalised and not were found to be accurate, sensitivity 11.9 % demonstrated that many of actual hospitalized cases are being missed, although we found with specificity that our model does an excellent job 96.6% at identifing individuals that were not hospitalised and lastly precision was 72%, indicating that it does a moderate job in performance. It is important to note that as much as there are low false positives being reported, the lower sensitivity 11.9%) can also result in actual cases of hospitalizations being missed. This can pose a threat to these individuals who actually require hospital admission, they might not be identified in time, this can pose a threat to their health and safety as well as a the general public.

#ROC Curve, AUC

#Enter your code here

library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
values_used <-latestdeprecated_2$Hospitalised_2[
  complete.cases(latestdeprecated_2$Age_Groups, latestdeprecated_2$Gender)
]
roc_obj <- roc(response = values_used, 
               predictor = logistic_model$fitted.values, levels = c("0", "1"), direction = "<" )

auc_val <-auc(roc_obj); auc_val
## Area under the curve: 0.7159
plot.roc(roc_obj, print.auc = TRUE, legacy.axes = TRUE, 
         xlab = "False Positive Rate (1 - Specificity)", ylab = "True Positive Rate (Sensitivity)")

Explain:The AUC measures the overall ability of the model between distinguishing individuals hospitalized and non-hospitalized. In this case, the AUC is 0.716 which is somewhat close to 1 and would indicate that the model does a fair job of distinguishing the hospitalized and non-hospitalized.

#Conclusion and Future Directions (1-2 paragraphs):

With the findings of our logistic model, we found that the age groups were not significant predicitors in terms of hospitalisation being that all the pvalues of that category were greater than the significance level of 0.05, on the other hand, gender was found to be less than sig. level of 0.05 males and indiv. labeled as uncertain gender were found to have lower log-odds of being hospitalised than females. We found r2 of our model to be 12.9% which shows weak model fit, indiciating that our model only explains a very small portion of the variability found in hospitalisation, the confusion matrix showed overall that it did a fair job at identifing individuals that were not hospitalised but was found to have a very low sensitivity at 11.9 this means that it can result in actual cases of hospitalizations being missed. This ultimately can pose a threat to these individuals who actually require hospital admission, they might not be identified in time, this can pose a threat to their health and safety as well as a the general public due to the severity of monkeypox. We lastly, found that the AUC is 0.716 which is somewhat closer to 1 and would indicate that the model does a fair job of distinguishing the hospitalized and non-hospitalized, this can imply that other factors that were not considered in our model are important to the prediction of hospitalization in comparision to our current predictors.

Overall, further directions would be including more variables like health conditions for example, into the model to have a better prediction of hospitalization. Researchers can look at geography more in depth and see if rural areas vs urban areas are more likely to predict hospitalization.