this is my first project converted to mark_down
#Summon and explore data
library(gtsummary)
library(tidyverse)
library(lubridate)
library(readxl)
library(ggpubr)
library(scales)
library(pillar)
library(ggthemes)
library(kableExtra)
library(ggsci)
library(hrbrthemes)
library(easystats)
library(reshape2)
library(Hmisc)
library(gsubfn)
library(tableone)
library(Publish)
library(broom)
#Factorization and Cleaning
x <- read.csv("mydata.CSV")
x <- as.data.frame(unclass(x), stringsAsFactors = T)
# cleaning death,recovered
x$death <- as.integer(x$death != 0)
x$recovered <- as.integer(x$recovered != 1)
#clean misspellings and divide symptoms column
x <- x %>%
mutate(symptom = str_to_lower(symptom)) %>%
mutate(symptom = str_replace_all(symptom, "//", "/")) %>%
mutate(symptom = str_replace_all(symptom, "feaver", "fever")) %>%
mutate(symptom = str_replace_all(symptom, "high fever", "mild fever")) %>%
mutate(symptom = sapply(symptom, paste, collapse = ", "))
x <- x %>%
mutate(
fever = as.integer(grepl("fever", symptom)),
cough = as.integer(grepl("cough", symptom)|grepl("sputum",symptom)),
chills = as.integer(grepl("chills", symptom)),
`joint_pain` = as.integer(grepl("joint pain", symptom)),
fatigue = as.integer(
grepl("fatigue", symptom)|grepl("malaise", symptom)|
grepl("myalgias", symptom)|
grepl("sore body", symptom)|
grepl("aching muscles", symptom)),
`abdominal_pain` = as.integer(grepl("abdominal pain", symptom)),
diarrhea = as.integer(grepl("diarrhea", symptom)),
cold = as.integer(
grepl("cold", symptom)|
grepl("sneeze", symptom)|
grepl("flu", symptom)|
grepl("runny nose", symptom)),
pneumonia = as.integer(grepl("pneumonia", symptom)),
vomiting = as.integer(grepl("vomiting", symptom)),
`loss_of_appetite` = as.integer(grepl("loss of appetite", symptom)),
headache = as.integer(
grepl("headache", symptom)|
grepl("heavy head", symptom)),
`difficulty_breathing` = as.integer(
grepl("difficulty breathing", symptom)|
grepl("difficult in breathing", symptom)|
grepl("respiratory distress", symptom) |
grepl("chest discomfort", symptom)),
`sore_throat` = as.integer(
grepl("sore throat", symptom)|
grepl("itchy throat", symptom)|
grepl("throat pain", symptom)),
thirst = as.integer(grepl("thirst", symptom)),
)
#barchart
x1 <- x %>% select(id,fever,cough,chills,joint_pain,fatigue,abdominal_pain,diarrhea,cold,pneumonia
,vomiting,loss_of_appetite,headache,difficulty_breathing,sore_throat,thirst) %>%
melt(id = c("id"))
x1 %>%
group_by(variable) %>%
summarise(count = round(sum(value))) %>%
ggplot(aes(x = reorder(variable, (count)), y = count, fill = variable)) +
geom_bar(stat = 'identity') +
coord_flip() +
xlab("") +
ylab("") +
ggtitle("symptoms of covid infection")+
theme_pubr(legend = "none") +
labs_pubr() +
theme(plot.title = element_text(hjust = 3, vjust=.12))
the following graph represents each symptom on the Y-axis and it’s frequencies on the X-axis. It shows a higher recurrence of fever, cough, fatigue and sore throat as the main competitors for the top 4 symptoms the rest of the symptoms are below 25 patients per 1086
#death rate calculation and mean age
sum(x$death) / nrow(x)
## [1] 0.05806452
The calculated death rate, which is approximately 5.8%, suggests that a significant proportion of individuals within the data set experienced a fatal outcome. To investigate the hypotheses that death is more likely among older individuals and recovery is more likely among younger individuals, t-tests were conducted. The results of the t-tests comparing age with death and age with recovery indicate the statistical significance of these relationships.
the t-test results indicate a statistically significant and large negative effect of age on the likelihood of death. The mean age for individuals who did not survive (group 1) is notably higher at 68.59, compared to the mean age of 48.07 for those who did (group 0). The substantial difference of a 95% CI [16, 24], underscores the significance of this age-related disparity. The t-statistic of -10.84 and p-value < 0.001 reinforce the strength of this finding.
The Chi-Square test results reveal a statistically significant relationship between gender and death (p-value <0.005). This suggests that there are indeed gender differences when it comes to the likelihood of death within the data set. Further analysis would be needed to understand the specific nature of these gender differences, but the test indicates that gender is a statistically significant factor in predicting the outcome of death.
#creating an excel table to summarize the findings
tab1 <- table(x$gender,x$death)
colnames(tab1) <- c("Alive","Dead")
row.names(tab1) <- c("Female","Male")
this code was made to summarize the information and difference between genders and death rate, and to export it in CSV excel file.
this code was also made to separate the categorical and numerical variables and make a table to make analysis easier, finally leading to logistic regression.
# Logistic Regression to predict death using age, visit_Wuhan or from_Wuhan
shapiro_results <- lapply(x[sapply(x, is.numeric)], function(var) {
shapiro_test_result <- shapiro.test(var)
return(shapiro_test_result) })
col_indices <- 20:34
x[, col_indices] <- lapply(x[, col_indices], as.factor)
glm_mortality <- glm(death ~ age + visiting.Wuhan + from.Wuhan, data = x , family=binomial())
tbl_regression(glm_mortality)
| Characteristic | log(OR)1 | 95% CI1 | p-value |
|---|---|---|---|
| age | 0.08 | 0.06, 0.11 | <0.001 |
| visiting.Wuhan | -0.96 | -3.9, 0.66 | 0.4 |
| from.Wuhan | 2.1 | 1.5, 2.8 | <0.001 |
| 1 OR = Odds Ratio, CI = Confidence Interval | |||
glm_model <- glm_mortality %>% tidy()
The logistic regression model was employed to predict the likelihood of death based on age, visiting Wuhan, and being from Wuhan.
Age has a statistically significant and positive effect (beta = 0.08, 95% CI [0.06, 0.11], p < 0.001; Std. beta = 1.50, 95% CI [1.10, 1.94]). This suggests that as age increases, the likelihood of death also increases.
Visiting Wuhan has a statistically non-significant and negative effect (beta = -0.96, 95% CI [-3.86, 0.66], p = 0.357; Std. beta = -0.36, 95% CI [-1.47, 0.25]). This indicates that visiting Wuhan does not significantly impact the likelihood of death.
Being from Wuhan has a statistically significant and positive effect (beta = 2.14, 95% CI [1.52, 2.79], p < 0.001; Std. beta = 0.83, 95% CI [0.58, 1.08]). Being from Wuhan is associated with a significantly higher likelihood of death.
#Grouped bar plot for countries and locations (country)
x %>%
filter(is.numeric(case_in_country)) %>%
group_by(location) %>%
summarise(count = sum(case_in_country,na.rm = T)) %>%
filter(count>1000)%>%
mutate(location = reorder(location, count)) %>%
ggbarplot(
y="count",
x="location",
fill = "location",
xlab = "High Risk Countries",
ylab = "Patient count",
title = "Cases Per Country"
) +
scale_fill_jama()+ rremove("legend")+ coord_flip()
this bar-chart was intended to show the top locations with the highest infection rates (>1000). While it shows that South Korea is the most place containing infected patients, other locations shows a high infection rates as well.