| Lets first load the data set |
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
data <- read.csv("C:/Users/dilip/Downloads/covid_R/COVID19_line_list_data.csv")
describe(data) # Hmisc command to check the dataset entries
## data
##
## 27 Variables 1085 Observations
## --------------------------------------------------------------------------------
## ï..id
## n missing distinct Info Mean Gmd .05 .10
## 1085 0 1085 1 543 362 55.2 109.4
## .25 .50 .75 .90 .95
## 272.0 543.0 814.0 976.6 1030.8
##
## lowest : 1 2 3 4 5, highest: 1081 1082 1083 1084 1085
## --------------------------------------------------------------------------------
## case_in_country
## n missing distinct Info Mean Gmd .05 .10
## 888 197 197 1 48.84 54.99 2.00 4.00
## .25 .50 .75 .90 .95
## 11.00 28.00 67.25 110.30 153.65
##
## lowest : 1 2 3 4 5, highest: 365 443 875 925 1443
##
## Value 0 20 40 60 80 100 120 140 160 180 200
## Frequency 215 241 137 81 84 40 22 19 22 19 1
## Proportion 0.242 0.271 0.154 0.091 0.095 0.045 0.025 0.021 0.025 0.021 0.001
##
## Value 280 300 360 440 880 920 1440
## Frequency 1 1 1 1 1 1 1
## Proportion 0.001 0.001 0.001 0.001 0.001 0.001 0.001
##
## For the frequency table, variable is rounded to the nearest 20
## --------------------------------------------------------------------------------
## reporting.date
## n missing distinct
## 1084 1 43
##
## lowest : 02/01/20 02/02/20 02/03/20 02/04/20 02/05/20
## highest: 2/24/2020 2/25/2020 2/26/2020 2/27/2020 2/28/2020
## --------------------------------------------------------------------------------
## summary
## n missing distinct
## 1080 5 967
##
## lowest : confirmed COVID-19 pneumonia patient No.11 in Tianjin: female, 55, symptom onset on 01/23/2020, hospitalized on 01/23/2020, confirmed on 01/26/2020 confirmed COVID-19 pneumonia patient No.12 in Tianjin: female, 79, symptom onset on 01/24/2020, hospitalized on 01/24/2020, confirmed on 01/26/2020 confirmed COVID-19 pneumonia patient No.13 in Tianjin: female, 19, symptom onset on 01/19/2020, hospitalized on 01/20/2020, confirmed on 01/26/2020 confirmed COVID-19 pneumonia patient No.14 in Tianjin: male, 71, Wuhan resident, visited Malaysia from 01/19/2020 to 01/25/2020, arrived in Tianjin on 01/25/2020, symptom onset on 01/25/2020, hospitalized on 01/25/2020, confirmed on 01/26/2020 confirmed imported COVID-19 pneumonia patient in Gansu: female, 20, lives in Wuhan, arrived in Gansu on 01/18/2020, symptom onset on 01/19/2020, visit clinic on 01/24/2020, hospitalized on 01/24/2020.
## highest: new recovered imported COVID-19 pneumonia patient in Beijing: female, returned to Beijing from Wuhan on 01/08/2020, symptom onset afterwards, recovered on 01/24/2020. new recovered imported COVID-19 pneumonia patient in Beijing: male, returned to Beijing from Wuhan on 01/08/2020, symptom onset afterwards, recovered on 01/25/2020. Second confirmed imported COVID-19 pneumonia patient in Guangxi: male, 46, in contact with individuals from Wuhan before symptom onset. symptom onset on 01/20/2020. Second confirmed imported COVID-19 pneumonia patient in Liaoning: male, 40, works in Wuhan, visit Fushun, Liaoning on 01/12/2020, symptom onset on 01/14/2020, visit clinic in Fushun Dalian on 01/19/2020. Second confirmed imported COVID-19 pneumonia patient in Sichuan: male, 57, Wuhan resident, visited Sichuan on 01/15/2020, symptom onset on 01/16/2020 and hospitalized.
## --------------------------------------------------------------------------------
## location
## n missing distinct
## 1085 0 156
##
## lowest : Afghanistan Aichi Prefecture Alappuzha Algeria Amiens
## highest: Yunnan Zabaikalsky Zaragoza Zhejiang Zhuhai
## --------------------------------------------------------------------------------
## country
## n missing distinct
## 1085 0 38
##
## lowest : Afghanistan Algeria Australia Austria Bahrain
## highest: Thailand UAE UK USA Vietnam
## --------------------------------------------------------------------------------
## gender
## n missing distinct
## 902 183 2
##
## Value female male
## Frequency 382 520
## Proportion 0.424 0.576
## --------------------------------------------------------------------------------
## age
## n missing distinct Info Mean Gmd .05 .10
## 843 242 85 0.999 49.48 20.79 22.0 25.0
## .25 .50 .75 .90 .95
## 35.0 51.0 64.0 75.0 78.9
##
## lowest : 0.25 0.50 1.00 2.00 4.00, highest: 86.00 87.00 89.00 91.00 96.00
## --------------------------------------------------------------------------------
## symptom_onset
## n missing distinct
## 563 522 70
##
## lowest : 01/02/20 01/03/20 01/04/20 01/05/20 01/06/20
## highest: 2/22/2020 2/23/2020 2/24/2020 2/25/2020 2/26/2020
## --------------------------------------------------------------------------------
## If_onset_approximated
## n missing distinct Info Sum Mean Gmd
## 560 525 2 0.123 24 0.04286 0.08219
##
## --------------------------------------------------------------------------------
## hosp_visit_date
## n missing distinct
## 507 578 60
##
## lowest : 01/01/20 01/03/20 01/05/20 01/06/20 01/08/20
## highest: 2/24/2020 2/25/2020 2/26/2020 2/27/2020 2/28/2020
## --------------------------------------------------------------------------------
## exposure_start
## n missing distinct
## 128 957 39
##
## lowest : 01/03/20 01/06/20 01/08/20 01/09/20 01/10/20
## highest: 2/15/2020 2/17/2020 2/19/2020 2/20/2020 2/21/2020
## --------------------------------------------------------------------------------
## exposure_end
## n missing distinct
## 341 744 52
##
## lowest : 01/02/20 01/03/20 01/04/20 01/05/20 01/06/20
## highest: 2/21/2020 2/22/2020 2/23/2020 2/24/2020 2/25/2020
## --------------------------------------------------------------------------------
## visiting.Wuhan
## n missing distinct Info Sum Mean Gmd
## 1085 0 2 0.437 192 0.177 0.2916
##
## --------------------------------------------------------------------------------
## from.Wuhan
## n missing distinct Info Sum Mean Gmd
## 1081 4 2 0.37 156 0.1443 0.2472
##
## --------------------------------------------------------------------------------
## death
## n missing distinct
## 1085 0 14
##
## lowest : 0 02/01/20 1 2/13/2020 2/14/2020
## highest: 2/24/2020 2/25/2020 2/26/2020 2/27/2020 2/28/2020
##
## 0 (1022, 0.942), 02/01/20 (1, 0.001), 1 (42, 0.039), 2/13/2020 (1, 0.001),
## 2/14/2020 (1, 0.001), 2/19/2020 (2, 0.002), 2/21/2020 (2, 0.002), 2/22/2020 (1,
## 0.001), 2/23/2020 (4, 0.004), 2/24/2020 (1, 0.001), 2/25/2020 (2, 0.002),
## 2/26/2020 (3, 0.003), 2/27/2020 (2, 0.002), 2/28/2020 (1, 0.001)
## --------------------------------------------------------------------------------
## recovered
## n missing distinct
## 1085 0 32
##
## lowest : 0 02/02/20 02/04/20 02/05/20 02/06/20
## highest: 2/24/2020 2/25/2020 2/26/2020 2/27/2020 2/28/2020
## --------------------------------------------------------------------------------
## symptom
## n missing distinct
## 270 815 108
##
## lowest : chest discomfort chills cold, fever, pneumonia cough cough with sputum
## highest: throat pain, chills throat pain, fever tired vomiting, cough, fever, sore throat vomiting, diarrhea, fever, cough
## --------------------------------------------------------------------------------
## source
## n missing distinct
## 1085 0 85
##
## lowest : 央视新闻 ABC ABC News 新浪 Al Arabiya
## highest: Wa.de Washington Examiner Xin Hua Net Yahoo News Yonnhap News Agency
## --------------------------------------------------------------------------------
## link
## n missing distinct
## 1085 0 490
##
## lowest : http://behdasht.gov.ir/news/%DA%A9%D8%B1%D9%88%D9%86%D8%A7+%D9%88%DB%8C%D8%B1%D9%88%D8%B3/199807/%D8%AF%D8%B1+%D8%B1%D9%88%D8%B2%D9%87%D8%A7%DB%8C+%DA%AF%D8%B0%D8%B4%D8%AA%D9%87+735+%D8%A8%DB%8C%D9%85%D8%A7%D8%B1+%D8%A8%D8%A7+%D8%B9%D9%84%D8%A7%D8%A6%D9%85+%D8%B4%D8%A8%D9%87+%D8%A2%D9%86%D9%81%D9%84%D9%88%D8%A2%D9%86%D8%B2%D8%A7+%D8%AF%D8%B1+%DA%A9%D8%B4%D9%88%D8%B1+%D8%A8%D8%B3%D8%AA%D8%B1%DB%8C+%D8%B4%D8%AF%D9%86%D8%AF+%D8%A8%D8%B1+%D8%A7%D8%B3%D8%A7%D8%B3+%D8%A2%D8%AE%D8%B1%DB%8C%D9%86+%D9%86%D8%AA%D8%A7%DB%8C%D8%AC+%D8%A2%D8%B2%D9%85%D8%A7%DB%8C%D8%B4+%D9%87%D8%A7+%D8%A7%D8%A8%D8%AA%D9%84%D8%A7%DB%8C+13+%D9%85%D9%88%D8%B1%D8%AF+%D8%AF%DB%8C%DA%AF%D8%B1+%D8%A8%D9%87+%DA%A9%D9%88%D9%88%DB%8C%D8%AF19+%D9%82%D8%B7%D8%B9%DB%8C+%D8%A8%D9%87+%D9%86%D8%B8%D8%B1+%D9%85%DB%8C+%D8%B1%D8%B3%D8%AF http://english.alarabiya.net/en/News/gulf/2020/02/25/Number-of-Kuwait-coronavirus-cases-rises-to-eight-KUNA.html http://sxwjw.shaanxi.gov.cn/art/2020/1/27/art_9_67483.html http://wjw.beijing.gov.cn/xwzx_20031/wnxw/202001/t20200121_1620353.html http://wjw.sz.gov.cn/wzx/202001/t20200120_18987787.htm
## highest: https://www3.nhk.or.jp/nhkworld/en/news/20200116_23/ https://www3.nhk.or.jp/nhkworld/en/news/20200124_14/ https://www3.nhk.or.jp/nhkworld/en/news/20200126_31/ https://www3.nhk.or.jp/nhkworld/en/news/20200130_02/ https://www3.nhk.or.jp/nhkworld/en/news/20200131_01/
## --------------------------------------------------------------------------------
##
## Variables with all observations missing:
##
## [1] X X.1 X.2 X.3 X.4 X.5 X.6
| Cleaning the dataset and the required columns for analysis and calculating the death rate. |
data$death_dummy <- as.integer(data$death != 0)
# Lets calculate the death rate
sum(data$death_dummy)/nrow(data)
## [1] 0.05806452
| AGE |
| Claim: the people who died were actually old |
dead = subset(data, death_dummy == 1)
alive = subset(data, death_dummy == 0)
# Mean of the people who died and who are alive
mean(dead$age, na.rm = TRUE)
## [1] 68.58621
mean(alive$age, na.rm = TRUE)
## [1] 48.07229
t.test(dead$age, alive$age, alternative = "two.sided",conf.level = 0.95)
##
## Welch Two Sample t-test
##
## data: dead$age and alive$age
## t = 10.839, df = 72.234, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 16.74114 24.28669
## sample estimates:
## mean of x mean of y
## 68.58621 48.07229
| After Calculating the mean of the ages of the people who died and the people who are alive we came to a conclusion that the difference in the age span is 20 years. To check the significance of our claim we are running a t-test on the tibble. |
| Normally, the p-value<0.05 we reject the null hypothesis. |
| Here we got a p-value ~ 0, so we can directly reject the null hypothesis. |
| Hence we can approve that the claim we made about the age is correct and the people who died were actually old. |
| GENDER |
| Claim: gender does matters about the death cause. |
men = subset(data, gender == "male")
women = subset(data, gender == "female")
# Mean of the people who died and who are alive
mean(men$death_dummy, na.rm = TRUE)
## [1] 0.08461538
mean(women$death_dummy, na.rm = TRUE)
## [1] 0.03664921
# we can see that the difference is 20 years
# So is this statistically significant?
t.test(men$death_dummy, women$death_dummy, alternative = "two.sided",conf.level = 0.95)
##
## Welch Two Sample t-test
##
## data: men$death_dummy and women$death_dummy
## t = 3.084, df = 894.06, p-value = 0.002105
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.01744083 0.07849151
## sample estimates:
## mean of x mean of y
## 0.08461538 0.03664921
| After calculating the mean of the gender wise deaths of the people, the age gap is still 20 years among men and women. To check the significance of our claim we again run a t-test on the tibble. |
| Hence we approve our claim that the people who actually died were more older than the people who didn’t died. |
| 95% confidence level that men have more deaths than women and have 1.7% to 7.8% higher chance of dying |
| p-value = 0.002105 < 0.05, hence we reject the null hypothesis and approve that our claim is wrong and |
| improve the claim to men have more chances of dying than women. |