Download the COVID-19 Case Surveillance Public Use Data from the US Center for Disease Control (CDC). Click on the “Export” button in the top right corner next to download the actual data set.
Hint: It might be useful to create a new numeric column for severity based on the values of the hosp, icu, death columns. If you do this, make sure you explain the meaning and rationale behind your choices.
a) Describe this data set at a high level. What does each row represent?
Each row represent one individual that has been in contact with the coronavirus.
b) How many rows and columns do you expect to see based on the website? Load the data into R. Is this reflected in the data?
covid = read.csv("~/Desktop/Stat 128/COVID-19_Case_Surveillance_Public_Use_Data.csv")
nrow(covid)
## [1] 4481062
ncol(covid)
## [1] 11
There should be countless rows since each rows represent one individual or patient. As for column, each column represent results. It is represented in the data.
There are few NA values in this data, but much of the data is missing, unknown, or blank. Which columns have the highest proportion of missing data? Why might these be missing?
missingdata = colSums(covid == "Missing" | covid == "Unknown" | covid == "" | is.na(covid))
missingdata
## cdc_report_dt pos_spec_dt
## 0 2957494
## onset_dt current_status
## 2113224 0
## sex age_group
## 54045 5281
## Race.and.ethnicity..combined. hosp_yn
## 1895805 2138483
## icu_yn death_yn
## 3932347 2269410
## medcond_yn
## 3536092
It is not uncommon to be missing some data. Missing data is better than falsified data. According to the covid data frame, onset_dt has the highest proportion in missing data. onset_dt is the start date of individual’s symptons. This has a high proportion; this may be cause to unreliable commitment to believe a individual is symptomatic. A common side effect of coronavirus and to many other illnesses is coughing. These cough signs are usually noticed by the patient. Many patients may deny it is not cause by coronavirus. Hence it is not medically determined that each patient is systematic.
The CDC states “Among adults, the risk for severe illness from COVID-19 increases with age, with older adults at highest risk.” Does the data support this claim?
Hint 1: It might be useful to think about these COVID illness rates by age after adjusting for the relative sizes of each age group in the country. You’ll need to bring in another data source to do this.
table(covid$age_group)
##
## 0 - 9 Years 10 - 19 Years 20 - 29 Years 30 - 39 Years 40 - 49 Years
## 140213 386579 869057 748175 694670
## 50 - 59 Years 60 - 69 Years 70 - 79 Years 80+ Years Unknown
## 671827 468486 263186 233588 5188
t_age = table(covid[, c("age_group")])
plot(t_age)
This plot supports a higher proportion between the age of 30-39 has been in contact with coronavirus.
The CDC states “People of any age with certain underlying medical conditions are at increased risk for severe illness from COVID-19” Does the data support this claim?
t_med = table(covid[, c("age_group", "medcond_yn")])
plot(t_med)
This plot supports the older patients with medical condition have a higher risk contacting coronavirus.
In statistics, an interaction effect is when two or more variables have a non additive effect in the value of some response. In the questions above, we explored the severity of COVID-19 (response) given variables age and medical condition. Do age and medical condition have an interaction effect in the severity of COVID-19?
severity = rep(0, times = nrow(covid))
severity[covid$hosp_yn == "Yes"] = 1
severity[covid$icu_yn == "Yes"] = 2
covid$severity = severity
table(covid$severity)
##
## 0 1 2
## 4101132 329316 50614
t_sev = table(covid[, c("age_group", "severity")])
plot(t_sev)
This plots supports that as patients age, they are more like to have a severe outcome. 0 represents that these patients did not need to be hospitalized. 1 represents that these patients were hospitalized but did not require ICU care. 2 represents that these patients were hospitalized and required ICU care.
Come up with your own question about COVID-19, and answer it using this data set. You may want to explore other demographic columns: sex, and race / ethnicity. You could also look at how the number of cases have changed over time using the report dates.
Question: Are patients with coronavirus most likely expire?
t_exp = table(covid[, c("age_group", "death_yn")])
plot(t_exp)
This plot shows a higher proportion in older patients with coronavirus has expired. It is not clear that these patients expired because of COVID but looking at the plots above at #3 - 5, a higher proportion of medically ill and severity appears in the older age group.