Outcomes

Instructions

Download the COVID-19 Case Surveillance Public Use Data from the US Center for Disease Control (CDC). Click on the “Export” button in the top right corner next to download the actual data set.

Hint: It might be useful to create a new numeric column for severity based on the values of the hosp, icu, death columns. If you do this, make sure you explain the meaning and rationale behind your choices.

Questions

1 - Reading the Documentation

a) Describe this data set at a high level. What does each row represent?

Each row represent one individual that has been in contact with the coronavirus.

b) How many rows and columns do you expect to see based on the website? Load the data into R. Is this reflected in the data?

covid = read.csv("~/Desktop/Stat 128/COVID-19_Case_Surveillance_Public_Use_Data.csv")

nrow(covid)
## [1] 4481062
ncol(covid)
## [1] 11

There should be countless rows since each rows represent one individual or patient. As for column, each column represent results. It is represented in the data.

2 - Missing Data

There are few NA values in this data, but much of the data is missing, unknown, or blank. Which columns have the highest proportion of missing data? Why might these be missing?

missingdata = colSums(covid == "Missing" | covid == "Unknown" | covid == "" | is.na(covid)) 

missingdata
##                 cdc_report_dt                   pos_spec_dt 
##                             0                       2957494 
##                      onset_dt                current_status 
##                       2113224                             0 
##                           sex                     age_group 
##                         54045                          5281 
## Race.and.ethnicity..combined.                       hosp_yn 
##                       1895805                       2138483 
##                        icu_yn                      death_yn 
##                       3932347                       2269410 
##                    medcond_yn 
##                       3536092

It is not uncommon to be missing some data. Missing data is better than falsified data. According to the covid data frame, onset_dt has the highest proportion in missing data. onset_dt is the start date of individual’s symptons. This has a high proportion; this may be cause to unreliable commitment to believe a individual is symptomatic. A common side effect of coronavirus and to many other illnesses is coughing. These cough signs are usually noticed by the patient. Many patients may deny it is not cause by coronavirus. Hence it is not medically determined that each patient is systematic.

3 - Age Risk

The CDC states “Among adults, the risk for severe illness from COVID-19 increases with age, with older adults at highest risk.” Does the data support this claim?

Hint 1: It might be useful to think about these COVID illness rates by age after adjusting for the relative sizes of each age group in the country. You’ll need to bring in another data source to do this.

table(covid$age_group)
## 
##   0 - 9 Years 10 - 19 Years 20 - 29 Years 30 - 39 Years 40 - 49 Years 
##        140213        386579        869057        748175        694670 
## 50 - 59 Years 60 - 69 Years 70 - 79 Years     80+ Years       Unknown 
##        671827        468486        263186        233588          5188
t_age = table(covid[, c("age_group")])
plot(t_age)

This plot supports a higher proportion between the age of 30-39 has been in contact with coronavirus.

4 - Medical Condition Risk

The CDC states “People of any age with certain underlying medical conditions are at increased risk for severe illness from COVID-19” Does the data support this claim?

t_med = table(covid[, c("age_group", "medcond_yn")])
plot(t_med)

This plot supports the older patients with medical condition have a higher risk contacting coronavirus.

5 - Interaction Effects

In statistics, an interaction effect is when two or more variables have a non additive effect in the value of some response. In the questions above, we explored the severity of COVID-19 (response) given variables age and medical condition. Do age and medical condition have an interaction effect in the severity of COVID-19?

severity = rep(0, times = nrow(covid))
severity[covid$hosp_yn == "Yes"] = 1
severity[covid$icu_yn == "Yes"] = 2
covid$severity = severity

table(covid$severity)
## 
##       0       1       2 
## 4101132  329316   50614
t_sev = table(covid[, c("age_group", "severity")])
plot(t_sev)

This plots supports that as patients age, they are more like to have a severe outcome. 0 represents that these patients did not need to be hospitalized. 1 represents that these patients were hospitalized but did not require ICU care. 2 represents that these patients were hospitalized and required ICU care.

6 - Your Question

Come up with your own question about COVID-19, and answer it using this data set. You may want to explore other demographic columns: sex, and race / ethnicity. You could also look at how the number of cases have changed over time using the report dates.

Question: Are patients with coronavirus most likely expire?

t_exp = table(covid[, c("age_group", "death_yn")])
               
plot(t_exp)

This plot shows a higher proportion in older patients with coronavirus has expired. It is not clear that these patients expired because of COVID but looking at the plots above at #3 - 5, a higher proportion of medically ill and severity appears in the older age group.