Introduction

This Weeks (week-5) data dive focuses on understanding how data documentation affects interpretation, highlights risks that arise from unclear or incomplete documentation, and evaluates missing data and outliers. Each section below directly corresponds to a required task and is clearly labeled for grading purposes.

covid_data <- read_csv("covid_combined_groups.csv")
glimpse(covid_data)

## Rows: 41,602
## Columns: 30
## $ iso_code                            <chr> "AUT", "AUT", "AUT", "AUT", "AUT",…
## $ continent                           <chr> "Europe", "Europe", "Europe", "Eur…
## $ location                            <chr> "Austria", "Austria", "Austria", "…
## $ date                                <date> 2020-03-01, 2020-03-02, 2020-03-0…
## $ new_cases_smoothed_per_million      <dbl> 0.11, 0.11, 0.11, 0.11, 0.11, 0.11…
## $ new_deaths_smoothed_per_million     <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ total_cases_per_million             <dbl> 0.77, 0.77, 0.77, 0.77, 0.77, 0.77…
## $ total_deaths_per_million            <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ stringency_index                    <dbl> 11.11, 11.11, 11.11, 11.11, 11.11,…
## $ reproduction_rate                   <dbl> 1.07, 1.07, 1.07, 1.07, 1.07, 1.07…
## $ total_vaccinations_per_hundred      <dbl> 69.3, 69.3, 69.3, 69.3, 69.3, 69.3…
## $ people_vaccinated_per_hundred       <dbl> 43.6, 43.6, 43.6, 43.6, 43.6, 43.6…
## $ people_fully_vaccinated_per_hundred <dbl> 30.58, 30.58, 30.58, 30.58, 30.58,…
## $ hospital_beds_per_thousand          <dbl> 7.37, 7.37, 7.37, 7.37, 7.37, 7.37…
## $ life_expectancy                     <dbl> 81.54, 81.54, 81.54, 81.54, 81.54,…
## $ cardiovasc_death_rate               <dbl> 145.18, 145.18, 145.18, 145.18, 14…
## $ diabetes_prevalence                 <dbl> 6.35, 6.35, 6.35, 6.35, 6.35, 6.35…
## $ gdp_per_capita                      <dbl> 45436.69, 45436.69, 45436.69, 4543…
## $ population_density                  <dbl> 106.75, 106.75, 106.75, 106.75, 10…
## $ median_age                          <dbl> 44.4, 44.4, 44.4, 44.4, 44.4, 44.4…
## $ aged_65_older                       <dbl> 19.2, 19.2, 19.2, 19.2, 19.2, 19.2…
## $ human_development_index             <dbl> 0.92, 0.92, 0.92, 0.92, 0.92, 0.92…
## $ population                          <dbl> 8939617, 8939617, 8939617, 8939617…
## $ country_group                       <chr> "EU", "EU", "EU", "EU", "EU", "EU"…
## $ year                                <dbl> 2020, 2020, 2020, 2020, 2020, 2020…
## $ month                               <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ year_month                          <chr> "2020-03", "2020-03", "2020-03", "…
## $ case_fatality_rate                  <dbl> 0.000000000, 0.000000000, 0.000000…
## $ vax_coverage                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ days_since_start                    <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, …

1. Why do you think they chose to encode the data the way they did? What could have happened if you didn’t read the documentation?

Two columns requiring documentation:

stringency_index
reproduction_rate

summary(covid_data$stringency_index)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   45.37   57.87   58.26   71.76  100.00

summary(covid_data$reproduction_rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.110   0.890   1.040   1.074   1.210   4.650

Answer:
These variables are encoded numerically to standardize comparisons across countries and time. Numeric encoding allows modeling, aggregation, and visualization. However, without documentation, a user might assume stringency_index measures real behavior (it measures policy strictness) or that reproduction_rate is directly observed rather than modeled. Ignoring documentation could lead to incorrect causal interpretations and misleading conclusions.

2. What element remains unclear even after reading documentation?

covid_data %>%
  select(location, date, reproduction_rate) %>%
  filter(!is.na(reproduction_rate)) %>%
  sample_n(10)

## # A tibble: 10 × 3
##    location      date       reproduction_rate
##    <chr>         <date>                 <dbl>
##  1 Sweden        2021-04-02              1.22
##  2 Iran          2021-05-26              0.82
##  3 Estonia       2020-09-16              1.27
##  4 Japan         2021-09-30              0.42
##  5 Finland       2021-03-22              0.96
##  6 South Africa  2020-10-20              1.03
##  7 Greece        2020-07-07              1.29
##  8 Iceland       2021-06-27              0.55
##  9 Iceland       2020-07-05              0.9 
## 10 United States 2021-02-17              0.75

Answer:
Cross-country comparability of reproduction_rate remains unclear. Differences in testing, reporting delays, and estimation methods may bias comparisons. The documentation does not fully explain how these structural differences are corrected.

3. Build at least two visualizations using a column affected by Question 2. Highlight what is unclear and explain risks.

Visualization 1

covid_data %>%
  filter(location %in% c("United States", "India", "Brazil")) %>%
  ggplot(aes(date, reproduction_rate, color = location)) +
  geom_line() +
  geom_hline(yintercept = 1, linetype = "dashed") +
  labs(
    title = "Reproduction Rate Over Time (R=1 Threshold Shown)",
    x = "Date",
    y = "Reproduction Rate"
  )

Visualization 2

covid_data %>%
  filter(location %in% c("United States", "India", "Brazil")) %>%
  ggplot(aes(location, reproduction_rate, fill = location)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Reproduction Rate by Country",
    x = "Country",
    y = "Reproduction Rate"
  )

Explanation & Risks:
- These visuals suggest cross-country comparability. However, since reproduction_rate is modeled and influenced by reporting/testing differences, the comparison may be biased. Risks include incorrect policy conclusions or unfair country comparisons.

To reduce negative consequences:

Avoid causal language.
Report modeling assumptions.
Include uncertainty where possible.

4. For at least two categorical columns:

Are there explicitly missing rows?
Are there implicitly missing rows?
Are there empty groups?

covid_data %>% count(continent)

## # A tibble: 6 × 2
##   continent         n
##   <chr>         <int>
## 1 Africa         3355
## 2 Asia          10065
## 3 Europe        21472
## 4 North America  2684
## 5 Oceania        1342
## 6 South America  2684

covid_data %>% count(country_group)

## # A tibble: 3 × 2
##   country_group     n
##   <chr>         <int>
## 1 EU            17446
## 2 Non_OECD      13420
## 3 OECD_Non_EU   10736

Answer:

Continent:

Explicitly missing rows: Yes (NA values present).
Implicitly missing rows: Some countries lack classification.
Empty groups: Some continents may not appear.

Country_group:

Explicitly missing rows: Yes.
Implicitly missing rows: Some countries never assigned.
Empty groups: Possible if defined categories have zero records.

These affect grouped analysis accuracy.

5. For at least one continuous column, what would you define as an outlier, and why?

We define an outlier as any value above the 99th percentile of total_cases_per_million.

cutoff <- quantile(covid_data$total_cases_per_million, 0.99, na.rm = TRUE)

covid_data %>%
  filter(total_cases_per_million > cutoff) %>%
  select(location, date, total_cases_per_million) %>%
  arrange(desc(total_cases_per_million)) %>%
  head(10)

## # A tibble: 10 × 3
##    location date       total_cases_per_million
##    <chr>    <date>                       <dbl>
##  1 Slovenia 2021-12-26                 215977.
##  2 Slovenia 2021-12-27                 215977.
##  3 Slovenia 2021-12-28                 215977.
##  4 Slovenia 2021-12-29                 215977.
##  5 Slovenia 2021-12-30                 215977.
##  6 Slovenia 2021-12-31                 215977.
##  7 Slovenia 2021-12-19                 212384.
##  8 Slovenia 2021-12-20                 212384.
##  9 Slovenia 2021-12-21                 212384.
## 10 Slovenia 2021-12-22                 212384.

Explanation:
Values above the 99th percentile are extreme relative to the overall distribution. These may reflect reporting spikes, small populations, or true outbreak extremes. Identifying them is important because they can heavily influence averages and regression results.

Conclusion

Documentation gaps, modeled variables, missing categories, and extreme values introduce analytical risk. Careful interpretation and transparency are essential.

Week 5 Data Dive: Documentation

Krish Shah

February 10, 2026