This Weeks (week-5) data dive focuses on understanding how data documentation affects interpretation, highlights risks that arise from unclear or incomplete documentation, and evaluates missing data and outliers. Each section below directly corresponds to a required task and is clearly labeled for grading purposes.
## Rows: 41,602
## Columns: 30
## $ iso_code <chr> "AUT", "AUT", "AUT", "AUT", "AUT",…
## $ continent <chr> "Europe", "Europe", "Europe", "Eur…
## $ location <chr> "Austria", "Austria", "Austria", "…
## $ date <date> 2020-03-01, 2020-03-02, 2020-03-0…
## $ new_cases_smoothed_per_million <dbl> 0.11, 0.11, 0.11, 0.11, 0.11, 0.11…
## $ new_deaths_smoothed_per_million <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ total_cases_per_million <dbl> 0.77, 0.77, 0.77, 0.77, 0.77, 0.77…
## $ total_deaths_per_million <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ stringency_index <dbl> 11.11, 11.11, 11.11, 11.11, 11.11,…
## $ reproduction_rate <dbl> 1.07, 1.07, 1.07, 1.07, 1.07, 1.07…
## $ total_vaccinations_per_hundred <dbl> 69.3, 69.3, 69.3, 69.3, 69.3, 69.3…
## $ people_vaccinated_per_hundred <dbl> 43.6, 43.6, 43.6, 43.6, 43.6, 43.6…
## $ people_fully_vaccinated_per_hundred <dbl> 30.58, 30.58, 30.58, 30.58, 30.58,…
## $ hospital_beds_per_thousand <dbl> 7.37, 7.37, 7.37, 7.37, 7.37, 7.37…
## $ life_expectancy <dbl> 81.54, 81.54, 81.54, 81.54, 81.54,…
## $ cardiovasc_death_rate <dbl> 145.18, 145.18, 145.18, 145.18, 14…
## $ diabetes_prevalence <dbl> 6.35, 6.35, 6.35, 6.35, 6.35, 6.35…
## $ gdp_per_capita <dbl> 45436.69, 45436.69, 45436.69, 4543…
## $ population_density <dbl> 106.75, 106.75, 106.75, 106.75, 10…
## $ median_age <dbl> 44.4, 44.4, 44.4, 44.4, 44.4, 44.4…
## $ aged_65_older <dbl> 19.2, 19.2, 19.2, 19.2, 19.2, 19.2…
## $ human_development_index <dbl> 0.92, 0.92, 0.92, 0.92, 0.92, 0.92…
## $ population <dbl> 8939617, 8939617, 8939617, 8939617…
## $ country_group <chr> "EU", "EU", "EU", "EU", "EU", "EU"…
## $ year <dbl> 2020, 2020, 2020, 2020, 2020, 2020…
## $ month <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ year_month <chr> "2020-03", "2020-03", "2020-03", "…
## $ case_fatality_rate <dbl> 0.000000000, 0.000000000, 0.000000…
## $ vax_coverage <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ days_since_start <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, …
Two columns requiring documentation:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 45.37 57.87 58.26 71.76 100.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.110 0.890 1.040 1.074 1.210 4.650
Answer:
These variables are encoded numerically to standardize comparisons
across countries and time. Numeric encoding allows modeling,
aggregation, and visualization. However, without documentation, a user
might assume stringency_index measures real behavior (it measures policy
strictness) or that reproduction_rate is directly observed rather than
modeled. Ignoring documentation could lead to incorrect causal
interpretations and misleading conclusions.
covid_data %>%
select(location, date, reproduction_rate) %>%
filter(!is.na(reproduction_rate)) %>%
sample_n(10)## # A tibble: 10 × 3
## location date reproduction_rate
## <chr> <date> <dbl>
## 1 Sweden 2021-04-02 1.22
## 2 Iran 2021-05-26 0.82
## 3 Estonia 2020-09-16 1.27
## 4 Japan 2021-09-30 0.42
## 5 Finland 2021-03-22 0.96
## 6 South Africa 2020-10-20 1.03
## 7 Greece 2020-07-07 1.29
## 8 Iceland 2021-06-27 0.55
## 9 Iceland 2020-07-05 0.9
## 10 United States 2021-02-17 0.75
Answer:
Cross-country comparability of reproduction_rate remains unclear.
Differences in testing, reporting delays, and estimation methods may
bias comparisons. The documentation does not fully explain how these
structural differences are corrected.
covid_data %>%
filter(location %in% c("United States", "India", "Brazil")) %>%
ggplot(aes(date, reproduction_rate, color = location)) +
geom_line() +
geom_hline(yintercept = 1, linetype = "dashed") +
labs(
title = "Reproduction Rate Over Time (R=1 Threshold Shown)",
x = "Date",
y = "Reproduction Rate"
)covid_data %>%
filter(location %in% c("United States", "India", "Brazil")) %>%
ggplot(aes(location, reproduction_rate, fill = location)) +
geom_boxplot() +
labs(
title = "Distribution of Reproduction Rate by Country",
x = "Country",
y = "Reproduction Rate"
)Explanation & Risks:
- These visuals suggest cross-country comparability. However, since
reproduction_rate is modeled and influenced by reporting/testing
differences, the comparison may be biased. Risks include incorrect
policy conclusions or unfair country comparisons.
To reduce negative consequences:
## # A tibble: 6 × 2
## continent n
## <chr> <int>
## 1 Africa 3355
## 2 Asia 10065
## 3 Europe 21472
## 4 North America 2684
## 5 Oceania 1342
## 6 South America 2684
## # A tibble: 3 × 2
## country_group n
## <chr> <int>
## 1 EU 17446
## 2 Non_OECD 13420
## 3 OECD_Non_EU 10736
Answer:
Continent:
Explicitly missing rows: Yes (NA values present).
Implicitly missing rows: Some countries lack classification.
Empty groups: Some continents may not appear.
Country_group:
Explicitly missing rows: Yes.
Implicitly missing rows: Some countries never assigned.
Empty groups: Possible if defined categories have zero records.
These affect grouped analysis accuracy.
We define an outlier as any value above the 99th percentile of total_cases_per_million.
cutoff <- quantile(covid_data$total_cases_per_million, 0.99, na.rm = TRUE)
covid_data %>%
filter(total_cases_per_million > cutoff) %>%
select(location, date, total_cases_per_million) %>%
arrange(desc(total_cases_per_million)) %>%
head(10)## # A tibble: 10 × 3
## location date total_cases_per_million
## <chr> <date> <dbl>
## 1 Slovenia 2021-12-26 215977.
## 2 Slovenia 2021-12-27 215977.
## 3 Slovenia 2021-12-28 215977.
## 4 Slovenia 2021-12-29 215977.
## 5 Slovenia 2021-12-30 215977.
## 6 Slovenia 2021-12-31 215977.
## 7 Slovenia 2021-12-19 212384.
## 8 Slovenia 2021-12-20 212384.
## 9 Slovenia 2021-12-21 212384.
## 10 Slovenia 2021-12-22 212384.
Explanation:
Values above the 99th percentile are extreme relative to the overall
distribution. These may reflect reporting spikes, small populations, or
true outbreak extremes. Identifying them is important because they can
heavily influence averages and regression results.
Documentation gaps, modeled variables, missing categories, and extreme values introduce analytical risk. Careful interpretation and transparency are essential.