#Weekly Assignment #3
The functions pivot_longer() and pivot_wider() from the tidyr package are used to reshape data tables so they follow tidy data principles: every column represents a variable, every row represents an observation, and every cell contains a single value. The function pivot_longer() is used when a table is too wide, meaning that key observations are spread across multiple columns. It restructures the table by stacking the original column names into one new column and placing their associated values into another, appropriately named column. This produces a long format where each observation is represented by its own row, ensuring that only variables appear as column headers. In contrast, pivot_wider() is used when a table is too long, meaning that variables are spread across rows instead of columns. It reorganizes the data by spreading variable names and their corresponding values into new columns, reducing the number of rows and creating a wider table.
The pipe operator (%>% in the tidyverse or |> in base R) makes code easier to read, write, and understand. Instead of nesting functions inside each other, the pipe sends the result of one function directly into the next, creating a step-by-step flow that reads from left to right like a sentence. This makes data wrangling workflows clearer and less cluttered, since each step is written in order. Pipes also make it easier to debug or adjust code, because you can inspect the results at each stage without rewriting everything. Overall, using pipes reduces complexity, improves clarity, and supports a clean, modular approach to data analysis.
We will start with publicly available data about COIVD-19 diagnosis and outcomes in Ontario. This data is maintained by Ontario Public Health and can be directly downloaded from their webpage. I have converted it into an R object and saved it. Use load() to bring the object into the work space. I have also loaded in the libraries I will use using the library() function.
# Load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(formattable)
library(cowplot)
##
## Attaching package: 'cowplot'
##
## The following object is masked from 'package:lubridate':
##
## stamp
library(corrplot)
## corrplot 0.95 loaded
library(scales)
##
## Attaching package: 'scales'
##
## The following objects are masked from 'package:formattable':
##
## comma, percent, scientific
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
# Load an R object
load("covid.Rdata")
# Checking the data and what is in the table
#covid
head(covid)
## Row_ID Accurate_Episode_Date Case_Reported_Date Test_Reported_Date
## 1 1 2019-05-30 2020-05-05 2020-05-05
## 2 2 2019-11-20 2020-10-21 2020-11-21
## 3 3 2019-12-18 2021-12-20 2021-12-20
## 4 4 2020-01-01 2020-04-24 2020-04-24
## 5 5 2020-01-01 2020-05-17 2020-05-17
## 6 6 2020-01-01 2021-05-26 2021-03-31
## Specimen_Date Age_Group Client_Gender Case_AcquisitionInfo Outcome1
## 1 2020-05-03 50s FEMALE CC Resolved
## 2 2019-11-20 20s FEMALE NO KNOWN EPI LINK Resolved
## 3 2019-12-18 30s MALE MISSING INFORMATION Resolved
## 4 2020-04-23 80s MALE NO KNOWN EPI LINK Resolved
## 5 2020-05-15 50s MALE CC Resolved
## 6 2021-03-28 UNKNOWN MALE TRAVEL Resolved
## Outbreak_Related Reporting_PHU_ID Reporting_PHU
## 1 2260 Simcoe Muskoka District Health Unit
## 2 4913 Southwestern Public Health
## 3 2270 York Region Public Health Services
## 4 2234 Haldimand-Norfolk Health Unit
## 5 2265 Region of Waterloo, Public Health
## 6 2263 Timiskaming Health Unit
## Reporting_PHU_Address Reporting_PHU_City Reporting_PHU_Postal_Code
## 1 15 Sperling Drive Barrie L4M 6K9
## 2 1230 Talbot Street St. Thomas N5P 1G9
## 3 17250 Yonge Street Newmarket L3Y 6Z1
## 4 12 Gilbertson Drive Simcoe N3Y 4N5
## 5 99 Regina Street South Waterloo N2J 4V3
## 6 247 Whitewood Avenue, Unit 43 New Liskeard P0J 1P0
## Reporting_PHU_Website Reporting_PHU_Latitude
## 1 www.simcoemuskokahealth.org 44.41071
## 2 www.swpublichealth.ca 42.77780
## 3 www.york.ca/wps/portal/yorkhome/health/ 44.04802
## 4 www.hnhu.org 42.84783
## 5 www.regionofwaterloo.ca 43.46288
## 6 www.timiskaminghu.com 47.50928
## Reporting_PHU_Longitude
## 1 -79.68631
## 2 -81.15116
## 3 -79.48024
## 4 -80.30381
## 5 -80.52091
## 6 -79.68163
This data contains several dates related to infection and detection, demographic information related to the ages, genders and locations of the individuals.
Age_Group, Client_Gender, Reporting_PHU_City and others are
categories. The dates are not categories they are dates, which are
numeric ordinal data as the order is important. We are not considering
time in this analysis.
Thinking of the processing steps we need to: 1. Ensure the data is tidy
2. Select the columns of data needed 3. Clean by removing missing data
or unknowns/incomplete data 4. Make transformations of the data through
pivoting and mutating.
I will use these functions and this code to assess the quality of the dataset and prepare it for analysis. The summary() function provides an overview of the dataset, helping identify potential issues such as missing or unusual values. To check for missing data, I use is.na() on specific columns and colSums(is.na()) across the dataset to count how many missing values exist in each column. If missing values were present, I could apply the filter() function to remove those incomplete entries. If the dataset does not contain missing values, I will directly use select() to retain only the relevant variables—Age_Group, Client_Gender, and Outcome1—to keep the analysis focused. Finally, I will apply filter() again to remove unresolved outcomes, unknown age groups, or unspecified genders, ensuring the dataset remains consistent and reliable. To confirm these steps worked correctly, I use head() to preview the cleaned dataset.
summary(covid)
## Row_ID Accurate_Episode_Date Case_Reported_Date Test_Reported_Date
## Min. : 1 Length:969437 Length:969437 Length:969437
## 1st Qu.:242360 Class :character Class :character Class :character
## Median :484719 Mode :character Mode :character Mode :character
## Mean :484719
## 3rd Qu.:727078
## Max. :969437
## Specimen_Date Age_Group Client_Gender Case_AcquisitionInfo
## Length:969437 Length:969437 Length:969437 Length:969437
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Outcome1 Outbreak_Related Reporting_PHU_ID Reporting_PHU
## Length:969437 Length:969437 Min. :2226 Length:969437
## Class :character Class :character 1st Qu.:2247 Class :character
## Mode :character Mode :character Median :2260 Mode :character
## Mean :2726
## 3rd Qu.:3895
## Max. :5183
## Reporting_PHU_Address Reporting_PHU_City Reporting_PHU_Postal_Code
## Length:969437 Length:969437 Length:969437
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Reporting_PHU_Website Reporting_PHU_Latitude Reporting_PHU_Longitude
## Length:969437 Min. :42.31 Min. :-94.49
## Class :character 1st Qu.:43.46 1st Qu.:-79.74
## Mode :character Median :43.66 Median :-79.48
## Mean :43.82 Mean :-79.59
## 3rd Qu.:43.97 3rd Qu.:-79.38
## Max. :49.77 Max. :-74.74
# Check if NA values exist in a column from the Covid dataset
summary(is.na(covid$Age_Group))
## Mode FALSE
## logical 969437
# Check if NA values exist in any column from the Covid dataset
head(colSums(is.na(covid)))
## Row_ID Accurate_Episode_Date Case_Reported_Date
## 0 0 0
## Test_Reported_Date Specimen_Date Age_Group
## 0 0 0
# There appears to be no NA values in the dataset
# Clean and filter the data
covid_clean<-covid%>%
select(Age_Group,Client_Gender,Outcome1)%>%
filter(Outcome1!="Not Resolved",
Client_Gender %in% c("MALE","FEMALE"),
Age_Group!="UNKNOWN")
#covid_clean
head(covid_clean)
## Age_Group Client_Gender Outcome1
## 1 50s FEMALE Resolved
## 2 20s FEMALE Resolved
## 3 30s MALE Resolved
## 4 80s MALE Resolved
## 5 50s MALE Resolved
## 6 50s MALE Resolved
The results show that there were no missing (NA) values in the dataset, which means the data is complete and ready to be worked with. After applying the select() and filter() steps, the dataset has been cleaned to include only valid observations, leaving out unresolved outcomes, unknown age groups, and unspecified genders. This ensures that only meaningful and accurate data remain in covid_clean, making it a more consistent and reliable dataset. By narrowing down the variables and removing irrelevant or incomplete information, the dataset is now properly prepared for further analysis, such as calculating mortality percentages and comparing trends between male and female patients across different age groups.
It was noted in global data that elderly people were more susceptible to COVID19 fatalities that younger. Our goal is to test this relationship in the Ontario data. To accomplish this we need a table that can be used in a Chi sqr test. We required a table with age categories as the rows and columns of male and female cases in the categories of resolved or deceased.
To explore the relationship between age, sex, and COVID-19 outcomes, I will create a contingency table that organizes the data in a way that can later be used for a chi-square test. My plan is to group the dataset by Age_Group, Client_Gender, and Outcome1, which will allow me to count the number of resolved and fatal cases within each age and sex category. Starting with the covid_clean data and creating a pipe chain (%>%), I will use group_by() to structure the data by these variables, followed by count() to tally the number of cases for each subgroup. Since the resulting table will initially be in a long format, I will apply pivot_wider() to reshape it into a wider format, creating separate columns for female and male outcomes (resolved and fatal). This table will make it easier to compare mortality outcomes across sex and age bins, setting the foundation for statistical testing.
outcome_table<-covid_clean%>%
group_by(Age_Group,Client_Gender,Outcome1)%>%
count(Client_Gender,Outcome1)%>%
pivot_wider(names_from = c(Client_Gender,Outcome1),
values_from=n)
#covid_clean
outcome_table
## # A tibble: 9 × 5
## # Groups: Age_Group [9]
## Age_Group FEMALE_Fatal FEMALE_Resolved MALE_Fatal MALE_Resolved
## <chr> <int> <int> <int> <int>
## 1 20s 11 96313 23 95386
## 2 30s 25 75426 54 73226
## 3 40s 61 67147 127 59894
## 4 50s 201 59910 375 56772
## 5 60s 438 33637 862 35472
## 6 70s 887 15853 1379 15641
## 7 80s 1761 10630 1784 6926
## 8 90+ 1722 5735 947 2102
## 9 <20 5 78205 4 82328
From this code, the resulting outcome_table displays age groups as rows and four distinct columns for female fatal, female resolved, male fatal, and male resolved cases. This structure clearly organizes COVID-19 outcomes by sex within each age category and provides a straightforward way to compare recovery versus fatality rates between males and females across different ages. By restructuring the dataset in this way, I created a clean and interpretable contingency table that can now be used as the foundation for a chi-square test. A quick inspection of the table reveals several striking trends. For instance, in the 20s age group, there are only 11 female fatalities and 23 male fatalities, compared to 96,313 female and 95,386 male resolved cases, highlighting that deaths in this cohort are rare. By contrast, in the 80s age group, the numbers shift dramatically, with 1,761 female fatalities and 1,784 male fatalities versus only 10,630 female and 6,926 male resolved cases, showing that deaths represent a substantial proportion of outcomes at older ages. Similarly, in the 90+ group, fatalities (1,722 female, 947 male) are even higher and resolved cases are lower (5,735 female, 2,102 male). These comparisons underscore two broad observations: fatalities are heavily concentrated in older age groups, while younger categories overwhelmingly show resolved cases. In addition, males tend to exhibit higher fatality numbers than females in most of the older categories—for example, in the 70s group, there are 1,379 male fatalities compared to 887 female fatalities, despite similar numbers of resolved cases (15,641 male vs. 15,853 female). While these raw counts strongly suggest both age-related and sex-related trends, it remains unclear whether the observed differences are statistically significant or could be attributed to random variation or sampling biases. This is precisely what the next stage of statistical analysis, including chi-square testing and visualization, will aim to clarify.
To test whether there is a dependency between age and COVID-19 outcomes (resolved versus deceased), I will use the contingency table created in Question 1 (outcome_table). My plan is to restructure the data so that age groups are the rows (and I remove the column title) and the four categories of outcome by sex (female fatal, female resolved, male fatal, male resolved) are the columns. I will then apply the chisq.test() function, which compares the observed counts with the expected counts that would occur if there were no dependency between variables. If a dependency exists, the observed and expected counts will differ significantly, and this will be reflected in the test’s p-value. Finally, to explore patterns in more detail, I will use the corrplot() function on the standardized residuals. This visualization highlights which specific age–sex–outcome combinations contribute most strongly to the chi-square result, allowing us to identify where dependencies may exist.
# Prepare the contingency table for chi-square
overall <- outcome_table %>%
select(Age_Group, FEMALE_Fatal, FEMALE_Resolved, MALE_Fatal, MALE_Resolved) %>%
# Set Age_Group as row names
column_to_rownames("Age_Group")
# Perform chi-square test
chiCovid <- chisq.test(overall)
# Inspect results
chiCovid$observed # Observed counts
## FEMALE_Fatal FEMALE_Resolved MALE_Fatal MALE_Resolved
## 20s 11 96313 23 95386
## 30s 25 75426 54 73226
## 40s 61 67147 127 59894
## 50s 201 59910 375 56772
## 60s 438 33637 862 35472
## 70s 887 15853 1379 15641
## 80s 1761 10630 1784 6926
## 90+ 1722 5735 947 2102
## <20 5 78205 4 82328
chiCovid$expected # Expected counts if independent
## FEMALE_Fatal FEMALE_Resolved MALE_Fatal MALE_Resolved
## 20s 1111.97303 96349.820 1208.57175 93062.635
## 30s 862.57901 74740.421 937.51250 72190.488
## 40s 737.87620 63935.218 801.97658 61753.929
## 50s 680.04847 58924.584 739.12527 56914.243
## 60s 408.34342 35381.987 443.81681 34174.853
## 70s 195.79420 16965.102 212.80313 16386.301
## 80s 122.37718 10603.691 133.00826 10241.923
## 90+ 60.93051 5279.483 66.22363 5099.362
## <20 931.07798 80675.694 1011.96208 77923.266
chiCovid$residuals # Standardized residuals
## FEMALE_Fatal FEMALE_Resolved MALE_Fatal MALE_Resolved
## 20s -33.016388 -0.1186198 -34.10292 7.6160487
## 30s -28.518478 2.5077254 -28.85521 3.8540310
## 40s -24.918247 12.7021116 -23.83460 -7.4845211
## 50s -18.370016 4.0594905 -13.39343 -0.5962377
## 60s 1.467602 -9.2768606 19.85019 7.0167478
## 70s 49.397783 -8.5382019 79.94345 -5.8222520
## 80s 148.125145 0.2554879 143.15470 -32.7652669
## 90+ 212.799406 6.2691463 108.23287 -41.9740924
## <20 -30.349709 -8.6985640 -31.68561 15.7792404
# P-value is significant < 2.2e-16
chiCovid
##
## Pearson's Chi-squared test
##
## data: overall
## X-squared = 119944, df = 24, p-value < 2.2e-16
# Visualize residuals with corrplot
corrplot(chiCovid$residuals, is.corr = FALSE, cl.ratio = 1)
From this code, the object overall reorganizes the outcome table so it can be used as valid input for the chi-square test. The chi-square analysis (chiCovid) produced observed, expected, and residual values, which confirm whether outcome differences across age categories are statistically significant. The corrplot of the residuals provides an interpretable visualization: cells with strong positive or negative residuals indicate where the largest deviations between observed and expected counts occur. This allows us to see, for example, that older age groups (70+) are more strongly associated with higher fatality counts, while younger groups are associated with higher resolved case counts. These results confirm that there is indeed a dependency between age and COVID-19 outcomes, and the visualization suggests that this dependency holds for both sexes, though with some variation in the relative magnitude of male versus female fatalities in older age categories.
To better understand differences in COVID-19 outcomes between males and females across age groups, I will extend the outcome table by adding new variables that calculate percent mortality for each sex and the difference in mortality rates between them. This will be done using the mutate() function, which allows me to create additional columns within the existing table. Specifically, I will calculate female mortality as the number of female fatalities divided by the total female cases (fatal + resolved), and similarly for male mortality. Multiplying by 100 converts these proportions into percentages for easier interpretation. Finally, I will add a column for the difference in mortality rates by subtracting male mortality from female mortality. To keep the results concise and readable, I will round the values to three significant figures using the signif() function. This process will allow me to quantitatively compare mortality risks between sexes across different age categories.
# Add percent mortality and difference between sexes
outcome_table_percent <- outcome_table %>%
# Calculate percent mortality for each sex
mutate(FEMALE_Mortality = signif(FEMALE_Fatal / (FEMALE_Fatal + FEMALE_Resolved) * 100,3),
MALE_Mortality = signif(MALE_Fatal / (MALE_Fatal + MALE_Resolved) * 100,3),
Mortality_Diff = signif(FEMALE_Mortality - MALE_Mortality,3))
# View the updated table
outcome_table_percent
## # A tibble: 9 × 8
## # Groups: Age_Group [9]
## Age_Group FEMALE_Fatal FEMALE_Resolved MALE_Fatal MALE_Resolved
## <chr> <int> <int> <int> <int>
## 1 20s 11 96313 23 95386
## 2 30s 25 75426 54 73226
## 3 40s 61 67147 127 59894
## 4 50s 201 59910 375 56772
## 5 60s 438 33637 862 35472
## 6 70s 887 15853 1379 15641
## 7 80s 1761 10630 1784 6926
## 8 90+ 1722 5735 947 2102
## 9 <20 5 78205 4 82328
## # ℹ 3 more variables: FEMALE_Mortality <dbl>, MALE_Mortality <dbl>,
## # Mortality_Diff <dbl>
From this code, the output table outcome_table_percent now displays each age group alongside female and male mortality percentages and the mortality difference between them. The results clearly show that mortality rates increase with age for both sexes, with the most dramatic rises occurring in the 70s, 80s, and 90+ categories. Younger age groups (<20s, 20s, 30s) have near-zero mortality, while older groups show much higher percentages, aligning with global observations that elderly individuals are most at risk. In every age category above 30, male mortality is consistently higher than female mortality, as indicated by the negative values in the mortality difference column. These findings provide strong preliminary evidence that both age and sex play important roles in determining COVID-19 outcomes, setting the stage for further statistical testing and visualization.
I will enhance the readability and interpretability of the table generated in the previous step by applying conditional formatting using the formattable() function. First, I will use the color_tile() function to create gradient background fills, where lighter shades represent lower percentages and darker shades indicate higher percentages. Female mortality rates will be displayed on a pink-to-dark-red gradient, male mortality rates on a light-blue-to-dark-blue gradient, and the mortality difference on a dark-green-to-light-green gradient. For the mortality difference column, utilizing my observations of the trends, I will use a reversed gradient so that large negative values (indicating higher male mortality) appear darker, while values closer to zero appear lighter. This approach ensures that even though most values are negative, the magnitude of the difference is clearly visible, highlighting the most significant disparities. Next, I will explore the color_bar() function, which overlays horizontal color bars within each cell, where the bar length corresponds to the magnitude of the value. Together, these formatting techniques will provide intuitive, visually distinct cues, making it easier to detect trends and differences in mortality percentages across age groups and between sexes.
# Exploring colour_tile()
formattable(outcome_table_percent, list(
FEMALE_Mortality = color_tile("pink", "darkred"),
MALE_Mortality = color_tile("lightblue", "darkblue"),
Mortality_Diff = color_tile("darkgreen", "lightgreen")))
Age_Group | FEMALE_Fatal | FEMALE_Resolved | MALE_Fatal | MALE_Resolved | FEMALE_Mortality | MALE_Mortality | Mortality_Diff |
---|---|---|---|---|---|---|---|
20s | 11 | 96313 | 23 | 95386 | 0.01140 | 0.02410 | -0.01270 |
30s | 25 | 75426 | 54 | 73226 | 0.03310 | 0.07370 | -0.04060 |
40s | 61 | 67147 | 127 | 59894 | 0.09080 | 0.21200 | -0.12100 |
50s | 201 | 59910 | 375 | 56772 | 0.33400 | 0.65600 | -0.32200 |
60s | 438 | 33637 | 862 | 35472 | 1.29000 | 2.37000 | -1.08000 |
70s | 887 | 15853 | 1379 | 15641 | 5.30000 | 8.10000 | -2.80000 |
80s | 1761 | 10630 | 1784 | 6926 | 14.20000 | 20.50000 | -6.30000 |
90+ | 1722 | 5735 | 947 | 2102 | 23.10000 | 31.10000 | -8.00000 |
<20 | 5 | 78205 | 4 | 82328 | 0.00639 | 0.00486 | 0.00153 |
# Exploring colour_bar()
formattable(outcome_table_percent, list(
FEMALE_Mortality = color_bar("red"),
MALE_Mortality = color_bar("blue"),
Mortality_Diff = color_bar("green")))
Age_Group | FEMALE_Fatal | FEMALE_Resolved | MALE_Fatal | MALE_Resolved | FEMALE_Mortality | MALE_Mortality | Mortality_Diff |
---|---|---|---|---|---|---|---|
20s | 11 | 96313 | 23 | 95386 | 0.01140 | 0.02410 | -0.01270 |
30s | 25 | 75426 | 54 | 73226 | 0.03310 | 0.07370 | -0.04060 |
40s | 61 | 67147 | 127 | 59894 | 0.09080 | 0.21200 | -0.12100 |
50s | 201 | 59910 | 375 | 56772 | 0.33400 | 0.65600 | -0.32200 |
60s | 438 | 33637 | 862 | 35472 | 1.29000 | 2.37000 | -1.08000 |
70s | 887 | 15853 | 1379 | 15641 | 5.30000 | 8.10000 | -2.80000 |
80s | 1761 | 10630 | 1784 | 6926 | 14.20000 | 20.50000 | -6.30000 |
90+ | 1722 | 5735 | 947 | 2102 | 23.10000 | 31.10000 | -8.00000 |
<20 | 5 | 78205 | 4 | 82328 | 0.00639 | 0.00486 | 0.00153 |
After running the code, the formatted table displayed mortality percentages with clear gradient-based visual cues that highlighted key patterns. Using color_tile(), shading intensity increased with higher mortality rates, making it straightforward to observe the age-related progression of risk and the consistently higher male mortality in older age groups. The color_bar() function further emphasized these differences by overlaying proportional horizontal bars, allowing quick visual comparisons across sexes and age groups. A limitation arose with the Mortality_Diff column: because values could be either positive (higher female mortality) or negative (higher male mortality), the gradients and bars could not fully reflect the magnitude of the differences. For instance, with the current code, large positive values would appear with lighter shading, despite also representing substantial disparities. To address this in future analyses, I could customize the color scales—so that values near zero have lighter colors while large-magnitude differences, whether positive or negative, appear darker—or transform the differences using absolute values while preserving sign indicators. Despite this limitation, the green gradients still effectively highlighted the widening sex-based disparity in older populations. Overall, applying formattable() with both color_tile() and color_bar() improved the interpretability and visual accessibility of the mortality data.