Coding activity

Obtain the data

We will start with publicly available data about COIVD-19 diagnosis and outcomes in Ontario. This data is maintained by Ontario Public Health and can be directly downloaded from their webpage. I have converted it into an R object and saved it. Use load() to bring the object into the work space. I have also loaded in the libraries I will use using the library() function.

# Load libraries 
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(formattable)
library(cowplot)

## 
## Attaching package: 'cowplot'
## 
## The following object is masked from 'package:lubridate':
## 
##     stamp

library(corrplot)

## corrplot 0.95 loaded

library(scales)

## 
## Attaching package: 'scales'
## 
## The following objects are masked from 'package:formattable':
## 
##     comma, percent, scientific
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

# Load an R object
load("covid.Rdata")


# Checking the data and what is in the table
#covid
head(covid)

##   Row_ID Accurate_Episode_Date Case_Reported_Date Test_Reported_Date
## 1      1            2019-05-30         2020-05-05         2020-05-05
## 2      2            2019-11-20         2020-10-21         2020-11-21
## 3      3            2019-12-18         2021-12-20         2021-12-20
## 4      4            2020-01-01         2020-04-24         2020-04-24
## 5      5            2020-01-01         2020-05-17         2020-05-17
## 6      6            2020-01-01         2021-05-26         2021-03-31
##   Specimen_Date Age_Group Client_Gender Case_AcquisitionInfo Outcome1
## 1    2020-05-03       50s        FEMALE                   CC Resolved
## 2    2019-11-20       20s        FEMALE    NO KNOWN EPI LINK Resolved
## 3    2019-12-18       30s          MALE  MISSING INFORMATION Resolved
## 4    2020-04-23       80s          MALE    NO KNOWN EPI LINK Resolved
## 5    2020-05-15       50s          MALE                   CC Resolved
## 6    2021-03-28   UNKNOWN          MALE               TRAVEL Resolved
##   Outbreak_Related Reporting_PHU_ID                       Reporting_PHU
## 1                              2260 Simcoe Muskoka District Health Unit
## 2                              4913          Southwestern Public Health
## 3                              2270  York Region Public Health Services
## 4                              2234       Haldimand-Norfolk Health Unit
## 5                              2265   Region of Waterloo, Public Health
## 6                              2263             Timiskaming Health Unit
##           Reporting_PHU_Address Reporting_PHU_City Reporting_PHU_Postal_Code
## 1             15 Sperling Drive             Barrie                   L4M 6K9
## 2            1230 Talbot Street         St. Thomas                   N5P 1G9
## 3            17250 Yonge Street          Newmarket                   L3Y 6Z1
## 4           12 Gilbertson Drive             Simcoe                   N3Y 4N5
## 5        99 Regina Street South           Waterloo                   N2J 4V3
## 6 247 Whitewood Avenue, Unit 43       New Liskeard                   P0J 1P0
##                     Reporting_PHU_Website Reporting_PHU_Latitude
## 1             www.simcoemuskokahealth.org               44.41071
## 2                   www.swpublichealth.ca               42.77780
## 3 www.york.ca/wps/portal/yorkhome/health/               44.04802
## 4                            www.hnhu.org               42.84783
## 5                 www.regionofwaterloo.ca               43.46288
## 6                   www.timiskaminghu.com               47.50928
##   Reporting_PHU_Longitude
## 1               -79.68631
## 2               -81.15116
## 3               -79.48024
## 4               -80.30381
## 5               -80.52091
## 6               -79.68163

This data contains several dates related to infection and detection, demographic information related to the ages, genders and locations of the individuals.

Age_Group, Client_Gender, Reporting_PHU_City and others are categories. The dates are not categories they are dates, which are numeric ordinal data as the order is important. We are not considering time in this analysis.
Thinking of the processing steps we need to: 1. Ensure the data is tidy 2. Select the columns of data needed 3. Clean by removing missing data or unknowns/incomplete data 4. Make transformations of the data through pivoting and mutating.

I will use these functions and this code to assess the quality of the dataset and prepare it for analysis. The summary() function provides an overview of the dataset, helping identify potential issues such as missing or unusual values. To check for missing data, I use is.na() on specific columns and colSums(is.na()) across the dataset to count how many missing values exist in each column. If missing values were present, I could apply the filter() function to remove those incomplete entries. If the dataset does not contain missing values, I will directly use select() to retain only the relevant variables—Age_Group, Client_Gender, and Outcome1—to keep the analysis focused. Finally, I will apply filter() again to remove unresolved outcomes, unknown age groups, or unspecified genders, ensuring the dataset remains consistent and reliable. To confirm these steps worked correctly, I use head() to preview the cleaned dataset.

summary(covid)

##      Row_ID       Accurate_Episode_Date Case_Reported_Date Test_Reported_Date
##  Min.   :     1   Length:969437         Length:969437      Length:969437     
##  1st Qu.:242360   Class :character      Class :character   Class :character  
##  Median :484719   Mode  :character      Mode  :character   Mode  :character  
##  Mean   :484719                                                              
##  3rd Qu.:727078                                                              
##  Max.   :969437                                                              
##  Specimen_Date       Age_Group         Client_Gender      Case_AcquisitionInfo
##  Length:969437      Length:969437      Length:969437      Length:969437       
##  Class :character   Class :character   Class :character   Class :character    
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character    
##                                                                               
##                                                                               
##                                                                               
##    Outcome1         Outbreak_Related   Reporting_PHU_ID Reporting_PHU     
##  Length:969437      Length:969437      Min.   :2226     Length:969437     
##  Class :character   Class :character   1st Qu.:2247     Class :character  
##  Mode  :character   Mode  :character   Median :2260     Mode  :character  
##                                        Mean   :2726                       
##                                        3rd Qu.:3895                       
##                                        Max.   :5183                       
##  Reporting_PHU_Address Reporting_PHU_City Reporting_PHU_Postal_Code
##  Length:969437         Length:969437      Length:969437            
##  Class :character      Class :character   Class :character         
##  Mode  :character      Mode  :character   Mode  :character         
##                                                                    
##                                                                    
##                                                                    
##  Reporting_PHU_Website Reporting_PHU_Latitude Reporting_PHU_Longitude
##  Length:969437         Min.   :42.31          Min.   :-94.49         
##  Class :character      1st Qu.:43.46          1st Qu.:-79.74         
##  Mode  :character      Median :43.66          Median :-79.48         
##                        Mean   :43.82          Mean   :-79.59         
##                        3rd Qu.:43.97          3rd Qu.:-79.38         
##                        Max.   :49.77          Max.   :-74.74

# Check if NA values exist in a column from the Covid dataset
  summary(is.na(covid$Age_Group))

##    Mode   FALSE 
## logical  969437

# Check if NA values exist in any column from the Covid dataset
  head(colSums(is.na(covid)))

##                Row_ID Accurate_Episode_Date    Case_Reported_Date 
##                     0                     0                     0 
##    Test_Reported_Date         Specimen_Date             Age_Group 
##                     0                     0                     0

# There appears to be no NA values in the dataset 
  
# Clean and filter the data

covid_clean<-covid%>%
  select(Age_Group,Client_Gender,Outcome1)%>%
    filter(Outcome1!="Not Resolved",
           Client_Gender %in% c("MALE","FEMALE"),
           Age_Group!="UNKNOWN")
  
#covid_clean
head(covid_clean)

##   Age_Group Client_Gender Outcome1
## 1       50s        FEMALE Resolved
## 2       20s        FEMALE Resolved
## 3       30s          MALE Resolved
## 4       80s          MALE Resolved
## 5       50s          MALE Resolved
## 6       50s          MALE Resolved

The results show that there were no missing (NA) values in the dataset, which means the data is complete and ready to be worked with. After applying the select() and filter() steps, the dataset has been cleaned to include only valid observations, leaving out unresolved outcomes, unknown age groups, and unspecified genders. This ensures that only meaningful and accurate data remain in covid_clean, making it a more consistent and reliable dataset. By narrowing down the variables and removing irrelevant or incomplete information, the dataset is now properly prepared for further analysis, such as calculating mortality percentages and comparing trends between male and female patients across different age groups.

Whats the relationship between age, sex and COVID19 infection outcomes

It was noted in global data that elderly people were more susceptible to COVID19 fatalities that younger. Our goal is to test this relationship in the Ontario data. To accomplish this we need a table that can be used in a Chi sqr test. We required a table with age categories as the rows and columns of male and female cases in the categories of resolved or deceased.

Create a table of outcomes (resolved and deceased) for COVID19 patients by sex for age categories (age on as rows and four columns female/male resolved/deceased). We want to identify if within an age bin what is the affect of sex on the mortality rates.

To explore the relationship between age, sex, and COVID-19 outcomes, I will create a contingency table that organizes the data in a way that can later be used for a chi-square test. My plan is to group the dataset by Age_Group, Client_Gender, and Outcome1, which will allow me to count the number of resolved and fatal cases within each age and sex category. Starting with the covid_clean data and creating a pipe chain (%>%), I will use group_by() to structure the data by these variables, followed by count() to tally the number of cases for each subgroup. Since the resulting table will initially be in a long format, I will apply pivot_wider() to reshape it into a wider format, creating separate columns for female and male outcomes (resolved and fatal). This table will make it easier to compare mortality outcomes across sex and age bins, setting the foundation for statistical testing.

outcome_table<-covid_clean%>%
    group_by(Age_Group,Client_Gender,Outcome1)%>%
  count(Client_Gender,Outcome1)%>%
  pivot_wider(names_from = c(Client_Gender,Outcome1),
              values_from=n)
#covid_clean
outcome_table

## # A tibble: 9 × 5
## # Groups:   Age_Group [9]
##   Age_Group FEMALE_Fatal FEMALE_Resolved MALE_Fatal MALE_Resolved
##   <chr>            <int>           <int>      <int>         <int>
## 1 20s                 11           96313         23         95386
## 2 30s                 25           75426         54         73226
## 3 40s                 61           67147        127         59894
## 4 50s                201           59910        375         56772
## 5 60s                438           33637        862         35472
## 6 70s                887           15853       1379         15641
## 7 80s               1761           10630       1784          6926
## 8 90+               1722            5735        947          2102
## 9 <20                  5           78205          4         82328

From this code, the resulting outcome_table displays age groups as rows and four distinct columns for female fatal, female resolved, male fatal, and male resolved cases. This structure clearly organizes COVID-19 outcomes by sex within each age category and provides a straightforward way to compare recovery versus fatality rates between males and females across different ages. By restructuring the dataset in this way, I created a clean and interpretable contingency table that can now be used as the foundation for a chi-square test. A quick inspection of the table reveals several striking trends. For instance, in the 20s age group, there are only 11 female fatalities and 23 male fatalities, compared to 96,313 female and 95,386 male resolved cases, highlighting that deaths in this cohort are rare. By contrast, in the 80s age group, the numbers shift dramatically, with 1,761 female fatalities and 1,784 male fatalities versus only 10,630 female and 6,926 male resolved cases, showing that deaths represent a substantial proportion of outcomes at older ages. Similarly, in the 90+ group, fatalities (1,722 female, 947 male) are even higher and resolved cases are lower (5,735 female, 2,102 male). These comparisons underscore two broad observations: fatalities are heavily concentrated in older age groups, while younger categories overwhelmingly show resolved cases. In addition, males tend to exhibit higher fatality numbers than females in most of the older categories—for example, in the 70s group, there are 1,379 male fatalities compared to 887 female fatalities, despite similar numbers of resolved cases (15,641 male vs. 15,853 female). While these raw counts strongly suggest both age-related and sex-related trends, it remains unclear whether the observed differences are statistically significant or could be attributed to random variation or sampling biases. This is precisely what the next stage of statistical analysis, including chi-square testing and visualization, will aim to clarify.

Use this as a contingency table to calculate a chi square analysis determining the dependency of age on patients recovered versus deceased. Is there a dependency, does this seem different for males and females? Explore it with a corrplot to visualize where the dependency may be.

To test whether there is a dependency between age and COVID-19 outcomes (resolved versus deceased), I will use the contingency table created in Question 1 (outcome_table). My plan is to restructure the data so that age groups are the rows (and I remove the column title) and the four categories of outcome by sex (female fatal, female resolved, male fatal, male resolved) are the columns. I will then apply the chisq.test() function, which compares the observed counts with the expected counts that would occur if there were no dependency between variables. If a dependency exists, the observed and expected counts will differ significantly, and this will be reflected in the test’s p-value. Finally, to explore patterns in more detail, I will use the corrplot() function on the standardized residuals. This visualization highlights which specific age–sex–outcome combinations contribute most strongly to the chi-square result, allowing us to identify where dependencies may exist.

# Prepare the contingency table for chi-square
overall <- outcome_table %>%
  select(Age_Group, FEMALE_Fatal, FEMALE_Resolved, MALE_Fatal, MALE_Resolved) %>%
   # Set Age_Group as row names
  column_to_rownames("Age_Group") 

# Perform chi-square test
chiCovid <- chisq.test(overall)

# Inspect results
chiCovid$observed    # Observed counts

##     FEMALE_Fatal FEMALE_Resolved MALE_Fatal MALE_Resolved
## 20s           11           96313         23         95386
## 30s           25           75426         54         73226
## 40s           61           67147        127         59894
## 50s          201           59910        375         56772
## 60s          438           33637        862         35472
## 70s          887           15853       1379         15641
## 80s         1761           10630       1784          6926
## 90+         1722            5735        947          2102
## <20            5           78205          4         82328

chiCovid$expected    # Expected counts if independent

##     FEMALE_Fatal FEMALE_Resolved MALE_Fatal MALE_Resolved
## 20s   1111.97303       96349.820 1208.57175     93062.635
## 30s    862.57901       74740.421  937.51250     72190.488
## 40s    737.87620       63935.218  801.97658     61753.929
## 50s    680.04847       58924.584  739.12527     56914.243
## 60s    408.34342       35381.987  443.81681     34174.853
## 70s    195.79420       16965.102  212.80313     16386.301
## 80s    122.37718       10603.691  133.00826     10241.923
## 90+     60.93051        5279.483   66.22363      5099.362
## <20    931.07798       80675.694 1011.96208     77923.266

chiCovid$residuals   # Standardized residuals

##     FEMALE_Fatal FEMALE_Resolved MALE_Fatal MALE_Resolved
## 20s   -33.016388      -0.1186198  -34.10292     7.6160487
## 30s   -28.518478       2.5077254  -28.85521     3.8540310
## 40s   -24.918247      12.7021116  -23.83460    -7.4845211
## 50s   -18.370016       4.0594905  -13.39343    -0.5962377
## 60s     1.467602      -9.2768606   19.85019     7.0167478
## 70s    49.397783      -8.5382019   79.94345    -5.8222520
## 80s   148.125145       0.2554879  143.15470   -32.7652669
## 90+   212.799406       6.2691463  108.23287   -41.9740924
## <20   -30.349709      -8.6985640  -31.68561    15.7792404

# P-value is significant < 2.2e-16
chiCovid

## 
##  Pearson's Chi-squared test
## 
## data:  overall
## X-squared = 119944, df = 24, p-value < 2.2e-16

# Visualize residuals with corrplot
corrplot(chiCovid$residuals, is.corr = FALSE, cl.ratio = 1)

From this code, the object overall reorganizes the outcome table so it can be used as valid input for the chi-square test. The chi-square analysis (chiCovid) produced observed, expected, and residual values, which confirm whether outcome differences across age categories are statistically significant. The corrplot of the residuals provides an interpretable visualization: cells with strong positive or negative residuals indicate where the largest deviations between observed and expected counts occur. This allows us to see, for example, that older age groups (70+) are more strongly associated with higher fatality counts, while younger groups are associated with higher resolved case counts. These results confirm that there is indeed a dependency between age and COVID-19 outcomes, and the visualization suggests that this dependency holds for both sexes, though with some variation in the relative magnitude of male versus female fatalities in older age categories.

Returning to the wrangled table, add new column variables to calculate the percent mortality for females and males in each age category and the difference in mortality rates between the females and male.

To better understand differences in COVID-19 outcomes between males and females across age groups, I will extend the outcome table by adding new variables that calculate percent mortality for each sex and the difference in mortality rates between them. This will be done using the mutate() function, which allows me to create additional columns within the existing table. Specifically, I will calculate female mortality as the number of female fatalities divided by the total female cases (fatal + resolved), and similarly for male mortality. Multiplying by 100 converts these proportions into percentages for easier interpretation. Finally, I will add a column for the difference in mortality rates by subtracting male mortality from female mortality. To keep the results concise and readable, I will round the values to three significant figures using the signif() function. This process will allow me to quantitatively compare mortality risks between sexes across different age categories.

# Add percent mortality and difference between sexes
outcome_table_percent <- outcome_table %>%
  # Calculate percent mortality for each sex
  mutate(FEMALE_Mortality = signif(FEMALE_Fatal / (FEMALE_Fatal + FEMALE_Resolved) * 100,3),
    MALE_Mortality   = signif(MALE_Fatal   / (MALE_Fatal   + MALE_Resolved) * 100,3),
    Mortality_Diff   = signif(FEMALE_Mortality - MALE_Mortality,3)) 

# View the updated table
outcome_table_percent

## # A tibble: 9 × 8
## # Groups:   Age_Group [9]
##   Age_Group FEMALE_Fatal FEMALE_Resolved MALE_Fatal MALE_Resolved
##   <chr>            <int>           <int>      <int>         <int>
## 1 20s                 11           96313         23         95386
## 2 30s                 25           75426         54         73226
## 3 40s                 61           67147        127         59894
## 4 50s                201           59910        375         56772
## 5 60s                438           33637        862         35472
## 6 70s                887           15853       1379         15641
## 7 80s               1761           10630       1784          6926
## 8 90+               1722            5735        947          2102
## 9 <20                  5           78205          4         82328
## # ℹ 3 more variables: FEMALE_Mortality <dbl>, MALE_Mortality <dbl>,
## #   Mortality_Diff <dbl>

From this code, the output table outcome_table_percent now displays each age group alongside female and male mortality percentages and the mortality difference between them. The results clearly show that mortality rates increase with age for both sexes, with the most dramatic rises occurring in the 70s, 80s, and 90+ categories. Younger age groups (<20s, 20s, 30s) have near-zero mortality, while older groups show much higher percentages, aligning with global observations that elderly individuals are most at risk. In every age category above 30, male mortality is consistently higher than female mortality, as indicated by the negative values in the mortality difference column. These findings provide strong preliminary evidence that both age and sex play important roles in determining COVID-19 outcomes, setting the stage for further statistical testing and visualization.

Using formattable() function to add colour to the table generated in 3. Explore the colour_bar() and colour_tile() functions such that the percentages are colour coded on a gradient colour scale. I did not show you this try to learn it from online documents.

I will enhance the readability and interpretability of the table generated in the previous step by applying conditional formatting using the formattable() function. First, I will use the color_tile() function to create gradient background fills, where lighter shades represent lower percentages and darker shades indicate higher percentages. Female mortality rates will be displayed on a pink-to-dark-red gradient, male mortality rates on a light-blue-to-dark-blue gradient, and the mortality difference on a dark-green-to-light-green gradient. For the mortality difference column, utilizing my observations of the trends, I will use a reversed gradient so that large negative values (indicating higher male mortality) appear darker, while values closer to zero appear lighter. This approach ensures that even though most values are negative, the magnitude of the difference is clearly visible, highlighting the most significant disparities. Next, I will explore the color_bar() function, which overlays horizontal color bars within each cell, where the bar length corresponds to the magnitude of the value. Together, these formatting techniques will provide intuitive, visually distinct cues, making it easier to detect trends and differences in mortality percentages across age groups and between sexes.

# Exploring colour_tile()
formattable(outcome_table_percent, list(
  FEMALE_Mortality = color_tile("pink", "darkred"),
  MALE_Mortality   = color_tile("lightblue", "darkblue"),
  Mortality_Diff   = color_tile("darkgreen", "lightgreen")))

Age_Group	FEMALE_Fatal	FEMALE_Resolved	MALE_Fatal	MALE_Resolved	FEMALE_Mortality	MALE_Mortality	Mortality_Diff
20s	11	96313	23	95386	0.01140	0.02410	-0.01270
30s	25	75426	54	73226	0.03310	0.07370	-0.04060
40s	61	67147	127	59894	0.09080	0.21200	-0.12100
50s	201	59910	375	56772	0.33400	0.65600	-0.32200
60s	438	33637	862	35472	1.29000	2.37000	-1.08000
70s	887	15853	1379	15641	5.30000	8.10000	-2.80000
80s	1761	10630	1784	6926	14.20000	20.50000	-6.30000
90+	1722	5735	947	2102	23.10000	31.10000	-8.00000
<20	5	78205	4	82328	0.00639	0.00486	0.00153

# Exploring colour_bar()
formattable(outcome_table_percent, list(
  FEMALE_Mortality = color_bar("red"),
  MALE_Mortality   = color_bar("blue"),
  Mortality_Diff   = color_bar("green")))

Age_Group	FEMALE_Fatal	FEMALE_Resolved	MALE_Fatal	MALE_Resolved	FEMALE_Mortality	MALE_Mortality	Mortality_Diff
20s	11	96313	23	95386	0.01140	0.02410	-0.01270
30s	25	75426	54	73226	0.03310	0.07370	-0.04060
40s	61	67147	127	59894	0.09080	0.21200	-0.12100
50s	201	59910	375	56772	0.33400	0.65600	-0.32200
60s	438	33637	862	35472	1.29000	2.37000	-1.08000
70s	887	15853	1379	15641	5.30000	8.10000	-2.80000
80s	1761	10630	1784	6926	14.20000	20.50000	-6.30000
90+	1722	5735	947	2102	23.10000	31.10000	-8.00000
<20	5	78205	4	82328	0.00639	0.00486	0.00153

After running the code, the formatted table displayed mortality percentages with clear gradient-based visual cues that highlighted key patterns. Using color_tile(), shading intensity increased with higher mortality rates, making it straightforward to observe the age-related progression of risk and the consistently higher male mortality in older age groups. The color_bar() function further emphasized these differences by overlaying proportional horizontal bars, allowing quick visual comparisons across sexes and age groups. A limitation arose with the Mortality_Diff column: because values could be either positive (higher female mortality) or negative (higher male mortality), the gradients and bars could not fully reflect the magnitude of the differences. For instance, with the current code, large positive values would appear with lighter shading, despite also representing substantial disparities. To address this in future analyses, I could customize the color scales—so that values near zero have lighter colors while large-magnitude differences, whether positive or negative, appear darker—or transform the differences using absolute values while preserving sign indicators. Despite this limitation, the green gradients still effectively highlighted the widening sex-based disparity in older populations. Overall, applying formattable() with both color_tile() and color_bar() improved the interpretability and visual accessibility of the mortality data.

Assignment 3

Michael Sava

2025-10-03

Due: Fri, Oct 3, 2025

Descriptive questions with minimal coding

Coding activity

Obtain the data

Whats the relationship between age, sex and COVID19 infection outcomes