Case study: COVID-19-Positive Cases Mexico

Introduction

The objective of the present project is to perform an exploratory analysis of COVIS-19 positive cases in the Mexican population, considering those patients who are smokers, diabetics, with pneumonia and who required intubation. An analysis by sex and age in the same conditions is also performed.

Data set info

The information contained corresponds only to the data obtained from the epidemiological study of a suspected case of viral respiratory disease at the time it is identified in the medical units of the Health Sector.

According to the clinical diagnosis at admission, it is considered as an outpatient or hospitalized patient. The base does not include the evolution during their stay in the medical units, with the exception of updates upon discharge by the hospital epidemiological surveillance units or health jurisdictions in the case of deaths.

Source

This dataset was taking from Secretary of Public Health of the Mexican State and you can explore it HERE. In addition, you can browse the data dictionary HERE.

Note : However, the data set used for this case corresponds to a random sample of 1,000,000 samples from the original database.

Objectives:

Data Exploration
Data Manipulation
Exploration Data Analysis

Outline

Data Exploration.
Data Manipulation.
Exploration Data Analysis
Conclusions

Data Exploration

## Libraries 
libraries <- c("ggplot2", "dplyr", "readr", "magrittr", "tidyverse", "stringr", "plotly")
sapply(libraries, require, character.only = TRUE)

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.1.3

## Loading required package: dplyr

## Warning: package 'dplyr' was built under R version 4.1.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: readr

## Warning: package 'readr' was built under R version 4.1.3

## Loading required package: magrittr

## Warning: package 'magrittr' was built under R version 4.1.3

## Loading required package: tidyverse

## Warning: package 'tidyverse' was built under R version 4.1.3

## Error: package or namespace load failed for 'tidyverse' in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]):
##  namespace 'purrr' 0.3.4 is already loaded, but >= 1.0.1 is required

## Loading required package: stringr

## Warning: package 'stringr' was built under R version 4.1.3

## Loading required package: plotly

## Warning: package 'plotly' was built under R version 4.1.3

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

##   ggplot2     dplyr     readr  magrittr tidyverse   stringr    plotly 
##      TRUE      TRUE      TRUE      TRUE     FALSE      TRUE      TRUE

setwd("E:/COVID19_Case_Study")

## Load data

data <- read.csv("COVID19MEXICO.csv", quote="\"", skipNul= TRUE)

## Random sample

data_covid <- sample_n(data, size = 1000000, replace = T)

## Data visualization

head(data_covid)

##   FECHA_ACTUALIZACION ID_REGISTRO ORIGEN SECTOR ENTIDAD_UM SEXO ENTIDAD_NAC
## 1          2022-10-04      9f58c2      2      4         20    1          20
## 2          2022-10-04      9c54cf      2      9         24    2          15
## 3          2022-10-04      4ec3e7      2     12         21    1          21
## 4          2022-10-04      b1f057      2      4         19    1          19
## 5          2022-10-04      ddc853      2      9          9    1           9
## 6          2022-10-04      4a6c25      2      9          9    1           9
##   ENTIDAD_RES MUNICIPIO_RES TIPO_PACIENTE FECHA_INGRESO FECHA_SINTOMAS
## 1          20            67             1    2022-01-04     2022-01-02
## 2          11            20             1    2022-01-06     2022-01-05
## 3          21            63             1    2022-02-10     2022-02-09
## 4          19            46             1    2022-01-15     2022-01-02
## 5           9             5             1    2022-01-26     2022-01-26
## 6           9            14             1    2022-02-19     2022-02-17
##    FECHA_DEF INTUBADO NEUMONIA EDAD NACIONALIDAD EMBARAZO HABLA_LENGUA_INDIG
## 1 9999-99-99       97        2   33            1        2                  2
## 2 9999-99-99       97        2   54            1       97                  2
## 3 9999-99-99       97        2   34            1        2                  2
## 4 9999-99-99       97        2   38            1        2                  2
## 5 9999-99-99       97        2   27            1        1                 99
## 6 9999-99-99       97        2   39            1        2                 99
##   INDIGENA DIABETES EPOC ASMA INMUSUPR HIPERTENSION OTRA_COM CARDIOVASCULAR
## 1        2        2    2    2        2            2        2              2
## 2        2        2    2    2        2            2        2              2
## 3        2        2    2    2        2            2        2              2
## 4        2        2    2    2        2            2        2              1
## 5       99        2    2    2        2            2        2              2
## 6       99        2    2    2        2            2        2              2
##   OBESIDAD RENAL_CRONICA TABAQUISMO OTRO_CASO TOMA_MUESTRA_LAB RESULTADO_LAB
## 1        2             2          2         2                2            97
## 2        2             2          2         2                2            97
## 3        1             2          2         1                2            97
## 4        2             2          2         2                2            97
## 5        2             2          2         1                2            97
## 6        2             2          2         2                2            97
##   TOMA_MUESTRA_ANTIGENO RESULTADO_ANTIGENO CLASIFICACION_FINAL MIGRANTE
## 1                     1                  2                   7       99
## 2                     1                  1                   3       99
## 3                     1                  1                   3       99
## 4                     1                  1                   3       99
## 5                     1                  2                   7       99
## 6                     1                  2                   7       99
##   PAIS_NACIONALIDAD PAIS_ORIGEN UCI
## 1           MÃ©xico          97  97
## 2           MÃ©xico          97  97
## 3           MÃ©xico          97  97
## 4           MÃ©xico          97  97
## 5           MÃ©xico          97  97
## 6           MÃ©xico          97  97

## Data shape

nrows <- nrow(data_covid)
ncols <- ncol(data_covid)
sprintf("The data set has %s samples and %s features", nrows, ncols)

## [1] "The data set has 1000000 samples and 40 features"

## Data info

str(data_covid)

## 'data.frame':    1000000 obs. of  40 variables:
##  $ FECHA_ACTUALIZACION  : chr  "2022-10-04" "2022-10-04" "2022-10-04" "2022-10-04" ...
##  $ ID_REGISTRO          : chr  "9f58c2" "9c54cf" "4ec3e7" "b1f057" ...
##  $ ORIGEN               : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ SECTOR               : int  4 9 12 4 9 9 4 12 12 12 ...
##  $ ENTIDAD_UM           : int  20 24 21 19 9 9 5 9 24 9 ...
##  $ SEXO                 : int  1 2 1 1 1 1 2 2 1 1 ...
##  $ ENTIDAD_NAC          : int  20 15 21 19 9 9 5 9 24 9 ...
##  $ ENTIDAD_RES          : int  20 11 21 19 9 9 5 9 24 9 ...
##  $ MUNICIPIO_RES        : int  67 20 63 46 5 14 18 3 28 15 ...
##  $ TIPO_PACIENTE        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ FECHA_INGRESO        : chr  "2022-01-04" "2022-01-06" "2022-02-10" "2022-01-15" ...
##  $ FECHA_SINTOMAS       : chr  "2022-01-02" "2022-01-05" "2022-02-09" "2022-01-02" ...
##  $ FECHA_DEF            : chr  "9999-99-99" "9999-99-99" "9999-99-99" "9999-99-99" ...
##  $ INTUBADO             : int  97 97 97 97 97 97 97 97 97 97 ...
##  $ NEUMONIA             : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ EDAD                 : int  33 54 34 38 27 39 54 63 27 2 ...
##  $ NACIONALIDAD         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EMBARAZO             : int  2 97 2 2 1 2 97 97 2 2 ...
##  $ HABLA_LENGUA_INDIG   : int  2 2 2 2 99 99 2 2 99 2 ...
##  $ INDIGENA             : int  2 2 2 2 99 99 2 2 99 2 ...
##  $ DIABETES             : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ EPOC                 : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ ASMA                 : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ INMUSUPR             : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ HIPERTENSION         : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ OTRA_COM             : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ CARDIOVASCULAR       : int  2 2 2 1 2 2 2 2 2 2 ...
##  $ OBESIDAD             : int  2 2 1 2 2 2 2 2 2 2 ...
##  $ RENAL_CRONICA        : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ TABAQUISMO           : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ OTRO_CASO            : int  2 2 1 2 1 2 2 2 1 2 ...
##  $ TOMA_MUESTRA_LAB     : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ RESULTADO_LAB        : int  97 97 97 97 97 97 97 97 97 97 ...
##  $ TOMA_MUESTRA_ANTIGENO: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ RESULTADO_ANTIGENO   : int  2 1 1 1 2 2 1 2 1 2 ...
##  $ CLASIFICACION_FINAL  : int  7 3 3 3 7 7 3 7 3 7 ...
##  $ MIGRANTE             : int  99 99 99 99 99 99 99 99 99 99 ...
##  $ PAIS_NACIONALIDAD    : chr  "MÃ©xico" "MÃ©xico" "MÃ©xico" "MÃ©xico" ...
##  $ PAIS_ORIGEN          : chr  "97" "97" "97" "97" ...
##  $ UCI                  : int  97 97 97 97 97 97 97 97 97 97 ...

## Null values

apply(X = is.na(data_covid), MARGIN = 2, FUN = sum)

##   FECHA_ACTUALIZACION           ID_REGISTRO                ORIGEN 
##                     0                     0                     0 
##                SECTOR            ENTIDAD_UM                  SEXO 
##                     0                     0                     0 
##           ENTIDAD_NAC           ENTIDAD_RES         MUNICIPIO_RES 
##                     0                     0                     0 
##         TIPO_PACIENTE         FECHA_INGRESO        FECHA_SINTOMAS 
##                     0                     0                     0 
##             FECHA_DEF              INTUBADO              NEUMONIA 
##                     0                     0                     0 
##                  EDAD          NACIONALIDAD              EMBARAZO 
##                     0                     0                     0 
##    HABLA_LENGUA_INDIG              INDIGENA              DIABETES 
##                     0                     0                     0 
##                  EPOC                  ASMA              INMUSUPR 
##                     1                     1                     1 
##          HIPERTENSION              OTRA_COM        CARDIOVASCULAR 
##                     1                     1                     1 
##              OBESIDAD         RENAL_CRONICA            TABAQUISMO 
##                     1                     1                     1 
##             OTRO_CASO      TOMA_MUESTRA_LAB         RESULTADO_LAB 
##                     1                     1                     1 
## TOMA_MUESTRA_ANTIGENO    RESULTADO_ANTIGENO   CLASIFICACION_FINAL 
##                     1                     1                     1 
##              MIGRANTE     PAIS_NACIONALIDAD           PAIS_ORIGEN 
##                     1                     0                     0 
##                   UCI 
##                     1

## Counting unique values 

sort(sapply(data_covid, function(x) length(unique(x))))

##   FECHA_ACTUALIZACION                ORIGEN                  SEXO 
##                     1                     2                     2 
##         TIPO_PACIENTE              NEUMONIA          NACIONALIDAD 
##                     2                     2                     2 
##    HABLA_LENGUA_INDIG              INDIGENA              DIABETES 
##                     3                     3                     3 
##      TOMA_MUESTRA_LAB TOMA_MUESTRA_ANTIGENO              INTUBADO 
##                     3                     3                     4 
##              EMBARAZO                  EPOC                  ASMA 
##                     4                     4                     4 
##              INMUSUPR          HIPERTENSION              OTRA_COM 
##                     4                     4                     4 
##        CARDIOVASCULAR              OBESIDAD         RENAL_CRONICA 
##                     4                     4                     4 
##            TABAQUISMO             OTRO_CASO    RESULTADO_ANTIGENO 
##                     4                     4                     4 
##              MIGRANTE         RESULTADO_LAB                   UCI 
##                     4                     5                     5 
##   CLASIFICACION_FINAL                SECTOR            ENTIDAD_UM 
##                     8                    14                    32 
##           ENTIDAD_RES           ENTIDAD_NAC           PAIS_ORIGEN 
##                    32                    33                    53 
##     PAIS_NACIONALIDAD             FECHA_DEF                  EDAD 
##                    94                   105                   121 
##        FECHA_SINTOMAS         FECHA_INGRESO         MUNICIPIO_RES 
##                   153                   170                   415 
##           ID_REGISTRO 
##                818214

Data Manipulation

## Filter only case positives 

positive_cases <- data_covid %>% 
  filter(CLASIFICACION_FINAL == 1 | CLASIFICACION_FINAL == 2 | CLASIFICACION_FINAL == 3)

## New data shape

nrows <- nrow(positive_cases)
ncols <- ncol(positive_cases)

sprintf("The sample data set has %s feautures and %s samples", ncols, nrows)

## [1] "The sample data set has 40 feautures and 602748 samples"

Inference: About 40% of the data were reduced.

## Rename column
    
positive_cases <- positive_cases %>% rename(POSITIVOS = CLASIFICACION_FINAL)

## Transformation of positive case values 

positive_cases <- positive_cases %>% 
  transform(POSITIVOS = ifelse(POSITIVOS != 1, 1, 1))

## Visualization DATE column

head(positive_cases$FECHA_DEF, 20)

##  [1] "9999-99-99" "9999-99-99" "9999-99-99" "9999-99-99" "9999-99-99"
##  [6] "9999-99-99" "9999-99-99" "9999-99-99" "9999-99-99" "9999-99-99"
## [11] "9999-99-99" "9999-99-99" "2022-02-09" "9999-99-99" "9999-99-99"
## [16] "9999-99-99" "9999-99-99" "9999-99-99" "9999-99-99" "9999-99-99"

Inference: The patients who didn’t die, has value “9999-99-99”, while the patitients that did die, has values 1.

## Transforming date value

positive_cases <- positive_cases %>% 
  mutate(DEFUNCION = ifelse((FECHA_DEF=="9999-99-99"),0, 1))

## Obtain the name of the States 

positive_cases <- positive_cases %>% 
  transform(ENTIDAD_RES = case_when(
  ENTIDAD_RES == 1 ~ "Aguascalientes", 
  ENTIDAD_RES == 2 ~ "Baja California", 
  ENTIDAD_RES == 3 ~ "Baja California Sur",
  ENTIDAD_RES == 4 ~ "Campeche",
  ENTIDAD_RES == 5 ~ "Coahuila",
  ENTIDAD_RES == 6 ~ "Colima",
  ENTIDAD_RES == 7 ~ "Chiapas", 
  ENTIDAD_RES == 8 ~ "Chihuahua",
  ENTIDAD_RES == 9 ~ "CDMX", 
  ENTIDAD_RES == 10 ~ "Durango", 
  ENTIDAD_RES == 11 ~ "Guanajuato",
  ENTIDAD_RES == 12 ~ "Guerrero",
  ENTIDAD_RES == 13 ~ "Hidalgo",
  ENTIDAD_RES == 14 ~ "Jalisco", 
  ENTIDAD_RES == 15 ~ "México",
  ENTIDAD_RES == 16 ~ "Michoacán",
  ENTIDAD_RES == 17 ~ "Morelos",
  ENTIDAD_RES == 18 ~ "Nayarit",
  ENTIDAD_RES == 19 ~ "Nuevo León",
  ENTIDAD_RES == 20 ~ "Oaxaca",
  ENTIDAD_RES == 21 ~ "Puebla",
  ENTIDAD_RES == 22 ~ "Querétaro",
  ENTIDAD_RES == 23 ~ "Quintana Roo",
  ENTIDAD_RES == 24 ~ "San Luis Potosí",
  ENTIDAD_RES == 25 ~ "Sinaloa",
  ENTIDAD_RES == 26 ~ "Sonora",
  ENTIDAD_RES == 27 ~ "Tabasco",
  ENTIDAD_RES == 28 ~ "Tamaulipas",
  ENTIDAD_RES == 29 ~ "Tlaxcala",
  ENTIDAD_RES == 30 ~ "Veracruz",
  ENTIDAD_RES == 31 ~ "Yucatán",
  ENTIDAD_RES == 32 ~ "Zacatecas",
  T ~ "Sin especificar"    
))

## Transforming the SEX column

positive_cases <- positive_cases %>% 
  transform(SEXO = case_when(
  SEXO == 1 ~ "Mujer",
  SEXO == 2 ~ "Hombre",
  T ~ "Sin especificar"
))

## Select only the features of interest.

positive_cases <- positive_cases %>% 
  select(POSITIVOS,
         SEXO,
         ENTIDAD_RES,
         INTUBADO,
         NEUMONIA,
         EDAD,
         DIABETES,
         TABAQUISMO,
         PAIS_NACIONALIDAD,
         DEFUNCION)

sample(positive_cases$PAIS_NACIONALIDAD, 40)

##  [1] "MÃ©xico" "MÃ©xico" "MÃ©xico" "MÃ©xico" "MÃ©xico" "MÃ©xico" "MÃ©xico"
##  [8] "MÃ©xico" "MÃ©xico" "MÃ©xico" "MÃ©xico" "MÃ©xico" "MÃ©xico" "MÃ©xico"
## [15] "MÃ©xico" "MÃ©xico" "MÃ©xico" "MÃ©xico" "MÃ©xico" "MÃ©xico" "MÃ©xico"
## [22] "MÃ©xico" "MÃ©xico" "MÃ©xico" "MÃ©xico" "MÃ©xico" "MÃ©xico" "MÃ©xico"
## [29] "MÃ©xico" "MÃ©xico" "MÃ©xico" "MÃ©xico" "MÃ©xico" "MÃ©xico" "MÃ©xico"
## [36] "MÃ©xico" "MÃ©xico" "MÃ©xico" "MÃ©xico" "MÃ©xico"

## Correction of errors in the nationality column

positive_cases$PAIS_NACIONALIDAD <- str_replace(positive_cases$PAIS_NACIONALIDAD, "MÃ©xico","México")

## Filter only mexican nationality

positive_cases_mexico <- positive_cases %>% 
  filter(positive_cases$PAIS_NACIONALIDAD == "México")

positive_cases_mexico <- positive_cases_mexico %>% 
  filter(positive_cases_mexico$SEXO == "Hombre" | positive_cases_mexico$SEXO == "Mujer", 
         positive_cases_mexico$INTUBADO == 1 | positive_cases_mexico$INTUBADO == 2, 
         positive_cases_mexico$NEUMONIA == 1 | positive_cases_mexico$NEUMONIA ==2, 
         positive_cases_mexico$DIABETES == 1 | positive_cases_mexico$DIABETES ==2, 
         positive_cases_mexico$TABAQUISMO == 1 | positive_cases_mexico$TABAQUISMO ==2)

Exploration Data Analysis (EDA)

## Positive cases by state (TOP 10)

positive_cases_by_state <- positive_cases_mexico %>% 
  group_by(ENTIDAD_RES) %>% 
  summarize(conteo = n()) %>% 
  arrange(desc(conteo)) %>% 
  top_n(10)

## Selecting by conteo

positive_cases_by_state

## # A tibble: 10 x 2
##    ENTIDAD_RES     conteo
##    <chr>            <int>
##  1 México            2646
##  2 CDMX              2544
##  3 Jalisco           1631
##  4 Nuevo León        1403
##  5 Guanajuato        1207
##  6 Veracruz           880
##  7 Coahuila           825
##  8 Chihuahua          806
##  9 Tamaulipas         717
## 10 Baja California    695

Inference: Mexico State has more positive cases.

## States with highest COVID-19 infection rates

width <- c(0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8)

plot_ly(positive_cases_by_state) %>% 
  add_bars(x = ~conteo, 
           y = ~ENTIDAD_RES, 
           width = ~width, 
           orientation = 'h') %>%
  layout(title = "States with highest COVID-19 infection rates",
         xaxis = list(title = "Count"),
         yaxis = list(title = ""))

## Positive cases by gender

positive_cases_by_gender <- positive_cases_mexico %>% 
  group_by(ENTIDAD_RES, SEXO) %>% 
  summarize(conteo = n()) %>% 
  arrange(desc(conteo))

## `summarise()` has grouped output by 'ENTIDAD_RES'. You can override using the
## `.groups` argument.

head(positive_cases_by_gender,10)

## # A tibble: 10 x 3
## # Groups:   ENTIDAD_RES [5]
##    ENTIDAD_RES SEXO   conteo
##    <chr>       <chr>   <int>
##  1 México      Hombre   1350
##  2 CDMX        Hombre   1304
##  3 México      Mujer    1296
##  4 CDMX        Mujer    1240
##  5 Jalisco     Hombre    908
##  6 Nuevo León  Hombre    742
##  7 Jalisco     Mujer     723
##  8 Nuevo León  Mujer     661
##  9 Guanajuato  Hombre    648
## 10 Guanajuato  Mujer     559

Inference:State of Mexico and CDMX have more positive cases by sex.

## Graph of positive cases by sex

plot_ly(positive_cases_by_gender) %>% 
  add_bars(x = ~ENTIDAD_RES, 
           y = ~conteo, 
           color = ~SEXO) %>%
  layout(title = "Number of positive cases by women and men in each state",
         xaxis = list(title = ""),
         yaxis = list(title = "Count"))

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Graph of positive cases by sex

positive_cases_mexico %>% 
  count(SEXO) %>%
  plot_ly( x = ~SEXO, 
           y=~n, 
           type = 'bar', 
           text = ~n, 
           textposition = 'auto',
           marker = list(color = 'rgb(158,202,225)',
                         line = list(color = 'rgb(8,48,107)',
                                     width = 1.5))) %>%
  layout(title = "Positive cases by fameles and males",
         xaxis = list(title = ""),
         yaxis = list(title = "Count"))

Inference : There are more positive cases in men than in women.

## Deaths due to COVID

positive_cases_mexico %>%  
  filter(DEFUNCION == 1) %>% 
  count(SEXO) %>%
  plot_ly( x = ~SEXO, 
           y=~n, 
           type = 'bar', 
           text = ~n, 
           textposition = 'auto',
           marker = list(color = 'rgb(158,202,225)',
                         line = list(color = 'rgb(8,48,107)',
                                     width = 1.5))) %>%
  layout(title = "Number of deaths of women and men",
         xaxis = list(title = ""),
         yaxis = list(title = "Count"))

## Graph of Deaths due to COVID

ggplot(positive_cases_mexico, aes(x= EDAD)) + 
  geom_density(color = "#15CFB5",
             fill = 4,
             alpha = 0.25) +
  labs(title = "Graph of density of COVID19 positives by age.", 
       x = "Age", 
       y="Count")

## Average age of COVID positive cases

average_age <- round(mean(positive_cases_mexico$EDAD),0)
paste("The average age of COVID-19 positives is: ", average_age, "years ago")

## [1] "The average age of COVID-19 positives is:  57 years ago"

## Average age of COVID positive cases woman 

positive_cases_by_fameles <- positive_cases_mexico %>% 
  filter(SEXO == 'Mujer')

average_age_women <- round(mean(positive_cases_by_fameles$EDAD), 0)
paste("The average age of COVID-19-positive women is: ", average_age_women, "years ago")

## [1] "The average age of COVID-19-positive women is:  54 years ago"

## Average age of COVID positive cases men 

positive_cases_by_males <- positive_cases_mexico %>% 
  filter(SEXO == 'Hombre')

average_age_men <- round(mean(positive_cases_by_males$EDAD), 0)
paste("The average age of COVID-19-positive men is: ", average_age_men, "years ago.")

## [1] "The average age of COVID-19-positive men is:  59 years ago."

## Graph of Density 

ggplot(positive_cases_mexico, 
       aes(x = EDAD, 
           colour = SEXO, 
           fill = SEXO )) + 
  geom_density() + 
  labs(title = "Graph of density of COVID-19 positive cases by gender.", 
       x = "Age", 
       y="")

## Count cases positives by Age

positive_cases_mexico %>% 
  group_by(EDAD) %>% 
  summarize(conteo = n()) %>%
  arrange(desc(conteo)) %>% 
  top_n(10)

## Selecting by conteo

## # A tibble: 10 x 2
##     EDAD conteo
##    <int>  <int>
##  1    69    435
##  2    65    423
##  3    80    416
##  4    67    406
##  5    73    404
##  6    72    396
##  7    75    396
##  8    60    395
##  9    62    387
## 10    77    381

## Count cases positives by AGE (woman)

positive_cases_by_fameles %>% 
  group_by(EDAD) %>% 
  summarize(conteo = n()) %>% 
  arrange(desc(conteo)) %>% 
  top_n(10)

## Selecting by conteo

## # A tibble: 10 x 2
##     EDAD conteo
##    <int>  <int>
##  1    69    197
##  2    65    196
##  3    80    184
##  4    75    175
##  5    62    174
##  6    77    173
##  7    73    170
##  8    60    169
##  9    67    166
## 10    68    165

## Count cases positives by AGE (men)

positive_cases_by_males %>% 
  group_by(EDAD) %>% 
  summarize(conteo = n()) %>% 
  arrange(desc(conteo)) %>% 
  top_n(10)

## Selecting by conteo

## # A tibble: 10 x 2
##     EDAD conteo
##    <int>  <int>
##  1    70    241
##  2    67    240
##  3    69    238
##  4    73    234
##  5    72    232
##  6    80    232
##  7    65    227
##  8    57    226
##  9    60    226
## 10    66    225

## Boxplot

positive_cases_mexico %>% 
  ggplot(aes(x=SEXO, 
             y=EDAD, 
             fill=SEXO)) +
  geom_boxplot(width = 0.5) + 
  theme(
    legend.position="none", 
    plot.title = element_text(size=11)) +
    ggtitle("Box plot of the ages of males and females positive for COVID-19") +
    xlab("")

## Average age of deceased patients

deceased_patients <- positive_cases_mexico %>% 
  filter(DEFUNCION == 1)

average_age_of_deceased_patients <- round(mean(deceased_patients$EDAD),0)
paste("The average age of patients who died from COVID-19 is: ", 
      average_age_of_deceased_patients, "years ago.")

## [1] "The average age of patients who died from COVID-19 is:  69 years ago."

## Average age at death of women

deceased_women <- deceased_patients %>% 
  filter(SEXO == 'Mujer')

average_age_of_deceased_women <- round(mean(deceased_women$EDAD), 0)
paste("The average age of the women who died from COVID-19 is: ", 
      average_age_of_deceased_women, "years ago.")

## [1] "The average age of the women who died from COVID-19 is:  68 years ago."

## Average age at death of men

decresed_men <- deceased_patients %>% 
  filter(SEXO == 'Hombre')

average_age_of_deceased_men <- round(mean(decresed_men$EDAD), 0)
paste("The average age of the men who died from COVID-19 is: ",
      average_age_of_deceased_men, "years ago.")

## [1] "The average age of the men who died from COVID-19 is:  69 years ago."

## Number of deaths by AGE

positive_cases_mexico %>% 
  filter(DEFUNCION == 1) %>% 
  group_by(EDAD) %>% 
  summarize(conteo = n()) %>% 
  arrange(desc(conteo)) %>% 
  top_n(10)

## Selecting by conteo

## # A tibble: 10 x 2
##     EDAD conteo
##    <int>  <int>
##  1    80    224
##  2    69    212
##  3    82    211
##  4    75    200
##  5    78    200
##  6    70    196
##  7    73    195
##  8    71    193
##  9    77    193
## 10    65    192

Inference : The ages with the highest number of deaths are: 80, 79 and 78.

## Number of deaths by women

positive_cases_by_fameles %>% 
  filter(DEFUNCION == 1) %>% 
  group_by(EDAD) %>% 
  summarize(conteo = n()) %>%
  arrange(desc(conteo)) %>% 
  top_n(10)

## Selecting by conteo

## # A tibble: 10 x 2
##     EDAD conteo
##    <int>  <int>
##  1    65     92
##  2    69     89
##  3    71     82
##  4    75     81
##  5    72     80
##  6    77     80
##  7    80     76
##  8    68     75
##  9    78     74
## 10    82     74

Inference: The ages with the highest number of deaths among women are: 68, 65 and 79.

## Number of deaths by men

positive_cases_by_males %>% 
  filter(DEFUNCION == 1) %>% 
  group_by(EDAD) %>% 
  summarize(conteo = n()) %>% 
  arrange(desc(conteo)) %>% 
  top_n(10)

## Selecting by conteo

## # A tibble: 11 x 2
##     EDAD conteo
##    <int>  <int>
##  1    80    148
##  2    82    137
##  3    70    128
##  4    78    126
##  5    81    125
##  6    69    123
##  7    73    123
##  8    76    120
##  9    75    119
## 10    77    113
## 11    79    113

## Graph of deaths

deaths <- positive_cases_mexico %>% 
  select(EDAD, DEFUNCION, SEXO) %>% 
  filter(DEFUNCION == 1)

ggplot(deaths, 
       aes(x = EDAD, 
           fill= SEXO, 
           colour = SEXO)) +
  geom_histogram(alpha = 0.5, 
                 position = "dodge") + 
  labs(title = "Histogram of women and men deceased by COVID-19.", 
       x = "Age", 
       y = "Count")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Boxplot of ages of deaths 

deaths %>% 
  ggplot( aes(x=SEXO, 
              y=EDAD, 
              fill=SEXO)) +
  geom_violin(width=1) + 
  geom_boxplot() +
  theme(
    legend.position="none", 
    plot.title = element_text(size=12)) +
    ggtitle("Boxplot of ages of deaths by cause COVID-19.") +
    xlab("")

## Number of smokers by gender

positive_cases_mexico %>% 
  count(SEXO, TABAQUISMO) %>% 
  plot_ly(x = ~SEXO, 
          y= ~n, 
          color = ~TABAQUISMO) %>% 
  add_bars() %>% 
  layout(title = "Counts of smokers and nonsmokers by gender", 
         x = "Sex", 
         y = "Count") %>% 
  layout(barmode ="stack")

## Warning: textfont.color doesn't (yet) support data arrays

## Warning: textfont.color doesn't (yet) support data arrays

## Warning: 'layout' objects don't have these attributes: 'x', 'y'
## Valid attributes include:
## '_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'smith', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'barmode', 'bargap', 'mapType'

Inference: There are more COVID-19 positive male smokers than female smokers.

## Number of diabetics by gender

positive_cases_mexico %>% 
  count(SEXO, DIABETES) %>% 
  plot_ly(x = ~SEXO, 
          y= ~n, 
          color = ~DIABETES) %>% 
  add_bars() %>%
  layout(title = "Count of men and women with diabetes") %>% 
  layout(barmode ="stack")

## Warning: textfont.color doesn't (yet) support data arrays

## Warning: textfont.color doesn't (yet) support data arrays

Inference There are more COVID-19 positive diabetic men than women.

## Number of intubated patients by sex

positive_cases_mexico %>% 
  count(SEXO, INTUBADO) %>% 
  plot_ly(x = ~SEXO, 
          y= ~n, 
          color = ~INTUBADO) %>% 
  add_bars() %>%
  layout(title = "Number of intubated men and women") %>% 
  layout(barmode ="stack")

## Warning: textfont.color doesn't (yet) support data arrays

## Warning: textfont.color doesn't (yet) support data arrays

Inference: The number of patients intubated due to covid is slightly low.

## Number of deaths of smokers, pneumonia and intubated cases

death_smok_pneumo_intub  <- positive_cases_mexico %>% 
  filter(TABAQUISMO == 1 | NEUMONIA ==1, 
         INTUBADO  == 1,
         SEXO == 'Hombre' | SEXO == 'Mujer', 
         DEFUNCION == 1)

plot_ly(death_smok_pneumo_intub, 
        x = ~SEXO, 
        y = ~TABAQUISMO, 
        z = ~INTUBADO,
        type = 'bar', 
        name = 'Smoking') %>%
  add_trace(y = ~NEUMONIA,
            name = 'Pneumonia') %>% 
  add_trace(z = ~INTUBADO, 
            name = 'Intubated') %>% 
  layout(yaxis = list(title = 'Count'), 
         barmode = 'group')

## Warning: 'bar' objects don't have these attributes: 'z'
## Valid attributes include:
## '_deprecated', 'alignmentgroup', 'base', 'basesrc', 'cliponaxis', 'constraintext', 'customdata', 'customdatasrc', 'dx', 'dy', 'error_x', 'error_y', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'insidetextanchor', 'insidetextfont', 'legendgroup', 'legendgrouptitle', 'legendrank', 'marker', 'meta', 'metasrc', 'name', 'offset', 'offsetgroup', 'offsetsrc', 'opacity', 'orientation', 'outsidetextfont', 'selected', 'selectedpoints', 'showlegend', 'stream', 'text', 'textangle', 'textfont', 'textposition', 'textpositionsrc', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'width', 'widthsrc', 'x', 'x0', 'xaxis', 'xcalendar', 'xhoverformat', 'xperiod', 'xperiod0', 'xperiodalignment', 'xsrc', 'y', 'y0', 'yaxis', 'ycalendar', 'yhoverformat', 'yperiod', 'yperiod0', 'yperiodalignment', 'ysrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'

## Warning: 'bar' objects don't have these attributes: 'z'
## Valid attributes include:
## '_deprecated', 'alignmentgroup', 'base', 'basesrc', 'cliponaxis', 'constraintext', 'customdata', 'customdatasrc', 'dx', 'dy', 'error_x', 'error_y', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'insidetextanchor', 'insidetextfont', 'legendgroup', 'legendgrouptitle', 'legendrank', 'marker', 'meta', 'metasrc', 'name', 'offset', 'offsetgroup', 'offsetsrc', 'opacity', 'orientation', 'outsidetextfont', 'selected', 'selectedpoints', 'showlegend', 'stream', 'text', 'textangle', 'textfont', 'textposition', 'textpositionsrc', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'width', 'widthsrc', 'x', 'x0', 'xaxis', 'xcalendar', 'xhoverformat', 'xperiod', 'xperiod0', 'xperiodalignment', 'xsrc', 'y', 'y0', 'yaxis', 'ycalendar', 'yhoverformat', 'yperiod', 'yperiod0', 'yperiodalignment', 'ysrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'

## Warning: 'bar' objects don't have these attributes: 'z'
## Valid attributes include:
## '_deprecated', 'alignmentgroup', 'base', 'basesrc', 'cliponaxis', 'constraintext', 'customdata', 'customdatasrc', 'dx', 'dy', 'error_x', 'error_y', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'insidetextanchor', 'insidetextfont', 'legendgroup', 'legendgrouptitle', 'legendrank', 'marker', 'meta', 'metasrc', 'name', 'offset', 'offsetgroup', 'offsetsrc', 'opacity', 'orientation', 'outsidetextfont', 'selected', 'selectedpoints', 'showlegend', 'stream', 'text', 'textangle', 'textfont', 'textposition', 'textpositionsrc', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'width', 'widthsrc', 'x', 'x0', 'xaxis', 'xcalendar', 'xhoverformat', 'xperiod', 'xperiod0', 'xperiodalignment', 'xsrc', 'y', 'y0', 'yaxis', 'ycalendar', 'yhoverformat', 'yperiod', 'yperiod0', 'yperiodalignment', 'ysrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'

## Cases of deaths, smokers and intubated patients

death_smok_intub <- positive_cases_mexico %>% 
  filter(TABAQUISMO == 1 | INTUBADO ==1, 
         SEXO == 'Hombre' | SEXO == 'Mujer', 
         DEFUNCION == 1, 
         EDAD <= average_age_of_deceased_patients)

plot_ly(death_smok_intub, 
        x = ~SEXO, 
        y = ~TABAQUISMO, 
        type = 'bar', 
        name = 'Smoking') %>%
  add_trace(y = ~INTUBADO, 
            name = 'Intubated ') %>% 
  layout(yaxis = list(title = 'Count'), 
         barmode = 'group')

## Number of deaths 

deaths_covid_medic_cond  <- positive_cases_mexico %>% 
  select(SEXO, 
         TABAQUISMO, 
         INTUBADO,
         DIABETES,
         DEFUNCION)

deaths_covid_medic_cond <- deaths_covid_medic_cond  %>% 
  transform(TABAQUISMO = case_when(TABAQUISMO == 1 ~ "Sí",
                                   TABAQUISMO == 2 ~ "No", 
                                   T ~ ""), 
            INTUBADO = case_when(INTUBADO == 1 ~ "Sí",
                                 INTUBADO == 2 ~ "No",
                                 T ~ ""), 
            DIABETES = case_when(DIABETES == 1 ~ "Sí",
                                 DIABETES == 2 ~ "No",
                                 T ~ ""))

deaths_covid_medic_cond  <- deaths_covid_medic_cond  %>% 
  filter(SEXO == 'Mujer' | SEXO == 'Hombre',
         TABAQUISMO == "Sí" | TABAQUISMO == "No",
         INTUBADO == "Sí" | INTUBADO == "No",
         DIABETES == "Sí" | DIABETES == "No",
         DEFUNCION == 1)

## Number of deaths with medical conditions 

deaths_covid_medic_cond  <- positive_cases_mexico %>% 
  select(SEXO, 
         TABAQUISMO, 
         INTUBADO,
         DIABETES,
         DEFUNCION)

deaths_covid_medic_cond <- deaths_covid_medic_cond  %>% 
  transform(TABAQUISMO = case_when(TABAQUISMO == 1 ~ "Sí",
                                   TABAQUISMO == 2 ~ "No", 
                                   T ~ ""), 
            INTUBADO = case_when(INTUBADO == 1 ~ "Sí",
                                 INTUBADO == 2 ~ "No",
                                 T ~ ""), 
            DIABETES = case_when(DIABETES == 1 ~ "Sí",
                                 DIABETES == 2 ~ "No",
                                 T ~ ""))

deaths_covid_medic_cond  <- deaths_covid_medic_cond  %>% 
  filter(SEXO == 'Mujer' | SEXO == 'Hombre',
         TABAQUISMO == "Sí" | TABAQUISMO == "No",
         INTUBADO == "Sí" | INTUBADO == "No",
         DIABETES == "Sí" | DIABETES == "No",
         DEFUNCION == 1)

## Smoker table 

gender <- as.factor(deaths_covid_medic_cond$SEXO)
smoker <- as.factor(deaths_covid_medic_cond$TABAQUISMO)

smoker_table <- ftable(xtabs(~gender+smoker, data = deaths_covid_medic_cond))
smoker_table <- cbind(smoker_table, apply(smoker_table, 1, sum)) 
smoker_table <- rbind(smoker_table, apply(smoker_table, 2, sum))
colnames(smoker_table) <- c('Smoker', 'Non-smoker', 'Total')
rownames(smoker_table)[3] <- c('Total')

smoker_table <- as.data.frame(smoker_table)
smoker_table

##        Smoker Non-smoker Total
## Hombre   4087        483  4570
## Mujer    2797        105  2902
## Total    6884        588  7472

## Intubated table 

intubated <- as.factor(deaths_covid_medic_cond$INTUBADO)

intubated_table <- ftable(xtabs(~gender+intubated, data = deaths_covid_medic_cond))
intubated_table <- cbind(intubated_table, apply(intubated_table, 1, sum)) 
intubated_table <- rbind(intubated_table, apply(intubated_table, 2, sum))
colnames(intubated_table) <- c('Intubado_Sí', 'Intubado_No', 'Total')
rownames(intubated_table)[3] <- c('Total')

intubated_table <- as.data.frame(intubated_table)
intubated_table

##        Intubado_Sí Intubado_No Total
## Hombre        3861         709  4570
## Mujer         2468         434  2902
## Total         6329        1143  7472

## Diabetics table

diabetics <- as.factor(deaths_covid_medic_cond$DIABETES)

diabetics_table <- ftable(xtabs(~gender+diabetics, data = deaths_covid_medic_cond))
diabetics_table <- cbind(diabetics_table, apply(diabetics_table, 1, sum)) 
diabetics_table <- rbind(diabetics_table, apply(diabetics_table, 2, sum))
colnames(diabetics_table) <- c('Diabetics: Yes', 'Diabetics: No', 'Total')
rownames(diabetics_table)[3] <- c('Total')

diabetics_table <- as.data.frame(diabetics_table)
diabetics_table

##        Diabetics: Yes Diabetics: No Total
## Hombre           2877          1693  4570
## Mujer            1698          1204  2902
## Total            4575          2897  7472

Conclusions

The ages with the highest number of positive cases are: 59, 65,67 and 80.

The ages with the highest number of positve cases in women are: 65, 59, 67, 60.

The ages with the highest number of positve cases in men are: 80, 66, 73 and 74.

The average age of patients who died for COVID-19 is 69 years ago.

The average age of the women who died from COVID-19 is 68 years ago.

The average age of the women who died from COVID-19 is 69 years ago.

The ages with the highest number of deaths are: 80, 79 and 78.

The ages with the highest number of deaths among women are: 68, 65 and 79.

The ages with the highest number of deaths among men are: 80, 82 and 77.

There are more COVID-19 positive male smokers than female smokers.

There are more COVID-19 positive diabetic men than women.

The number of patients intubated due to covid is slightly low.

Case study: COVID-19-Positive Cases Mexico

Carlos Eduardo Salinas Díaz

Case study: COVID-19-Positive Cases Mexico

Introduction

Data set info

Source

Objectives:

Outline

Data Exploration

Data Manipulation

Exploration Data Analysis (EDA)

Conclusions