The objective of the present project is to perform an exploratory analysis of COVIS-19 positive cases in the Mexican population, considering those patients who are smokers, diabetics, with pneumonia and who required intubation. An analysis by sex and age in the same conditions is also performed.
The information contained corresponds only to the data obtained from the epidemiological study of a suspected case of viral respiratory disease at the time it is identified in the medical units of the Health Sector.
According to the clinical diagnosis at admission, it is considered as an outpatient or hospitalized patient. The base does not include the evolution during their stay in the medical units, with the exception of updates upon discharge by the hospital epidemiological surveillance units or health jurisdictions in the case of deaths.This dataset was taking from Secretary of Public Health of the Mexican State and you can explore it HERE. In addition, you can browse the data dictionary HERE.
Note : However, the data set used for this case corresponds to a random sample of 1,000,000 samples from the original database.
## Libraries
libraries <- c("ggplot2", "dplyr", "readr", "magrittr", "tidyverse", "stringr", "plotly")
sapply(libraries, require, character.only = TRUE)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.1.3
## Loading required package: dplyr
## Warning: package 'dplyr' was built under R version 4.1.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: readr
## Warning: package 'readr' was built under R version 4.1.3
## Loading required package: magrittr
## Warning: package 'magrittr' was built under R version 4.1.3
## Loading required package: tidyverse
## Warning: package 'tidyverse' was built under R version 4.1.3
## Error: package or namespace load failed for 'tidyverse' in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]):
## namespace 'purrr' 0.3.4 is already loaded, but >= 1.0.1 is required
## Loading required package: stringr
## Warning: package 'stringr' was built under R version 4.1.3
## Loading required package: plotly
## Warning: package 'plotly' was built under R version 4.1.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## ggplot2 dplyr readr magrittr tidyverse stringr plotly
## TRUE TRUE TRUE TRUE FALSE TRUE TRUE
setwd("E:/COVID19_Case_Study")
## Load data
data <- read.csv("COVID19MEXICO.csv", quote="\"", skipNul= TRUE)
## Random sample
data_covid <- sample_n(data, size = 1000000, replace = T)
## Data visualization
head(data_covid)
## FECHA_ACTUALIZACION ID_REGISTRO ORIGEN SECTOR ENTIDAD_UM SEXO ENTIDAD_NAC
## 1 2022-10-04 9f58c2 2 4 20 1 20
## 2 2022-10-04 9c54cf 2 9 24 2 15
## 3 2022-10-04 4ec3e7 2 12 21 1 21
## 4 2022-10-04 b1f057 2 4 19 1 19
## 5 2022-10-04 ddc853 2 9 9 1 9
## 6 2022-10-04 4a6c25 2 9 9 1 9
## ENTIDAD_RES MUNICIPIO_RES TIPO_PACIENTE FECHA_INGRESO FECHA_SINTOMAS
## 1 20 67 1 2022-01-04 2022-01-02
## 2 11 20 1 2022-01-06 2022-01-05
## 3 21 63 1 2022-02-10 2022-02-09
## 4 19 46 1 2022-01-15 2022-01-02
## 5 9 5 1 2022-01-26 2022-01-26
## 6 9 14 1 2022-02-19 2022-02-17
## FECHA_DEF INTUBADO NEUMONIA EDAD NACIONALIDAD EMBARAZO HABLA_LENGUA_INDIG
## 1 9999-99-99 97 2 33 1 2 2
## 2 9999-99-99 97 2 54 1 97 2
## 3 9999-99-99 97 2 34 1 2 2
## 4 9999-99-99 97 2 38 1 2 2
## 5 9999-99-99 97 2 27 1 1 99
## 6 9999-99-99 97 2 39 1 2 99
## INDIGENA DIABETES EPOC ASMA INMUSUPR HIPERTENSION OTRA_COM CARDIOVASCULAR
## 1 2 2 2 2 2 2 2 2
## 2 2 2 2 2 2 2 2 2
## 3 2 2 2 2 2 2 2 2
## 4 2 2 2 2 2 2 2 1
## 5 99 2 2 2 2 2 2 2
## 6 99 2 2 2 2 2 2 2
## OBESIDAD RENAL_CRONICA TABAQUISMO OTRO_CASO TOMA_MUESTRA_LAB RESULTADO_LAB
## 1 2 2 2 2 2 97
## 2 2 2 2 2 2 97
## 3 1 2 2 1 2 97
## 4 2 2 2 2 2 97
## 5 2 2 2 1 2 97
## 6 2 2 2 2 2 97
## TOMA_MUESTRA_ANTIGENO RESULTADO_ANTIGENO CLASIFICACION_FINAL MIGRANTE
## 1 1 2 7 99
## 2 1 1 3 99
## 3 1 1 3 99
## 4 1 1 3 99
## 5 1 2 7 99
## 6 1 2 7 99
## PAIS_NACIONALIDAD PAIS_ORIGEN UCI
## 1 México 97 97
## 2 México 97 97
## 3 México 97 97
## 4 México 97 97
## 5 México 97 97
## 6 México 97 97
## Data shape
nrows <- nrow(data_covid)
ncols <- ncol(data_covid)
sprintf("The data set has %s samples and %s features", nrows, ncols)
## [1] "The data set has 1000000 samples and 40 features"
## Data info
str(data_covid)
## 'data.frame': 1000000 obs. of 40 variables:
## $ FECHA_ACTUALIZACION : chr "2022-10-04" "2022-10-04" "2022-10-04" "2022-10-04" ...
## $ ID_REGISTRO : chr "9f58c2" "9c54cf" "4ec3e7" "b1f057" ...
## $ ORIGEN : int 2 2 2 2 2 2 2 2 2 2 ...
## $ SECTOR : int 4 9 12 4 9 9 4 12 12 12 ...
## $ ENTIDAD_UM : int 20 24 21 19 9 9 5 9 24 9 ...
## $ SEXO : int 1 2 1 1 1 1 2 2 1 1 ...
## $ ENTIDAD_NAC : int 20 15 21 19 9 9 5 9 24 9 ...
## $ ENTIDAD_RES : int 20 11 21 19 9 9 5 9 24 9 ...
## $ MUNICIPIO_RES : int 67 20 63 46 5 14 18 3 28 15 ...
## $ TIPO_PACIENTE : int 1 1 1 1 1 1 1 1 1 1 ...
## $ FECHA_INGRESO : chr "2022-01-04" "2022-01-06" "2022-02-10" "2022-01-15" ...
## $ FECHA_SINTOMAS : chr "2022-01-02" "2022-01-05" "2022-02-09" "2022-01-02" ...
## $ FECHA_DEF : chr "9999-99-99" "9999-99-99" "9999-99-99" "9999-99-99" ...
## $ INTUBADO : int 97 97 97 97 97 97 97 97 97 97 ...
## $ NEUMONIA : int 2 2 2 2 2 2 2 2 2 2 ...
## $ EDAD : int 33 54 34 38 27 39 54 63 27 2 ...
## $ NACIONALIDAD : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EMBARAZO : int 2 97 2 2 1 2 97 97 2 2 ...
## $ HABLA_LENGUA_INDIG : int 2 2 2 2 99 99 2 2 99 2 ...
## $ INDIGENA : int 2 2 2 2 99 99 2 2 99 2 ...
## $ DIABETES : int 2 2 2 2 2 2 2 2 2 2 ...
## $ EPOC : int 2 2 2 2 2 2 2 2 2 2 ...
## $ ASMA : int 2 2 2 2 2 2 2 2 2 2 ...
## $ INMUSUPR : int 2 2 2 2 2 2 2 2 2 2 ...
## $ HIPERTENSION : int 2 2 2 2 2 2 2 2 2 2 ...
## $ OTRA_COM : int 2 2 2 2 2 2 2 2 2 2 ...
## $ CARDIOVASCULAR : int 2 2 2 1 2 2 2 2 2 2 ...
## $ OBESIDAD : int 2 2 1 2 2 2 2 2 2 2 ...
## $ RENAL_CRONICA : int 2 2 2 2 2 2 2 2 2 2 ...
## $ TABAQUISMO : int 2 2 2 2 2 2 2 2 2 2 ...
## $ OTRO_CASO : int 2 2 1 2 1 2 2 2 1 2 ...
## $ TOMA_MUESTRA_LAB : int 2 2 2 2 2 2 2 2 2 2 ...
## $ RESULTADO_LAB : int 97 97 97 97 97 97 97 97 97 97 ...
## $ TOMA_MUESTRA_ANTIGENO: int 1 1 1 1 1 1 1 1 1 1 ...
## $ RESULTADO_ANTIGENO : int 2 1 1 1 2 2 1 2 1 2 ...
## $ CLASIFICACION_FINAL : int 7 3 3 3 7 7 3 7 3 7 ...
## $ MIGRANTE : int 99 99 99 99 99 99 99 99 99 99 ...
## $ PAIS_NACIONALIDAD : chr "México" "México" "México" "México" ...
## $ PAIS_ORIGEN : chr "97" "97" "97" "97" ...
## $ UCI : int 97 97 97 97 97 97 97 97 97 97 ...
## Null values
apply(X = is.na(data_covid), MARGIN = 2, FUN = sum)
## FECHA_ACTUALIZACION ID_REGISTRO ORIGEN
## 0 0 0
## SECTOR ENTIDAD_UM SEXO
## 0 0 0
## ENTIDAD_NAC ENTIDAD_RES MUNICIPIO_RES
## 0 0 0
## TIPO_PACIENTE FECHA_INGRESO FECHA_SINTOMAS
## 0 0 0
## FECHA_DEF INTUBADO NEUMONIA
## 0 0 0
## EDAD NACIONALIDAD EMBARAZO
## 0 0 0
## HABLA_LENGUA_INDIG INDIGENA DIABETES
## 0 0 0
## EPOC ASMA INMUSUPR
## 1 1 1
## HIPERTENSION OTRA_COM CARDIOVASCULAR
## 1 1 1
## OBESIDAD RENAL_CRONICA TABAQUISMO
## 1 1 1
## OTRO_CASO TOMA_MUESTRA_LAB RESULTADO_LAB
## 1 1 1
## TOMA_MUESTRA_ANTIGENO RESULTADO_ANTIGENO CLASIFICACION_FINAL
## 1 1 1
## MIGRANTE PAIS_NACIONALIDAD PAIS_ORIGEN
## 1 0 0
## UCI
## 1
## Counting unique values
sort(sapply(data_covid, function(x) length(unique(x))))
## FECHA_ACTUALIZACION ORIGEN SEXO
## 1 2 2
## TIPO_PACIENTE NEUMONIA NACIONALIDAD
## 2 2 2
## HABLA_LENGUA_INDIG INDIGENA DIABETES
## 3 3 3
## TOMA_MUESTRA_LAB TOMA_MUESTRA_ANTIGENO INTUBADO
## 3 3 4
## EMBARAZO EPOC ASMA
## 4 4 4
## INMUSUPR HIPERTENSION OTRA_COM
## 4 4 4
## CARDIOVASCULAR OBESIDAD RENAL_CRONICA
## 4 4 4
## TABAQUISMO OTRO_CASO RESULTADO_ANTIGENO
## 4 4 4
## MIGRANTE RESULTADO_LAB UCI
## 4 5 5
## CLASIFICACION_FINAL SECTOR ENTIDAD_UM
## 8 14 32
## ENTIDAD_RES ENTIDAD_NAC PAIS_ORIGEN
## 32 33 53
## PAIS_NACIONALIDAD FECHA_DEF EDAD
## 94 105 121
## FECHA_SINTOMAS FECHA_INGRESO MUNICIPIO_RES
## 153 170 415
## ID_REGISTRO
## 818214
## Filter only case positives
positive_cases <- data_covid %>%
filter(CLASIFICACION_FINAL == 1 | CLASIFICACION_FINAL == 2 | CLASIFICACION_FINAL == 3)
## New data shape
nrows <- nrow(positive_cases)
ncols <- ncol(positive_cases)
sprintf("The sample data set has %s feautures and %s samples", ncols, nrows)
## [1] "The sample data set has 40 feautures and 602748 samples"
Inference: About 40% of the data were reduced.
## Rename column
positive_cases <- positive_cases %>% rename(POSITIVOS = CLASIFICACION_FINAL)
## Transformation of positive case values
positive_cases <- positive_cases %>%
transform(POSITIVOS = ifelse(POSITIVOS != 1, 1, 1))
## Visualization DATE column
head(positive_cases$FECHA_DEF, 20)
## [1] "9999-99-99" "9999-99-99" "9999-99-99" "9999-99-99" "9999-99-99"
## [6] "9999-99-99" "9999-99-99" "9999-99-99" "9999-99-99" "9999-99-99"
## [11] "9999-99-99" "9999-99-99" "2022-02-09" "9999-99-99" "9999-99-99"
## [16] "9999-99-99" "9999-99-99" "9999-99-99" "9999-99-99" "9999-99-99"
Inference: The patients who didn’t die, has value “9999-99-99”, while the patitients that did die, has values 1.
## Transforming date value
positive_cases <- positive_cases %>%
mutate(DEFUNCION = ifelse((FECHA_DEF=="9999-99-99"),0, 1))
## Obtain the name of the States
positive_cases <- positive_cases %>%
transform(ENTIDAD_RES = case_when(
ENTIDAD_RES == 1 ~ "Aguascalientes",
ENTIDAD_RES == 2 ~ "Baja California",
ENTIDAD_RES == 3 ~ "Baja California Sur",
ENTIDAD_RES == 4 ~ "Campeche",
ENTIDAD_RES == 5 ~ "Coahuila",
ENTIDAD_RES == 6 ~ "Colima",
ENTIDAD_RES == 7 ~ "Chiapas",
ENTIDAD_RES == 8 ~ "Chihuahua",
ENTIDAD_RES == 9 ~ "CDMX",
ENTIDAD_RES == 10 ~ "Durango",
ENTIDAD_RES == 11 ~ "Guanajuato",
ENTIDAD_RES == 12 ~ "Guerrero",
ENTIDAD_RES == 13 ~ "Hidalgo",
ENTIDAD_RES == 14 ~ "Jalisco",
ENTIDAD_RES == 15 ~ "México",
ENTIDAD_RES == 16 ~ "Michoacán",
ENTIDAD_RES == 17 ~ "Morelos",
ENTIDAD_RES == 18 ~ "Nayarit",
ENTIDAD_RES == 19 ~ "Nuevo León",
ENTIDAD_RES == 20 ~ "Oaxaca",
ENTIDAD_RES == 21 ~ "Puebla",
ENTIDAD_RES == 22 ~ "Querétaro",
ENTIDAD_RES == 23 ~ "Quintana Roo",
ENTIDAD_RES == 24 ~ "San Luis Potosí",
ENTIDAD_RES == 25 ~ "Sinaloa",
ENTIDAD_RES == 26 ~ "Sonora",
ENTIDAD_RES == 27 ~ "Tabasco",
ENTIDAD_RES == 28 ~ "Tamaulipas",
ENTIDAD_RES == 29 ~ "Tlaxcala",
ENTIDAD_RES == 30 ~ "Veracruz",
ENTIDAD_RES == 31 ~ "Yucatán",
ENTIDAD_RES == 32 ~ "Zacatecas",
T ~ "Sin especificar"
))
## Transforming the SEX column
positive_cases <- positive_cases %>%
transform(SEXO = case_when(
SEXO == 1 ~ "Mujer",
SEXO == 2 ~ "Hombre",
T ~ "Sin especificar"
))
## Select only the features of interest.
positive_cases <- positive_cases %>%
select(POSITIVOS,
SEXO,
ENTIDAD_RES,
INTUBADO,
NEUMONIA,
EDAD,
DIABETES,
TABAQUISMO,
PAIS_NACIONALIDAD,
DEFUNCION)
sample(positive_cases$PAIS_NACIONALIDAD, 40)
## [1] "México" "México" "México" "México" "México" "México" "México"
## [8] "México" "México" "México" "México" "México" "México" "México"
## [15] "México" "México" "México" "México" "México" "México" "México"
## [22] "México" "México" "México" "México" "México" "México" "México"
## [29] "México" "México" "México" "México" "México" "México" "México"
## [36] "México" "México" "México" "México" "México"
## Correction of errors in the nationality column
positive_cases$PAIS_NACIONALIDAD <- str_replace(positive_cases$PAIS_NACIONALIDAD, "México","México")
## Filter only mexican nationality
positive_cases_mexico <- positive_cases %>%
filter(positive_cases$PAIS_NACIONALIDAD == "México")
positive_cases_mexico <- positive_cases_mexico %>%
filter(positive_cases_mexico$SEXO == "Hombre" | positive_cases_mexico$SEXO == "Mujer",
positive_cases_mexico$INTUBADO == 1 | positive_cases_mexico$INTUBADO == 2,
positive_cases_mexico$NEUMONIA == 1 | positive_cases_mexico$NEUMONIA ==2,
positive_cases_mexico$DIABETES == 1 | positive_cases_mexico$DIABETES ==2,
positive_cases_mexico$TABAQUISMO == 1 | positive_cases_mexico$TABAQUISMO ==2)
## Positive cases by state (TOP 10)
positive_cases_by_state <- positive_cases_mexico %>%
group_by(ENTIDAD_RES) %>%
summarize(conteo = n()) %>%
arrange(desc(conteo)) %>%
top_n(10)
## Selecting by conteo
positive_cases_by_state
## # A tibble: 10 x 2
## ENTIDAD_RES conteo
## <chr> <int>
## 1 México 2646
## 2 CDMX 2544
## 3 Jalisco 1631
## 4 Nuevo León 1403
## 5 Guanajuato 1207
## 6 Veracruz 880
## 7 Coahuila 825
## 8 Chihuahua 806
## 9 Tamaulipas 717
## 10 Baja California 695
Inference: Mexico State has more positive cases.
## States with highest COVID-19 infection rates
width <- c(0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8)
plot_ly(positive_cases_by_state) %>%
add_bars(x = ~conteo,
y = ~ENTIDAD_RES,
width = ~width,
orientation = 'h') %>%
layout(title = "States with highest COVID-19 infection rates",
xaxis = list(title = "Count"),
yaxis = list(title = ""))
## Positive cases by gender
positive_cases_by_gender <- positive_cases_mexico %>%
group_by(ENTIDAD_RES, SEXO) %>%
summarize(conteo = n()) %>%
arrange(desc(conteo))
## `summarise()` has grouped output by 'ENTIDAD_RES'. You can override using the
## `.groups` argument.
head(positive_cases_by_gender,10)
## # A tibble: 10 x 3
## # Groups: ENTIDAD_RES [5]
## ENTIDAD_RES SEXO conteo
## <chr> <chr> <int>
## 1 México Hombre 1350
## 2 CDMX Hombre 1304
## 3 México Mujer 1296
## 4 CDMX Mujer 1240
## 5 Jalisco Hombre 908
## 6 Nuevo León Hombre 742
## 7 Jalisco Mujer 723
## 8 Nuevo León Mujer 661
## 9 Guanajuato Hombre 648
## 10 Guanajuato Mujer 559
Inference:State of Mexico and CDMX have more positive cases by sex.
## Graph of positive cases by sex
plot_ly(positive_cases_by_gender) %>%
add_bars(x = ~ENTIDAD_RES,
y = ~conteo,
color = ~SEXO) %>%
layout(title = "Number of positive cases by women and men in each state",
xaxis = list(title = ""),
yaxis = list(title = "Count"))
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Graph of positive cases by sex
positive_cases_mexico %>%
count(SEXO) %>%
plot_ly( x = ~SEXO,
y=~n,
type = 'bar',
text = ~n,
textposition = 'auto',
marker = list(color = 'rgb(158,202,225)',
line = list(color = 'rgb(8,48,107)',
width = 1.5))) %>%
layout(title = "Positive cases by fameles and males",
xaxis = list(title = ""),
yaxis = list(title = "Count"))
Inference : There are more positive cases in men than in women.
## Deaths due to COVID
positive_cases_mexico %>%
filter(DEFUNCION == 1) %>%
count(SEXO) %>%
plot_ly( x = ~SEXO,
y=~n,
type = 'bar',
text = ~n,
textposition = 'auto',
marker = list(color = 'rgb(158,202,225)',
line = list(color = 'rgb(8,48,107)',
width = 1.5))) %>%
layout(title = "Number of deaths of women and men",
xaxis = list(title = ""),
yaxis = list(title = "Count"))
## Graph of Deaths due to COVID
ggplot(positive_cases_mexico, aes(x= EDAD)) +
geom_density(color = "#15CFB5",
fill = 4,
alpha = 0.25) +
labs(title = "Graph of density of COVID19 positives by age.",
x = "Age",
y="Count")
## Average age of COVID positive cases
average_age <- round(mean(positive_cases_mexico$EDAD),0)
paste("The average age of COVID-19 positives is: ", average_age, "years ago")
## [1] "The average age of COVID-19 positives is: 57 years ago"
## Average age of COVID positive cases woman
positive_cases_by_fameles <- positive_cases_mexico %>%
filter(SEXO == 'Mujer')
average_age_women <- round(mean(positive_cases_by_fameles$EDAD), 0)
paste("The average age of COVID-19-positive women is: ", average_age_women, "years ago")
## [1] "The average age of COVID-19-positive women is: 54 years ago"
## Average age of COVID positive cases men
positive_cases_by_males <- positive_cases_mexico %>%
filter(SEXO == 'Hombre')
average_age_men <- round(mean(positive_cases_by_males$EDAD), 0)
paste("The average age of COVID-19-positive men is: ", average_age_men, "years ago.")
## [1] "The average age of COVID-19-positive men is: 59 years ago."
## Graph of Density
ggplot(positive_cases_mexico,
aes(x = EDAD,
colour = SEXO,
fill = SEXO )) +
geom_density() +
labs(title = "Graph of density of COVID-19 positive cases by gender.",
x = "Age",
y="")
## Count cases positives by Age
positive_cases_mexico %>%
group_by(EDAD) %>%
summarize(conteo = n()) %>%
arrange(desc(conteo)) %>%
top_n(10)
## Selecting by conteo
## # A tibble: 10 x 2
## EDAD conteo
## <int> <int>
## 1 69 435
## 2 65 423
## 3 80 416
## 4 67 406
## 5 73 404
## 6 72 396
## 7 75 396
## 8 60 395
## 9 62 387
## 10 77 381
## Count cases positives by AGE (woman)
positive_cases_by_fameles %>%
group_by(EDAD) %>%
summarize(conteo = n()) %>%
arrange(desc(conteo)) %>%
top_n(10)
## Selecting by conteo
## # A tibble: 10 x 2
## EDAD conteo
## <int> <int>
## 1 69 197
## 2 65 196
## 3 80 184
## 4 75 175
## 5 62 174
## 6 77 173
## 7 73 170
## 8 60 169
## 9 67 166
## 10 68 165
## Count cases positives by AGE (men)
positive_cases_by_males %>%
group_by(EDAD) %>%
summarize(conteo = n()) %>%
arrange(desc(conteo)) %>%
top_n(10)
## Selecting by conteo
## # A tibble: 10 x 2
## EDAD conteo
## <int> <int>
## 1 70 241
## 2 67 240
## 3 69 238
## 4 73 234
## 5 72 232
## 6 80 232
## 7 65 227
## 8 57 226
## 9 60 226
## 10 66 225
## Boxplot
positive_cases_mexico %>%
ggplot(aes(x=SEXO,
y=EDAD,
fill=SEXO)) +
geom_boxplot(width = 0.5) +
theme(
legend.position="none",
plot.title = element_text(size=11)) +
ggtitle("Box plot of the ages of males and females positive for COVID-19") +
xlab("")
## Average age of deceased patients
deceased_patients <- positive_cases_mexico %>%
filter(DEFUNCION == 1)
average_age_of_deceased_patients <- round(mean(deceased_patients$EDAD),0)
paste("The average age of patients who died from COVID-19 is: ",
average_age_of_deceased_patients, "years ago.")
## [1] "The average age of patients who died from COVID-19 is: 69 years ago."
## Average age at death of women
deceased_women <- deceased_patients %>%
filter(SEXO == 'Mujer')
average_age_of_deceased_women <- round(mean(deceased_women$EDAD), 0)
paste("The average age of the women who died from COVID-19 is: ",
average_age_of_deceased_women, "years ago.")
## [1] "The average age of the women who died from COVID-19 is: 68 years ago."
## Average age at death of men
decresed_men <- deceased_patients %>%
filter(SEXO == 'Hombre')
average_age_of_deceased_men <- round(mean(decresed_men$EDAD), 0)
paste("The average age of the men who died from COVID-19 is: ",
average_age_of_deceased_men, "years ago.")
## [1] "The average age of the men who died from COVID-19 is: 69 years ago."
## Number of deaths by AGE
positive_cases_mexico %>%
filter(DEFUNCION == 1) %>%
group_by(EDAD) %>%
summarize(conteo = n()) %>%
arrange(desc(conteo)) %>%
top_n(10)
## Selecting by conteo
## # A tibble: 10 x 2
## EDAD conteo
## <int> <int>
## 1 80 224
## 2 69 212
## 3 82 211
## 4 75 200
## 5 78 200
## 6 70 196
## 7 73 195
## 8 71 193
## 9 77 193
## 10 65 192
Inference : The ages with the highest number of deaths are: 80, 79 and 78.
## Number of deaths by women
positive_cases_by_fameles %>%
filter(DEFUNCION == 1) %>%
group_by(EDAD) %>%
summarize(conteo = n()) %>%
arrange(desc(conteo)) %>%
top_n(10)
## Selecting by conteo
## # A tibble: 10 x 2
## EDAD conteo
## <int> <int>
## 1 65 92
## 2 69 89
## 3 71 82
## 4 75 81
## 5 72 80
## 6 77 80
## 7 80 76
## 8 68 75
## 9 78 74
## 10 82 74
Inference: The ages with the highest number of deaths among women are: 68, 65 and 79.
## Number of deaths by men
positive_cases_by_males %>%
filter(DEFUNCION == 1) %>%
group_by(EDAD) %>%
summarize(conteo = n()) %>%
arrange(desc(conteo)) %>%
top_n(10)
## Selecting by conteo
## # A tibble: 11 x 2
## EDAD conteo
## <int> <int>
## 1 80 148
## 2 82 137
## 3 70 128
## 4 78 126
## 5 81 125
## 6 69 123
## 7 73 123
## 8 76 120
## 9 75 119
## 10 77 113
## 11 79 113
## Graph of deaths
deaths <- positive_cases_mexico %>%
select(EDAD, DEFUNCION, SEXO) %>%
filter(DEFUNCION == 1)
ggplot(deaths,
aes(x = EDAD,
fill= SEXO,
colour = SEXO)) +
geom_histogram(alpha = 0.5,
position = "dodge") +
labs(title = "Histogram of women and men deceased by COVID-19.",
x = "Age",
y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Boxplot of ages of deaths
deaths %>%
ggplot( aes(x=SEXO,
y=EDAD,
fill=SEXO)) +
geom_violin(width=1) +
geom_boxplot() +
theme(
legend.position="none",
plot.title = element_text(size=12)) +
ggtitle("Boxplot of ages of deaths by cause COVID-19.") +
xlab("")
## Number of smokers by gender
positive_cases_mexico %>%
count(SEXO, TABAQUISMO) %>%
plot_ly(x = ~SEXO,
y= ~n,
color = ~TABAQUISMO) %>%
add_bars() %>%
layout(title = "Counts of smokers and nonsmokers by gender",
x = "Sex",
y = "Count") %>%
layout(barmode ="stack")
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: 'layout' objects don't have these attributes: 'x', 'y'
## Valid attributes include:
## '_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'smith', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'barmode', 'bargap', 'mapType'
Inference: There are more COVID-19 positive male smokers than female smokers.
## Number of diabetics by gender
positive_cases_mexico %>%
count(SEXO, DIABETES) %>%
plot_ly(x = ~SEXO,
y= ~n,
color = ~DIABETES) %>%
add_bars() %>%
layout(title = "Count of men and women with diabetes") %>%
layout(barmode ="stack")
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
Inference There are more COVID-19 positive diabetic men than women.
## Number of intubated patients by sex
positive_cases_mexico %>%
count(SEXO, INTUBADO) %>%
plot_ly(x = ~SEXO,
y= ~n,
color = ~INTUBADO) %>%
add_bars() %>%
layout(title = "Number of intubated men and women") %>%
layout(barmode ="stack")
## Warning: textfont.color doesn't (yet) support data arrays
## Warning: textfont.color doesn't (yet) support data arrays
Inference: The number of patients intubated due to covid is slightly low.
## Number of deaths of smokers, pneumonia and intubated cases
death_smok_pneumo_intub <- positive_cases_mexico %>%
filter(TABAQUISMO == 1 | NEUMONIA ==1,
INTUBADO == 1,
SEXO == 'Hombre' | SEXO == 'Mujer',
DEFUNCION == 1)
plot_ly(death_smok_pneumo_intub,
x = ~SEXO,
y = ~TABAQUISMO,
z = ~INTUBADO,
type = 'bar',
name = 'Smoking') %>%
add_trace(y = ~NEUMONIA,
name = 'Pneumonia') %>%
add_trace(z = ~INTUBADO,
name = 'Intubated') %>%
layout(yaxis = list(title = 'Count'),
barmode = 'group')
## Warning: 'bar' objects don't have these attributes: 'z'
## Valid attributes include:
## '_deprecated', 'alignmentgroup', 'base', 'basesrc', 'cliponaxis', 'constraintext', 'customdata', 'customdatasrc', 'dx', 'dy', 'error_x', 'error_y', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'insidetextanchor', 'insidetextfont', 'legendgroup', 'legendgrouptitle', 'legendrank', 'marker', 'meta', 'metasrc', 'name', 'offset', 'offsetgroup', 'offsetsrc', 'opacity', 'orientation', 'outsidetextfont', 'selected', 'selectedpoints', 'showlegend', 'stream', 'text', 'textangle', 'textfont', 'textposition', 'textpositionsrc', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'width', 'widthsrc', 'x', 'x0', 'xaxis', 'xcalendar', 'xhoverformat', 'xperiod', 'xperiod0', 'xperiodalignment', 'xsrc', 'y', 'y0', 'yaxis', 'ycalendar', 'yhoverformat', 'yperiod', 'yperiod0', 'yperiodalignment', 'ysrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'bar' objects don't have these attributes: 'z'
## Valid attributes include:
## '_deprecated', 'alignmentgroup', 'base', 'basesrc', 'cliponaxis', 'constraintext', 'customdata', 'customdatasrc', 'dx', 'dy', 'error_x', 'error_y', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'insidetextanchor', 'insidetextfont', 'legendgroup', 'legendgrouptitle', 'legendrank', 'marker', 'meta', 'metasrc', 'name', 'offset', 'offsetgroup', 'offsetsrc', 'opacity', 'orientation', 'outsidetextfont', 'selected', 'selectedpoints', 'showlegend', 'stream', 'text', 'textangle', 'textfont', 'textposition', 'textpositionsrc', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'width', 'widthsrc', 'x', 'x0', 'xaxis', 'xcalendar', 'xhoverformat', 'xperiod', 'xperiod0', 'xperiodalignment', 'xsrc', 'y', 'y0', 'yaxis', 'ycalendar', 'yhoverformat', 'yperiod', 'yperiod0', 'yperiodalignment', 'ysrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'bar' objects don't have these attributes: 'z'
## Valid attributes include:
## '_deprecated', 'alignmentgroup', 'base', 'basesrc', 'cliponaxis', 'constraintext', 'customdata', 'customdatasrc', 'dx', 'dy', 'error_x', 'error_y', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'insidetextanchor', 'insidetextfont', 'legendgroup', 'legendgrouptitle', 'legendrank', 'marker', 'meta', 'metasrc', 'name', 'offset', 'offsetgroup', 'offsetsrc', 'opacity', 'orientation', 'outsidetextfont', 'selected', 'selectedpoints', 'showlegend', 'stream', 'text', 'textangle', 'textfont', 'textposition', 'textpositionsrc', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'width', 'widthsrc', 'x', 'x0', 'xaxis', 'xcalendar', 'xhoverformat', 'xperiod', 'xperiod0', 'xperiodalignment', 'xsrc', 'y', 'y0', 'yaxis', 'ycalendar', 'yhoverformat', 'yperiod', 'yperiod0', 'yperiodalignment', 'ysrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Cases of deaths, smokers and intubated patients
death_smok_intub <- positive_cases_mexico %>%
filter(TABAQUISMO == 1 | INTUBADO ==1,
SEXO == 'Hombre' | SEXO == 'Mujer',
DEFUNCION == 1,
EDAD <= average_age_of_deceased_patients)
plot_ly(death_smok_intub,
x = ~SEXO,
y = ~TABAQUISMO,
type = 'bar',
name = 'Smoking') %>%
add_trace(y = ~INTUBADO,
name = 'Intubated ') %>%
layout(yaxis = list(title = 'Count'),
barmode = 'group')
## Number of deaths
deaths_covid_medic_cond <- positive_cases_mexico %>%
select(SEXO,
TABAQUISMO,
INTUBADO,
DIABETES,
DEFUNCION)
deaths_covid_medic_cond <- deaths_covid_medic_cond %>%
transform(TABAQUISMO = case_when(TABAQUISMO == 1 ~ "Sí",
TABAQUISMO == 2 ~ "No",
T ~ ""),
INTUBADO = case_when(INTUBADO == 1 ~ "Sí",
INTUBADO == 2 ~ "No",
T ~ ""),
DIABETES = case_when(DIABETES == 1 ~ "Sí",
DIABETES == 2 ~ "No",
T ~ ""))
deaths_covid_medic_cond <- deaths_covid_medic_cond %>%
filter(SEXO == 'Mujer' | SEXO == 'Hombre',
TABAQUISMO == "Sí" | TABAQUISMO == "No",
INTUBADO == "Sí" | INTUBADO == "No",
DIABETES == "Sí" | DIABETES == "No",
DEFUNCION == 1)
## Number of deaths with medical conditions
deaths_covid_medic_cond <- positive_cases_mexico %>%
select(SEXO,
TABAQUISMO,
INTUBADO,
DIABETES,
DEFUNCION)
deaths_covid_medic_cond <- deaths_covid_medic_cond %>%
transform(TABAQUISMO = case_when(TABAQUISMO == 1 ~ "Sí",
TABAQUISMO == 2 ~ "No",
T ~ ""),
INTUBADO = case_when(INTUBADO == 1 ~ "Sí",
INTUBADO == 2 ~ "No",
T ~ ""),
DIABETES = case_when(DIABETES == 1 ~ "Sí",
DIABETES == 2 ~ "No",
T ~ ""))
deaths_covid_medic_cond <- deaths_covid_medic_cond %>%
filter(SEXO == 'Mujer' | SEXO == 'Hombre',
TABAQUISMO == "Sí" | TABAQUISMO == "No",
INTUBADO == "Sí" | INTUBADO == "No",
DIABETES == "Sí" | DIABETES == "No",
DEFUNCION == 1)
## Smoker table
gender <- as.factor(deaths_covid_medic_cond$SEXO)
smoker <- as.factor(deaths_covid_medic_cond$TABAQUISMO)
smoker_table <- ftable(xtabs(~gender+smoker, data = deaths_covid_medic_cond))
smoker_table <- cbind(smoker_table, apply(smoker_table, 1, sum))
smoker_table <- rbind(smoker_table, apply(smoker_table, 2, sum))
colnames(smoker_table) <- c('Smoker', 'Non-smoker', 'Total')
rownames(smoker_table)[3] <- c('Total')
smoker_table <- as.data.frame(smoker_table)
smoker_table
## Smoker Non-smoker Total
## Hombre 4087 483 4570
## Mujer 2797 105 2902
## Total 6884 588 7472
## Intubated table
intubated <- as.factor(deaths_covid_medic_cond$INTUBADO)
intubated_table <- ftable(xtabs(~gender+intubated, data = deaths_covid_medic_cond))
intubated_table <- cbind(intubated_table, apply(intubated_table, 1, sum))
intubated_table <- rbind(intubated_table, apply(intubated_table, 2, sum))
colnames(intubated_table) <- c('Intubado_Sí', 'Intubado_No', 'Total')
rownames(intubated_table)[3] <- c('Total')
intubated_table <- as.data.frame(intubated_table)
intubated_table
## Intubado_Sí Intubado_No Total
## Hombre 3861 709 4570
## Mujer 2468 434 2902
## Total 6329 1143 7472
## Diabetics table
diabetics <- as.factor(deaths_covid_medic_cond$DIABETES)
diabetics_table <- ftable(xtabs(~gender+diabetics, data = deaths_covid_medic_cond))
diabetics_table <- cbind(diabetics_table, apply(diabetics_table, 1, sum))
diabetics_table <- rbind(diabetics_table, apply(diabetics_table, 2, sum))
colnames(diabetics_table) <- c('Diabetics: Yes', 'Diabetics: No', 'Total')
rownames(diabetics_table)[3] <- c('Total')
diabetics_table <- as.data.frame(diabetics_table)
diabetics_table
## Diabetics: Yes Diabetics: No Total
## Hombre 2877 1693 4570
## Mujer 1698 1204 2902
## Total 4575 2897 7472
- The ages with the highest number of positive cases are: 59, 65,67 and 80.
- The ages with the highest number of positve cases in women are: 65, 59, 67, 60.
- The ages with the highest number of positve cases in men are: 80, 66, 73 and 74.
- The average age of patients who died for COVID-19 is 69 years ago.
- The average age of the women who died from COVID-19 is 68 years ago.
- The average age of the women who died from COVID-19 is 69 years ago.
- The ages with the highest number of deaths are: 80, 79 and 78.
- The ages with the highest number of deaths among women are: 68, 65 and 79.
- The ages with the highest number of deaths among men are: 80, 82 and 77.
- There are more COVID-19 positive male smokers than female smokers.
- There are more COVID-19 positive diabetic men than women.
- The number of patients intubated due to covid is slightly low.