I have completed a data scientist with R course, and want to practice applying recently-gained knowledge on COVID19 situation in Thailand.
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("ggplot2")
library(readr)
covid <- read_csv("https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv")
##
## ── Column specification ────────────────────────────────────────────────────────────────────────────────
## cols(
## date = col_date(format = ""),
## province = col_character(),
## country = col_character(),
## lat = col_double(),
## long = col_double(),
## type = col_character(),
## cases = col_double()
## )
covid
The structure and summary of covid dataset are shown below.
str(covid)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 355887 obs. of 7 variables:
## $ date : Date, format: "2020-01-22" "2020-01-22" ...
## $ province: chr NA NA NA NA ...
## $ country : chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
## $ lat : num 33.9 41.2 28 42.5 -11.2 ...
## $ long : num 67.71 20.17 1.66 1.52 17.87 ...
## $ type : chr "confirmed" "confirmed" "confirmed" "confirmed" ...
## $ cases : num 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. date = col_date(format = ""),
## .. province = col_character(),
## .. country = col_character(),
## .. lat = col_double(),
## .. long = col_double(),
## .. type = col_character(),
## .. cases = col_double()
## .. )
summary(covid)
## date province country lat
## Min. :2020-01-22 Length:355887 Length:355887 Min. :-51.80
## 1st Qu.:2020-05-11 Class :character Class :character 1st Qu.: 4.86
## Median :2020-08-29 Mode :character Mode :character Median : 21.51
## Mean :2020-08-29 Mean : 20.07
## 3rd Qu.:2020-12-17 3rd Qu.: 40.14
## Max. :2021-04-06 Max. : 71.71
## NA's :882
## long type cases
## Min. :-178.12 Length:355887 Min. : -88158.0
## 1st Qu.: -14.45 Class :character 1st Qu.: 0.0
## Median : 21.75 Mode :character Median : 0.0
## Mean : 24.79 Mean : 609.1
## 3rd Qu.: 85.24 3rd Qu.: 30.0
## Max. : 178.06 Max. :1123456.0
## NA's :882
First, check if there is any observation from Thailand.
any(covid$country=="Thailand")
## [1] TRUE
Next, we can filter for only confirmed cases in Thailand, and explore them.
covid_th <- covid %>%
filter(country == "Thailand",type=="confirmed")
summary(covid_th)
## date province country lat
## Min. :2020-01-22 Length:441 Length:441 Min. :15.87
## 1st Qu.:2020-05-11 Class :character Class :character 1st Qu.:15.87
## Median :2020-08-29 Mode :character Mode :character Median :15.87
## Mean :2020-08-29 Mean :15.87
## 3rd Qu.:2020-12-17 3rd Qu.:15.87
## Max. :2021-04-06 Max. :15.87
## long type cases
## Min. :101 Length:441 Min. : -10.00
## 1st Qu.:101 Class :character 1st Qu.: 2.00
## Median :101 Mode :character Median : 7.00
## Mean :101 Mean : 67.05
## 3rd Qu.:101 3rd Qu.: 50.00
## Max. :101 Max. :1732.00
From the above summary, I notice that there is one or more cases with negative value, which is impossible. Thus, I will filter those observations out and make a cases vs date plot.
covid_th <- covid_th %>%
filter(cases >= 0) %>%
select(date,country,cases)
summary(covid_th)
## date country cases
## Min. :2020-01-22 Length:437 Min. : 0.00
## 1st Qu.:2020-05-10 Class :character 1st Qu.: 2.00
## Median :2020-08-28 Mode :character Median : 7.00
## Mean :2020-08-28 Mean : 67.72
## 3rd Qu.:2020-12-18 3rd Qu.: 51.00
## Max. :2021-04-06 Max. :1732.00
ggplot(covid_th,aes(x=date,y=cases)) +
geom_line() +
ylab("Confirmed cases in Thailand")
Now, let’s add a new column for ‘cumulative cases’, plot a line plot for cum_cases vs date and fit a linear regression model with the below code chunk.
covid_th_cum <- covid_th %>%
mutate(cum_cases = cumsum(cases))
summary(covid_th_cum)
## date country cases cum_cases
## Min. :2020-01-22 Length:437 Min. : 0.00 Min. : 4
## 1st Qu.:2020-05-10 Class :character 1st Qu.: 2.00 1st Qu.: 3009
## Median :2020-08-28 Mode :character Median : 7.00 Median : 3412
## Mean :2020-08-28 Mean : 67.72 Mean : 6968
## 3rd Qu.:2020-12-18 3rd Qu.: 51.00 3rd Qu.: 4352
## Max. :2021-04-06 Max. :1732.00 Max. :29592
ggplot(covid_th_cum, aes(x=date,y=cum_cases)) +
geom_line() +
geom_smooth(method="lm",se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
Since the end of 2020, cases in Thailand skyrocketed. Let’s focus only on data since December.
covid_th_dec <- covid_th_cum %>%
filter(date >= "2020-12-01")
ggplot(covid_th_dec, aes(x=date,y=cum_cases)) +
geom_line() +
geom_smooth(method="lm",se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
Let’s examine more on the peak period.
peak <- covid_th_dec %>%
filter(cases == max(cases))
peak
On January 29, Thailand has reached its peak of 1732 due to Samutsakhon cluster.
ggplot(covid_th_dec, aes(x=date,y=cum_cases)) +
geom_line() +
geom_smooth(method="lm",se=FALSE) +
geom_vline(aes(xintercept=date),data=peak,linetype="dashed") +
geom_text(aes(x=date,label="Samutsakhon cluster"),data=peak, y=30000)
## `geom_smooth()` using formula 'y ~ x'
Let’s build a linear model from ‘covid_th_dec’ data set to see the trend.
covid_model <- lm(cum_cases~date, data=covid_th_dec)
summary(covid_model)
##
## Call:
## lm(formula = cum_cases ~ date, data = covid_th_dec)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3715.1 -1619.3 -934.9 1771.1 4771.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.623e+06 1.039e+05 -44.51 <2e-16 ***
## date 2.487e+02 5.566e+00 44.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2300 on 125 degrees of freedom
## Multiple R-squared: 0.9411, Adjusted R-squared: 0.9406
## F-statistic: 1996 on 1 and 125 DF, p-value: < 2.2e-16
The model seems to explain the situation since December 2020 well, but I hope that this upward trend will not continue.