What is this?

I have completed a data scientist with R course, and want to practice applying recently-gained knowledge on COVID19 situation in Thailand.

library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("ggplot2")

Importing the data with the below code chunk

library(readr)
covid <- read_csv("https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────────────────────────────
## cols(
##   date = col_date(format = ""),
##   province = col_character(),
##   country = col_character(),
##   lat = col_double(),
##   long = col_double(),
##   type = col_character(),
##   cases = col_double()
## )
covid

Exploring the data

The structure and summary of covid dataset are shown below.

str(covid)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 355887 obs. of  7 variables:
##  $ date    : Date, format: "2020-01-22" "2020-01-22" ...
##  $ province: chr  NA NA NA NA ...
##  $ country : chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ lat     : num  33.9 41.2 28 42.5 -11.2 ...
##  $ long    : num  67.71 20.17 1.66 1.52 17.87 ...
##  $ type    : chr  "confirmed" "confirmed" "confirmed" "confirmed" ...
##  $ cases   : num  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   date = col_date(format = ""),
##   ..   province = col_character(),
##   ..   country = col_character(),
##   ..   lat = col_double(),
##   ..   long = col_double(),
##   ..   type = col_character(),
##   ..   cases = col_double()
##   .. )
summary(covid)
##       date              province           country               lat        
##  Min.   :2020-01-22   Length:355887      Length:355887      Min.   :-51.80  
##  1st Qu.:2020-05-11   Class :character   Class :character   1st Qu.:  4.86  
##  Median :2020-08-29   Mode  :character   Mode  :character   Median : 21.51  
##  Mean   :2020-08-29                                         Mean   : 20.07  
##  3rd Qu.:2020-12-17                                         3rd Qu.: 40.14  
##  Max.   :2021-04-06                                         Max.   : 71.71  
##                                                             NA's   :882     
##       long             type               cases          
##  Min.   :-178.12   Length:355887      Min.   : -88158.0  
##  1st Qu.: -14.45   Class :character   1st Qu.:      0.0  
##  Median :  21.75   Mode  :character   Median :      0.0  
##  Mean   :  24.79                      Mean   :    609.1  
##  3rd Qu.:  85.24                      3rd Qu.:     30.0  
##  Max.   : 178.06                      Max.   :1123456.0  
##  NA's   :882

Thailand dataset

First, check if there is any observation from Thailand.

any(covid$country=="Thailand")
## [1] TRUE

Next, we can filter for only confirmed cases in Thailand, and explore them.

covid_th <- covid %>%
  filter(country == "Thailand",type=="confirmed")
summary(covid_th)
##       date              province           country               lat       
##  Min.   :2020-01-22   Length:441         Length:441         Min.   :15.87  
##  1st Qu.:2020-05-11   Class :character   Class :character   1st Qu.:15.87  
##  Median :2020-08-29   Mode  :character   Mode  :character   Median :15.87  
##  Mean   :2020-08-29                                         Mean   :15.87  
##  3rd Qu.:2020-12-17                                         3rd Qu.:15.87  
##  Max.   :2021-04-06                                         Max.   :15.87  
##       long         type               cases        
##  Min.   :101   Length:441         Min.   : -10.00  
##  1st Qu.:101   Class :character   1st Qu.:   2.00  
##  Median :101   Mode  :character   Median :   7.00  
##  Mean   :101                      Mean   :  67.05  
##  3rd Qu.:101                      3rd Qu.:  50.00  
##  Max.   :101                      Max.   :1732.00

From the above summary, I notice that there is one or more cases with negative value, which is impossible. Thus, I will filter those observations out and make a cases vs date plot.

covid_th <- covid_th %>%
  filter(cases >= 0) %>%
  select(date,country,cases)
summary(covid_th)
##       date              country              cases        
##  Min.   :2020-01-22   Length:437         Min.   :   0.00  
##  1st Qu.:2020-05-10   Class :character   1st Qu.:   2.00  
##  Median :2020-08-28   Mode  :character   Median :   7.00  
##  Mean   :2020-08-28                      Mean   :  67.72  
##  3rd Qu.:2020-12-18                      3rd Qu.:  51.00  
##  Max.   :2021-04-06                      Max.   :1732.00
ggplot(covid_th,aes(x=date,y=cases)) +
  geom_line() +
  ylab("Confirmed cases in Thailand")

Now, let’s add a new column for ‘cumulative cases’, plot a line plot for cum_cases vs date and fit a linear regression model with the below code chunk.

covid_th_cum <- covid_th %>%
  mutate(cum_cases = cumsum(cases))
summary(covid_th_cum)
##       date              country              cases           cum_cases    
##  Min.   :2020-01-22   Length:437         Min.   :   0.00   Min.   :    4  
##  1st Qu.:2020-05-10   Class :character   1st Qu.:   2.00   1st Qu.: 3009  
##  Median :2020-08-28   Mode  :character   Median :   7.00   Median : 3412  
##  Mean   :2020-08-28                      Mean   :  67.72   Mean   : 6968  
##  3rd Qu.:2020-12-18                      3rd Qu.:  51.00   3rd Qu.: 4352  
##  Max.   :2021-04-06                      Max.   :1732.00   Max.   :29592
ggplot(covid_th_cum, aes(x=date,y=cum_cases)) +
  geom_line() +
  geom_smooth(method="lm",se=FALSE)
## `geom_smooth()` using formula 'y ~ x'

Since the end of 2020, cases in Thailand skyrocketed. Let’s focus only on data since December.

covid_th_dec <- covid_th_cum %>%
  filter(date >= "2020-12-01")
ggplot(covid_th_dec, aes(x=date,y=cum_cases)) +
  geom_line() +
  geom_smooth(method="lm",se=FALSE) 
## `geom_smooth()` using formula 'y ~ x'

Let’s examine more on the peak period.

peak <- covid_th_dec %>%
  filter(cases == max(cases))
peak

On January 29, Thailand has reached its peak of 1732 due to Samutsakhon cluster.

ggplot(covid_th_dec, aes(x=date,y=cum_cases)) +
  geom_line() +
  geom_smooth(method="lm",se=FALSE) +
  geom_vline(aes(xintercept=date),data=peak,linetype="dashed") +
  geom_text(aes(x=date,label="Samutsakhon cluster"),data=peak, y=30000)
## `geom_smooth()` using formula 'y ~ x'

Let’s build a linear model from ‘covid_th_dec’ data set to see the trend.

covid_model <- lm(cum_cases~date, data=covid_th_dec)
summary(covid_model)
## 
## Call:
## lm(formula = cum_cases ~ date, data = covid_th_dec)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3715.1 -1619.3  -934.9  1771.1  4771.6 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.623e+06  1.039e+05  -44.51   <2e-16 ***
## date         2.487e+02  5.566e+00   44.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2300 on 125 degrees of freedom
## Multiple R-squared:  0.9411, Adjusted R-squared:  0.9406 
## F-statistic:  1996 on 1 and 125 DF,  p-value: < 2.2e-16

The model seems to explain the situation since December 2020 well, but I hope that this upward trend will not continue.

Data source

https://github.com/RamiKrispin/coronavirus

THANK YOU :)