Fetching data
# coronavirus::refresh_coronavirus_jhu() -> dataframeLoding data
readr::read_csv("C:/Users/Asus/Documents/R Pubs/R Pubs/alltis/alltis.csv") -> alltis## Warning: Missing column names filled in: 'X1' [1]
##
## -- Column specification --------------------------------------------------------
## cols(
## X1 = col_double(),
## ds = col_date(format = ""),
## location = col_character(),
## location_type = col_character(),
## location_code = col_character(),
## location_code_type = col_character(),
## data_type = col_character(),
## value = col_double(),
## lat = col_double(),
## long = col_double()
## )
library(magrittr) # to use %>% (pipe operator)
names(alltis)## [1] "X1" "ds" "location"
## [4] "location_type" "location_code" "location_code_type"
## [7] "data_type" "value" "lat"
## [10] "long"
alltis$X1 <- NULLknitr::kable(head(alltis,5))| ds | location | location_type | location_code | location_code_type | data_type | value | lat | long |
|---|---|---|---|---|---|---|---|---|
| 2020-06-25 | Afghanistan | country | AF | iso_3166_2 | deaths_new | 36 | 33.93911 | 67.70995 |
| 2020-02-20 | Afghanistan | country | AF | iso_3166_2 | cases_new | 0 | 33.93911 | 67.70995 |
| 2020-03-23 | Afghanistan | country | AF | iso_3166_2 | deaths_new | 0 | 33.93911 | 67.70995 |
| 2020-02-18 | Afghanistan | country | AF | iso_3166_2 | cases_new | 0 | 33.93911 | 67.70995 |
| 2020-02-19 | Afghanistan | country | AF | iso_3166_2 | cases_new | 0 | 33.93911 | 67.70995 |
dplyr::glimpse(alltis)## Rows: 181,335
## Columns: 9
## $ ds <date> 2020-06-25, 2020-02-20, 2020-03-23, 2020-02-18, 20~
## $ location <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afgha~
## $ location_type <chr> "country", "country", "country", "country", "countr~
## $ location_code <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF~
## $ location_code_type <chr> "iso_3166_2", "iso_3166_2", "iso_3166_2", "iso_3166~
## $ data_type <chr> "deaths_new", "cases_new", "deaths_new", "cases_new~
## $ value <dbl> 36, 0, 0, 0, 0, 37, 10, 0, 100, 24, 0, 0, 0, 0, 0, ~
## $ lat <dbl> 33.93911, 33.93911, 33.93911, 33.93911, 33.93911, 3~
## $ long <dbl> 67.70995, 67.70995, 67.70995, 67.70995, 67.70995, 6~
Missing Data
colSums(is.na(alltis))## ds location location_type location_code
## 0 0 0 3003
## location_code_type data_type value lat
## 2310 0 0 0
## long
## 0
alltisdata frame have 1 Lakh and 81 thousand 335 number of observation, or 181335 rows.Having 9 variables, or 9 columns. named- date, location, location_type, location_code, location_code_type, data_type, value, lat, long..
All things look normal on a cursory look. I just have to make the date in the correct format i.e
ymd: year month date format.I am not interested in studying all the variables provided in this dataframe. So, before starting I have to use the `filters and select command (in dplyr package)’ to select or remove the variables/observations that I am interested in.
Installing libraries
library(dplyr, warn.conflicts = F) # data wrangling
library(lubridate, warn.conflicts = F) # works with dates
library(ggplot2) # for data visualization
library(ggthemes) # for interactive themes and colors
library(prophet) # forecastingtheme_set(theme_bw()) # all graph will follow the same theme automatically
options(scipen = 999) # No scientific annotationFiltering data for India, new cases, removing observation containing 0
# alltis %>% rename(ds= date) ->alltis
alltis$ds <- as.Date(alltis$ds) # renaming date as "ds"
alltis$ds <- ymd(alltis$ds) #year months date
data <- alltis # duplicating all `alltis` to`data`
data= data %>% dplyr::filter(location== "India") %>%
select(ds, location, data_type, data_type, value) %>%
filter(value!=0)
data <- as_tibble(data) # converting to tibble format
data %>% head() %>% kableExtra::kable()| ds | location | data_type | value |
|---|---|---|---|
| 2020-05-09 | India | cases_new | 3113 |
| 2020-05-03 | India | cases_new | 2806 |
| 2020-05-07 | India | cases_new | 3364 |
| 2020-04-27 | India | cases_new | 1561 |
| 2020-04-28 | India | cases_new | 1873 |
| 2020-05-04 | India | cases_new | 3932 |
How many types of cases are there in our data?
What is the time period of the observations ?
How many observations are left after filtering ?
unique(data$data_type);range(data$ds); dim(data)## [1] "cases_new" "recovered_new" "deaths_new"
## [1] "2020-01-30" "2020-09-08"
## [1] 543 4
# arranging data in `decreasing order of date column`
data <- data %>% arrange(desc(ds))
# top 20 data
pander::pander(data[1:10,])| ds | location | data_type | value |
|---|---|---|---|
| 2020-09-08 | India | recovered_new | 74894 |
| 2020-09-08 | India | cases_new | 89706 |
| 2020-09-08 | India | deaths_new | 1115 |
| 2020-09-07 | India | recovered_new | 73521 |
| 2020-09-07 | India | cases_new | 75809 |
| 2020-09-07 | India | deaths_new | 1133 |
| 2020-09-06 | India | recovered_new | 69564 |
| 2020-09-06 | India | cases_new | 90802 |
| 2020-09-06 | India | deaths_new | 1016 |
| 2020-09-05 | India | recovered_new | 73642 |
Data Visualization
ggplot(data, aes(ds, value, col= data_type)) +
geom_line() + geom_point() +
labs(x= "Months (2020)","Number of Cases",y= "New Caeses/ Day", title = "India", subtitle = "Condition of new cases of COVID-19 In India") + theme_linedraw()- See the effect of
lockdown - New corona cases increased with the termination of
lockdown. - Number of cases_new and recovered_new are close i.e death rate is 2 %.
scale_y_log10()
ggplot(data, aes(ds, value, col= data_type)) +
geom_line(size=1) + geom_point(col= "black", size= .9) +
labs(x= "Months (2020)","Number of Cases",y= "New Caeses (log10)", title = "India",
subtitle = "Condition of new cases of COVID-19") +
theme_classic() +
scale_y_log10()Now we can see a sharp increase in the number of deaths.
The recovery rate is improving.
Forecasting
New corona cases upto the next 31 days
Note: library prophet is not for epidemiological model
Renaming y to my dependent variable, and filtering data_type: cases_new Making new data frame for data_type: cases_new named newcases
data %>% rename(y= value) ->data
#filtering
newcases <- data %>% dplyr::filter(data_type=="cases_new")
#now top 4 data
pander::pander(newcases[1:4,])| ds | location | data_type | y |
|---|---|---|---|
| 2020-09-08 | India | cases_new | 89706 |
| 2020-09-07 | India | cases_new | 75809 |
| 2020-09-06 | India | cases_new | 90802 |
| 2020-09-05 | India | cases_new | 90632 |
Forceasting the new_cases/day for next 31 days
m <- prophet(df= newcases) # model to predict
futuredf1 <- make_future_dataframe(m = m, periods = 31)
forecast1 <- predict(m, futuredf1) # predicting
# Plotting
dyplot.prophet(m, forecast1)Result
- This forecasting shows that on 09 oct 1 lakh 16 thousand cases will be added.
This is an interactive graph, you can move your cursor over the line and see the actual and predicated value of any day.
Probability to get infected
prophet::prophet_plot_components(m , forecast1)Monday is safest day and Wednesday is most dangerous (hehe, kidding)
The lowest number of data has been uploaded on Monday, but the good thing is that they have done everything well on all other days of the week. They are working good!
Forecasting death/day for next 31 days
#filtering
newdeath <- data %>% dplyr::filter(data_type=="deaths_new")
#now top 4 data
pander::pander(head(newdeath,4))| ds | location | data_type | y |
|---|---|---|---|
| 2020-09-08 | India | deaths_new | 1115 |
| 2020-09-07 | India | deaths_new | 1133 |
| 2020-09-06 | India | deaths_new | 1016 |
| 2020-09-05 | India | deaths_new | 1065 |
m <- prophet(newdeath)
futuredf2 <- make_future_dataframe(m = m, periods = 31)
forecast2 <- predict(m, futuredf2)
dyplot.prophet(m, forecast2)Note: You can see the forecasting by moving the cursor over the line.
Probability of death…
prophet_plot_components(m, forecast2)- Most deaths on
Tuesday, least onMonday(or most of death-data updated on Tuesday)
Forecasting Recovery/Day
#filtering
newrecovery <- data %>% dplyr::filter(data_type=="recovered_new")
# model making and forecasting
m <- prophet(df = newrecovery)
futuredf3 <- make_future_dataframe(m = m, periods = 31)
forecast3 <- predict(m, futuredf3)
dyplot.prophet(m, forecast3)prophet::prophet_plot_components(m, forecast3)India, US, Russia, new cases
data <- alltis
data %>% dplyr::filter(data_type=="cases_new",
location==c("India","US","Russia")) %>%
ggplot(aes(ds, value)) +
geom_line(aes(col= location, linetype= location), size=1) +
scale_color_calc() + geom_point(aes(col=location),size=5)+
labs(x= "Year 2020", y= "New cases/ Day",
title = "India, Russia, US")- Cases are increasing in India, cases in America have started decreasing now, the situation in Russia is stable.
Top 7 countries in Corona Cases per Day
data <- alltis
data %>% group_by(location) %>%
summarise(value= sum(value)) %>%
ungroup() %>% arrange(desc(value)) %>%
top_n(20) %>% rename(new_cases= value)->n
# plotting barplot for top 7 countries in `new corona cases/Day`
n %>% top_n(7) %>%
ggplot(aes(reorder(location, desc(new_cases)), new_cases, fill= location)) + geom_col(col="black", size=2) +
scale_fill_calc()+
labs(x= "Countries", y= "New cases/Day", caption = Sys.Date())pander::pander(n)| location | new_cases |
|---|---|
| US | 8875773 |
| Brazil | 7861958 |
| India | 7842862 |
| Russia | 1898039 |
| Mexico | 1246485 |
| Peru | 1243802 |
| South Africa | 1223256 |
| Colombia | 1222161 |
| Argentina | 877029 |
| Chile | 834953 |
| Iran | 751068 |
| Spain | 714483 |
| Saudi Arabia | 624620 |
| Pakistan | 592524 |
| Bangladesh | 561612 |
| Turkey | 543297 |
| Italy | 526517 |
| Germany | 491672 |
| Iraq | 483559 |
| France | 460426 |
# extracting the names of top 7 countries
n %>% top_n(7)-> n## Selecting by new_cases
Trends of top 7 countries regarding new cases of COVID-19
scale_y_log(10)
#removing all "0" containing values
data %>% dplyr::filter(value!= 0)->data
loc <- n$location
loc[1:7] #top7 location in corona new cases## [1] "US" "Brazil" "India" "Russia" "Mexico"
## [6] "Peru" "South Africa"
data %>% dplyr::filter(location== c("US","Brazil","India","Russia","Mexico","Peru","South Africa")) %>% dplyr::filter(data_type=="cases_new") %>% ggplot(aes(ds, value)) +
geom_line(aes(col= location, linetype= location), size=1) +
geom_point(aes(col=location),size=2) +scale_y_log10() +
scale_color_calc() +
labs(x= "Year 2020", y= "New_cases/day (log 10)")Conclusions
Cases are continuously increasing in India, but we have to keep in mind that India’s population is very huge. India’s performance is still efficient in terms of cases per million population.
The corona cases have started to grow high, and if the control is not done here soon then the situation can become very bitter.
Converting and saving alltis to “.xlsx” format.
# CSV (excel) format
# write.csv(alltis, "C:/Users/Alok Pratap Singh/Documents/R Pubs/R Pubs/alltis.csv")
# .sav (spss) format
# haven::write_sav(alltis, "C:/Users/Alok Pratap Singh/Documents/R Pubs/R Pubs/alltis.sav")
#in .txt format
# write.table(alltis,"C:/Users/Alok Pratap Singh/Documents/R Pubs/R Pubs/alltis.txt)Thank You
Regards
Please visit my profile
Alok Pratap Singh (Research Scholar)
Linkedin (Open in New TAB)
Department of Psychology
University of Allahabad
Click on download alltis.rar and unrar to get data in .csv, .sav, & .txt format. Please open the download link in new tab
Without data you’re just another person with an opinion
.