Fetching data
# coronavirus::refresh_coronavirus_jhu() -> dataframe
Loding data
::read_csv("C:/Users/Asus/Documents/R Pubs/R Pubs/alltis/alltis.csv") -> alltis readr
## Warning: Missing column names filled in: 'X1' [1]
##
## -- Column specification --------------------------------------------------------
## cols(
## X1 = col_double(),
## ds = col_date(format = ""),
## location = col_character(),
## location_type = col_character(),
## location_code = col_character(),
## location_code_type = col_character(),
## data_type = col_character(),
## value = col_double(),
## lat = col_double(),
## long = col_double()
## )
library(magrittr) # to use %>% (pipe operator)
names(alltis)
## [1] "X1" "ds" "location"
## [4] "location_type" "location_code" "location_code_type"
## [7] "data_type" "value" "lat"
## [10] "long"
$X1 <- NULL alltis
::kable(head(alltis,5)) knitr
ds | location | location_type | location_code | location_code_type | data_type | value | lat | long |
---|---|---|---|---|---|---|---|---|
2020-06-25 | Afghanistan | country | AF | iso_3166_2 | deaths_new | 36 | 33.93911 | 67.70995 |
2020-02-20 | Afghanistan | country | AF | iso_3166_2 | cases_new | 0 | 33.93911 | 67.70995 |
2020-03-23 | Afghanistan | country | AF | iso_3166_2 | deaths_new | 0 | 33.93911 | 67.70995 |
2020-02-18 | Afghanistan | country | AF | iso_3166_2 | cases_new | 0 | 33.93911 | 67.70995 |
2020-02-19 | Afghanistan | country | AF | iso_3166_2 | cases_new | 0 | 33.93911 | 67.70995 |
::glimpse(alltis) dplyr
## Rows: 181,335
## Columns: 9
## $ ds <date> 2020-06-25, 2020-02-20, 2020-03-23, 2020-02-18, 20~
## $ location <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afgha~
## $ location_type <chr> "country", "country", "country", "country", "countr~
## $ location_code <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF~
## $ location_code_type <chr> "iso_3166_2", "iso_3166_2", "iso_3166_2", "iso_3166~
## $ data_type <chr> "deaths_new", "cases_new", "deaths_new", "cases_new~
## $ value <dbl> 36, 0, 0, 0, 0, 37, 10, 0, 100, 24, 0, 0, 0, 0, 0, ~
## $ lat <dbl> 33.93911, 33.93911, 33.93911, 33.93911, 33.93911, 3~
## $ long <dbl> 67.70995, 67.70995, 67.70995, 67.70995, 67.70995, 6~
Missing Data
colSums(is.na(alltis))
## ds location location_type location_code
## 0 0 0 3003
## location_code_type data_type value lat
## 2310 0 0 0
## long
## 0
alltis
data frame have 1 Lakh and 81 thousand 335 number of observation, or 181335 rows.Having 9 variables, or 9 columns. named- date, location, location_type, location_code, location_code_type, data_type, value, lat, long..
All things look normal on a cursory look. I just have to make the date in the correct format i.e
ymd
: year month date format.I am not interested in studying all the variables provided in this dataframe. So, before starting I have to use the `filters and select command (in dplyr package)’ to select or remove the variables/observations that I am interested in.
Installing libraries
library(dplyr, warn.conflicts = F) # data wrangling
library(lubridate, warn.conflicts = F) # works with dates
library(ggplot2) # for data visualization
library(ggthemes) # for interactive themes and colors
library(prophet) # forecasting
theme_set(theme_bw()) # all graph will follow the same theme automatically
options(scipen = 999) # No scientific annotation
Filtering data for India, new cases, removing observation containing 0
# alltis %>% rename(ds= date) ->alltis
$ds <- as.Date(alltis$ds) # renaming date as "ds"
alltis$ds <- ymd(alltis$ds) #year months date
alltis<- alltis # duplicating all `alltis` to`data`
data = data %>% dplyr::filter(location== "India") %>%
dataselect(ds, location, data_type, data_type, value) %>%
filter(value!=0)
<- as_tibble(data) # converting to tibble format
data %>% head() %>% kableExtra::kable() data
ds | location | data_type | value |
---|---|---|---|
2020-05-09 | India | cases_new | 3113 |
2020-05-03 | India | cases_new | 2806 |
2020-05-07 | India | cases_new | 3364 |
2020-04-27 | India | cases_new | 1561 |
2020-04-28 | India | cases_new | 1873 |
2020-05-04 | India | cases_new | 3932 |
How many types of cases
are there in our data?
What is the time period
of the observations
?
How many observations are left after filtering
?
unique(data$data_type);range(data$ds); dim(data)
## [1] "cases_new" "recovered_new" "deaths_new"
## [1] "2020-01-30" "2020-09-08"
## [1] 543 4
# arranging data in `decreasing order of date column`
<- data %>% arrange(desc(ds))
data # top 20 data
::pander(data[1:10,]) pander
ds | location | data_type | value |
---|---|---|---|
2020-09-08 | India | recovered_new | 74894 |
2020-09-08 | India | cases_new | 89706 |
2020-09-08 | India | deaths_new | 1115 |
2020-09-07 | India | recovered_new | 73521 |
2020-09-07 | India | cases_new | 75809 |
2020-09-07 | India | deaths_new | 1133 |
2020-09-06 | India | recovered_new | 69564 |
2020-09-06 | India | cases_new | 90802 |
2020-09-06 | India | deaths_new | 1016 |
2020-09-05 | India | recovered_new | 73642 |
Data Visualization
ggplot(data, aes(ds, value, col= data_type)) +
geom_line() + geom_point() +
labs(x= "Months (2020)","Number of Cases",y= "New Caeses/ Day", title = "India", subtitle = "Condition of new cases of COVID-19 In India") + theme_linedraw()
- See the effect of
lockdown
- New corona cases increased with the termination of
lockdown
. - Number of cases_new and recovered_new are close i.e death rate is 2 %.
scale_y_log10()
ggplot(data, aes(ds, value, col= data_type)) +
geom_line(size=1) + geom_point(col= "black", size= .9) +
labs(x= "Months (2020)","Number of Cases",y= "New Caeses (log10)", title = "India",
subtitle = "Condition of new cases of COVID-19") +
theme_classic() +
scale_y_log10()
Now we can see a sharp increase in the number of deaths.
The recovery rate is improving.
Forecasting
New corona cases upto the next 31 days
Note: library prophet
is not for epidemiological model
Renaming y
to my dependent variable, and filtering data_type: cases_new
Making new data frame for data_type: cases_new
named newcases
%>% rename(y= value) ->data
data #filtering
<- data %>% dplyr::filter(data_type=="cases_new")
newcases #now top 4 data
::pander(newcases[1:4,]) pander
ds | location | data_type | y |
---|---|---|---|
2020-09-08 | India | cases_new | 89706 |
2020-09-07 | India | cases_new | 75809 |
2020-09-06 | India | cases_new | 90802 |
2020-09-05 | India | cases_new | 90632 |
Forceasting the new_cases/day for next 31 days
<- prophet(df= newcases) # model to predict
m <- make_future_dataframe(m = m, periods = 31)
futuredf1 <- predict(m, futuredf1) # predicting
forecast1 # Plotting
dyplot.prophet(m, forecast1)
Result
- This forecasting shows that on 09 oct 1 lakh 16 thousand cases will be added.
This is an interactive graph, you can move your cursor over the line and see the actual and predicated value of any day.
Probability to get infected
::prophet_plot_components(m , forecast1) prophet
Monday is safest day and Wednesday is most dangerous (hehe, kidding)
The lowest number of data has been uploaded on Monday, but the good thing is that they have done everything well on all other days of the week. They are working good!
Forecasting death/day for next 31 days
#filtering
<- data %>% dplyr::filter(data_type=="deaths_new")
newdeath #now top 4 data
::pander(head(newdeath,4)) pander
ds | location | data_type | y |
---|---|---|---|
2020-09-08 | India | deaths_new | 1115 |
2020-09-07 | India | deaths_new | 1133 |
2020-09-06 | India | deaths_new | 1016 |
2020-09-05 | India | deaths_new | 1065 |
<- prophet(newdeath)
m <- make_future_dataframe(m = m, periods = 31)
futuredf2 <- predict(m, futuredf2)
forecast2 dyplot.prophet(m, forecast2)
Note:
You can see the forecasting by moving the cursor over the line.
Probability of death…
prophet_plot_components(m, forecast2)
- Most deaths on
Tuesday
, least onMonday
(or most of death-data updated on Tuesday)
Forecasting Recovery/Day
#filtering
<- data %>% dplyr::filter(data_type=="recovered_new")
newrecovery # model making and forecasting
<- prophet(df = newrecovery)
m <- make_future_dataframe(m = m, periods = 31)
futuredf3 <- predict(m, futuredf3)
forecast3 dyplot.prophet(m, forecast3)
::prophet_plot_components(m, forecast3) prophet
India, US, Russia, new cases
<- alltis
data %>% dplyr::filter(data_type=="cases_new",
data ==c("India","US","Russia")) %>%
locationggplot(aes(ds, value)) +
geom_line(aes(col= location, linetype= location), size=1) +
scale_color_calc() + geom_point(aes(col=location),size=5)+
labs(x= "Year 2020", y= "New cases/ Day",
title = "India, Russia, US")
- Cases are increasing in India, cases in America have started decreasing now, the situation in Russia is stable.
Top 7 countries in Corona Cases per Day
<- alltis
data %>% group_by(location) %>%
data summarise(value= sum(value)) %>%
ungroup() %>% arrange(desc(value)) %>%
top_n(20) %>% rename(new_cases= value)->n
# plotting barplot for top 7 countries in `new corona cases/Day`
%>% top_n(7) %>%
n ggplot(aes(reorder(location, desc(new_cases)), new_cases, fill= location)) + geom_col(col="black", size=2) +
scale_fill_calc()+
labs(x= "Countries", y= "New cases/Day", caption = Sys.Date())
::pander(n) pander
location | new_cases |
---|---|
US | 8875773 |
Brazil | 7861958 |
India | 7842862 |
Russia | 1898039 |
Mexico | 1246485 |
Peru | 1243802 |
South Africa | 1223256 |
Colombia | 1222161 |
Argentina | 877029 |
Chile | 834953 |
Iran | 751068 |
Spain | 714483 |
Saudi Arabia | 624620 |
Pakistan | 592524 |
Bangladesh | 561612 |
Turkey | 543297 |
Italy | 526517 |
Germany | 491672 |
Iraq | 483559 |
France | 460426 |
# extracting the names of top 7 countries
%>% top_n(7)-> n n
## Selecting by new_cases
Trends of top 7 countries regarding new cases of COVID-19
scale_y_log(10)
#removing all "0" containing values
%>% dplyr::filter(value!= 0)->data
data <- n$location
loc 1:7] #top7 location in corona new cases loc[
## [1] "US" "Brazil" "India" "Russia" "Mexico"
## [6] "Peru" "South Africa"
%>% dplyr::filter(location== c("US","Brazil","India","Russia","Mexico","Peru","South Africa")) %>% dplyr::filter(data_type=="cases_new") %>% ggplot(aes(ds, value)) +
data geom_line(aes(col= location, linetype= location), size=1) +
geom_point(aes(col=location),size=2) +scale_y_log10() +
scale_color_calc() +
labs(x= "Year 2020", y= "New_cases/day (log 10)")
Conclusions
Cases are continuously increasing in India, but we have to keep in mind that India’s population is very huge. India’s performance is still efficient in terms of cases per million population.
The corona cases have started to grow high, and if the control is not done here soon then the situation can become very bitter.
Converting and saving alltis
to “.xlsx” format.
# CSV (excel) format
# write.csv(alltis, "C:/Users/Alok Pratap Singh/Documents/R Pubs/R Pubs/alltis.csv")
# .sav (spss) format
# haven::write_sav(alltis, "C:/Users/Alok Pratap Singh/Documents/R Pubs/R Pubs/alltis.sav")
#in .txt format
# write.table(alltis,"C:/Users/Alok Pratap Singh/Documents/R Pubs/R Pubs/alltis.txt)
Thank You
Regards
Please visit my profile
Alok Pratap Singh (Research Scholar)
Linkedin (Open in New TAB)
Department of Psychology
University of Allahabad
Click on download alltis.rar and unrar to get data in .csv, .sav, & .txt
format. Please open the download link in new tab
Without data you’re just another person with an opinion
.