As a reminder, the instructions to this project are formatted in markdown. It will be easiest to read the instructions if you click on “Preview” in RStudio. Doing so will render the markdown to an HTML page for easier reading!
Welcome to your off-platform project focused on Data Manipulation in R! The goal of this project is to get experience working with multiple .csv files in RStudio. We’ll be creating several data frames and manipulating them to try to find new insights.
In this project we will be looking at data associated with the Coronavirus pandemic from 2020. While this topic isn’t your typical lighthearted Codecademy assignment, we felt like it is a excellent way to use your new R skills to understand the world around you. This assignment was written on March 23rd, 2020 — since you will be taking this assignment in the future, we’re hoping that this can be an interesting historical analysis rather than an ongoing issue.
We’ve provided three datasets named corona_confirmed.csv, corona_recovered.csv and corona_deaths.csv. For the entirety of this project, we’ll be using these datasets. You can find more recent versions of this data at Johns Hopkins’ data repository on GitHub.
This is a good opportunity to practice downloading data from GitHub or other repositories. We’re using files found in the csse_covid_19_time_series folder. In addition, we recommend looking at some of the projects linked on the GitHub page — that can help give you a better sense of what type of information you can extract from this data.
For now, load these three .csv files into three separate dataframes. Note that we used the read_csv() function from the readr library. If you load the csv files using a different method, you might have slightly different column names.
Inspect the dataframes to get a sense of how the data is stored (we recommend using head()).
library(dplyr)
library(readr)
# Load the data
df_confirmed <- read.csv('corona_confirmed.csv')
df_deaths <-read.csv('corona_deaths.csv')
df_recovered <- read.csv('corona_recovered.csv')
# Inspect the data
head(df_confirmed)
## Province.State Country.Region Lat Long X1.22.20 X1.23.20 X1.24.20
## 1 Thailand 15.0000 101.0000 2 3 5
## 2 Japan 36.0000 138.0000 2 1 2
## 3 Singapore 1.2833 103.8333 0 1 3
## 4 Nepal 28.1667 84.2500 0 0 0
## 5 Malaysia 2.5000 112.5000 0 0 0
## 6 British Columbia Canada 49.2827 -123.1207 0 0 0
## X1.25.20 X1.26.20 X1.27.20 X1.28.20 X1.29.20 X1.30.20 X1.31.20 X2.1.20
## 1 7 8 8 14 14 14 19 19
## 2 2 4 4 7 7 11 15 20
## 3 3 4 5 7 7 10 13 16
## 4 1 1 1 1 1 1 1 1
## 5 3 4 4 4 7 8 8 8
## 6 0 0 0 1 1 1 1 1
## X2.2.20 X2.3.20 X2.4.20 X2.5.20 X2.6.20 X2.7.20 X2.8.20 X2.9.20 X2.10.20
## 1 19 19 25 25 25 25 32 32 32
## 2 20 20 22 22 45 25 25 26 26
## 3 18 18 24 28 28 30 33 40 45
## 4 1 1 1 1 1 1 1 1 1
## 5 8 8 10 12 12 12 16 16 18
## 6 1 1 1 2 2 4 4 4 4
## X2.11.20 X2.12.20 X2.13.20 X2.14.20 X2.15.20 X2.16.20 X2.17.20 X2.18.20
## 1 33 33 33 33 33 34 35 35
## 2 26 28 28 29 43 59 66 74
## 3 47 50 58 67 72 75 77 81
## 4 1 1 1 1 1 1 1 1
## 5 18 18 19 19 22 22 22 22
## 6 4 4 4 4 4 4 5 5
## X2.19.20 X2.20.20 X2.21.20 X2.22.20 X2.23.20 X2.24.20 X2.25.20 X2.26.20
## 1 35 35 35 35 35 35 37 40
## 2 84 94 105 122 147 159 170 189
## 3 84 84 85 85 89 89 91 93
## 4 1 1 1 1 1 1 1 1
## 5 22 22 22 22 22 22 22 22
## 6 5 5 6 6 6 6 7 7
## X2.27.20 X2.28.20 X2.29.20 X3.1.20 X3.2.20 X3.3.20 X3.4.20 X3.5.20 X3.6.20
## 1 40 41 42 42 43 43 43 47 48
## 2 214 228 241 256 274 293 331 360 420
## 3 93 93 102 106 108 110 110 117 130
## 4 1 1 1 1 1 1 1 1 1
## 5 23 23 25 29 29 36 50 50 83
## 6 7 7 8 8 8 9 12 13 21
## X3.7.20 X3.8.20 X3.9.20 X3.10.20 X3.11.20 X3.12.20 X3.13.20 X3.14.20 X3.15.20
## 1 50 50 50 53 59 70 75 82 114
## 2 461 502 511 581 639 639 701 773 839
## 3 138 150 150 160 178 178 200 212 226
## 4 1 1 1 1 1 1 1 1 1
## 5 93 99 117 129 149 149 197 238 428
## 6 21 27 32 32 39 46 64 64 73
## X3.16.20 X3.17.20 X3.18.20 X3.19.20 X3.20.20 X3.21.20 X3.22.20
## 1 147 177 212 272 322 411 599
## 2 825 878 889 924 963 1007 1086
## 3 243 266 313 345 385 432 455
## 4 1 1 1 1 1 1 2
## 5 566 673 790 900 1030 1183 1306
## 6 103 103 186 231 271 424 424
head(df_deaths)
## Province.State Country.Region Lat Long X1.22.20 X1.23.20 X1.24.20
## 1 Thailand 15.0000 101.0000 0 0 0
## 2 Japan 36.0000 138.0000 0 0 0
## 3 Singapore 1.2833 103.8333 0 0 0
## 4 Nepal 28.1667 84.2500 0 0 0
## 5 Malaysia 2.5000 112.5000 0 0 0
## 6 British Columbia Canada 49.2827 -123.1207 0 0 0
## X1.25.20 X1.26.20 X1.27.20 X1.28.20 X1.29.20 X1.30.20 X1.31.20 X2.1.20
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0
## X2.2.20 X2.3.20 X2.4.20 X2.5.20 X2.6.20 X2.7.20 X2.8.20 X2.9.20 X2.10.20
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0
## X2.11.20 X2.12.20 X2.13.20 X2.14.20 X2.15.20 X2.16.20 X2.17.20 X2.18.20
## 1 0 0 0 0 0 0 0 0
## 2 0 0 1 1 1 1 1 1
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0
## X2.19.20 X2.20.20 X2.21.20 X2.22.20 X2.23.20 X2.24.20 X2.25.20 X2.26.20
## 1 0 0 0 0 0 0 0 0
## 2 1 1 1 1 1 1 1 2
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0
## X2.27.20 X2.28.20 X2.29.20 X3.1.20 X3.2.20 X3.3.20 X3.4.20 X3.5.20 X3.6.20
## 1 0 0 0 1 1 1 1 1 1
## 2 4 4 5 6 6 6 6 6 6
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0
## X3.7.20 X3.8.20 X3.9.20 X3.10.20 X3.11.20 X3.12.20 X3.13.20 X3.14.20 X3.15.20
## 1 1 1 1 1 1 1 1 1 1
## 2 6 6 10 10 15 16 19 22 22
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 1 1 1 1 1 1 1
## X3.16.20 X3.17.20 X3.18.20 X3.19.20 X3.20.20 X3.21.20 X3.22.20
## 1 1 1 1 1 1 1 1
## 2 27 29 29 29 33 35 40
## 3 0 0 0 0 0 2 2
## 4 0 0 0 0 0 0 0
## 5 0 2 2 2 3 4 10
## 6 4 4 7 7 8 10 10
head(df_recovered)
## Province.State Country.Region Lat Long X1.22.20 X1.23.20 X1.24.20
## 1 Thailand 15.0000 101.0000 0 0 0
## 2 Japan 36.0000 138.0000 0 0 0
## 3 Singapore 1.2833 103.8333 0 0 0
## 4 Nepal 28.1667 84.2500 0 0 0
## 5 Malaysia 2.5000 112.5000 0 0 0
## 6 British Columbia Canada 49.2827 -123.1207 0 0 0
## X1.25.20 X1.26.20 X1.27.20 X1.28.20 X1.29.20 X1.30.20 X1.31.20 X2.1.20
## 1 0 2 2 5 5 5 5 5
## 2 0 1 1 1 1 1 1 1
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0
## X2.2.20 X2.3.20 X2.4.20 X2.5.20 X2.6.20 X2.7.20 X2.8.20 X2.9.20 X2.10.20
## 1 5 5 5 5 5 5 10 10 10
## 2 1 1 1 1 1 1 1 1 4
## 3 0 0 0 0 0 0 2 2 2
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 1 1 1 1
## 6 0 0 0 0 0 0 0 0 0
## X2.11.20 X2.12.20 X2.13.20 X2.14.20 X2.15.20 X2.16.20 X2.17.20 X2.18.20
## 1 10 10 12 12 12 14 15 15
## 2 9 9 9 9 12 12 12 13
## 3 9 15 15 17 18 18 24 29
## 4 0 1 1 1 1 1 1 1
## 5 3 3 3 3 7 7 7 13
## 6 0 0 0 0 0 0 0 0
## X2.19.20 X2.20.20 X2.21.20 X2.22.20 X2.23.20 X2.24.20 X2.25.20 X2.26.20
## 1 15 15 17 17 21 21 22 22
## 2 18 18 22 22 22 22 22 22
## 3 34 34 37 37 51 51 53 62
## 4 1 1 1 1 1 1 1 1
## 5 15 15 15 15 15 18 18 18
## 6 0 0 0 0 0 0 0 0
## X2.27.20 X2.28.20 X2.29.20 X3.1.20 X3.2.20 X3.3.20 X3.4.20 X3.5.20 X3.6.20
## 1 22 28 28 28 31 31 31 31 31
## 2 22 22 32 32 32 43 43 43 46
## 3 62 62 72 72 78 78 78 78 78
## 4 1 1 1 1 1 1 1 1 1
## 5 18 18 18 18 18 22 22 22 22
## 6 3 3 3 3 3 3 3 3 3
## X3.7.20 X3.8.20 X3.9.20 X3.10.20 X3.11.20 X3.12.20 X3.13.20 X3.14.20 X3.15.20
## 1 31 31 31 33 34 34 35 35 35
## 2 76 76 76 101 118 118 118 118 118
## 3 78 78 78 78 96 96 97 105 105
## 4 1 1 1 1 1 1 1 1 1
## 5 23 24 24 24 26 26 26 35 42
## 6 4 4 4 4 4 4 4 4 4
## X3.16.20 X3.17.20 X3.18.20 X3.19.20 X3.20.20 X3.21.20 X3.22.20
## 1 35 41 42 42 42 42 44
## 2 144 144 144 150 191 232 235
## 3 109 114 114 114 124 140 144
## 4 1 1 1 1 1 1 1
## 5 42 49 60 75 87 114 139
## 6 4 4 4 4 4 4 4
The format of these three data frames are all the same — each row contains information about the number of cases in a certain province, state, or country. Every column (other than the columns containing the latitude, longitude, and country name) represents a date. We have data starting on January 1st, 2020 and ending on March 22nd, 2020 (the day this assignment was written).
Let’s warm up by finding the total number of confirmed cases on March 22nd, 2020. To do this, we’ll use both the select() and sum() functions. When using the select() function, make sure to put the name of the column in backticks (usually found in the top left corner of your keyboard)! This usually isn’t necessary, but since the name of the column contains the / character, we need backticks! Make sure to do this whenever you’re working with these columns.
# Sum the total number of cases on March 22nd
confirmed.X3.22.20 <- df_confirmed %>%
select(X3.22.20) %>%
sum()
sprintf('There are %d confirmed cases on March 22nd, 2020.', confirmed.X3.22.20)
## [1] "There are 335955 confirmed cases on March 22nd, 2020."
Let’s start to filter the data a bit more. Give these three tasks a try:
0)You may want to save these results to variables as they will be useful later!
# Filter for countries in the northern hemisphere
confirmed.X3.22.20.north <- df_confirmed %>%
filter(Lat > 0) %>%
select(X3.22.20) %>%
sum()
sprintf('There were %d confirmed cases in the Northern Hemisphere on March 22nd, 2020.', confirmed.X3.22.20.north)
## [1] "There were 329794 confirmed cases in the Northern Hemisphere on March 22nd, 2020."
# Filter for Australia cases
confirmed.X3.22.20.australia <- df_confirmed %>%
filter(`Country.Region` == 'Australia') %>%
select(X3.22.20) %>%
sum()
sprintf('There were %d confirmed cases in Australia on March 22nd, 2020.', confirmed.X3.22.20.australia)
## [1] "There were 1314 confirmed cases in Australia on March 22nd, 2020."
# Display percentages
sprintf('%.2f%% of the confirmed cases were from the Northern Hemisphere on March 22nd, 2020.', confirmed.X3.22.20.north / confirmed.X3.22.20 * 100)
## [1] "98.17% of the confirmed cases were from the Northern Hemisphere on March 22nd, 2020."
sprintf('%.2f%% of the confirmed cases were from Australia on March 22nd, 2020.', confirmed.X3.22.20.australia / confirmed.X3.22.20 * 100)
## [1] "0.39% of the confirmed cases were from Australia on March 22nd, 2020."
Notice that some countries have multiple rows of data. This happens when a country has information about specific states or provinces. While this information might be useful, it makes it a bit tricky to see the total number of cases by country.
We were able to get the total number of cases in Australia by using filter(), but what if we wanted to do this for every country? It would be a pain to have to type every country name into a filter function. Use group_by() to create a new data frame containing one row for every Country/Region.
Every column of those new rows should have the sum of the total number of cases for that country for every day. We can use the summarize_at() function to do this. summarize_at() takes two parameters:
Lat, Long, and Province/State. summarize_at() only works with numbers, so we’ll want to remove the column containing country names. We also don’t need to add the latitudes and longitudes. Remember, you can use the - symbol to list columns you want to exclude. Also don’t forget to use backticks around columns with / symbols. For example -`Province/State` will exclude that column. These small details get complicated!sum.After creating this new data frame inspect it. To confirm you did the group_by() correctly, find the row for Australia and confirm the number of cases on March 22nd matches your results from the previous step
# Group by countries
df_confirmed_grouped <- df_confirmed %>%
group_by(Country.Region) %>%
summarise_at(vars(-Province.State, -Lat, -Long), sum)
df_confirmed_grouped
## # A tibble: 171 x 62
## Country.Region X1.22.20 X1.23.20 X1.24.20 X1.25.20 X1.26.20 X1.27.20 X1.28.20
## <chr> <int> <int> <int> <int> <int> <int> <int>
## 1 Afghanistan 0 0 0 0 0 0 0
## 2 Albania 0 0 0 0 0 0 0
## 3 Algeria 0 0 0 0 0 0 0
## 4 Andorra 0 0 0 0 0 0 0
## 5 Angola 0 0 0 0 0 0 0
## 6 Antigua and B~ 0 0 0 0 0 0 0
## 7 Argentina 0 0 0 0 0 0 0
## 8 Armenia 0 0 0 0 0 0 0
## 9 Australia 0 0 0 0 4 5 5
## 10 Austria 0 0 0 0 0 0 0
## # ... with 161 more rows, and 54 more variables: X1.29.20 <int>,
## # X1.30.20 <int>, X1.31.20 <int>, X2.1.20 <int>, X2.2.20 <int>,
## # X2.3.20 <int>, X2.4.20 <int>, X2.5.20 <int>, X2.6.20 <int>, X2.7.20 <int>,
## # X2.8.20 <int>, X2.9.20 <int>, X2.10.20 <int>, X2.11.20 <int>,
## # X2.12.20 <int>, X2.13.20 <int>, X2.14.20 <int>, X2.15.20 <int>,
## # X2.16.20 <int>, X2.17.20 <int>, X2.18.20 <int>, X2.19.20 <int>,
## # X2.20.20 <int>, X2.21.20 <int>, X2.22.20 <int>, X2.23.20 <int>,
## # X2.24.20 <int>, X2.25.20 <int>, X2.26.20 <int>, X2.27.20 <int>,
## # X2.28.20 <int>, X2.29.20 <int>, X3.1.20 <int>, X3.2.20 <int>,
## # X3.3.20 <int>, X3.4.20 <int>, X3.5.20 <int>, X3.6.20 <int>, X3.7.20 <int>,
## # X3.8.20 <int>, X3.9.20 <int>, X3.10.20 <int>, X3.11.20 <int>,
## # X3.12.20 <int>, X3.13.20 <int>, X3.14.20 <int>, X3.15.20 <int>,
## # X3.16.20 <int>, X3.17.20 <int>, X3.18.20 <int>, X3.19.20 <int>,
## # X3.20.20 <int>, X3.21.20 <int>, X3.22.20 <int>
# Filter the grouped data set for Australia
df_confirmed_grouped %>%
filter(Country.Region == 'Australia') %>%
select(X3.22.20)
## # A tibble: 1 x 1
## X3.22.20
## <int>
## 1 1314
Do the same process of grouping by country using the recovered dataset. What percentage of the cases in the US have recovered on March 22nd?
# Group by countries
df_recovered_grouped <- df_recovered %>%
group_by(Country.Region) %>%
summarise_at(vars(-Province.State, -Lat, -Long), sum)
df_recovered_grouped
## # A tibble: 171 x 62
## Country.Region X1.22.20 X1.23.20 X1.24.20 X1.25.20 X1.26.20 X1.27.20 X1.28.20
## <chr> <int> <int> <int> <int> <int> <int> <int>
## 1 Afghanistan 0 0 0 0 0 0 0
## 2 Albania 0 0 0 0 0 0 0
## 3 Algeria 0 0 0 0 0 0 0
## 4 Andorra 0 0 0 0 0 0 0
## 5 Angola 0 0 0 0 0 0 0
## 6 Antigua and B~ 0 0 0 0 0 0 0
## 7 Argentina 0 0 0 0 0 0 0
## 8 Armenia 0 0 0 0 0 0 0
## 9 Australia 0 0 0 0 0 0 0
## 10 Austria 0 0 0 0 0 0 0
## # ... with 161 more rows, and 54 more variables: X1.29.20 <int>,
## # X1.30.20 <int>, X1.31.20 <int>, X2.1.20 <int>, X2.2.20 <int>,
## # X2.3.20 <int>, X2.4.20 <int>, X2.5.20 <int>, X2.6.20 <int>, X2.7.20 <int>,
## # X2.8.20 <int>, X2.9.20 <int>, X2.10.20 <int>, X2.11.20 <int>,
## # X2.12.20 <int>, X2.13.20 <int>, X2.14.20 <int>, X2.15.20 <int>,
## # X2.16.20 <int>, X2.17.20 <int>, X2.18.20 <int>, X2.19.20 <int>,
## # X2.20.20 <int>, X2.21.20 <int>, X2.22.20 <int>, X2.23.20 <int>,
## # X2.24.20 <int>, X2.25.20 <int>, X2.26.20 <int>, X2.27.20 <int>,
## # X2.28.20 <int>, X2.29.20 <int>, X3.1.20 <int>, X3.2.20 <int>,
## # X3.3.20 <int>, X3.4.20 <int>, X3.5.20 <int>, X3.6.20 <int>, X3.7.20 <int>,
## # X3.8.20 <int>, X3.9.20 <int>, X3.10.20 <int>, X3.11.20 <int>,
## # X3.12.20 <int>, X3.13.20 <int>, X3.14.20 <int>, X3.15.20 <int>,
## # X3.16.20 <int>, X3.17.20 <int>, X3.18.20 <int>, X3.19.20 <int>,
## # X3.20.20 <int>, X3.21.20 <int>, X3.22.20 <int>
confirmed.X3.22.20.us <- df_confirmed_grouped %>%
filter(Country.Region == 'US') %>%
select(-Country.Region) %>%
max()
confirmed.X3.22.20.us
## [1] 33272
recovered.X3.22.20.us <- df_recovered_grouped %>%
filter(Country.Region == 'US') %>%
select(-Country.Region) %>%
max()
recovered.X3.22.20.us
## [1] 17
sprintf('%.2f%% of the confirmed cases from US on March 22nd, 2020 have recovered.', recovered.X3.22.20.us / confirmed.X3.22.20.us * 100)
## [1] "0.05% of the confirmed cases from US on March 22nd, 2020 have recovered."
You may see some surprising results — are there really zero recovered cases in the US? Let’s take a closer look at the US row in the recovered table. Either view the table in RStudio’s interface, or print the row in the cell block below. What do you notice about the values as you scroll to March 22nd?
# Filter to inspect the US row
View(df_recovered_grouped)
df_recovered_grouped %>%
filter(Country.Region == 'US')
## # A tibble: 1 x 62
## Country.Region X1.22.20 X1.23.20 X1.24.20 X1.25.20 X1.26.20 X1.27.20 X1.28.20
## <chr> <int> <int> <int> <int> <int> <int> <int>
## 1 US 0 0 0 0 0 0 0
## # ... with 54 more variables: X1.29.20 <int>, X1.30.20 <int>, X1.31.20 <int>,
## # X2.1.20 <int>, X2.2.20 <int>, X2.3.20 <int>, X2.4.20 <int>, X2.5.20 <int>,
## # X2.6.20 <int>, X2.7.20 <int>, X2.8.20 <int>, X2.9.20 <int>, X2.10.20 <int>,
## # X2.11.20 <int>, X2.12.20 <int>, X2.13.20 <int>, X2.14.20 <int>,
## # X2.15.20 <int>, X2.16.20 <int>, X2.17.20 <int>, X2.18.20 <int>,
## # X2.19.20 <int>, X2.20.20 <int>, X2.21.20 <int>, X2.22.20 <int>,
## # X2.23.20 <int>, X2.24.20 <int>, X2.25.20 <int>, X2.26.20 <int>,
## # X2.27.20 <int>, X2.28.20 <int>, X2.29.20 <int>, X3.1.20 <int>,
## # X3.2.20 <int>, X3.3.20 <int>, X3.4.20 <int>, X3.5.20 <int>, X3.6.20 <int>,
## # X3.7.20 <int>, X3.8.20 <int>, X3.9.20 <int>, X3.10.20 <int>,
## # X3.11.20 <int>, X3.12.20 <int>, X3.13.20 <int>, X3.14.20 <int>,
## # X3.15.20 <int>, X3.16.20 <int>, X3.17.20 <int>, X3.18.20 <int>,
## # X3.19.20 <int>, X3.20.20 <int>, X3.21.20 <int>, X3.22.20 <int>
It seems like the number of recovered cases is steadily increasing to 17, until March 18th, when it suddenly drops back to 0. This is surprising, and not what we expected! Perhaps we’re misunderstanding the format of the data, or perhaps there’s an error in reporting the data.
What’s the best course of action at this point? One strategy is to investigate the source of the data to truly understand what it is reporting. When we went back to Johns Hopkins’ repository, we found a note saying that the data had moved into a different file. It might be best to use that file!
If we wanted to use the current file, instead of reporting the number of confirmed and recovered cases in the US on March 22nd, we could report the maximum number of confirmed and recovered cases. Let’s give that a shot using the max() function. Make sure to also use select() to remove the Country/Region column — we can’t take the max of a row if it contains a string.
# Find the maximum number of confirmed and recovered cases
confirmed_max_us <- df_confirmed_grouped %>%
filter(Country.Region == 'US') %>%
select(-Country.Region) %>%
max()
confirmed_max_us
## [1] 33272
recovered_max_us <- df_recovered_grouped %>%
filter(Country.Region == 'US') %>%
select(-Country.Region) %>%
max()
recovered_max_us
## [1] 17
sprintf('US maximum confirmed cases - %d', confirmed_max_us)
## [1] "US maximum confirmed cases - 33272"
sprintf('US maximum recovery cases - %d', recovered_max_us)
## [1] "US maximum recovery cases - 17"
sprintf('Recovery percentage - %.2f%%', recovered_max_us / confirmed_max_us * 100)
## [1] "Recovery percentage - 0.05%"
The code that you just wrote in the last prompt is a little ugly — we first have to select a particular row by the value in a particular column, remove that column from the row, and then find the maximum value of the remaining values. It would be much easier if we rotated the data frame so rows represented dates and columns represented countries. Try drawing a sketch of what this rotated data frame might look like!
We could then find the maximum value of a country by simply selecting the appropriate column and finding the maximum value in that column. Let’s try that! The t() function will transpose the data frame so all of the columns become rows. t() unfortunately returns a matrix instead of a data frame. If you pipe the result of t() through the as.data.frame() function, your result will be the transposed data frame. Try transposing one of your data frames by piping it through both t() and as.data.frame() and call head() to investigate it.
# Transpose the data frame
df_confirmed_transposed <- df_confirmed_grouped %>%
t() %>%
as.data.frame()
head(df_confirmed_transposed)
## V1 V2 V3 V4 V5 V6
## Country.Region Afghanistan Albania Algeria Andorra Angola Antigua and Barbuda
## X1.22.20 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0
## V7 V8 V9 V10 V11 V12
## Country.Region Argentina Armenia Australia Austria Azerbaijan Bahamas, The
## X1.22.20 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0
## X1.26.20 0 0 4 0 0 0
## V13 V14 V15 V16 V17 V18 V19 V20
## Country.Region Bahrain Bangladesh Barbados Belarus Belgium Benin Bhutan Bolivia
## X1.22.20 0 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0 0 0
## V21 V22 V23 V24 V25
## Country.Region Bosnia and Herzegovina Brazil Brunei Bulgaria Burkina Faso
## X1.22.20 0 0 0 0 0
## X1.23.20 0 0 0 0 0
## X1.24.20 0 0 0 0 0
## X1.25.20 0 0 0 0 0
## X1.26.20 0 0 0 0 0
## V26 V27 V28 V29 V30
## Country.Region Cabo Verde Cambodia Cameroon Canada Cape Verde
## X1.22.20 0 0 0 0 0
## X1.23.20 0 0 0 0 0
## X1.24.20 0 0 0 0 0
## X1.25.20 0 0 0 0 0
## X1.26.20 0 0 0 1 0
## V31 V32 V33 V34 V35
## Country.Region Central African Republic Chad Chile China Colombia
## X1.22.20 0 0 0 548 0
## X1.23.20 0 0 0 643 0
## X1.24.20 0 0 0 920 0
## X1.25.20 0 0 0 1406 0
## X1.26.20 0 0 0 2075 0
## V36 V37 V38 V39
## Country.Region Congo (Brazzaville) Congo (Kinshasa) Costa Rica Cote d'Ivoire
## X1.22.20 0 0 0 0
## X1.23.20 0 0 0 0
## X1.24.20 0 0 0 0
## X1.25.20 0 0 0 0
## X1.26.20 0 0 0 0
## V40 V41 V42 V43 V44 V45 V46
## Country.Region Croatia Cruise Ship Cuba Cyprus Czechia Denmark Djibouti
## X1.22.20 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0 0
## V47 V48 V49 V50 V51 V52
## Country.Region Dominica Dominican Republic East Timor Ecuador Egypt El Salvador
## X1.22.20 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0
## V53 V54 V55 V56 V57 V58 V59
## Country.Region Equatorial Guinea Eritrea Estonia Eswatini Ethiopia Fiji Finland
## X1.22.20 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0 0
## V60 V61 V62 V63 V64 V65 V66 V67
## Country.Region France Gabon Gambia, The Georgia Germany Ghana Greece Grenada
## X1.22.20 0 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0 0
## X1.24.20 2 0 0 0 0 0 0 0
## X1.25.20 3 0 0 0 0 0 0 0
## X1.26.20 3 0 0 0 0 0 0 0
## V68 V69 V70 V71 V72 V73 V74 V75
## Country.Region Guatemala Guinea Guyana Haiti Holy See Honduras Hungary Iceland
## X1.22.20 0 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0 0 0
## V76 V77 V78 V79 V80 V81 V82 V83 V84
## Country.Region India Indonesia Iran Iraq Ireland Israel Italy Jamaica Japan
## X1.22.20 0 0 0 0 0 0 0 0 2
## X1.23.20 0 0 0 0 0 0 0 0 1
## X1.24.20 0 0 0 0 0 0 0 0 2
## X1.25.20 0 0 0 0 0 0 0 0 2
## X1.26.20 0 0 0 0 0 0 0 0 4
## V85 V86 V87 V88 V89 V90 V91
## Country.Region Jordan Kazakhstan Kenya Korea, South Kosovo Kuwait Kyrgyzstan
## X1.22.20 0 0 0 1 0 0 0
## X1.23.20 0 0 0 1 0 0 0
## X1.24.20 0 0 0 2 0 0 0
## X1.25.20 0 0 0 2 0 0 0
## X1.26.20 0 0 0 3 0 0 0
## V92 V93 V94 V95 V96 V97
## Country.Region Latvia Lebanon Liberia Liechtenstein Lithuania Luxembourg
## X1.22.20 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0
## V98 V99 V100 V101 V102 V103
## Country.Region Madagascar Malaysia Maldives Malta Martinique Mauritania
## X1.22.20 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0
## X1.25.20 0 3 0 0 0 0
## X1.26.20 0 4 0 0 0 0
## V104 V105 V106 V107 V108 V109 V110
## Country.Region Mauritius Mexico Moldova Monaco Mongolia Montenegro Morocco
## X1.22.20 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0 0
## V111 V112 V113 V114 V115 V116 V117
## Country.Region Mozambique Namibia Nepal Netherlands New Zealand Nicaragua Niger
## X1.22.20 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0
## X1.25.20 0 0 1 0 0 0 0
## X1.26.20 0 0 1 0 0 0 0
## V118 V119 V120 V121 V122 V123
## Country.Region Nigeria North Macedonia Norway Oman Pakistan Panama
## X1.22.20 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0
## V124 V125 V126 V127 V128 V129 V130
## Country.Region Papua New Guinea Paraguay Peru Philippines Poland Portugal Qatar
## X1.22.20 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0 0
## V131 V132 V133 V134
## Country.Region Romania Russia Rwanda Saint Lucia
## X1.22.20 0 0 0 0
## X1.23.20 0 0 0 0
## X1.24.20 0 0 0 0
## X1.25.20 0 0 0 0
## X1.26.20 0 0 0 0
## V135 V136 V137 V138
## Country.Region Saint Vincent and the Grenadines San Marino Saudi Arabia Senegal
## X1.22.20 0 0 0 0
## X1.23.20 0 0 0 0
## X1.24.20 0 0 0 0
## X1.25.20 0 0 0 0
## X1.26.20 0 0 0 0
## V139 V140 V141 V142 V143 V144
## Country.Region Serbia Seychelles Singapore Slovakia Slovenia Somalia
## X1.22.20 0 0 0 0 0 0
## X1.23.20 0 0 1 0 0 0
## X1.24.20 0 0 3 0 0 0
## X1.25.20 0 0 3 0 0 0
## X1.26.20 0 0 4 0 0 0
## V145 V146 V147 V148 V149 V150 V151
## Country.Region South Africa Spain Sri Lanka Sudan Suriname Sweden Switzerland
## X1.22.20 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0 0
## V152 V153 V154 V155 V156 V157
## Country.Region Syria Taiwan* Tanzania Thailand Timor-Leste Togo
## X1.22.20 0 1 0 2 0 0
## X1.23.20 0 1 0 3 0 0
## X1.24.20 0 3 0 5 0 0
## X1.25.20 0 3 0 7 0 0
## X1.26.20 0 4 0 8 0 0
## V158 V159 V160 V161 V162
## Country.Region Trinidad and Tobago Tunisia Turkey Uganda Ukraine
## X1.22.20 0 0 0 0 0
## X1.23.20 0 0 0 0 0
## X1.24.20 0 0 0 0 0
## X1.25.20 0 0 0 0 0
## X1.26.20 0 0 0 0 0
## V163 V164 V165 V166 V167
## Country.Region United Arab Emirates United Kingdom Uruguay US Uzbekistan
## X1.22.20 0 0 0 1 0
## X1.23.20 0 0 0 1 0
## X1.24.20 0 0 0 2 0
## X1.25.20 0 0 0 2 0
## X1.26.20 0 0 0 5 0
## V168 V169 V170 V171
## Country.Region Venezuela Vietnam Zambia Zimbabwe
## X1.22.20 0 0 0 0
## X1.23.20 0 2 0 0
## X1.24.20 0 2 0 0
## X1.25.20 0 2 0 0
## X1.26.20 0 2 0 0
Great! We’re almost there. But take a look at the column names. We want the column names to be the countries, but instead, the country names are in the first row. We need to figure out how to change the first row of a data frame into column names. Challenge yourself to try to figure this out on your own in the following code block. We’ll walk you through our solution in the following section.
names(df_confirmed_transposed) <- df_confirmed_transposed[1,]
df_confirmed_transposed <- df_confirmed_transposed[-1,]
head(df_confirmed_transposed)
## Afghanistan Albania Algeria Andorra Angola Antigua and Barbuda
## X1.22.20 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0
## X1.27.20 0 0 0 0 0 0
## Argentina Armenia Australia Austria Azerbaijan Bahamas, The Bahrain
## X1.22.20 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0 0
## X1.26.20 0 0 4 0 0 0 0
## X1.27.20 0 0 5 0 0 0 0
## Bangladesh Barbados Belarus Belgium Benin Bhutan Bolivia
## X1.22.20 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0 0
## X1.27.20 0 0 0 0 0 0 0
## Bosnia and Herzegovina Brazil Brunei Bulgaria Burkina Faso Cabo Verde
## X1.22.20 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0
## X1.27.20 0 0 0 0 0 0
## Cambodia Cameroon Canada Cape Verde Central African Republic Chad
## X1.22.20 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0
## X1.26.20 0 0 1 0 0 0
## X1.27.20 1 0 1 0 0 0
## Chile China Colombia Congo (Brazzaville) Congo (Kinshasa) Costa Rica
## X1.22.20 0 548 0 0 0 0
## X1.23.20 0 643 0 0 0 0
## X1.24.20 0 920 0 0 0 0
## X1.25.20 0 1406 0 0 0 0
## X1.26.20 0 2075 0 0 0 0
## X1.27.20 0 2877 0 0 0 0
## Cote d'Ivoire Croatia Cruise Ship Cuba Cyprus Czechia Denmark Djibouti
## X1.22.20 0 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0 0 0
## X1.27.20 0 0 0 0 0 0 0 0
## Dominica Dominican Republic East Timor Ecuador Egypt El Salvador
## X1.22.20 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0
## X1.27.20 0 0 0 0 0 0
## Equatorial Guinea Eritrea Estonia Eswatini Ethiopia Fiji Finland
## X1.22.20 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0 0
## X1.27.20 0 0 0 0 0 0 0
## France Gabon Gambia, The Georgia Germany Ghana Greece Grenada
## X1.22.20 0 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0 0
## X1.24.20 2 0 0 0 0 0 0 0
## X1.25.20 3 0 0 0 0 0 0 0
## X1.26.20 3 0 0 0 0 0 0 0
## X1.27.20 3 0 0 0 1 0 0 0
## Guatemala Guinea Guyana Haiti Holy See Honduras Hungary Iceland India
## X1.22.20 0 0 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0 0 0 0
## X1.27.20 0 0 0 0 0 0 0 0 0
## Indonesia Iran Iraq Ireland Israel Italy Jamaica Japan Jordan
## X1.22.20 0 0 0 0 0 0 0 2 0
## X1.23.20 0 0 0 0 0 0 0 1 0
## X1.24.20 0 0 0 0 0 0 0 2 0
## X1.25.20 0 0 0 0 0 0 0 2 0
## X1.26.20 0 0 0 0 0 0 0 4 0
## X1.27.20 0 0 0 0 0 0 0 4 0
## Kazakhstan Kenya Korea, South Kosovo Kuwait Kyrgyzstan Latvia Lebanon
## X1.22.20 0 0 1 0 0 0 0 0
## X1.23.20 0 0 1 0 0 0 0 0
## X1.24.20 0 0 2 0 0 0 0 0
## X1.25.20 0 0 2 0 0 0 0 0
## X1.26.20 0 0 3 0 0 0 0 0
## X1.27.20 0 0 4 0 0 0 0 0
## Liberia Liechtenstein Lithuania Luxembourg Madagascar Malaysia
## X1.22.20 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 3
## X1.26.20 0 0 0 0 0 4
## X1.27.20 0 0 0 0 0 4
## Maldives Malta Martinique Mauritania Mauritius Mexico Moldova Monaco
## X1.22.20 0 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0 0 0
## X1.27.20 0 0 0 0 0 0 0 0
## Mongolia Montenegro Morocco Mozambique Namibia Nepal Netherlands
## X1.22.20 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 1 0
## X1.26.20 0 0 0 0 0 1 0
## X1.27.20 0 0 0 0 0 1 0
## New Zealand Nicaragua Niger Nigeria North Macedonia Norway Oman
## X1.22.20 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0 0
## X1.27.20 0 0 0 0 0 0 0
## Pakistan Panama Papua New Guinea Paraguay Peru Philippines Poland
## X1.22.20 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0 0
## X1.27.20 0 0 0 0 0 0 0
## Portugal Qatar Romania Russia Rwanda Saint Lucia
## X1.22.20 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0
## X1.27.20 0 0 0 0 0 0
## Saint Vincent and the Grenadines San Marino Saudi Arabia Senegal
## X1.22.20 0 0 0 0
## X1.23.20 0 0 0 0
## X1.24.20 0 0 0 0
## X1.25.20 0 0 0 0
## X1.26.20 0 0 0 0
## X1.27.20 0 0 0 0
## Serbia Seychelles Singapore Slovakia Slovenia Somalia South Africa
## X1.22.20 0 0 0 0 0 0 0
## X1.23.20 0 0 1 0 0 0 0
## X1.24.20 0 0 3 0 0 0 0
## X1.25.20 0 0 3 0 0 0 0
## X1.26.20 0 0 4 0 0 0 0
## X1.27.20 0 0 5 0 0 0 0
## Spain Sri Lanka Sudan Suriname Sweden Switzerland Syria Taiwan*
## X1.22.20 0 0 0 0 0 0 0 1
## X1.23.20 0 0 0 0 0 0 0 1
## X1.24.20 0 0 0 0 0 0 0 3
## X1.25.20 0 0 0 0 0 0 0 3
## X1.26.20 0 0 0 0 0 0 0 4
## X1.27.20 0 1 0 0 0 0 0 5
## Tanzania Thailand Timor-Leste Togo Trinidad and Tobago Tunisia Turkey
## X1.22.20 0 2 0 0 0 0 0
## X1.23.20 0 3 0 0 0 0 0
## X1.24.20 0 5 0 0 0 0 0
## X1.25.20 0 7 0 0 0 0 0
## X1.26.20 0 8 0 0 0 0 0
## X1.27.20 0 8 0 0 0 0 0
## Uganda Ukraine United Arab Emirates United Kingdom Uruguay US
## X1.22.20 0 0 0 0 0 1
## X1.23.20 0 0 0 0 0 1
## X1.24.20 0 0 0 0 0 2
## X1.25.20 0 0 0 0 0 2
## X1.26.20 0 0 0 0 0 5
## X1.27.20 0 0 0 0 0 5
## Uzbekistan Venezuela Vietnam Zambia Zimbabwe
## X1.22.20 0 0 0 0 0
## X1.23.20 0 0 2 0 0
## X1.24.20 0 0 2 0 0
## X1.25.20 0 0 2 0 0
## X1.26.20 0 0 2 0 0
## X1.27.20 0 0 2 0 0
As usual, we used Google to help solve this problem. Learning how to properly search for answers to your programming questions is an essential skill. In this case, we specifically searched “r data frame set first row as column names”, which led us to a Stack Overflow page that suggested using the janitor library. That’s a cute name for a library — this janitor library will help us clean up our data frame.
First, we installed and loaded the janitor package. Then we passed our transposed data frame into the row_to_names() function and used the parameter row_number = 1.
if (!require('janitor')) {
install.packages('janitor')
}
## Loading required package: janitor
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(janitor)
# Make the first row the column titles
df_confirmed_transposed <- df_confirmed_grouped %>%
t() %>%
as.data.frame()
df_confirmed_transposed <- df_confirmed_transposed %>%
row_to_names(row_number = 1)
head(df_confirmed_transposed)
## Afghanistan Albania Algeria Andorra Angola Antigua and Barbuda
## X1.22.20 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0
## X1.27.20 0 0 0 0 0 0
## Argentina Armenia Australia Austria Azerbaijan Bahamas, The Bahrain
## X1.22.20 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0 0
## X1.26.20 0 0 4 0 0 0 0
## X1.27.20 0 0 5 0 0 0 0
## Bangladesh Barbados Belarus Belgium Benin Bhutan Bolivia
## X1.22.20 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0 0
## X1.27.20 0 0 0 0 0 0 0
## Bosnia and Herzegovina Brazil Brunei Bulgaria Burkina Faso Cabo Verde
## X1.22.20 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0
## X1.27.20 0 0 0 0 0 0
## Cambodia Cameroon Canada Cape Verde Central African Republic Chad
## X1.22.20 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0
## X1.26.20 0 0 1 0 0 0
## X1.27.20 1 0 1 0 0 0
## Chile China Colombia Congo (Brazzaville) Congo (Kinshasa) Costa Rica
## X1.22.20 0 548 0 0 0 0
## X1.23.20 0 643 0 0 0 0
## X1.24.20 0 920 0 0 0 0
## X1.25.20 0 1406 0 0 0 0
## X1.26.20 0 2075 0 0 0 0
## X1.27.20 0 2877 0 0 0 0
## Cote d'Ivoire Croatia Cruise Ship Cuba Cyprus Czechia Denmark Djibouti
## X1.22.20 0 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0 0 0
## X1.27.20 0 0 0 0 0 0 0 0
## Dominica Dominican Republic East Timor Ecuador Egypt El Salvador
## X1.22.20 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0
## X1.27.20 0 0 0 0 0 0
## Equatorial Guinea Eritrea Estonia Eswatini Ethiopia Fiji Finland
## X1.22.20 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0 0
## X1.27.20 0 0 0 0 0 0 0
## France Gabon Gambia, The Georgia Germany Ghana Greece Grenada
## X1.22.20 0 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0 0
## X1.24.20 2 0 0 0 0 0 0 0
## X1.25.20 3 0 0 0 0 0 0 0
## X1.26.20 3 0 0 0 0 0 0 0
## X1.27.20 3 0 0 0 1 0 0 0
## Guatemala Guinea Guyana Haiti Holy See Honduras Hungary Iceland India
## X1.22.20 0 0 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0 0 0 0
## X1.27.20 0 0 0 0 0 0 0 0 0
## Indonesia Iran Iraq Ireland Israel Italy Jamaica Japan Jordan
## X1.22.20 0 0 0 0 0 0 0 2 0
## X1.23.20 0 0 0 0 0 0 0 1 0
## X1.24.20 0 0 0 0 0 0 0 2 0
## X1.25.20 0 0 0 0 0 0 0 2 0
## X1.26.20 0 0 0 0 0 0 0 4 0
## X1.27.20 0 0 0 0 0 0 0 4 0
## Kazakhstan Kenya Korea, South Kosovo Kuwait Kyrgyzstan Latvia Lebanon
## X1.22.20 0 0 1 0 0 0 0 0
## X1.23.20 0 0 1 0 0 0 0 0
## X1.24.20 0 0 2 0 0 0 0 0
## X1.25.20 0 0 2 0 0 0 0 0
## X1.26.20 0 0 3 0 0 0 0 0
## X1.27.20 0 0 4 0 0 0 0 0
## Liberia Liechtenstein Lithuania Luxembourg Madagascar Malaysia
## X1.22.20 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 3
## X1.26.20 0 0 0 0 0 4
## X1.27.20 0 0 0 0 0 4
## Maldives Malta Martinique Mauritania Mauritius Mexico Moldova Monaco
## X1.22.20 0 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0 0 0
## X1.27.20 0 0 0 0 0 0 0 0
## Mongolia Montenegro Morocco Mozambique Namibia Nepal Netherlands
## X1.22.20 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 1 0
## X1.26.20 0 0 0 0 0 1 0
## X1.27.20 0 0 0 0 0 1 0
## New Zealand Nicaragua Niger Nigeria North Macedonia Norway Oman
## X1.22.20 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0 0
## X1.27.20 0 0 0 0 0 0 0
## Pakistan Panama Papua New Guinea Paraguay Peru Philippines Poland
## X1.22.20 0 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0 0
## X1.27.20 0 0 0 0 0 0 0
## Portugal Qatar Romania Russia Rwanda Saint Lucia
## X1.22.20 0 0 0 0 0 0
## X1.23.20 0 0 0 0 0 0
## X1.24.20 0 0 0 0 0 0
## X1.25.20 0 0 0 0 0 0
## X1.26.20 0 0 0 0 0 0
## X1.27.20 0 0 0 0 0 0
## Saint Vincent and the Grenadines San Marino Saudi Arabia Senegal
## X1.22.20 0 0 0 0
## X1.23.20 0 0 0 0
## X1.24.20 0 0 0 0
## X1.25.20 0 0 0 0
## X1.26.20 0 0 0 0
## X1.27.20 0 0 0 0
## Serbia Seychelles Singapore Slovakia Slovenia Somalia South Africa
## X1.22.20 0 0 0 0 0 0 0
## X1.23.20 0 0 1 0 0 0 0
## X1.24.20 0 0 3 0 0 0 0
## X1.25.20 0 0 3 0 0 0 0
## X1.26.20 0 0 4 0 0 0 0
## X1.27.20 0 0 5 0 0 0 0
## Spain Sri Lanka Sudan Suriname Sweden Switzerland Syria Taiwan*
## X1.22.20 0 0 0 0 0 0 0 1
## X1.23.20 0 0 0 0 0 0 0 1
## X1.24.20 0 0 0 0 0 0 0 3
## X1.25.20 0 0 0 0 0 0 0 3
## X1.26.20 0 0 0 0 0 0 0 4
## X1.27.20 0 1 0 0 0 0 0 5
## Tanzania Thailand Timor-Leste Togo Trinidad and Tobago Tunisia Turkey
## X1.22.20 0 2 0 0 0 0 0
## X1.23.20 0 3 0 0 0 0 0
## X1.24.20 0 5 0 0 0 0 0
## X1.25.20 0 7 0 0 0 0 0
## X1.26.20 0 8 0 0 0 0 0
## X1.27.20 0 8 0 0 0 0 0
## Uganda Ukraine United Arab Emirates United Kingdom Uruguay US
## X1.22.20 0 0 0 0 0 1
## X1.23.20 0 0 0 0 0 1
## X1.24.20 0 0 0 0 0 2
## X1.25.20 0 0 0 0 0 2
## X1.26.20 0 0 0 0 0 5
## X1.27.20 0 0 0 0 0 5
## Uzbekistan Venezuela Vietnam Zambia Zimbabwe
## X1.22.20 0 0 0 0 0
## X1.23.20 0 0 2 0 0
## X1.24.20 0 0 2 0 0
## X1.25.20 0 0 2 0 0
## X1.26.20 0 0 2 0 0
## X1.27.20 0 0 2 0 0
We’re getting closer. If you printed the head of the data frame you just created, you might have noticed that the columns are now of type <fctr>, or factor. This was one of the side effects of rotating the data frame. We want to turn all of these columns back into doubles.
To do so, pass your data frame to the apply() function. apply() takes two parameters in addition to the data set that you’re using. The MARGIN parameter should be set to 2 — this tells apply() to work with columns rather than rows.
The second parameter should be the function you want to apply to all columns. In this case, we want as.numeric.
Finally, after calling apply(), pipe the result through as.data.frame() to ensure we’re still working with a data frame. Print out the head of your result to see what the data now looks like. Look at the data type of the columns now!
# Transform the columns to numeric values
df_confirmed_transposed <- df_confirmed_transposed %>%
apply(MARGIN = 2, as.numeric) %>%
as.data.frame()
head(df_confirmed_transposed)
## Afghanistan Albania Algeria Andorra Angola Antigua and Barbuda Argentina
## 1 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0
## Armenia Australia Austria Azerbaijan Bahamas, The Bahrain Bangladesh Barbados
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 4 0 0 0 0 0 0
## 6 0 5 0 0 0 0 0 0
## Belarus Belgium Benin Bhutan Bolivia Bosnia and Herzegovina Brazil Brunei
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0
## Bulgaria Burkina Faso Cabo Verde Cambodia Cameroon Canada Cape Verde
## 1 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0
## 5 0 0 0 0 0 1 0
## 6 0 0 0 1 0 1 0
## Central African Republic Chad Chile China Colombia Congo (Brazzaville)
## 1 0 0 0 548 0 0
## 2 0 0 0 643 0 0
## 3 0 0 0 920 0 0
## 4 0 0 0 1406 0 0
## 5 0 0 0 2075 0 0
## 6 0 0 0 2877 0 0
## Congo (Kinshasa) Costa Rica Cote d'Ivoire Croatia Cruise Ship Cuba Cyprus
## 1 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0
## Czechia Denmark Djibouti Dominica Dominican Republic East Timor Ecuador Egypt
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0
## El Salvador Equatorial Guinea Eritrea Estonia Eswatini Ethiopia Fiji Finland
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0
## France Gabon Gambia, The Georgia Germany Ghana Greece Grenada Guatemala
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 2 0 0 0 0 0 0 0 0
## 4 3 0 0 0 0 0 0 0 0
## 5 3 0 0 0 0 0 0 0 0
## 6 3 0 0 0 1 0 0 0 0
## Guinea Guyana Haiti Holy See Honduras Hungary Iceland India Indonesia Iran
## 1 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0
## Iraq Ireland Israel Italy Jamaica Japan Jordan Kazakhstan Kenya Korea, South
## 1 0 0 0 0 0 2 0 0 0 1
## 2 0 0 0 0 0 1 0 0 0 1
## 3 0 0 0 0 0 2 0 0 0 2
## 4 0 0 0 0 0 2 0 0 0 2
## 5 0 0 0 0 0 4 0 0 0 3
## 6 0 0 0 0 0 4 0 0 0 4
## Kosovo Kuwait Kyrgyzstan Latvia Lebanon Liberia Liechtenstein Lithuania
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0
## Luxembourg Madagascar Malaysia Maldives Malta Martinique Mauritania Mauritius
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## 4 0 0 3 0 0 0 0 0
## 5 0 0 4 0 0 0 0 0
## 6 0 0 4 0 0 0 0 0
## Mexico Moldova Monaco Mongolia Montenegro Morocco Mozambique Namibia Nepal
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 1
## 5 0 0 0 0 0 0 0 0 1
## 6 0 0 0 0 0 0 0 0 1
## Netherlands New Zealand Nicaragua Niger Nigeria North Macedonia Norway Oman
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0
## Pakistan Panama Papua New Guinea Paraguay Peru Philippines Poland Portugal
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0
## Qatar Romania Russia Rwanda Saint Lucia Saint Vincent and the Grenadines
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## San Marino Saudi Arabia Senegal Serbia Seychelles Singapore Slovakia Slovenia
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 1 0 0
## 3 0 0 0 0 0 3 0 0
## 4 0 0 0 0 0 3 0 0
## 5 0 0 0 0 0 4 0 0
## 6 0 0 0 0 0 5 0 0
## Somalia South Africa Spain Sri Lanka Sudan Suriname Sweden Switzerland Syria
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 1 0 0 0 0 0
## Taiwan* Tanzania Thailand Timor-Leste Togo Trinidad and Tobago Tunisia Turkey
## 1 1 0 2 0 0 0 0 0
## 2 1 0 3 0 0 0 0 0
## 3 3 0 5 0 0 0 0 0
## 4 3 0 7 0 0 0 0 0
## 5 4 0 8 0 0 0 0 0
## 6 5 0 8 0 0 0 0 0
## Uganda Ukraine United Arab Emirates United Kingdom Uruguay US Uzbekistan
## 1 0 0 0 0 0 1 0
## 2 0 0 0 0 0 1 0
## 3 0 0 0 0 0 2 0
## 4 0 0 0 0 0 2 0
## 5 0 0 0 0 0 5 0
## 6 0 0 0 0 0 5 0
## Venezuela Vietnam Zambia Zimbabwe
## 1 0 0 0 0
## 2 0 2 0 0
## 3 0 2 0 0
## 4 0 2 0 0
## 5 0 2 0 0
## 6 0 2 0 0
Nice work - we got there! Let’s see what our hard work allows us to do. Let’s once again find the maximum number of cases reported in the US. To do so, simply select the US column and pipe the result to the max() function. This should be the same value that we found before we rotate the data frame.
# Find the maximum number of confirmed cases in the US
confirmed_max_us <- df_confirmed_transposed$US %>%
max()
confirmed_max_us
## [1] 33272
We put a lot of work into rotating that data frame, and you may be wondering if it was worth it. In this section, we’re going to make some basic graphs that demonstrate the value of rotating the data frame.
You may not have been introduced to visualization yet. If this is unfamiliar to you, feel free to skip this section — in the following section we offer some ideas for more data manipulation challenges using this data set. That being said, if visualization is unfamiliar to you, you can use this as an opportunity to do some independent learning! Challenge yourself to do some research to create the following graph.
Let’s build a line graph showing the number of confirmed cases over time for a particular country. To do this, we’ll first need to add a new column to our dataset to represent the date (notice that when we rotated our data frame, we lost the column containing the date). The first day in our dataset was January 22nd. Let’s represent that as day 1. January 23rd would then be day 2, and so on. We want to add a column named date that contains the numbers 1 through the total number of rows in the dataset. Use mutate() to do this. nrow() will help find the number of rows in the dataset. Print out the head of the date column to ensure this worked.
# Add the date column
df_confirmed_transposed <- df_confirmed_transposed %>%
mutate(date = as.Date('2020-01-22') + 0:(nrow(df_confirmed_transposed)-1))
df_confirmed_transposed %>%
select(date) %>%
head()
## date
## 1 2020-01-22
## 2 2020-01-23
## 3 2020-01-24
## 4 2020-01-25
## 5 2020-01-26
## 6 2020-01-27
Great! Our date column is all set. Let’s now graph the number of cases in Malaysia over the days in our dataset. We’ll use the ggplot2 library to do this. Use a combination of ggplot(), aes(), and geom_line() to make our graph! Once again, if you’re unfamiliar with graphing, try to do some online research to learn how to do this!
library(ggplot2)
# Create a line graph with date on the X axis and number of cases in Malaysia on the Y axis
df_confirmed_transposed %>%
ggplot(aes(x = date, y = Malaysia)) +
geom_line()
That line of code is pretty concise. That’s one of the reasons why we wanted to rotate the data frame. Having a column containing only the confirmed cases from a particular country made this graph relatively simple to create.
Finally, let’s do a bit of work to add a title, and to fix the names of the axes. There are a few different ways to do this, but we used the labs() function which takes x, y and title parameters.
# Add a proper title, x label, and y label
df_confirmed_transposed %>%
ggplot(aes(x = date, y = Malaysia)) +
geom_line() +
labs(x = 'Date', y = 'Confirmed Cases', title = 'Confirmed Cases in Malaysia')
In the next section we’ll give you ideas on ways to continue visualizing this data.
Great work! In this project you have truly refined your data manipulation skills using a real data set. In doing so, we hope that you’ve seen some of the difficulties you need to work through when working with data. Your data set might have annoying column names. Data may be missing or incorrect. You might need to reshape the data or change its type. Gaining the confidence to master these challenges unique to your dataset is an important part of being a data scientist.
If you’d like to continue working with this data, here are some challenges: