Objective

The goal of this project is to use R tools to transform and prepare dataset for downstream analysis work. three datasets will be selected and used in this project.

Data I: National Agricultural Statistics Service Dataset

Description : The data will be extracted using tabulizer library from the following land values report published in usda.gov website. the report include several tables. we will extract the first table : Farm Real Estate Average Value per Acre –Regions, States, and UnitedStates:2015-2019.

#Load all required packages
library(tabulizer) # This package will be used to extract a table from the pdf report
library(tidyr)
library(hrbrthemes)

## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.

##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and

##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2) 
library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ tibble  2.1.3     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ✓ purrr   0.3.3

## ── Conflicts ──────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Extract the data from the USDA PDF file report.

table1 <- extract_tables(
  "https://www.nass.usda.gov/Publications/Todays_Reports/reports/land0819.pdf",
  output = "data.frame",
  pages = c(8, 9), # Specify the table location page 8, 9 
  guess = TRUE )

Let’s clean the DATA

table2_clean <- reduce(table1, bind_rows) %>% # bind elements of list to 1 df
as_tibble() %>%
filter(!(X %in% c("Region and State", ""))) %>%
  
rename(                             # Rename header
    Region_and_State = X,
    Year_2015 = X.1,
    Year_2016 = X.2,
    Year_2017 = X.3,
    Year_2018 = X.4,
    Year_2019 = X.5,
    ) 
table3_clean <- data.frame(lapply(table2_clean, function(x) gsub("\\s\\s[.]*", "", x))) # Remove the dotted lines 2 spaces after string
table3_clean <- data.frame(lapply(table2_clean, function(x) gsub("\\s[.]*", " ", x))) # Remove the dotted lines 1 space after string
landData <- data.frame(lapply(table3_clean, function(x) gsub(",", "", x))) %>%
mutate_at(vars(Year_2015, Year_2016, Year_2017, Year_2018, Year_2019, Change), as.character) %>%
mutate_at(vars(Year_2015, Year_2016, Year_2017, Year_2018, Year_2019, Change), as.numeric) %>%
mutate_at(vars(Region_and_State), as.character)

## Warning: NAs introduced by coercion

# an error will shows up "NAs introduced by coercion" because the change column include dash symbol for NA's

The data is ready lets save it as csv file, and practice some data wrangling with it.

write.csv(landData, file = "landData.csv")

Data wrangling and ploting

Lets plot the Farm Real Estate Average Value per Acre in Utited State

# Extarct United State Data
LandUS <- filter(landData, landData$Region_and_State == "United States 2  ") 
LandUS

##    Region_and_State Year_2015 Year_2016 Year_2017 Year_2018 Year_2019 Change
## 1 United States 2        3000      2990      3030      3100      3160    1.9

# Guther Colomuns into rows and select the needed colomuns 
LandGuther <- gather(LandUS, "Year", "Value_per_Acre", 2:6) %>%
select(Year, Value_per_Acre)
LandGuther

##        Year Value_per_Acre
## 1 Year_2015           3000
## 2 Year_2016           2990
## 3 Year_2017           3030
## 4 Year_2018           3100
## 5 Year_2019           3160

#Plot the Farm Real Estate Average Values per Acre in United States

ggplot(LandGuther, aes(x=Year, y=Value_per_Acre, group =1)) +
  geom_line(color="#69b3a2") + 
  theme_ipsum()+
  geom_point() + 
  ggtitle("Farm Real Estate Average Value per Acre in United States 2015-2019")+
  labs(y= "Dollars per acre", x = "Year")

Extract Data by Region

landByRegion <- filter(landData, Region_and_State == "Northeast  " | Region_and_State == "Lake States  " | Region_and_State == "Corn Belt  " | Region_and_State == "Northern Plains  " | Region_and_State == "Appalachian  " | Region_and_State == "Southeast  " | Region_and_State == "Delta States  " | Region_and_State =="Southern Plains  " | Region_and_State == "Mountain " | Region_and_State == "Pacific  ")
landByRegion

##     Region_and_State Year_2015 Year_2016 Year_2017 Year_2018 Year_2019 Change
## 1        Northeast        5190      5270      5380      5550      5690    2.5
## 2      Lake States        4740      4730      4880      4890      4900    0.2
## 3        Corn Belt        6220      6100      5990      6110      6100   -0.2
## 4  Northern Plains        2320      2200      2150      2110      2170    2.8
## 5      Appalachian        3810      3880      3970      4030      4080    1.2
## 6        Southeast        3740      3830      3990      4050      4090    1.0
## 7     Delta States        2790      2830      2920      3000      3100    3.3
## 8  Southern Plains        1810      1810      1880      2000      2070    3.5
## 9          Mountain       1130      1140      1170      1200      1220    1.7
## 10         Pacific        4800      4960      5440      5610      5900    5.2

ggplot(landByRegion, aes(x=Region_and_State, y=Change, fill=Change))+
  geom_bar(colour="black", stat="identity")+
  theme(axis.text.x = element_text(angle=65, vjust=0.6))+
  ggtitle("Farm Real Estate Value per Acre % Change 2015-2019 by Region")+
  labs(y= "% Change", x = "Region")

Data II : Students performance in two tests in three terms

Let’s Imprort the data from the csv file

As we can see the data is untidy and following are the issues:

Multiple variables are stored in one column
Variables are stored in both rows and columns
Some column headers are values, not variable names
Multiple types of observational units are stored in the same table

StudentsPerfor <- read.csv("https://gist.githubusercontent.com/Kimmirikwa/b69d0ea134820ea52f8481991ffae93e/raw/4db7b1698035ee29885d10e1a59bd902716ae168/student_results.csv")
StudentsPerfor

##    id   name phone sex.and.age test.number term.1 term.2 term.3
## 1   1   Mike   134        m_12      test 1     76     84     87
## 2   2  Linda   270        f_13      test 1     88     90     73
## 3   3    Sam   210        m_11      test 1     78     74     80
## 4   4 Esther   617        f_12      test 1     68     75     74
## 5   5   Mary   114        f_14      test 1     65     67     64
## 6   1   Mike   134        m_12      test 2     85     80     90
## 7   2  Linda   270        f_13      test 2     87     82     94
## 8   3    Sam   210        m_11      test 2     80     87     80
## 9   4 Esther   617        f_12      test 2     70     75     78
## 10  5   Mary   114        f_14      test 2     68     70     63

1. Multiple variables are stored in one column

This data has data for both student and performance. The first thing I will do is to split these different observational units for each to have its own table. The columns id, name, phone and sex and age will be in the student table. test number, term 1, term 2 and term 3 will be in the performance table. id column will also be added to the performance table to identify the performance’s student.

students_tbl <- select(StudentsPerfor, id, name, phone, sex.and.age)
students_tbl

##    id   name phone sex.and.age
## 1   1   Mike   134        m_12
## 2   2  Linda   270        f_13
## 3   3    Sam   210        m_11
## 4   4 Esther   617        f_12
## 5   5   Mary   114        f_14
## 6   1   Mike   134        m_12
## 7   2  Linda   270        f_13
## 8   3    Sam   210        m_11
## 9   4 Esther   617        f_12
## 10  5   Mary   114        f_14

performance_tbl <- select(StudentsPerfor, id, test.number, term.1, term.2, term.3)
performance_tbl

##    id test.number term.1 term.2 term.3
## 1   1      test 1     76     84     87
## 2   2      test 1     88     90     73
## 3   3      test 1     78     74     80
## 4   4      test 1     68     75     74
## 5   5      test 1     65     67     64
## 6   1      test 2     85     80     90
## 7   2      test 2     87     82     94
## 8   3      test 2     80     87     80
## 9   4      test 2     70     75     78
## 10  5      test 2     68     70     63

2. Variables are stored in both rows and columns

The sex and age column in student_tbl has data for both sex and age. The sex and age for each student is extracted and added to the relevant column.

students_tbl2 <- separate(students_tbl, sex.and.age, c("Sex", "Age"), sep = "_")
students_tbl2

##    id   name phone Sex Age
## 1   1   Mike   134   m  12
## 2   2  Linda   270   f  13
## 3   3    Sam   210   m  11
## 4   4 Esther   617   f  12
## 5   5   Mary   114   f  14
## 6   1   Mike   134   m  12
## 7   2  Linda   270   f  13
## 8   3    Sam   210   m  11
## 9   4 Esther   617   f  12
## 10  5   Mary   114   f  14

3. Some column headers are values, not variable names

term 1, term 2 and term 3 are values but are used as column headers. We’ll make term 1, term 2 and term 3 columns in the term column and their previous values are in the marks column.

performance_tbl2 <- gather(performance_tbl, "Term", "Marks", 3:5)
performance_tbl2

##    id test.number   Term Marks
## 1   1      test 1 term.1    76
## 2   2      test 1 term.1    88
## 3   3      test 1 term.1    78
## 4   4      test 1 term.1    68
## 5   5      test 1 term.1    65
## 6   1      test 2 term.1    85
## 7   2      test 2 term.1    87
## 8   3      test 2 term.1    80
## 9   4      test 2 term.1    70
## 10  5      test 2 term.1    68
## 11  1      test 1 term.2    84
## 12  2      test 1 term.2    90
## 13  3      test 1 term.2    74
## 14  4      test 1 term.2    75
## 15  5      test 1 term.2    67
## 16  1      test 2 term.2    80
## 17  2      test 2 term.2    82
## 18  3      test 2 term.2    87
## 19  4      test 2 term.2    75
## 20  5      test 2 term.2    70
## 21  1      test 1 term.3    87
## 22  2      test 1 term.3    73
## 23  3      test 1 term.3    80
## 24  4      test 1 term.3    74
## 25  5      test 1 term.3    64
## 26  1      test 2 term.3    90
## 27  2      test 2 term.3    94
## 28  3      test 2 term.3    80
## 29  4      test 2 term.3    78
## 30  5      test 2 term.3    63

4. Multiple variables are stored in one column

test 1 and test 2 are different variables but are stored in test number column. test 1 and test 2 columns are added with their respective marks as. Also we’ll the names from the sutdents table

performance_tbl3 <- spread(performance_tbl2, test.number, Marks) %>%
rename(
   "test.1.marks" = "test 1",
    "test.2.marks" = "test 2"
) %>% 
left_join(select(students_tbl, id, name), by = "id") %>%
select(id, name, Term, test.1.marks, test.2.marks)
performance_tbl3

##    id   name   Term test.1.marks test.2.marks
## 1   1   Mike term.1           76           85
## 2   1   Mike term.1           76           85
## 3   1   Mike term.2           84           80
## 4   1   Mike term.2           84           80
## 5   1   Mike term.3           87           90
## 6   1   Mike term.3           87           90
## 7   2  Linda term.1           88           87
## 8   2  Linda term.1           88           87
## 9   2  Linda term.2           90           82
## 10  2  Linda term.2           90           82
## 11  2  Linda term.3           73           94
## 12  2  Linda term.3           73           94
## 13  3    Sam term.1           78           80
## 14  3    Sam term.1           78           80
## 15  3    Sam term.2           74           87
## 16  3    Sam term.2           74           87
## 17  3    Sam term.3           80           80
## 18  3    Sam term.3           80           80
## 19  4 Esther term.1           68           70
## 20  4 Esther term.1           68           70
## 21  4 Esther term.2           75           75
## 22  4 Esther term.2           75           75
## 23  4 Esther term.3           74           78
## 24  4 Esther term.3           74           78
## 25  5   Mary term.1           65           68
## 26  5   Mary term.1           65           68
## 27  5   Mary term.2           67           70
## 28  5   Mary term.2           67           70
## 29  5   Mary term.3           64           63
## 30  5   Mary term.3           64           63

Calculate the average

Here we we’ll cacluate a new variable (average), and select the Term column by terms

mutate(performance_tbl3, average = (test.1.marks + test.2.marks)/2) %>%
select(name, Term, average) %>%
group_by(name)

## # A tibble: 30 x 3
## # Groups:   name [5]
##    name  Term   average
##    <fct> <chr>    <dbl>
##  1 Mike  term.1    80.5
##  2 Mike  term.1    80.5
##  3 Mike  term.2    82  
##  4 Mike  term.2    82  
##  5 Mike  term.3    88.5
##  6 Mike  term.3    88.5
##  7 Linda term.1    87.5
##  8 Linda term.1    87.5
##  9 Linda term.2    86  
## 10 Linda term.2    86  
## # … with 20 more rows

Data III : Coronona Virus data

Load the data from the data source

coronaDB <- read.csv("https://raw.githubusercontent.com/CryptoKass/ncov-data/master/world.latest.bno.csv")
tbl_df(coronaDB)

## # A tibble: 26 x 6
##        X country   cases deaths notes            links                          
##    <int> <fct>     <int>  <int> <fct>            <fct>                          
##  1     0 China     59804   1365 0                https://bnonews.com/index.php/…
##  2     1 Hong Kong    50      1 4 critical, 2 s… https://edition.cnn.com/asia/l…
##  3     2 Taiwan       18      0 1 recovered      https://www.cdc.gov.tw/Bulleti…
##  4     3 Macau        10      0 1 recovered      https://news.gov.mo/detail/zh-…
##  5     4 Japan       247      0 4 serious, 4 re… https://www3.nhk.or.jp/nhkworl…
##  6     5 Singapore    50      0 8 critical, 15 … https://www.moh.gov.sg/news-hi…
##  7     6 Thailand     33      0 1 serious, 10 r… https://pr.moph.go.th/?url=pr/…
##  8     7 South Ko…    28      0 7 recovered      https://en.yna.co.kr/view/AEN2…
##  9     8 Malaysia     18      0 3 recovered      https://www.channelnewsasia.co…
## 10     9 Australia    15      0 8 recovered      https://www.smh.com.au/nationa…
## # … with 16 more rows

As we can see from the data the ‘notes’ column contained counts of individuals in critical, serious or recovered from the coronavirus. let’s extract these infromation and put it in separate columns. we have 3 type of conditions (critical, serious, recovered) seprated by comma, so we can use the seprate() function to extract the information.

corona_tbl <- separate(coronaDB, notes, c("condition1", "condition2", "condition3", NA), sep = ", ")

## Warning: Expected 4 pieces. Missing pieces filled with `NA` in 26 rows [1, 2, 3,
## 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

head(corona_tbl)

##   X   country cases deaths  condition1   condition2  condition3
## 1 0     China 59804   1365           0         <NA>        <NA>
## 2 1 Hong Kong    50      1  4 critical    2 serious 1 recovered
## 3 2    Taiwan    18      0 1 recovered         <NA>        <NA>
## 4 3     Macau    10      0 1 recovered         <NA>        <NA>
## 5 4     Japan   247      0   4 serious  4 recovered        <NA>
## 6 5 Singapore    50      0  8 critical 15 recovered        <NA>
##                                                                                                                      links
## 1                                                      https://bnonews.com/index.php/2020/02/the-latest-coronavirus-cases/
## 2         https://edition.cnn.com/asia/live-news/coronavirus-outbreak-02-12-20-intl-hnk/h_b49323f6a8ba348a79c4cda5886339c7
## 3                                                   https://www.cdc.gov.tw/Bulletin/Detail/iMZg3IYhKMeo-87fTw8hpQ?typeid=9
## 4                                                                          https://news.gov.mo/detail/zh-hant/N20BDPAzBd?3
## 5                                                                     https://www3.nhk.or.jp/nhkworld/en/news/20200213_29/
## 6 https://www.moh.gov.sg/news-highlights/details/six-more-cases-discharged-three-new-cases-of-covid-19-infection-confirmed

Now that we have three new columns for the conditions type and the number of conditions, we’ll need to separate the numbers from the health condition using space as a seprator. we’ll call the new columns NC (# of critical), NR (# of recovered), NS (# of serious conditions)

corona_tbl2 <- separate(corona_tbl, condition1, c("NC", "condition1"), sep = "\\s") %>% # NC is the number of critical cases
 separate(condition2, c("NR", "condition2"), sep = "\\s") %>% # NR is the number of Recovered cases
separate(condition3, c("NS", "condition3", NA), sep = " ") # NS is the number of serious cases

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 4 rows [1, 11,
## 19, 22].

## Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [17].

## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].

head(corona_tbl2)

##   X   country cases deaths NC condition1   NR condition2   NS condition3
## 1 0     China 59804   1365  0       <NA> <NA>       <NA> <NA>       <NA>
## 2 1 Hong Kong    50      1  4   critical    2    serious    1  recovered
## 3 2    Taiwan    18      0  1  recovered <NA>       <NA> <NA>       <NA>
## 4 3     Macau    10      0  1  recovered <NA>       <NA> <NA>       <NA>
## 5 4     Japan   247      0  4    serious    4  recovered <NA>       <NA>
## 6 5 Singapore    50      0  8   critical   15  recovered <NA>       <NA>
##                                                                                                                      links
## 1                                                      https://bnonews.com/index.php/2020/02/the-latest-coronavirus-cases/
## 2         https://edition.cnn.com/asia/live-news/coronavirus-outbreak-02-12-20-intl-hnk/h_b49323f6a8ba348a79c4cda5886339c7
## 3                                                   https://www.cdc.gov.tw/Bulletin/Detail/iMZg3IYhKMeo-87fTw8hpQ?typeid=9
## 4                                                                          https://news.gov.mo/detail/zh-hant/N20BDPAzBd?3
## 5                                                                     https://www3.nhk.or.jp/nhkworld/en/news/20200213_29/
## 6 https://www.moh.gov.sg/news-highlights/details/six-more-cases-discharged-three-new-cases-of-covid-19-infection-confirmed

Because the information in the notes column wasn’t in the same order (critical, recovered, serious), we now have mixed data in our new conditions columns. we have to fix this. so we’ll use the mutate() function to swipe these values, and put them in the right column.

corona_tbl2 %>% mutate(condition1 = ifelse(condition1 == "recovered", condition2, condition1)) %>%
mutate(condition1 = ifelse(condition1 == "serious", condition3, condition1)) %>%
mutate(condition2 = ifelse(condition2 == "serious", condition3, condition2)) %>%
mutate(condition3 = ifelse(condition3 == "recovered", "serious", condition2))

##     X        country cases deaths NC condition1   NR condition2   NS condition3
## 1   0          China 59804   1365  0       <NA> <NA>       <NA> <NA>       <NA>
## 2   1      Hong Kong    50      1  4   critical    2  recovered    1    serious
## 3   2         Taiwan    18      0  1       <NA> <NA>       <NA> <NA>       <NA>
## 4   3          Macau    10      0  1       <NA> <NA>       <NA> <NA>       <NA>
## 5   4          Japan   247      0  4       <NA>    4  recovered <NA>       <NA>
## 6   5      Singapore    50      0  8   critical   15  recovered <NA>       <NA>
## 7   6       Thailand    33      0  1       <NA>   10  recovered <NA>       <NA>
## 8   7    South Korea    28      0  7       <NA> <NA>       <NA> <NA>       <NA>
## 9   8       Malaysia    18      0  3       <NA> <NA>       <NA> <NA>       <NA>
## 10  9      Australia    15      0  8       <NA> <NA>       <NA> <NA>       <NA>
## 11 10        Germany    16      0  0       <NA> <NA>       <NA> <NA>       <NA>
## 12 11        Vietnam    16      0  7       <NA> <NA>       <NA> <NA>       <NA>
## 13 12  United States    14      0  3       <NA> <NA>       <NA> <NA>       <NA>
## 14 13         France    11      0  1       <NA>    2  recovered <NA>       <NA>
## 15 14 United Kingdom     9      0  1       <NA> <NA>       <NA> <NA>       <NA>
## 16 15         Canada     7      0  1       <NA> <NA>       <NA> <NA>       <NA>
## 17 16            UAE     8      0  1       <NA>    1  recovered <NA>       <NA>
## 18 17    Philippines     3      1  2       <NA> <NA>       <NA> <NA>       <NA>
## 19 18          India     3      0  0       <NA> <NA>       <NA> <NA>       <NA>
## 20 19          Italy     3      0  2       <NA> <NA>       <NA> <NA>       <NA>
## 21 20         Russia     2      0  2       <NA> <NA>       <NA> <NA>       <NA>
## 22 21          Spain     2      0  0       <NA> <NA>       <NA> <NA>       <NA>
## 23 22          Nepal     1      0  1       <NA> <NA>       <NA> <NA>       <NA>
## 24 23       Cambodia     1      0  1       <NA> <NA>       <NA> <NA>       <NA>
## 25 24      Sri Lanka     1      0  1       <NA> <NA>       <NA> <NA>       <NA>
## 26 25        Finland     1      0  1       <NA> <NA>       <NA> <NA>       <NA>
##                                                                                                                                                                     links
## 1                                                                                                     https://bnonews.com/index.php/2020/02/the-latest-coronavirus-cases/
## 2                                                        https://edition.cnn.com/asia/live-news/coronavirus-outbreak-02-12-20-intl-hnk/h_b49323f6a8ba348a79c4cda5886339c7
## 3                                                                                                  https://www.cdc.gov.tw/Bulletin/Detail/iMZg3IYhKMeo-87fTw8hpQ?typeid=9
## 4                                                                                                                         https://news.gov.mo/detail/zh-hant/N20BDPAzBd?3
## 5                                                                                                                    https://www3.nhk.or.jp/nhkworld/en/news/20200213_29/
## 6                                                https://www.moh.gov.sg/news-highlights/details/six-more-cases-discharged-three-new-cases-of-covid-19-infection-confirmed
## 7                                                                                                                       https://pr.moph.go.th/?url=pr/detail/2/04/138459/
## 8                                                                                                                          https://en.yna.co.kr/view/AEN20200212007600320
## 9                                                          https://www.channelnewsasia.com/news/asia/malaysia-recovered-wuhan-coronavirus-patients-china-embassy-12421116
## 10                                                               https://www.smh.com.au/national/all-four-nsw-coronavirus-patients-free-from-disease-20200213-p540i9.html
## 11                                        https://www.stmgp.bayern.de/presse/aktuelle-informationen-zur-coronavirus-lage-in-bayern-bayerisches-gesundheitsministerium-15/
## 12                                                                        https://vietnamnet.vn/vn/suc-khoe/suc-khoe-24h/nguoi-thu-16-nhiem-ncov-tai-viet-nam-615372.html
## 13                                                                                                                 https://twitter.com/BNODesk/status/1227752734486650880
## 14                                                       https://edition.cnn.com/asia/live-news/coronavirus-outbreak-02-12-20-intl-hnk/h_6ea5df590e539c8c15ab1f218588a086
## 15                                                                                                                             https://www.bbc.co.uk/news/amp/uk-51481469
## 16                                                            https://news.ontario.ca/mohltc/en/2020/02/ontario-confirms-resolved-case-of-the-2019-novel-coronavirus.html
## 17                                                                                                                https://twitter.com/mohapuae/status/1226926065563656193
## 18                                                                                                                https://twitter.com/mohapuae/status/1226552040559718400
## 19                                                                                                             https://www.pib.gov.in/PressReleseDetail.aspx?PRID=1601681
## 20 https://www.ansa.it/canale_saluteebenessere/notizie/sanita/2020/02/03/coronavirus-accertamenti-su-italiano-alla-cecchignola-_b20da9f2-57a5-4870-978f-23e9c50d6155.html
## 21                                                                 https://www.channelnewsasia.com/news/world/russia-discharges-second-china-coronavirus-patient-12426458
## 22                                                                                                          https://twitter.com/SaludPublicaEs/status/1226432655555731457
## 23                                                                                    https://www.thelancet.com/journals/laninf/article/PIIS1473-3099(20)30067-0/fulltext
## 24                                                                                                             https://twitter.com/VOD_English/status/1221769368180121603
## 25                                   https://www.reuters.com/article/us-health-china-sri-lanka/sri-lanka-confirms-first-case-of-coronavirus-health-official-idUSKBN1ZQ1WF
## 26                                                     https://yle.fi/uutiset/osasto/news/finlands_first_coronavirus_patient_released_from_hospital_symptom-free/11193661

corona_tbl3 <- select(corona_tbl2, country, cases, deaths)
head(corona_tbl3)

##     country cases deaths
## 1     China 59804   1365
## 2 Hong Kong    50      1
## 3    Taiwan    18      0
## 4     Macau    10      0
## 5     Japan   247      0
## 6 Singapore    50      0

corona_tbl4 <- gather(corona_tbl3, "corona_data", "numbers", 2:3)
head(corona_tbl4)

##     country corona_data numbers
## 1     China       cases   59804
## 2 Hong Kong       cases      50
## 3    Taiwan       cases      18
## 4     Macau       cases      10
## 5     Japan       cases     247
## 6 Singapore       cases      50

We wanted to create a stacked histogram using the transformed data, but there are several NA, and missing values.

Concluion

In this project i discovered the power of R, when handling and transforming data. by using small codes i was able to perform some complex data transformations, that may need long coding with other tools like panda in python.

Project 2

Abdellah, Ait Elmouden

3/6/2020