The goal of this project is to use R tools to transform and prepare dataset for downstream analysis work. three datasets will be selected and used in this project.
Description : The data will be extracted using tabulizer library from the following land values report published in usda.gov website. the report include several tables. we will extract the first table : Farm Real Estate Average Value per Acre –Regions, States, and UnitedStates:2015-2019.
#Load all required packages
library(tabulizer) # This package will be used to extract a table from the pdf report
library(tidyr)
library(hrbrthemes)
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
## Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
## if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble 2.1.3 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ✓ purrr 0.3.3
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
table1 <- extract_tables(
"https://www.nass.usda.gov/Publications/Todays_Reports/reports/land0819.pdf",
output = "data.frame",
pages = c(8, 9), # Specify the table location page 8, 9
guess = TRUE )
table2_clean <- reduce(table1, bind_rows) %>% # bind elements of list to 1 df
as_tibble() %>%
filter(!(X %in% c("Region and State", ""))) %>%
rename( # Rename header
Region_and_State = X,
Year_2015 = X.1,
Year_2016 = X.2,
Year_2017 = X.3,
Year_2018 = X.4,
Year_2019 = X.5,
)
table3_clean <- data.frame(lapply(table2_clean, function(x) gsub("\\s\\s[.]*", "", x))) # Remove the dotted lines 2 spaces after string
table3_clean <- data.frame(lapply(table2_clean, function(x) gsub("\\s[.]*", " ", x))) # Remove the dotted lines 1 space after string
landData <- data.frame(lapply(table3_clean, function(x) gsub(",", "", x))) %>%
mutate_at(vars(Year_2015, Year_2016, Year_2017, Year_2018, Year_2019, Change), as.character) %>%
mutate_at(vars(Year_2015, Year_2016, Year_2017, Year_2018, Year_2019, Change), as.numeric) %>%
mutate_at(vars(Region_and_State), as.character)
## Warning: NAs introduced by coercion
# an error will shows up "NAs introduced by coercion" because the change column include dash symbol for NA's
The data is ready lets save it as csv file, and practice some data wrangling with it.
write.csv(landData, file = "landData.csv")
# Extarct United State Data
LandUS <- filter(landData, landData$Region_and_State == "United States 2 ")
LandUS
## Region_and_State Year_2015 Year_2016 Year_2017 Year_2018 Year_2019 Change
## 1 United States 2 3000 2990 3030 3100 3160 1.9
# Guther Colomuns into rows and select the needed colomuns
LandGuther <- gather(LandUS, "Year", "Value_per_Acre", 2:6) %>%
select(Year, Value_per_Acre)
LandGuther
## Year Value_per_Acre
## 1 Year_2015 3000
## 2 Year_2016 2990
## 3 Year_2017 3030
## 4 Year_2018 3100
## 5 Year_2019 3160
#Plot the Farm Real Estate Average Values per Acre in United States
ggplot(LandGuther, aes(x=Year, y=Value_per_Acre, group =1)) +
geom_line(color="#69b3a2") +
theme_ipsum()+
geom_point() +
ggtitle("Farm Real Estate Average Value per Acre in United States 2015-2019")+
labs(y= "Dollars per acre", x = "Year")
landByRegion <- filter(landData, Region_and_State == "Northeast " | Region_and_State == "Lake States " | Region_and_State == "Corn Belt " | Region_and_State == "Northern Plains " | Region_and_State == "Appalachian " | Region_and_State == "Southeast " | Region_and_State == "Delta States " | Region_and_State =="Southern Plains " | Region_and_State == "Mountain " | Region_and_State == "Pacific ")
landByRegion
## Region_and_State Year_2015 Year_2016 Year_2017 Year_2018 Year_2019 Change
## 1 Northeast 5190 5270 5380 5550 5690 2.5
## 2 Lake States 4740 4730 4880 4890 4900 0.2
## 3 Corn Belt 6220 6100 5990 6110 6100 -0.2
## 4 Northern Plains 2320 2200 2150 2110 2170 2.8
## 5 Appalachian 3810 3880 3970 4030 4080 1.2
## 6 Southeast 3740 3830 3990 4050 4090 1.0
## 7 Delta States 2790 2830 2920 3000 3100 3.3
## 8 Southern Plains 1810 1810 1880 2000 2070 3.5
## 9 Mountain 1130 1140 1170 1200 1220 1.7
## 10 Pacific 4800 4960 5440 5610 5900 5.2
ggplot(landByRegion, aes(x=Region_and_State, y=Change, fill=Change))+
geom_bar(colour="black", stat="identity")+
theme(axis.text.x = element_text(angle=65, vjust=0.6))+
ggtitle("Farm Real Estate Value per Acre % Change 2015-2019 by Region")+
labs(y= "% Change", x = "Region")
Let’s Imprort the data from the csv file
As we can see the data is untidy and following are the issues:
StudentsPerfor <- read.csv("https://gist.githubusercontent.com/Kimmirikwa/b69d0ea134820ea52f8481991ffae93e/raw/4db7b1698035ee29885d10e1a59bd902716ae168/student_results.csv")
StudentsPerfor
## id name phone sex.and.age test.number term.1 term.2 term.3
## 1 1 Mike 134 m_12 test 1 76 84 87
## 2 2 Linda 270 f_13 test 1 88 90 73
## 3 3 Sam 210 m_11 test 1 78 74 80
## 4 4 Esther 617 f_12 test 1 68 75 74
## 5 5 Mary 114 f_14 test 1 65 67 64
## 6 1 Mike 134 m_12 test 2 85 80 90
## 7 2 Linda 270 f_13 test 2 87 82 94
## 8 3 Sam 210 m_11 test 2 80 87 80
## 9 4 Esther 617 f_12 test 2 70 75 78
## 10 5 Mary 114 f_14 test 2 68 70 63
1. Multiple variables are stored in one column
This data has data for both student and performance. The first thing I will do is to split these different observational units for each to have its own table. The columns id, name, phone and sex and age will be in the student table. test number, term 1, term 2 and term 3 will be in the performance table. id column will also be added to the performance table to identify the performance’s student.
students_tbl <- select(StudentsPerfor, id, name, phone, sex.and.age)
students_tbl
## id name phone sex.and.age
## 1 1 Mike 134 m_12
## 2 2 Linda 270 f_13
## 3 3 Sam 210 m_11
## 4 4 Esther 617 f_12
## 5 5 Mary 114 f_14
## 6 1 Mike 134 m_12
## 7 2 Linda 270 f_13
## 8 3 Sam 210 m_11
## 9 4 Esther 617 f_12
## 10 5 Mary 114 f_14
performance_tbl <- select(StudentsPerfor, id, test.number, term.1, term.2, term.3)
performance_tbl
## id test.number term.1 term.2 term.3
## 1 1 test 1 76 84 87
## 2 2 test 1 88 90 73
## 3 3 test 1 78 74 80
## 4 4 test 1 68 75 74
## 5 5 test 1 65 67 64
## 6 1 test 2 85 80 90
## 7 2 test 2 87 82 94
## 8 3 test 2 80 87 80
## 9 4 test 2 70 75 78
## 10 5 test 2 68 70 63
2. Variables are stored in both rows and columns
The sex and age column in student_tbl has data for both sex and age. The sex and age for each student is extracted and added to the relevant column.
students_tbl2 <- separate(students_tbl, sex.and.age, c("Sex", "Age"), sep = "_")
students_tbl2
## id name phone Sex Age
## 1 1 Mike 134 m 12
## 2 2 Linda 270 f 13
## 3 3 Sam 210 m 11
## 4 4 Esther 617 f 12
## 5 5 Mary 114 f 14
## 6 1 Mike 134 m 12
## 7 2 Linda 270 f 13
## 8 3 Sam 210 m 11
## 9 4 Esther 617 f 12
## 10 5 Mary 114 f 14
3. Some column headers are values, not variable names
term 1, term 2 and term 3 are values but are used as column headers. We’ll make term 1, term 2 and term 3 columns in the term column and their previous values are in the marks column.
performance_tbl2 <- gather(performance_tbl, "Term", "Marks", 3:5)
performance_tbl2
## id test.number Term Marks
## 1 1 test 1 term.1 76
## 2 2 test 1 term.1 88
## 3 3 test 1 term.1 78
## 4 4 test 1 term.1 68
## 5 5 test 1 term.1 65
## 6 1 test 2 term.1 85
## 7 2 test 2 term.1 87
## 8 3 test 2 term.1 80
## 9 4 test 2 term.1 70
## 10 5 test 2 term.1 68
## 11 1 test 1 term.2 84
## 12 2 test 1 term.2 90
## 13 3 test 1 term.2 74
## 14 4 test 1 term.2 75
## 15 5 test 1 term.2 67
## 16 1 test 2 term.2 80
## 17 2 test 2 term.2 82
## 18 3 test 2 term.2 87
## 19 4 test 2 term.2 75
## 20 5 test 2 term.2 70
## 21 1 test 1 term.3 87
## 22 2 test 1 term.3 73
## 23 3 test 1 term.3 80
## 24 4 test 1 term.3 74
## 25 5 test 1 term.3 64
## 26 1 test 2 term.3 90
## 27 2 test 2 term.3 94
## 28 3 test 2 term.3 80
## 29 4 test 2 term.3 78
## 30 5 test 2 term.3 63
4. Multiple variables are stored in one column
test 1 and test 2 are different variables but are stored in test number column. test 1 and test 2 columns are added with their respective marks as. Also we’ll the names from the sutdents table
performance_tbl3 <- spread(performance_tbl2, test.number, Marks) %>%
rename(
"test.1.marks" = "test 1",
"test.2.marks" = "test 2"
) %>%
left_join(select(students_tbl, id, name), by = "id") %>%
select(id, name, Term, test.1.marks, test.2.marks)
performance_tbl3
## id name Term test.1.marks test.2.marks
## 1 1 Mike term.1 76 85
## 2 1 Mike term.1 76 85
## 3 1 Mike term.2 84 80
## 4 1 Mike term.2 84 80
## 5 1 Mike term.3 87 90
## 6 1 Mike term.3 87 90
## 7 2 Linda term.1 88 87
## 8 2 Linda term.1 88 87
## 9 2 Linda term.2 90 82
## 10 2 Linda term.2 90 82
## 11 2 Linda term.3 73 94
## 12 2 Linda term.3 73 94
## 13 3 Sam term.1 78 80
## 14 3 Sam term.1 78 80
## 15 3 Sam term.2 74 87
## 16 3 Sam term.2 74 87
## 17 3 Sam term.3 80 80
## 18 3 Sam term.3 80 80
## 19 4 Esther term.1 68 70
## 20 4 Esther term.1 68 70
## 21 4 Esther term.2 75 75
## 22 4 Esther term.2 75 75
## 23 4 Esther term.3 74 78
## 24 4 Esther term.3 74 78
## 25 5 Mary term.1 65 68
## 26 5 Mary term.1 65 68
## 27 5 Mary term.2 67 70
## 28 5 Mary term.2 67 70
## 29 5 Mary term.3 64 63
## 30 5 Mary term.3 64 63
Calculate the average
Here we we’ll cacluate a new variable (average), and select the Term column by terms
mutate(performance_tbl3, average = (test.1.marks + test.2.marks)/2) %>%
select(name, Term, average) %>%
group_by(name)
## # A tibble: 30 x 3
## # Groups: name [5]
## name Term average
## <fct> <chr> <dbl>
## 1 Mike term.1 80.5
## 2 Mike term.1 80.5
## 3 Mike term.2 82
## 4 Mike term.2 82
## 5 Mike term.3 88.5
## 6 Mike term.3 88.5
## 7 Linda term.1 87.5
## 8 Linda term.1 87.5
## 9 Linda term.2 86
## 10 Linda term.2 86
## # … with 20 more rows
Load the data from the data source
coronaDB <- read.csv("https://raw.githubusercontent.com/CryptoKass/ncov-data/master/world.latest.bno.csv")
tbl_df(coronaDB)
## # A tibble: 26 x 6
## X country cases deaths notes links
## <int> <fct> <int> <int> <fct> <fct>
## 1 0 China 59804 1365 0 https://bnonews.com/index.php/…
## 2 1 Hong Kong 50 1 4 critical, 2 s… https://edition.cnn.com/asia/l…
## 3 2 Taiwan 18 0 1 recovered https://www.cdc.gov.tw/Bulleti…
## 4 3 Macau 10 0 1 recovered https://news.gov.mo/detail/zh-…
## 5 4 Japan 247 0 4 serious, 4 re… https://www3.nhk.or.jp/nhkworl…
## 6 5 Singapore 50 0 8 critical, 15 … https://www.moh.gov.sg/news-hi…
## 7 6 Thailand 33 0 1 serious, 10 r… https://pr.moph.go.th/?url=pr/…
## 8 7 South Ko… 28 0 7 recovered https://en.yna.co.kr/view/AEN2…
## 9 8 Malaysia 18 0 3 recovered https://www.channelnewsasia.co…
## 10 9 Australia 15 0 8 recovered https://www.smh.com.au/nationa…
## # … with 16 more rows
As we can see from the data the ‘notes’ column contained counts of individuals in critical, serious or recovered from the coronavirus. let’s extract these infromation and put it in separate columns. we have 3 type of conditions (critical, serious, recovered) seprated by comma, so we can use the seprate() function to extract the information.
corona_tbl <- separate(coronaDB, notes, c("condition1", "condition2", "condition3", NA), sep = ", ")
## Warning: Expected 4 pieces. Missing pieces filled with `NA` in 26 rows [1, 2, 3,
## 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
head(corona_tbl)
## X country cases deaths condition1 condition2 condition3
## 1 0 China 59804 1365 0 <NA> <NA>
## 2 1 Hong Kong 50 1 4 critical 2 serious 1 recovered
## 3 2 Taiwan 18 0 1 recovered <NA> <NA>
## 4 3 Macau 10 0 1 recovered <NA> <NA>
## 5 4 Japan 247 0 4 serious 4 recovered <NA>
## 6 5 Singapore 50 0 8 critical 15 recovered <NA>
## links
## 1 https://bnonews.com/index.php/2020/02/the-latest-coronavirus-cases/
## 2 https://edition.cnn.com/asia/live-news/coronavirus-outbreak-02-12-20-intl-hnk/h_b49323f6a8ba348a79c4cda5886339c7
## 3 https://www.cdc.gov.tw/Bulletin/Detail/iMZg3IYhKMeo-87fTw8hpQ?typeid=9
## 4 https://news.gov.mo/detail/zh-hant/N20BDPAzBd?3
## 5 https://www3.nhk.or.jp/nhkworld/en/news/20200213_29/
## 6 https://www.moh.gov.sg/news-highlights/details/six-more-cases-discharged-three-new-cases-of-covid-19-infection-confirmed
Now that we have three new columns for the conditions type and the number of conditions, we’ll need to separate the numbers from the health condition using space as a seprator. we’ll call the new columns NC (# of critical), NR (# of recovered), NS (# of serious conditions)
corona_tbl2 <- separate(corona_tbl, condition1, c("NC", "condition1"), sep = "\\s") %>% # NC is the number of critical cases
separate(condition2, c("NR", "condition2"), sep = "\\s") %>% # NR is the number of Recovered cases
separate(condition3, c("NS", "condition3", NA), sep = " ") # NS is the number of serious cases
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 4 rows [1, 11,
## 19, 22].
## Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [17].
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].
head(corona_tbl2)
## X country cases deaths NC condition1 NR condition2 NS condition3
## 1 0 China 59804 1365 0 <NA> <NA> <NA> <NA> <NA>
## 2 1 Hong Kong 50 1 4 critical 2 serious 1 recovered
## 3 2 Taiwan 18 0 1 recovered <NA> <NA> <NA> <NA>
## 4 3 Macau 10 0 1 recovered <NA> <NA> <NA> <NA>
## 5 4 Japan 247 0 4 serious 4 recovered <NA> <NA>
## 6 5 Singapore 50 0 8 critical 15 recovered <NA> <NA>
## links
## 1 https://bnonews.com/index.php/2020/02/the-latest-coronavirus-cases/
## 2 https://edition.cnn.com/asia/live-news/coronavirus-outbreak-02-12-20-intl-hnk/h_b49323f6a8ba348a79c4cda5886339c7
## 3 https://www.cdc.gov.tw/Bulletin/Detail/iMZg3IYhKMeo-87fTw8hpQ?typeid=9
## 4 https://news.gov.mo/detail/zh-hant/N20BDPAzBd?3
## 5 https://www3.nhk.or.jp/nhkworld/en/news/20200213_29/
## 6 https://www.moh.gov.sg/news-highlights/details/six-more-cases-discharged-three-new-cases-of-covid-19-infection-confirmed
Because the information in the notes column wasn’t in the same order (critical, recovered, serious), we now have mixed data in our new conditions columns. we have to fix this. so we’ll use the mutate() function to swipe these values, and put them in the right column.
corona_tbl2 %>% mutate(condition1 = ifelse(condition1 == "recovered", condition2, condition1)) %>%
mutate(condition1 = ifelse(condition1 == "serious", condition3, condition1)) %>%
mutate(condition2 = ifelse(condition2 == "serious", condition3, condition2)) %>%
mutate(condition3 = ifelse(condition3 == "recovered", "serious", condition2))
## X country cases deaths NC condition1 NR condition2 NS condition3
## 1 0 China 59804 1365 0 <NA> <NA> <NA> <NA> <NA>
## 2 1 Hong Kong 50 1 4 critical 2 recovered 1 serious
## 3 2 Taiwan 18 0 1 <NA> <NA> <NA> <NA> <NA>
## 4 3 Macau 10 0 1 <NA> <NA> <NA> <NA> <NA>
## 5 4 Japan 247 0 4 <NA> 4 recovered <NA> <NA>
## 6 5 Singapore 50 0 8 critical 15 recovered <NA> <NA>
## 7 6 Thailand 33 0 1 <NA> 10 recovered <NA> <NA>
## 8 7 South Korea 28 0 7 <NA> <NA> <NA> <NA> <NA>
## 9 8 Malaysia 18 0 3 <NA> <NA> <NA> <NA> <NA>
## 10 9 Australia 15 0 8 <NA> <NA> <NA> <NA> <NA>
## 11 10 Germany 16 0 0 <NA> <NA> <NA> <NA> <NA>
## 12 11 Vietnam 16 0 7 <NA> <NA> <NA> <NA> <NA>
## 13 12 United States 14 0 3 <NA> <NA> <NA> <NA> <NA>
## 14 13 France 11 0 1 <NA> 2 recovered <NA> <NA>
## 15 14 United Kingdom 9 0 1 <NA> <NA> <NA> <NA> <NA>
## 16 15 Canada 7 0 1 <NA> <NA> <NA> <NA> <NA>
## 17 16 UAE 8 0 1 <NA> 1 recovered <NA> <NA>
## 18 17 Philippines 3 1 2 <NA> <NA> <NA> <NA> <NA>
## 19 18 India 3 0 0 <NA> <NA> <NA> <NA> <NA>
## 20 19 Italy 3 0 2 <NA> <NA> <NA> <NA> <NA>
## 21 20 Russia 2 0 2 <NA> <NA> <NA> <NA> <NA>
## 22 21 Spain 2 0 0 <NA> <NA> <NA> <NA> <NA>
## 23 22 Nepal 1 0 1 <NA> <NA> <NA> <NA> <NA>
## 24 23 Cambodia 1 0 1 <NA> <NA> <NA> <NA> <NA>
## 25 24 Sri Lanka 1 0 1 <NA> <NA> <NA> <NA> <NA>
## 26 25 Finland 1 0 1 <NA> <NA> <NA> <NA> <NA>
## links
## 1 https://bnonews.com/index.php/2020/02/the-latest-coronavirus-cases/
## 2 https://edition.cnn.com/asia/live-news/coronavirus-outbreak-02-12-20-intl-hnk/h_b49323f6a8ba348a79c4cda5886339c7
## 3 https://www.cdc.gov.tw/Bulletin/Detail/iMZg3IYhKMeo-87fTw8hpQ?typeid=9
## 4 https://news.gov.mo/detail/zh-hant/N20BDPAzBd?3
## 5 https://www3.nhk.or.jp/nhkworld/en/news/20200213_29/
## 6 https://www.moh.gov.sg/news-highlights/details/six-more-cases-discharged-three-new-cases-of-covid-19-infection-confirmed
## 7 https://pr.moph.go.th/?url=pr/detail/2/04/138459/
## 8 https://en.yna.co.kr/view/AEN20200212007600320
## 9 https://www.channelnewsasia.com/news/asia/malaysia-recovered-wuhan-coronavirus-patients-china-embassy-12421116
## 10 https://www.smh.com.au/national/all-four-nsw-coronavirus-patients-free-from-disease-20200213-p540i9.html
## 11 https://www.stmgp.bayern.de/presse/aktuelle-informationen-zur-coronavirus-lage-in-bayern-bayerisches-gesundheitsministerium-15/
## 12 https://vietnamnet.vn/vn/suc-khoe/suc-khoe-24h/nguoi-thu-16-nhiem-ncov-tai-viet-nam-615372.html
## 13 https://twitter.com/BNODesk/status/1227752734486650880
## 14 https://edition.cnn.com/asia/live-news/coronavirus-outbreak-02-12-20-intl-hnk/h_6ea5df590e539c8c15ab1f218588a086
## 15 https://www.bbc.co.uk/news/amp/uk-51481469
## 16 https://news.ontario.ca/mohltc/en/2020/02/ontario-confirms-resolved-case-of-the-2019-novel-coronavirus.html
## 17 https://twitter.com/mohapuae/status/1226926065563656193
## 18 https://twitter.com/mohapuae/status/1226552040559718400
## 19 https://www.pib.gov.in/PressReleseDetail.aspx?PRID=1601681
## 20 https://www.ansa.it/canale_saluteebenessere/notizie/sanita/2020/02/03/coronavirus-accertamenti-su-italiano-alla-cecchignola-_b20da9f2-57a5-4870-978f-23e9c50d6155.html
## 21 https://www.channelnewsasia.com/news/world/russia-discharges-second-china-coronavirus-patient-12426458
## 22 https://twitter.com/SaludPublicaEs/status/1226432655555731457
## 23 https://www.thelancet.com/journals/laninf/article/PIIS1473-3099(20)30067-0/fulltext
## 24 https://twitter.com/VOD_English/status/1221769368180121603
## 25 https://www.reuters.com/article/us-health-china-sri-lanka/sri-lanka-confirms-first-case-of-coronavirus-health-official-idUSKBN1ZQ1WF
## 26 https://yle.fi/uutiset/osasto/news/finlands_first_coronavirus_patient_released_from_hospital_symptom-free/11193661
corona_tbl3 <- select(corona_tbl2, country, cases, deaths)
head(corona_tbl3)
## country cases deaths
## 1 China 59804 1365
## 2 Hong Kong 50 1
## 3 Taiwan 18 0
## 4 Macau 10 0
## 5 Japan 247 0
## 6 Singapore 50 0
corona_tbl4 <- gather(corona_tbl3, "corona_data", "numbers", 2:3)
head(corona_tbl4)
## country corona_data numbers
## 1 China cases 59804
## 2 Hong Kong cases 50
## 3 Taiwan cases 18
## 4 Macau cases 10
## 5 Japan cases 247
## 6 Singapore cases 50
We wanted to create a stacked histogram using the transformed data, but there are several NA, and missing values.
Concluion
In this project i discovered the power of R, when handling and transforming data. by using small codes i was able to perform some complex data transformations, that may need long coding with other tools like panda in python.