Overview: We need to choose 3 sets of data in the Week 5 Discussion to tidy and transform the dat, and then to perform analysis. I have chosen the following 3 sets:
Coronavirus data from Philip Tanofsky: https://github.com/CryptoKass/ncov-data/blob/master/world.latest.bno.csv
New York City school level College Board AP results for 2010 from Sung Lee: https://raw.githubusercontent.com/ferrysany/cuny607p2/master/2010__AP__College_Board__School_Level_Results.csv
Example of MELT and others from Angel Claudio: https://raw.githubusercontent.com/rodrigomf5/Tidydata/master/relinc.csv
Load all the required packages.
library(tidyverse)
library(readr)
As they are all individual tables, I just read data from csv without creating any database
theUrl1 <- "https://raw.githubusercontent.com/CryptoKass/ncov-data/master/world.latest.bno.csv"
theUrl2 <- "https://raw.githubusercontent.com/ferrysany/cuny607p2/master/2010__AP__College_Board__School_Level_Results.csv"
theUrl3 <- "https://raw.githubusercontent.com/rodrigomf5/Tidydata/master/relinc.csv"
virus <- read_csv(file=theUrl1, na = c("", "NA"))
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## country = col_character(),
## cases = col_double(),
## deaths = col_double(),
## notes = col_character(),
## links = col_character()
## )
ap <- read_csv(file=theUrl2, na = c("", "NA"))
## Parsed with column specification:
## cols(
## DBN = col_character(),
## SchoolName = col_character(),
## `AP Test Takers` = col_double(),
## `Total Exams Taken` = col_double(),
## `Number of Exams with scores 3 4 or 5` = col_double()
## )
religion <- read_csv(file=theUrl3, na = c("", "NA"))
## Parsed with column specification:
## cols(
## religion = col_character(),
## `<10k` = col_double(),
## `10-20k` = col_double(),
## `20-30k` = col_double(),
## `30-40k` = col_double(),
## `40-50k` = col_double(),
## `50-75k` = col_double(),
## `75-100k` = col_double(),
## `100-150k` = col_double(),
## `>150k` = col_double(),
## refused = col_double()
## )
virus
## # A tibble: 26 x 6
## X1 country cases deaths notes links
## <dbl> <chr> <dbl> <dbl> <chr> <chr>
## 1 0 China 59804 1365 0 https://bnonews.com/index.…
## 2 1 Hong Kong 50 1 4 critical, 2 … https://edition.cnn.com/as…
## 3 2 Taiwan 18 0 1 recovered https://www.cdc.gov.tw/Bul…
## 4 3 Macau 10 0 1 recovered https://news.gov.mo/detail…
## 5 4 Japan 247 0 4 serious, 4 r… https://www3.nhk.or.jp/nhk…
## 6 5 Singapore 50 0 8 critical, 15… https://www.moh.gov.sg/new…
## 7 6 Thailand 33 0 1 serious, 10 … https://pr.moph.go.th/?url…
## 8 7 South Ko… 28 0 7 recovered https://en.yna.co.kr/view/…
## 9 8 Malaysia 18 0 3 recovered https://www.channelnewsasi…
## 10 9 Australia 15 0 8 recovered https://www.smh.com.au/nat…
## # … with 16 more rows
ap
## # A tibble: 258 x 5
## DBN SchoolName `AP Test Takers` `Total Exams Ta… `Number of Exams…
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 01M448 UNIVERSITY N… 39 49 10
## 2 01M450 EAST SIDE CO… 19 21 NA
## 3 01M515 LOWER EASTSI… 24 26 24
## 4 01M539 NEW EXPLORAT… 255 377 191
## 5 02M296 High School … NA NA NA
## 6 02M298 Pace High Sc… 21 21 NA
## 7 02M300 Urban Assemb… 99 117 10
## 8 02M303 Facing Histo… 42 44 NA
## 9 02M305 Urban Assemb… 25 37 15
## 10 02M308 Lower Manhat… NA NA NA
## # … with 248 more rows
religion
## # A tibble: 18 x 11
## religion `<10k` `10-20k` `20-30k` `30-40k` `40-50k` `50-75k` `75-100k`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Agnostic 27 34 60 81 76 137 122
## 2 Atheist 12 27 37 52 35 70 73
## 3 Buddhist 27 21 30 34 33 58 62
## 4 Catholic 418 617 732 670 638 1116 949
## 5 refused 15 14 15 11 10 35 21
## 6 Evangel… 575 869 1064 982 881 1486 949
## 7 Hindu 1 9 7 9 11 34 47
## 8 Histori… 228 244 236 238 197 223 131
## 9 Jehovah… 20 27 24 24 21 30 15
## 10 Jewish 19 19 25 25 30 95 69
## 11 Mainlin… 289 495 619 655 651 1107 939
## 12 Mormon 29 40 48 51 56 112 85
## 13 Muslim 6 7 9 10 9 23 16
## 14 Orthodox 13 17 23 32 32 47 38
## 15 Other C… 9 7 11 13 13 14 18
## 16 Other F… 20 33 40 46 49 63 46
## 17 Other W… 5 2 3 4 2 7 3
## 18 Unaffil… 217 299 374 365 341 528 407
## # … with 3 more variables: `100-150k` <dbl>, `>150k` <dbl>, refused <dbl>
Dataset 1: Tidy, transform and analyse data set “virus”
virus1 <- virus %>%
#Make sure all cells have "critical", "serious" and "recovered" case number
mutate(notes = case_when(
(str_detect(notes, "critical") & str_detect(notes, "serious") & str_detect(notes, "recovered")) ~ notes,
(str_detect(notes, "critical") & str_detect(notes, "serious")) ~ (str_c(notes, ", 0 recovered,")),
(str_detect(notes, "serious") & str_detect(notes, "recovered")) ~ (str_c("0 critical, ", notes)),
(str_detect(notes, "critical") & str_detect(notes, "recovered")) ~ (str_c(str_sub(notes, 0, 12), "0 serious,", str_sub(notes, 12, 24))), #can't get the str_sub works properly
str_detect(notes, "critical") ~ (str_c(notes, ", 0 serious, 0 recovered")),
str_detect(notes, "serious") ~ (str_c("0 critical, ", notes, ", 0 recovered")),
str_detect(notes, "recovered") ~ (str_c("0 critical, 0 serious, ", notes)),
)) %>%
#Separate the column notes1 into 3 columns
separate(col=notes, into=c("critical", "serious", "recovered"), sep = ",") %>%
#Parse column "critical", "serious" and "recovered" to number
mutate_at(5:7, parse_number)
virus1
## # A tibble: 26 x 8
## X1 country cases deaths critical serious recovered links
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 0 China 59804 1365 NA NA NA https://bnonews.…
## 2 1 Hong Ko… 50 1 4 2 1 https://edition.…
## 3 2 Taiwan 18 0 0 0 1 https://www.cdc.…
## 4 3 Macau 10 0 0 0 1 https://news.gov…
## 5 4 Japan 247 0 0 4 4 https://www3.nhk…
## 6 5 Singapo… 50 0 8 0 15 https://www.moh.…
## 7 6 Thailand 33 0 0 1 10 https://pr.moph.…
## 8 7 South K… 28 0 0 0 7 https://en.yna.c…
## 9 8 Malaysia 18 0 0 0 3 https://www.chan…
## 10 9 Austral… 15 0 0 0 8 https://www.smh.…
## # … with 16 more rows
There is nothing much to create plot as the origin of the virus is China which has the maximum number of cases and deaths.
Dataset 2: NY City school level college board AP results
ap1 <- ap %>%
rename("noExam345" = "Number of Exams with scores 3 4 or 5", "totalExams" = "Total Exams Taken") %>%
#filter(!is.na("noExam345")|!is.na("totalExams"))
mutate(ratio= noExam345 / totalExams) %>%
mutate(district=str_sub(DBN, 1 ,2))
ap1
## # A tibble: 258 x 7
## DBN SchoolName `AP Test Takers` totalExams noExam345 ratio district
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 01M448 UNIVERSIT… 39 49 10 0.204 01
## 2 01M450 EAST SIDE… 19 21 NA NA 01
## 3 01M515 LOWER EAS… 24 26 24 0.923 01
## 4 01M539 NEW EXPLO… 255 377 191 0.507 01
## 5 02M296 High Scho… NA NA NA NA 02
## 6 02M298 Pace High… 21 21 NA NA 02
## 7 02M300 Urban Ass… 99 117 10 0.0855 02
## 8 02M303 Facing Hi… 42 44 NA NA 02
## 9 02M305 Urban Ass… 25 37 15 0.405 02
## 10 02M308 Lower Man… NA NA NA NA 02
## # … with 248 more rows
Compare Number of Exam with Score 3-5 for each school
ggplot(data=ap1) +
geom_col(mapping= aes(x = SchoolName, y=ratio))+
labs(title="Ratio for Exam with Score 3-5 Out of Total Exam Taken for Each School")
## Warning: Removed 107 rows containing missing values (position_stack).
Group schools according to district and plot again
ap1Group <- ap1 %>%
group_by(district)%>%
summarise(meanRatio=mean(ratio, na.rm=TRUE))
ap1Group
## # A tibble: 32 x 2
## district meanRatio
## <chr> <dbl>
## 1 01 0.545
## 2 02 0.467
## 3 03 0.398
## 4 04 0.570
## 5 05 0.638
## 6 06 0.401
## 7 07 0.441
## 8 08 0.323
## 9 09 0.393
## 10 10 0.397
## # … with 22 more rows
ggplot(data=ap1Group) +
geom_col(mapping= aes(x = district, y=meanRatio))+
labs(title="Ratio for Exam with Score 3-5 Out of Total Exam Taken for Each District")
## Warning: Removed 5 rows containing missing values (position_stack).
It looks like district 05, 04 and 01 are the districts with top 3 ratio.
religion1 <- religion %>%
gather(key="salary", value="value", 2:10)
religion1
## # A tibble: 162 x 4
## religion refused salary value
## <chr> <dbl> <chr> <dbl>
## 1 Agnostic 96 <10k 27
## 2 Atheist 76 <10k 12
## 3 Buddhist 54 <10k 27
## 4 Catholic 1489 <10k 418
## 5 refused 116 <10k 15
## 6 Evangelical Prot 1529 <10k 575
## 7 Hindu 37 <10k 1
## 8 Historically Black Prot 339 <10k 228
## 9 Jehovah's Witness 37 <10k 20
## 10 Jewish 162 <10k 19
## # … with 152 more rows
Plot a histogram for salary distribution
ggplot(data=religion1) +
geom_col(mapping= aes(x = salary, y=value, fill=religion))+
labs(title="")
The highest number of salary is at 50-75K and the portion is similar for each religion.