Data Acquisition and Management Week 6 Assignment

Overview: We need to choose 3 sets of data in the Week 5 Discussion to tidy and transform the dat, and then to perform analysis. I have chosen the following 3 sets:

Coronavirus data from Philip Tanofsky: https://github.com/CryptoKass/ncov-data/blob/master/world.latest.bno.csv
New York City school level College Board AP results for 2010 from Sung Lee: https://raw.githubusercontent.com/ferrysany/cuny607p2/master/2010__AP__College_Board__School_Level_Results.csv
Example of MELT and others from Angel Claudio: https://raw.githubusercontent.com/rodrigomf5/Tidydata/master/relinc.csv

Load all the required packages.

library(tidyverse)
library(readr)

As they are all individual tables, I just read data from csv without creating any database

theUrl1 <- "https://raw.githubusercontent.com/CryptoKass/ncov-data/master/world.latest.bno.csv"
theUrl2 <- "https://raw.githubusercontent.com/ferrysany/cuny607p2/master/2010__AP__College_Board__School_Level_Results.csv"
theUrl3 <- "https://raw.githubusercontent.com/rodrigomf5/Tidydata/master/relinc.csv"

virus <- read_csv(file=theUrl1,  na = c("", "NA"))

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   country = col_character(),
##   cases = col_double(),
##   deaths = col_double(),
##   notes = col_character(),
##   links = col_character()
## )

ap <- read_csv(file=theUrl2, na = c("", "NA"))

## Parsed with column specification:
## cols(
##   DBN = col_character(),
##   SchoolName = col_character(),
##   `AP Test Takers` = col_double(),
##   `Total Exams Taken` = col_double(),
##   `Number of Exams with scores 3 4 or 5` = col_double()
## )

religion <- read_csv(file=theUrl3, na = c("", "NA"))

## Parsed with column specification:
## cols(
##   religion = col_character(),
##   `<10k` = col_double(),
##   `10-20k` = col_double(),
##   `20-30k` = col_double(),
##   `30-40k` = col_double(),
##   `40-50k` = col_double(),
##   `50-75k` = col_double(),
##   `75-100k` = col_double(),
##   `100-150k` = col_double(),
##   `>150k` = col_double(),
##   refused = col_double()
## )

virus

## # A tibble: 26 x 6
##       X1 country   cases deaths notes           links                      
##    <dbl> <chr>     <dbl>  <dbl> <chr>           <chr>                      
##  1     0 China     59804   1365 0               https://bnonews.com/index.…
##  2     1 Hong Kong    50      1 4 critical, 2 … https://edition.cnn.com/as…
##  3     2 Taiwan       18      0 1 recovered     https://www.cdc.gov.tw/Bul…
##  4     3 Macau        10      0 1 recovered     https://news.gov.mo/detail…
##  5     4 Japan       247      0 4 serious, 4 r… https://www3.nhk.or.jp/nhk…
##  6     5 Singapore    50      0 8 critical, 15… https://www.moh.gov.sg/new…
##  7     6 Thailand     33      0 1 serious, 10 … https://pr.moph.go.th/?url…
##  8     7 South Ko…    28      0 7 recovered     https://en.yna.co.kr/view/…
##  9     8 Malaysia     18      0 3 recovered     https://www.channelnewsasi…
## 10     9 Australia    15      0 8 recovered     https://www.smh.com.au/nat…
## # … with 16 more rows

ap

## # A tibble: 258 x 5
##    DBN    SchoolName    `AP Test Takers` `Total Exams Ta… `Number of Exams…
##    <chr>  <chr>                    <dbl>            <dbl>             <dbl>
##  1 01M448 UNIVERSITY N…               39               49                10
##  2 01M450 EAST SIDE CO…               19               21                NA
##  3 01M515 LOWER EASTSI…               24               26                24
##  4 01M539 NEW EXPLORAT…              255              377               191
##  5 02M296 High School …               NA               NA                NA
##  6 02M298 Pace High Sc…               21               21                NA
##  7 02M300 Urban Assemb…               99              117                10
##  8 02M303 Facing Histo…               42               44                NA
##  9 02M305 Urban Assemb…               25               37                15
## 10 02M308 Lower Manhat…               NA               NA                NA
## # … with 248 more rows

religion

## # A tibble: 18 x 11
##    religion `<10k` `10-20k` `20-30k` `30-40k` `40-50k` `50-75k` `75-100k`
##    <chr>     <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>     <dbl>
##  1 Agnostic     27       34       60       81       76      137       122
##  2 Atheist      12       27       37       52       35       70        73
##  3 Buddhist     27       21       30       34       33       58        62
##  4 Catholic    418      617      732      670      638     1116       949
##  5 refused      15       14       15       11       10       35        21
##  6 Evangel…    575      869     1064      982      881     1486       949
##  7 Hindu         1        9        7        9       11       34        47
##  8 Histori…    228      244      236      238      197      223       131
##  9 Jehovah…     20       27       24       24       21       30        15
## 10 Jewish       19       19       25       25       30       95        69
## 11 Mainlin…    289      495      619      655      651     1107       939
## 12 Mormon       29       40       48       51       56      112        85
## 13 Muslim        6        7        9       10        9       23        16
## 14 Orthodox     13       17       23       32       32       47        38
## 15 Other C…      9        7       11       13       13       14        18
## 16 Other F…     20       33       40       46       49       63        46
## 17 Other W…      5        2        3        4        2        7         3
## 18 Unaffil…    217      299      374      365      341      528       407
## # … with 3 more variables: `100-150k` <dbl>, `>150k` <dbl>, refused <dbl>

Dataset 1: Tidy, transform and analyse data set “virus”

virus1 <- virus %>%
  
  #Make sure all cells have "critical", "serious" and "recovered" case number
  mutate(notes = case_when(
    (str_detect(notes, "critical") & str_detect(notes, "serious") & str_detect(notes, "recovered")) ~ notes,
    (str_detect(notes, "critical") & str_detect(notes, "serious")) ~ (str_c(notes, ", 0 recovered,")),
    (str_detect(notes, "serious") & str_detect(notes, "recovered")) ~ (str_c("0 critical, ", notes)),
    (str_detect(notes, "critical") & str_detect(notes, "recovered")) ~ (str_c(str_sub(notes, 0, 12), "0 serious,", str_sub(notes, 12, 24))), #can't get the str_sub works properly
    str_detect(notes, "critical") ~ (str_c(notes, ", 0 serious, 0 recovered")),
    str_detect(notes, "serious") ~ (str_c("0 critical, ", notes, ", 0 recovered")),
    str_detect(notes, "recovered") ~ (str_c("0 critical, 0 serious, ", notes)),
  )) %>%
  
  #Separate the column notes1 into 3 columns
  separate(col=notes, into=c("critical", "serious", "recovered"), sep = ",") %>%
  
  #Parse column "critical", "serious" and "recovered" to number 
  mutate_at(5:7, parse_number)

virus1

## # A tibble: 26 x 8
##       X1 country  cases deaths critical serious recovered links            
##    <dbl> <chr>    <dbl>  <dbl>    <dbl>   <dbl>     <dbl> <chr>            
##  1     0 China    59804   1365       NA      NA        NA https://bnonews.…
##  2     1 Hong Ko…    50      1        4       2         1 https://edition.…
##  3     2 Taiwan      18      0        0       0         1 https://www.cdc.…
##  4     3 Macau       10      0        0       0         1 https://news.gov…
##  5     4 Japan      247      0        0       4         4 https://www3.nhk…
##  6     5 Singapo…    50      0        8       0        15 https://www.moh.…
##  7     6 Thailand    33      0        0       1        10 https://pr.moph.…
##  8     7 South K…    28      0        0       0         7 https://en.yna.c…
##  9     8 Malaysia    18      0        0       0         3 https://www.chan…
## 10     9 Austral…    15      0        0       0         8 https://www.smh.…
## # … with 16 more rows

There is nothing much to create plot as the origin of the virus is China which has the maximum number of cases and deaths.

Dataset 2: NY City school level college board AP results

ap1 <- ap %>%
  rename("noExam345" = "Number of Exams with scores 3 4 or 5", "totalExams" = "Total Exams Taken") %>%
  #filter(!is.na("noExam345")|!is.na("totalExams"))
  mutate(ratio= noExam345 / totalExams) %>%
  mutate(district=str_sub(DBN, 1 ,2))
ap1

## # A tibble: 258 x 7
##    DBN    SchoolName `AP Test Takers` totalExams noExam345   ratio district
##    <chr>  <chr>                 <dbl>      <dbl>     <dbl>   <dbl> <chr>   
##  1 01M448 UNIVERSIT…               39         49        10  0.204  01      
##  2 01M450 EAST SIDE…               19         21        NA NA      01      
##  3 01M515 LOWER EAS…               24         26        24  0.923  01      
##  4 01M539 NEW EXPLO…              255        377       191  0.507  01      
##  5 02M296 High Scho…               NA         NA        NA NA      02      
##  6 02M298 Pace High…               21         21        NA NA      02      
##  7 02M300 Urban Ass…               99        117        10  0.0855 02      
##  8 02M303 Facing Hi…               42         44        NA NA      02      
##  9 02M305 Urban Ass…               25         37        15  0.405  02      
## 10 02M308 Lower Man…               NA         NA        NA NA      02      
## # … with 248 more rows

Compare Number of Exam with Score 3-5 for each school

ggplot(data=ap1) + 
  geom_col(mapping= aes(x = SchoolName, y=ratio))+
  labs(title="Ratio for Exam with Score 3-5 Out of Total Exam Taken for Each School")

## Warning: Removed 107 rows containing missing values (position_stack).

Group schools according to district and plot again

ap1Group <- ap1 %>%
  group_by(district)%>%
  summarise(meanRatio=mean(ratio, na.rm=TRUE))
ap1Group

## # A tibble: 32 x 2
##    district meanRatio
##    <chr>        <dbl>
##  1 01           0.545
##  2 02           0.467
##  3 03           0.398
##  4 04           0.570
##  5 05           0.638
##  6 06           0.401
##  7 07           0.441
##  8 08           0.323
##  9 09           0.393
## 10 10           0.397
## # … with 22 more rows

ggplot(data=ap1Group) + 
  geom_col(mapping= aes(x = district, y=meanRatio))+
  labs(title="Ratio for Exam with Score 3-5 Out of Total Exam Taken for Each District")

## Warning: Removed 5 rows containing missing values (position_stack).

It looks like district 05, 04 and 01 are the districts with top 3 ratio.

Example of MELT and others from Angel Claudio:

religion1 <- religion %>%
  gather(key="salary", value="value", 2:10)
religion1

## # A tibble: 162 x 4
##    religion                refused salary value
##    <chr>                     <dbl> <chr>  <dbl>
##  1 Agnostic                     96 <10k      27
##  2 Atheist                      76 <10k      12
##  3 Buddhist                     54 <10k      27
##  4 Catholic                   1489 <10k     418
##  5 refused                     116 <10k      15
##  6 Evangelical Prot           1529 <10k     575
##  7 Hindu                        37 <10k       1
##  8 Historically Black Prot     339 <10k     228
##  9 Jehovah's Witness            37 <10k      20
## 10 Jewish                      162 <10k      19
## # … with 152 more rows

Plot a histogram for salary distribution

ggplot(data=religion1) + 
  geom_col(mapping= aes(x = salary, y=value, fill=religion))+
  labs(title="")

The highest number of salary is at 50-75K and the portion is similar for each religion.

Data Acquisition and Management Week 6 Assignment - Project 2

Chun San Yip

2020/03/08