Data607 Tidying and transformation Project

In this project, I am asked to select 3 ‘wide’ datasets and use tidyr and dplyr as needed to tidy and transform your data.

Library Required

Below libraries used. tidyr -> provides two important functions gather() and spread() dplyr -> provides select(), filter, group_by, summarize and mutate

library(readr)
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)
library(ggplot2)
library(knitr)

DataSet 1 : Major League Baseball Team

This data set can be found on the https://www.baseball-reference.com/leagues/MLB/ and it includes every MLB team from all time, there wins, and the games played in that year.

Dataset is available in csv format to download.

Below snippet will read the csv file and print first few samples.

mlbData <- data.frame(read_csv("https://raw.githubusercontent.com/charlsjoseph/CUNY-Data607/master/week5/project/dataset/MLB.csv"))
## Parsed with column specification:
## cols(
##   .default = col_integer()
## )
## See spec(...) for full column specifications.
head(mlbData)
##   Year   G ARI ATL BLA BAL BOS CHC CHW CIN CLE COL DET HOU KCR LAA LAD MIA
## 1 2018 163  82  90  NA  47 108  95  62  67  91  91  64 103  58  80  92  63
## 2 2017 162  93  72  NA  75  93  92  67  68 102  87  64 101  80  80 104  77
## 3 2016 162  69  68  NA  89  93 103  78  68  94  75  86  84  81  74  91  79
## 4 2015 162  79  67  NA  81  78  97  76  64  81  68  74  86  95  85  92  71
## 5 2014 162  64  79  NA  96  71  73  73  76  85  66  90  70  89  98  94  77
## 6 2013 163  81  96  NA  85  97  66  63  90  92  74  93  51  86  78  92  62
##   MIL MIN NYM NYY OAK PHI PIT SDP SFG SEA STL TBR TEX TOR WSN
## 1  96  78  77 100  97  80  82  66  73  89  88  90  67  73  82
## 2  86  85  70  91  75  66  75  71  64  78  83  80  78  76  97
## 3  73  59  87  84  69  71  78  68  87  86  86  68  95  89  95
## 4  68  83  90  87  68  63  98  74  84  76 100  80  88  93  83
## 5  82  70  79  84  88  73  88  77  88  87  90  77  67  83  96
## 6  74  66  74  85  96  73  94  76  76  71  97  92  91  74  86

Transform the wide format into Long format.

Below snippet transform the dataset from wide format to long.

mlbData <- mlbData %>% gather(team, wins, 3:length(colnames(mlbData)))
head(mlbData)
##   Year   G team wins
## 1 2018 163  ARI   82
## 2 2017 162  ARI   93
## 3 2016 162  ARI   69
## 4 2015 162  ARI   79
## 5 2014 162  ARI   64
## 6 2013 163  ARI   81

Tranform with dplyr funcions.

Since the number of Games in each year varies, the points can be used for comparison purpose, So have to create a new column for win percentage(win_pct) and is calculated by total number of wins/ total no of Games.

mlbData <- mlbData %>% mutate(win_pct = wins/G)
head(mlbData)
##   Year   G team wins   win_pct
## 1 2018 163  ARI   82 0.5030675
## 2 2017 162  ARI   93 0.5740741
## 3 2016 162  ARI   69 0.4259259
## 4 2015 162  ARI   79 0.4876543
## 5 2014 162  ARI   64 0.3950617
## 6 2013 163  ARI   81 0.4969325

Print the topper and their points since 2000. used join function to join two dataset to get the top team name and its max score

mlbData <- mlbData %>% filter(Year >= 2000)

mlbData_byyear_byteam = mlbData %>% group_by(Year, team)

mlbData_byyear = mlbData %>% group_by(Year) %>% summarise( max_wins = max(wins, na.rm = TRUE))  %>% arrange(-Year)

topper <- inner_join(mlbData_byyear, mlbData_byyear_byteam,   c("Year" = "Year", "max_wins" = "wins")  ) %>% select(Year, max_wins , team)
kable(topper)
Year max_wins team
2018 108 BOS
2017 104 LAD
2016 103 CHC
2015 100 STL
2014 98 LAA
2013 97 BOS
2013 97 STL
2012 98 WSN
2011 102 PHI
2010 97 PHI
2009 103 NYY
2008 100 LAA
2007 96 BOS
2007 96 CLE
2006 97 NYM
2006 97 NYY
2005 100 STL
2004 105 STL
2003 101 ATL
2003 101 NYY
2002 103 NYY
2002 103 OAK
2001 116 SEA
2000 97 SFG

DataSet 2 : Sample dataset of babies height from year 2011 to year 2015.

Below snippet will read the csv file and print first few samples.

babiesHeightData <- data.frame(read_csv("https://raw.githubusercontent.com/charlsjoseph/CUNY-Data607/master/week5/project/dataset/babies_heights.csv"))
## Parsed with column specification:
## cols(
##   name = col_character(),
##   gender = col_character(),
##   year2011 = col_double(),
##   year2012 = col_double(),
##   year2013 = col_double(),
##   year2014 = col_double(),
##   year2015 = col_double()
## )
head(babiesHeightData)
##             name gender year2011 year2012 year2013 year2014 year2015
## 1  Jackson Smith      M    74.69    84.99    91.73   105.11   111.04
## 2 Sophia Johnson      F       NA       NA       NA       NA    75.89
## 3  Emma Williams      F       NA       NA    75.74    86.50    91.50
## 4    Aiden Jones      M       NA       NA       NA    71.89    81.42
## 5     Liam Brown      M    88.24    96.91   101.85   108.13   112.45
## 6    Lucas Davis      M    70.60    83.78    94.17   100.03   106.35

Format the column name to year

colnames(babiesHeightData)[3:7] <- c("2011", "2012" , "2013" , "2014" , "2015")
colnames(babiesHeightData)
## [1] "name"   "gender" "2011"   "2012"   "2013"   "2014"   "2015"

Transform from wide to long dataset

Using tidyr gather() method, transform the wide format to long. Year column value becomes row data value.

babiesHeightData <- babiesHeightData %>% gather(year, height, 3:7)

head(babiesHeightData)
##             name gender year height
## 1  Jackson Smith      M 2011  74.69
## 2 Sophia Johnson      F 2011     NA
## 3  Emma Williams      F 2011     NA
## 4    Aiden Jones      M 2011     NA
## 5     Liam Brown      M 2011  88.24
## 6    Lucas Davis      M 2011  70.60

Average heights by year and by age.

using dplyr( using pipe), calculate the average heights by year and by age.

heights_by_gender_by_year = babiesHeightData %>% group_by(year,gender) %>% summarise(mean_height = mean(height, na.rm = TRUE))

heights_by_gender_by_year
## # A tibble: 10 x 3
## # Groups:   year [?]
##    year  gender mean_height
##    <chr> <chr>        <dbl>
##  1 2011  F             76.8
##  2 2011  M             77.8
##  3 2012  F             88.6
##  4 2012  M             88.6
##  5 2013  F             89.3
##  6 2013  M             95.9
##  7 2014  F             98.8
##  8 2014  M             96.3
##  9 2015  F             90.4
## 10 2015  M            103.

Plot average height for each year( for Male and Female)

Below R snippet plot a bar chart showing year on the X-axis and average heights on the Y-axis

ggplot(heights_by_gender_by_year, aes(year, mean_height)) + 
  geom_bar(aes(fill = gender), position = "dodge", stat = "identity") + 
  ylab("Average Height")

DataSet 3 : High School and Beyond Data Set

This dataset is found in https://rdrr.io/cran/blorr/man/hsb2.html Below snippet will read the csv file and print first few samples.

id - id of the student

female - gender of the student

race - ethnic background of the student

ses - socio-economic status of the student

schtyp - school type

prog - program type

read scores from test of reading

write - scores from test of writing

math - scores from test of math

science - scores from test of science

socst - scores from test of social studies

hsb2Data <- data.frame(read_csv("https://raw.githubusercontent.com/charlsjoseph/CUNY-Data607/master/week5/project/dataset/hsb2.csv"))
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   female = col_integer(),
##   race = col_integer(),
##   ses = col_integer(),
##   schtyp = col_integer(),
##   prog = col_integer(),
##   read = col_integer(),
##   write = col_integer(),
##   math = col_integer(),
##   science = col_integer(),
##   socst = col_integer()
## )
head(hsb2Data)
##    id female race ses schtyp prog read write math science socst
## 1  70      0    4   1      1    1   57    52   41      47    57
## 2 121      1    4   2      1    3   68    59   53      63    61
## 3  86      0    4   3      1    1   44    33   54      58    31
## 4 141      0    4   3      1    3   63    44   47      53    56
## 5 172      0    4   2      1    2   47    52   57      53    61
## 6 113      0    4   2      1    2   44    52   51      63    61

Transform from wide to long dataset

Using tidyr gather() method, transform the wide format to long. Year column value becomes row data value.

hsb2Data <- hsb2Data %>% gather(ability, score, read:socst)

head(hsb2Data)
##    id female race ses schtyp prog ability score
## 1  70      0    4   1      1    1    read    57
## 2 121      1    4   2      1    3    read    68
## 3  86      0    4   3      1    1    read    44
## 4 141      0    4   3      1    3    read    63
## 5 172      0    4   2      1    2    read    47
## 6 113      0    4   2      1    2    read    44

Tranform using dplyr

create a bar plot with socio-economic status of the student in X axis and average score of different abilities to understand if socio-economic status has any relation with their ability.

For this we can make out that students with low socio-economic status having the score < 50 and students with high socio-economic status has comparatively high average score.

hsb2Data$ses.f = factor(hsb2Data$ses, labels=c("low", "middle", "high")) 

meanScore_byses.f_byability = hsb2Data %>% group_by(ses.f, ability) %>% summarise(mean_score = mean(score, na.rm = TRUE))

head(meanScore_byses.f_byability)
## # A tibble: 6 x 3
## # Groups:   ses.f [2]
##   ses.f  ability mean_score
##   <fct>  <chr>        <dbl>
## 1 low    math          49.2
## 2 low    read          48.3
## 3 low    science       47.7
## 4 low    socst         47.3
## 5 low    write         50.6
## 6 middle math          52.2
ggplot(meanScore_byses.f_byability, aes(ses.f, mean_score)) + 
  geom_bar(aes(fill = ability), position = "dodge", stat = "identity") + 
  ylab("Average Score")