In this project, I am asked to select 3 ‘wide’ datasets and use tidyr and dplyr as needed to tidy and transform your data.
Below libraries used. tidyr -> provides two important functions gather() and spread() dplyr -> provides select(), filter, group_by, summarize and mutate
library(readr)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(ggplot2)
library(knitr)
This data set can be found on the https://www.baseball-reference.com/leagues/MLB/ and it includes every MLB team from all time, there wins, and the games played in that year.
Dataset is available in csv format to download.
Below snippet will read the csv file and print first few samples.
mlbData <- data.frame(read_csv("https://raw.githubusercontent.com/charlsjoseph/CUNY-Data607/master/week5/project/dataset/MLB.csv"))
## Parsed with column specification:
## cols(
## .default = col_integer()
## )
## See spec(...) for full column specifications.
head(mlbData)
## Year G ARI ATL BLA BAL BOS CHC CHW CIN CLE COL DET HOU KCR LAA LAD MIA
## 1 2018 163 82 90 NA 47 108 95 62 67 91 91 64 103 58 80 92 63
## 2 2017 162 93 72 NA 75 93 92 67 68 102 87 64 101 80 80 104 77
## 3 2016 162 69 68 NA 89 93 103 78 68 94 75 86 84 81 74 91 79
## 4 2015 162 79 67 NA 81 78 97 76 64 81 68 74 86 95 85 92 71
## 5 2014 162 64 79 NA 96 71 73 73 76 85 66 90 70 89 98 94 77
## 6 2013 163 81 96 NA 85 97 66 63 90 92 74 93 51 86 78 92 62
## MIL MIN NYM NYY OAK PHI PIT SDP SFG SEA STL TBR TEX TOR WSN
## 1 96 78 77 100 97 80 82 66 73 89 88 90 67 73 82
## 2 86 85 70 91 75 66 75 71 64 78 83 80 78 76 97
## 3 73 59 87 84 69 71 78 68 87 86 86 68 95 89 95
## 4 68 83 90 87 68 63 98 74 84 76 100 80 88 93 83
## 5 82 70 79 84 88 73 88 77 88 87 90 77 67 83 96
## 6 74 66 74 85 96 73 94 76 76 71 97 92 91 74 86
Below snippet transform the dataset from wide format to long.
mlbData <- mlbData %>% gather(team, wins, 3:length(colnames(mlbData)))
head(mlbData)
## Year G team wins
## 1 2018 163 ARI 82
## 2 2017 162 ARI 93
## 3 2016 162 ARI 69
## 4 2015 162 ARI 79
## 5 2014 162 ARI 64
## 6 2013 163 ARI 81
Since the number of Games in each year varies, the points can be used for comparison purpose, So have to create a new column for win percentage(win_pct) and is calculated by total number of wins/ total no of Games.
mlbData <- mlbData %>% mutate(win_pct = wins/G)
head(mlbData)
## Year G team wins win_pct
## 1 2018 163 ARI 82 0.5030675
## 2 2017 162 ARI 93 0.5740741
## 3 2016 162 ARI 69 0.4259259
## 4 2015 162 ARI 79 0.4876543
## 5 2014 162 ARI 64 0.3950617
## 6 2013 163 ARI 81 0.4969325
Print the topper and their points since 2000. used join function to join two dataset to get the top team name and its max score
mlbData <- mlbData %>% filter(Year >= 2000)
mlbData_byyear_byteam = mlbData %>% group_by(Year, team)
mlbData_byyear = mlbData %>% group_by(Year) %>% summarise( max_wins = max(wins, na.rm = TRUE)) %>% arrange(-Year)
topper <- inner_join(mlbData_byyear, mlbData_byyear_byteam, c("Year" = "Year", "max_wins" = "wins") ) %>% select(Year, max_wins , team)
kable(topper)
| Year | max_wins | team |
|---|---|---|
| 2018 | 108 | BOS |
| 2017 | 104 | LAD |
| 2016 | 103 | CHC |
| 2015 | 100 | STL |
| 2014 | 98 | LAA |
| 2013 | 97 | BOS |
| 2013 | 97 | STL |
| 2012 | 98 | WSN |
| 2011 | 102 | PHI |
| 2010 | 97 | PHI |
| 2009 | 103 | NYY |
| 2008 | 100 | LAA |
| 2007 | 96 | BOS |
| 2007 | 96 | CLE |
| 2006 | 97 | NYM |
| 2006 | 97 | NYY |
| 2005 | 100 | STL |
| 2004 | 105 | STL |
| 2003 | 101 | ATL |
| 2003 | 101 | NYY |
| 2002 | 103 | NYY |
| 2002 | 103 | OAK |
| 2001 | 116 | SEA |
| 2000 | 97 | SFG |
Below snippet will read the csv file and print first few samples.
babiesHeightData <- data.frame(read_csv("https://raw.githubusercontent.com/charlsjoseph/CUNY-Data607/master/week5/project/dataset/babies_heights.csv"))
## Parsed with column specification:
## cols(
## name = col_character(),
## gender = col_character(),
## year2011 = col_double(),
## year2012 = col_double(),
## year2013 = col_double(),
## year2014 = col_double(),
## year2015 = col_double()
## )
head(babiesHeightData)
## name gender year2011 year2012 year2013 year2014 year2015
## 1 Jackson Smith M 74.69 84.99 91.73 105.11 111.04
## 2 Sophia Johnson F NA NA NA NA 75.89
## 3 Emma Williams F NA NA 75.74 86.50 91.50
## 4 Aiden Jones M NA NA NA 71.89 81.42
## 5 Liam Brown M 88.24 96.91 101.85 108.13 112.45
## 6 Lucas Davis M 70.60 83.78 94.17 100.03 106.35
colnames(babiesHeightData)[3:7] <- c("2011", "2012" , "2013" , "2014" , "2015")
colnames(babiesHeightData)
## [1] "name" "gender" "2011" "2012" "2013" "2014" "2015"
Using tidyr gather() method, transform the wide format to long. Year column value becomes row data value.
babiesHeightData <- babiesHeightData %>% gather(year, height, 3:7)
head(babiesHeightData)
## name gender year height
## 1 Jackson Smith M 2011 74.69
## 2 Sophia Johnson F 2011 NA
## 3 Emma Williams F 2011 NA
## 4 Aiden Jones M 2011 NA
## 5 Liam Brown M 2011 88.24
## 6 Lucas Davis M 2011 70.60
using dplyr( using pipe), calculate the average heights by year and by age.
heights_by_gender_by_year = babiesHeightData %>% group_by(year,gender) %>% summarise(mean_height = mean(height, na.rm = TRUE))
heights_by_gender_by_year
## # A tibble: 10 x 3
## # Groups: year [?]
## year gender mean_height
## <chr> <chr> <dbl>
## 1 2011 F 76.8
## 2 2011 M 77.8
## 3 2012 F 88.6
## 4 2012 M 88.6
## 5 2013 F 89.3
## 6 2013 M 95.9
## 7 2014 F 98.8
## 8 2014 M 96.3
## 9 2015 F 90.4
## 10 2015 M 103.
Below R snippet plot a bar chart showing year on the X-axis and average heights on the Y-axis
ggplot(heights_by_gender_by_year, aes(year, mean_height)) +
geom_bar(aes(fill = gender), position = "dodge", stat = "identity") +
ylab("Average Height")
This dataset is found in https://rdrr.io/cran/blorr/man/hsb2.html Below snippet will read the csv file and print first few samples.
id - id of the student
female - gender of the student
race - ethnic background of the student
ses - socio-economic status of the student
schtyp - school type
prog - program type
read scores from test of reading
write - scores from test of writing
math - scores from test of math
science - scores from test of science
socst - scores from test of social studies
hsb2Data <- data.frame(read_csv("https://raw.githubusercontent.com/charlsjoseph/CUNY-Data607/master/week5/project/dataset/hsb2.csv"))
## Parsed with column specification:
## cols(
## id = col_integer(),
## female = col_integer(),
## race = col_integer(),
## ses = col_integer(),
## schtyp = col_integer(),
## prog = col_integer(),
## read = col_integer(),
## write = col_integer(),
## math = col_integer(),
## science = col_integer(),
## socst = col_integer()
## )
head(hsb2Data)
## id female race ses schtyp prog read write math science socst
## 1 70 0 4 1 1 1 57 52 41 47 57
## 2 121 1 4 2 1 3 68 59 53 63 61
## 3 86 0 4 3 1 1 44 33 54 58 31
## 4 141 0 4 3 1 3 63 44 47 53 56
## 5 172 0 4 2 1 2 47 52 57 53 61
## 6 113 0 4 2 1 2 44 52 51 63 61
Using tidyr gather() method, transform the wide format to long. Year column value becomes row data value.
hsb2Data <- hsb2Data %>% gather(ability, score, read:socst)
head(hsb2Data)
## id female race ses schtyp prog ability score
## 1 70 0 4 1 1 1 read 57
## 2 121 1 4 2 1 3 read 68
## 3 86 0 4 3 1 1 read 44
## 4 141 0 4 3 1 3 read 63
## 5 172 0 4 2 1 2 read 47
## 6 113 0 4 2 1 2 read 44
create a bar plot with socio-economic status of the student in X axis and average score of different abilities to understand if socio-economic status has any relation with their ability.
For this we can make out that students with low socio-economic status having the score < 50 and students with high socio-economic status has comparatively high average score.
hsb2Data$ses.f = factor(hsb2Data$ses, labels=c("low", "middle", "high"))
meanScore_byses.f_byability = hsb2Data %>% group_by(ses.f, ability) %>% summarise(mean_score = mean(score, na.rm = TRUE))
head(meanScore_byses.f_byability)
## # A tibble: 6 x 3
## # Groups: ses.f [2]
## ses.f ability mean_score
## <fct> <chr> <dbl>
## 1 low math 49.2
## 2 low read 48.3
## 3 low science 47.7
## 4 low socst 47.3
## 5 low write 50.6
## 6 middle math 52.2
ggplot(meanScore_byses.f_byability, aes(ses.f, mean_score)) +
geom_bar(aes(fill = ability), position = "dodge", stat = "identity") +
ylab("Average Score")