Your task is to: (1) Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets: ??? Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below. ??? Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!] ??? Perform the analysis requested in the discussion item. ??? Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. (2) Please include in your homework submission, for each of the three chosen datasets: ??? The URL to the .Rmd file in your GitHub repository, and ??? The URL for your rpubs.com web page.
To do the analysis, I first downloaded the data from the websites to csv files and loaded it into R, the data files can be downloaded from the link below and then save it your working directory: [link] https://github.com/ravi-kothari/DATA-607/
Also we will need the following package to do the analysis:
dplyr
tidyr
ggplot2(optional)
install.packages("tidyr")
install.packages("dplyr")
library("tidyr")
library("dplyr")
library("Downloader")
We start by loading all the datasets for the analysis. I picked Walt Wells dataset for the walking dead ratings, then I took the dataset i suggested for English premiere league. I wanted to see how the spin off of the walking dead tv series called Fear the walking dead ratings are compared to original walking dead, so i picked that as my third dataset.
getwd()
## [1] "C:/Users/rkothari/Documents/GitHub/DATA-607"
# load dataset into a dataframe.
df_soccer <- read.csv("~/MSDA/DATA 607/EPL.csv",header=TRUE,stringsAsFactors = FALSE)
df_soccer
## X Team P W D L GF GA GD Pts PPG Wh Dh Lh GFh GAh Wa Da
## 1 1 Manchester City 7 6 0 1 18 7 11 18 2.57 3 0 0 9 2 3 0
## 2 2 Tottenham 7 5 2 0 12 3 9 17 2.43 3 1 0 5 1 2 1
## 3 3 Arsenal 7 5 1 1 16 7 9 16 2.29 2 0 1 8 5 3 1
## 4 4 Liverpool 7 5 1 1 18 10 8 16 2.29 2 0 0 9 2 3 1
## 5 5 Everton 7 4 2 1 11 5 6 14 2.00 2 2 0 6 3 2 0
## 6 6 Manchester Utd 7 4 1 2 13 8 5 13 1.86 2 1 1 8 4 2 0
## 7 7 Chelsea 7 4 1 2 12 9 3 13 1.86 2 0 1 6 3 2 1
## 8 8 Crystal Palace 7 3 2 2 11 8 3 11 1.57 1 1 1 5 3 2 1
## 9 9 West Bromwich 7 2 3 2 8 7 1 9 1.29 1 1 1 5 4 1 2
## 10 10 Southampton 7 2 3 2 7 6 1 9 1.29 1 2 0 3 2 1 1
## 11 11 Watford 7 2 2 3 12 13 -1 8 1.14 1 1 2 7 8 1 1
## 12 12 Leicester City 7 2 2 3 8 11 -3 8 1.14 2 2 0 5 1 0 0
## 13 13 Bournemouth 7 2 2 3 6 11 -5 8 1.14 2 0 1 3 3 0 2
## 14 14 Burnley 7 2 1 4 5 9 -4 7 1.00 2 1 2 5 3 0 0
## 15 15 Hull City 7 2 1 4 7 14 -7 7 1.00 1 0 3 3 8 1 1
## 16 16 Middlesbrough 7 1 3 3 7 10 -3 6 0.86 0 1 2 3 5 1 2
## 17 17 Swansea City 7 1 1 5 6 12 -6 4 0.57 0 1 3 4 9 1 0
## 18 18 West Ham Utd 7 1 1 5 8 17 -9 4 0.57 1 1 2 4 8 0 0
## 19 19 Stoke City 7 0 3 4 5 16 -11 3 0.43 0 1 2 2 9 0 2
## 20 20 Sunderland 7 0 2 5 6 13 -7 2 0.29 0 1 3 4 9 0 1
## La GFa GAa
## 1 1 9 5
## 2 0 7 2
## 3 0 8 2
## 4 1 9 8
## 5 1 5 2
## 6 1 5 4
## 7 1 6 6
## 8 1 6 5
## 9 1 3 3
## 10 2 4 4
## 11 1 5 5
## 12 3 3 10
## 13 2 3 8
## 14 2 0 6
## 15 1 4 6
## 16 1 4 5
## 17 2 2 3
## 18 3 4 9
## 19 2 3 7
## 20 2 2 4
df_wd <- read.csv("~/MSDA/DATA 607/Walking_Dead.csv",header=TRUE,stringsAsFactors = FALSE, na.strings = "-")
df_wd
## Season X1 X2 X3 X4 X5 X6 X7 X8 X9
## 1 1 5.350 4.710 5.070 4.750 5.56 5.970 NA NA NA
## 2 2 7.260 6.700 6.100 6.290 6.12 6.080 6.620 8.100 6.890
## 3 3 10.970 9.550 10.510 9.270 10.37 9.210 10.430 10.480 12.260
## 4 4 16.111 13.950 12.920 13.310 12.20 12.000 11.290 12.050 15.760
## 5 5 17.300 15.143 13.801 14.518 13.53 14.068 13.330 14.807 15.643
## 6 6 14.633 12.183 13.143 13.339 12.44 12.871 13.224 13.981 13.742
## X10 X11 X12 X13 X14 X15 X16
## 1 NA NA NA NA NA NA NA
## 2 7.040 6.770 6.890 8.990 NA NA NA
## 3 11.050 11.010 11.300 11.460 10.840 10.990 12.420
## 4 13.340 13.120 12.610 12.650 12.870 13.470 15.680
## 5 12.267 13.438 14.430 14.534 13.781 13.757 15.784
## 6 13.483 12.794 12.812 12.530 12.686 12.384 14.193
df_fear_wd <- read.csv("~/MSDA/DATA 607/Fear_the_Walking_Dead.csv",header=TRUE,stringsAsFactors = FALSE, na.strings = " ")
df_fear_wd
## Season X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
## 1 1 10.13 8.18 7.19 6.62 6.66 6.86 NA NA NA NA NA NA NA
## 2 2 6.67 5.58 4.73 4.80 4.41 4.49 4.39 3.86 3.66 2.99 3.4 3.62 3.49
## X14 X15
## 1 NA NA
## 2 3.67 3.05
Lets analyze the soccer dataset first. As we can see although the data is in wide format its clean as there is no duplicate values in the table. One thing we can do is to renmae some column names to be more descriptive. Also we would like to remove the extra columns not required for our analysis, We will use dplyr function select to perform this.
library("tidyr")
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df_soccer <- tbl_df(df_soccer)#wide format
glimpse(df_soccer)
## Observations: 20
## Variables: 21
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ Team <chr> " Manchester City ", " Tottenham ", " Arsenal ", " Liverp...
## $ P <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7
## $ W <int> 6, 5, 5, 5, 4, 4, 4, 3, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 0, 0
## $ D <int> 0, 2, 1, 1, 2, 1, 1, 2, 3, 3, 2, 2, 2, 1, 1, 3, 1, 1, 3, 2
## $ L <int> 1, 0, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 3, 5, 5, 4, 5
## $ GF <int> 18, 12, 16, 18, 11, 13, 12, 11, 8, 7, 12, 8, 6, 5, 7, 7, ...
## $ GA <int> 7, 3, 7, 10, 5, 8, 9, 8, 7, 6, 13, 11, 11, 9, 14, 10, 12,...
## $ GD <int> 11, 9, 9, 8, 6, 5, 3, 3, 1, 1, -1, -3, -5, -4, -7, -3, -6...
## $ Pts <int> 18, 17, 16, 16, 14, 13, 13, 11, 9, 9, 8, 8, 8, 7, 7, 6, 4...
## $ PPG <dbl> 2.57, 2.43, 2.29, 2.29, 2.00, 1.86, 1.86, 1.57, 1.29, 1.2...
## $ Wh <int> 3, 3, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 0, 0, 1, 0, 0
## $ Dh <int> 0, 1, 0, 0, 2, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 1, 1, 1
## $ Lh <int> 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 2, 0, 1, 2, 3, 2, 3, 2, 2, 3
## $ GFh <int> 9, 5, 8, 9, 6, 8, 6, 5, 5, 3, 7, 5, 3, 5, 3, 3, 4, 4, 2, 4
## $ GAh <int> 2, 1, 5, 2, 3, 4, 3, 3, 4, 2, 8, 1, 3, 3, 8, 5, 9, 8, 9, 9
## $ Wa <int> 3, 2, 3, 3, 2, 2, 2, 2, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0
## $ Da <int> 0, 1, 1, 1, 0, 0, 1, 1, 2, 1, 1, 0, 2, 0, 1, 2, 0, 0, 2, 1
## $ La <int> 1, 0, 0, 1, 1, 1, 1, 1, 1, 2, 1, 3, 2, 2, 1, 1, 2, 3, 2, 2
## $ GFa <int> 9, 7, 8, 9, 5, 5, 6, 6, 3, 4, 5, 3, 3, 0, 4, 4, 2, 4, 3, 2
## $ GAa <int> 5, 2, 2, 8, 2, 4, 6, 5, 3, 4, 5, 10, 8, 6, 6, 5, 3, 9, 7, 4
clean_soccer <- df_soccer %>%
select(Team:L,Pts,Wh:Lh,Wa:La) %>%
mutate(Played_Home = Wh + Dh + Lh, Played_Away = P - Played_Home) %>%
rename(Played=P,Won=W, Draw=D, Lost=L, Won_Home = Wh, Draw_Home=Dh, Lost_Home = Lh, Won_Away=Wa, Draw_Away=Da, Lost_Away=La)
We used dplyr functions to analyse the data and calculate home won percentage. Also we arranged the teams based on their current points and also home win percent to see the performance vs the home advantage.
# Home win percent
(clean_soccer %>%
select(Team,Played,Pts,Won_Home,Played_Home,Played_Away) %>%
group_by(Team) %>%
summarise(home_win_p = (Won_Home/Played_Home)*100, Pts) %>%
mutate(rank = rank(desc(Pts))) %>%
arrange(rank))
## # A tibble: 20 × 4
## Team home_win_p Pts rank
## <chr> <dbl> <int> <dbl>
## 1 Manchester City 100.00000 18 1.0
## 2 Tottenham 75.00000 17 2.0
## 3 Arsenal 66.66667 16 3.5
## 4 Liverpool 100.00000 16 3.5
## 5 Everton 50.00000 14 5.0
## 6 Chelsea 66.66667 13 6.5
## 7 Manchester Utd 50.00000 13 6.5
## 8 Crystal Palace 33.33333 11 8.0
## 9 Southampton 33.33333 9 9.5
## 10 West Bromwich 33.33333 9 9.5
## 11 Bournemouth 66.66667 8 12.0
## 12 Leicester City 50.00000 8 12.0
## 13 Watford 25.00000 8 12.0
## 14 Burnley 40.00000 7 14.5
## 15 Hull City 25.00000 7 14.5
## 16 Middlesbrough 0.00000 6 16.0
## 17 Swansea City 0.00000 4 17.5
## 18 West Ham Utd 25.00000 4 17.5
## 19 Stoke City 0.00000 3 19.0
## 20 Sunderland 0.00000 2 20.0
Lets analyze the walking dead dataset now. For the analysis we took the dataset which is wide and dirty so we will use tidyr first and dplyr functions to clean and rename column names. We used gather to reshape the data into long form for further analysis.
library("tidyr")
library("dplyr")
df_wd <- tbl_df(df_wd)#wide format
glimpse(df_wd)
## Observations: 6
## Variables: 17
## $ Season <int> 1, 2, 3, 4, 5, 6
## $ X1 <dbl> 5.350, 7.260, 10.970, 16.111, 17.300, 14.633
## $ X2 <dbl> 4.710, 6.700, 9.550, 13.950, 15.143, 12.183
## $ X3 <dbl> 5.070, 6.100, 10.510, 12.920, 13.801, 13.143
## $ X4 <dbl> 4.750, 6.290, 9.270, 13.310, 14.518, 13.339
## $ X5 <dbl> 5.56, 6.12, 10.37, 12.20, 13.53, 12.44
## $ X6 <dbl> 5.970, 6.080, 9.210, 12.000, 14.068, 12.871
## $ X7 <dbl> NA, 6.620, 10.430, 11.290, 13.330, 13.224
## $ X8 <dbl> NA, 8.100, 10.480, 12.050, 14.807, 13.981
## $ X9 <dbl> NA, 6.890, 12.260, 15.760, 15.643, 13.742
## $ X10 <dbl> NA, 7.040, 11.050, 13.340, 12.267, 13.483
## $ X11 <dbl> NA, 6.770, 11.010, 13.120, 13.438, 12.794
## $ X12 <dbl> NA, 6.890, 11.300, 12.610, 14.430, 12.812
## $ X13 <dbl> NA, 8.990, 11.460, 12.650, 14.534, 12.530
## $ X14 <dbl> NA, NA, 10.840, 12.870, 13.781, 12.686
## $ X15 <dbl> NA, NA, 10.990, 13.470, 13.757, 12.384
## $ X16 <dbl> NA, NA, 12.420, 15.680, 15.784, 14.193
df_wd
## # A tibble: 6 × 17
## Season X1 X2 X3 X4 X5 X6 X7 X8 X9
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 5.350 4.710 5.070 4.750 5.56 5.970 NA NA NA
## 2 2 7.260 6.700 6.100 6.290 6.12 6.080 6.620 8.100 6.890
## 3 3 10.970 9.550 10.510 9.270 10.37 9.210 10.430 10.480 12.260
## 4 4 16.111 13.950 12.920 13.310 12.20 12.000 11.290 12.050 15.760
## 5 5 17.300 15.143 13.801 14.518 13.53 14.068 13.330 14.807 15.643
## 6 6 14.633 12.183 13.143 13.339 12.44 12.871 13.224 13.981 13.742
## # ... with 7 more variables: X10 <dbl>, X11 <dbl>, X12 <dbl>, X13 <dbl>,
## # X14 <dbl>, X15 <dbl>, X16 <dbl>
clean_wd <- df_wd %>%
select(Season,Episode = starts_with("X")) %>%
mutate_each(funs(round(.,1)), -Season)
# The arguments to gather():
# - data: Data object
# - key: Name of new key column (made from names of data columns)
# - value: Name of new value column
# - ...: Names of source columns that contain values
# - factor_key: Treat the new key column as a factor (instead of character vector)
# The arguments to spread():
# - data: Data object
# - key: Name of column containing the new column names
# - value: Name of column containing values
wd_long <- gather(clean_wd, Episodes, Ratings, -Season)
wd_wide <- spread(wd_long, Season, Ratings)
clean_wd <- wd_wide %>%
setNames(paste0('Season_', names(.))) %>%
rename(Episodes = Season_Episodes) %>%
arrange(Episodes)
For the analysis part we analyzed the walking dead rating over each season and how the ratings changed over the course of the seasons. To perform this analysis we calculated average ratings for each season and we can see that the populatity grew season over season but it dropped a little for the last season.
library(ggplot2)
x1 <- wd_long %>%
group_by(Season) %>%
summarise(average_ratings = mean(Ratings, na.rm = TRUE)) %>%
arrange(average_ratings) %>%
mutate(rank = rank(desc(average_ratings))) %>%
arrange(rank)
x1
## # A tibble: 6 × 3
## Season average_ratings rank
## <int> <dbl> <dbl>
## 1 5 14.375000 1
## 2 4 13.337500 2
## 3 6 13.143750 3
## 4 3 10.768750 4
## 5 2 6.915385 5
## 6 1 5.250000 6
library("tidyr")
library("dplyr")
df_fear_wd <- tbl_df(df_fear_wd)#wide format
glimpse(df_fear_wd)
## Observations: 2
## Variables: 16
## $ Season <int> 1, 2
## $ X1 <dbl> 10.13, 6.67
## $ X2 <dbl> 8.18, 5.58
## $ X3 <dbl> 7.19, 4.73
## $ X4 <dbl> 6.62, 4.80
## $ X5 <dbl> 6.66, 4.41
## $ X6 <dbl> 6.86, 4.49
## $ X7 <dbl> NA, 4.39
## $ X8 <dbl> NA, 3.86
## $ X9 <dbl> NA, 3.66
## $ X10 <dbl> NA, 2.99
## $ X11 <dbl> NA, 3.4
## $ X12 <dbl> NA, 3.62
## $ X13 <dbl> NA, 3.49
## $ X14 <dbl> NA, 3.67
## $ X15 <dbl> NA, 3.05
df_fear_wd
## # A tibble: 2 × 16
## Season X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 10.13 8.18 7.19 6.62 6.66 6.86 NA NA NA NA NA
## 2 2 6.67 5.58 4.73 4.80 4.41 4.49 4.39 3.86 3.66 2.99 3.4
## # ... with 4 more variables: X12 <dbl>, X13 <dbl>, X14 <dbl>, X15 <dbl>
clean_fear_wd <- df_fear_wd %>%
select(Season,Episode = starts_with("X")) %>%
mutate_each(funs(round(.,1)), -Season)
fear_wd_long <- gather(clean_fear_wd, Episodes, Ratings, -Season)
fear_wd_wide <- spread(fear_wd_long, Season, Ratings)
clean_fear_wd <- fear_wd_wide %>%
setNames(paste0('Season_', names(.))) %>%
rename(Episodes = Season_Episodes) %>%
arrange(Episodes)
x2 <- fear_wd_long %>%
group_by(Season) %>%
summarise(average_ratings = mean(Ratings, na.rm = TRUE)) %>%
arrange(average_ratings) %>%
mutate(rank = rank(desc(average_ratings))) %>%
arrange(rank)
We will compare the ratings of original walking dead with fear the walking dead ratings, We can see that the ratings for the spin off series are not that great.
library(ggplot2)
x1
## # A tibble: 6 × 3
## Season average_ratings rank
## <int> <dbl> <dbl>
## 1 5 14.375000 1
## 2 4 13.337500 2
## 3 6 13.143750 3
## 4 3 10.768750 4
## 5 2 6.915385 5
## 6 1 5.250000 6
x2
## # A tibble: 2 × 3
## Season average_ratings rank
## <int> <dbl> <dbl>
## 1 1 7.616667 1
## 2 2 4.193333 2
qplot(Season, weight = average_ratings, data = x1, geom = "bar")
qplot(Season, weight = average_ratings, data = x2, geom = "bar")