Project 2

Setup

To do the analysis, I first downloaded the data from the websites to csv files and loaded it into R, the data files can be downloaded from the link below and then save it your working directory: [link] https://github.com/ravi-kothari/DATA-607/

Also we will need the following package to do the analysis:

dplyr
tidyr
ggplot2(optional)

install.packages("tidyr")
install.packages("dplyr")
library("tidyr")
library("dplyr")
library("Downloader")

Step 1: Loading the datasets.

We start by loading all the datasets for the analysis. I picked Walt Wells dataset for the walking dead ratings, then I took the dataset i suggested for English premiere league. I wanted to see how the spin off of the walking dead tv series called Fear the walking dead ratings are compared to original walking dead, so i picked that as my third dataset.

getwd()

## [1] "C:/Users/rkothari/Documents/GitHub/DATA-607"

# load dataset into a dataframe.

df_soccer <- read.csv("~/MSDA/DATA 607/EPL.csv",header=TRUE,stringsAsFactors = FALSE)
df_soccer

##     X              Team P W D L GF GA  GD Pts  PPG Wh Dh Lh GFh GAh Wa Da
## 1   1  Manchester City  7 6 0 1 18  7  11  18 2.57  3  0  0   9   2  3  0
## 2   2        Tottenham  7 5 2 0 12  3   9  17 2.43  3  1  0   5   1  2  1
## 3   3          Arsenal  7 5 1 1 16  7   9  16 2.29  2  0  1   8   5  3  1
## 4   4        Liverpool  7 5 1 1 18 10   8  16 2.29  2  0  0   9   2  3  1
## 5   5          Everton  7 4 2 1 11  5   6  14 2.00  2  2  0   6   3  2  0
## 6   6   Manchester Utd  7 4 1 2 13  8   5  13 1.86  2  1  1   8   4  2  0
## 7   7          Chelsea  7 4 1 2 12  9   3  13 1.86  2  0  1   6   3  2  1
## 8   8   Crystal Palace  7 3 2 2 11  8   3  11 1.57  1  1  1   5   3  2  1
## 9   9    West Bromwich  7 2 3 2  8  7   1   9 1.29  1  1  1   5   4  1  2
## 10 10      Southampton  7 2 3 2  7  6   1   9 1.29  1  2  0   3   2  1  1
## 11 11          Watford  7 2 2 3 12 13  -1   8 1.14  1  1  2   7   8  1  1
## 12 12   Leicester City  7 2 2 3  8 11  -3   8 1.14  2  2  0   5   1  0  0
## 13 13      Bournemouth  7 2 2 3  6 11  -5   8 1.14  2  0  1   3   3  0  2
## 14 14          Burnley  7 2 1 4  5  9  -4   7 1.00  2  1  2   5   3  0  0
## 15 15        Hull City  7 2 1 4  7 14  -7   7 1.00  1  0  3   3   8  1  1
## 16 16    Middlesbrough  7 1 3 3  7 10  -3   6 0.86  0  1  2   3   5  1  2
## 17 17     Swansea City  7 1 1 5  6 12  -6   4 0.57  0  1  3   4   9  1  0
## 18 18     West Ham Utd  7 1 1 5  8 17  -9   4 0.57  1  1  2   4   8  0  0
## 19 19       Stoke City  7 0 3 4  5 16 -11   3 0.43  0  1  2   2   9  0  2
## 20 20       Sunderland  7 0 2 5  6 13  -7   2 0.29  0  1  3   4   9  0  1
##    La GFa GAa
## 1   1   9   5
## 2   0   7   2
## 3   0   8   2
## 4   1   9   8
## 5   1   5   2
## 6   1   5   4
## 7   1   6   6
## 8   1   6   5
## 9   1   3   3
## 10  2   4   4
## 11  1   5   5
## 12  3   3  10
## 13  2   3   8
## 14  2   0   6
## 15  1   4   6
## 16  1   4   5
## 17  2   2   3
## 18  3   4   9
## 19  2   3   7
## 20  2   2   4

df_wd <- read.csv("~/MSDA/DATA 607/Walking_Dead.csv",header=TRUE,stringsAsFactors = FALSE, na.strings = "-")
df_wd

##   Season     X1     X2     X3     X4    X5     X6     X7     X8     X9
## 1      1  5.350  4.710  5.070  4.750  5.56  5.970     NA     NA     NA
## 2      2  7.260  6.700  6.100  6.290  6.12  6.080  6.620  8.100  6.890
## 3      3 10.970  9.550 10.510  9.270 10.37  9.210 10.430 10.480 12.260
## 4      4 16.111 13.950 12.920 13.310 12.20 12.000 11.290 12.050 15.760
## 5      5 17.300 15.143 13.801 14.518 13.53 14.068 13.330 14.807 15.643
## 6      6 14.633 12.183 13.143 13.339 12.44 12.871 13.224 13.981 13.742
##      X10    X11    X12    X13    X14    X15    X16
## 1     NA     NA     NA     NA     NA     NA     NA
## 2  7.040  6.770  6.890  8.990     NA     NA     NA
## 3 11.050 11.010 11.300 11.460 10.840 10.990 12.420
## 4 13.340 13.120 12.610 12.650 12.870 13.470 15.680
## 5 12.267 13.438 14.430 14.534 13.781 13.757 15.784
## 6 13.483 12.794 12.812 12.530 12.686 12.384 14.193

df_fear_wd <- read.csv("~/MSDA/DATA 607/Fear_the_Walking_Dead.csv",header=TRUE,stringsAsFactors = FALSE, na.strings = " ")
df_fear_wd

##   Season    X1   X2   X3   X4   X5   X6   X7   X8   X9  X10 X11  X12  X13
## 1      1 10.13 8.18 7.19 6.62 6.66 6.86   NA   NA   NA   NA  NA   NA   NA
## 2      2  6.67 5.58 4.73 4.80 4.41 4.49 4.39 3.86 3.66 2.99 3.4 3.62 3.49
##    X14  X15
## 1   NA   NA
## 2 3.67 3.05

Step 2a: Tidy and Transform of soccer data:

Lets analyze the soccer dataset first. As we can see although the data is in wide format its clean as there is no duplicate values in the table. One thing we can do is to renmae some column names to be more descriptive. Also we would like to remove the extra columns not required for our analysis, We will use dplyr function select to perform this.

library("tidyr")
library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

df_soccer <- tbl_df(df_soccer)#wide format
glimpse(df_soccer)

## Observations: 20
## Variables: 21
## $ X    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ Team <chr> " Manchester City ", " Tottenham ", " Arsenal ", " Liverp...
## $ P    <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7
## $ W    <int> 6, 5, 5, 5, 4, 4, 4, 3, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 0, 0
## $ D    <int> 0, 2, 1, 1, 2, 1, 1, 2, 3, 3, 2, 2, 2, 1, 1, 3, 1, 1, 3, 2
## $ L    <int> 1, 0, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 3, 5, 5, 4, 5
## $ GF   <int> 18, 12, 16, 18, 11, 13, 12, 11, 8, 7, 12, 8, 6, 5, 7, 7, ...
## $ GA   <int> 7, 3, 7, 10, 5, 8, 9, 8, 7, 6, 13, 11, 11, 9, 14, 10, 12,...
## $ GD   <int> 11, 9, 9, 8, 6, 5, 3, 3, 1, 1, -1, -3, -5, -4, -7, -3, -6...
## $ Pts  <int> 18, 17, 16, 16, 14, 13, 13, 11, 9, 9, 8, 8, 8, 7, 7, 6, 4...
## $ PPG  <dbl> 2.57, 2.43, 2.29, 2.29, 2.00, 1.86, 1.86, 1.57, 1.29, 1.2...
## $ Wh   <int> 3, 3, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 0, 0, 1, 0, 0
## $ Dh   <int> 0, 1, 0, 0, 2, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 1, 1, 1
## $ Lh   <int> 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 2, 0, 1, 2, 3, 2, 3, 2, 2, 3
## $ GFh  <int> 9, 5, 8, 9, 6, 8, 6, 5, 5, 3, 7, 5, 3, 5, 3, 3, 4, 4, 2, 4
## $ GAh  <int> 2, 1, 5, 2, 3, 4, 3, 3, 4, 2, 8, 1, 3, 3, 8, 5, 9, 8, 9, 9
## $ Wa   <int> 3, 2, 3, 3, 2, 2, 2, 2, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0
## $ Da   <int> 0, 1, 1, 1, 0, 0, 1, 1, 2, 1, 1, 0, 2, 0, 1, 2, 0, 0, 2, 1
## $ La   <int> 1, 0, 0, 1, 1, 1, 1, 1, 1, 2, 1, 3, 2, 2, 1, 1, 2, 3, 2, 2
## $ GFa  <int> 9, 7, 8, 9, 5, 5, 6, 6, 3, 4, 5, 3, 3, 0, 4, 4, 2, 4, 3, 2
## $ GAa  <int> 5, 2, 2, 8, 2, 4, 6, 5, 3, 4, 5, 10, 8, 6, 6, 5, 3, 9, 7, 4

clean_soccer <- df_soccer %>% 
                        select(Team:L,Pts,Wh:Lh,Wa:La) %>%
                        mutate(Played_Home = Wh + Dh + Lh, Played_Away = P - Played_Home) %>%
                        rename(Played=P,Won=W, Draw=D, Lost=L, Won_Home = Wh, Draw_Home=Dh, Lost_Home = Lh, Won_Away=Wa, Draw_Away=Da, Lost_Away=La)

Step: Data Analysis for soccer data

We used dplyr functions to analyse the data and calculate home won percentage. Also we arranged the teams based on their current points and also home win percent to see the performance vs the home advantage.

# Home win percent

(clean_soccer %>%
            select(Team,Played,Pts,Won_Home,Played_Home,Played_Away) %>%
            group_by(Team) %>%
            summarise(home_win_p = (Won_Home/Played_Home)*100, Pts) %>%
            mutate(rank = rank(desc(Pts))) %>%
            arrange(rank))

## # A tibble: 20 × 4
##                 Team home_win_p   Pts  rank
##                <chr>      <dbl> <int> <dbl>
## 1   Manchester City   100.00000    18   1.0
## 2         Tottenham    75.00000    17   2.0
## 3           Arsenal    66.66667    16   3.5
## 4         Liverpool   100.00000    16   3.5
## 5           Everton    50.00000    14   5.0
## 6           Chelsea    66.66667    13   6.5
## 7    Manchester Utd    50.00000    13   6.5
## 8    Crystal Palace    33.33333    11   8.0
## 9       Southampton    33.33333     9   9.5
## 10    West Bromwich    33.33333     9   9.5
## 11      Bournemouth    66.66667     8  12.0
## 12   Leicester City    50.00000     8  12.0
## 13          Watford    25.00000     8  12.0
## 14          Burnley    40.00000     7  14.5
## 15        Hull City    25.00000     7  14.5
## 16    Middlesbrough     0.00000     6  16.0
## 17     Swansea City     0.00000     4  17.5
## 18     West Ham Utd    25.00000     4  17.5
## 19       Stoke City     0.00000     3  19.0
## 20       Sunderland     0.00000     2  20.0

Step: Tidy and Transform Walking dead data:

Lets analyze the walking dead dataset now. For the analysis we took the dataset which is wide and dirty so we will use tidyr first and dplyr functions to clean and rename column names. We used gather to reshape the data into long form for further analysis.

library("tidyr")
library("dplyr")

df_wd <- tbl_df(df_wd)#wide format
glimpse(df_wd)

## Observations: 6
## Variables: 17
## $ Season <int> 1, 2, 3, 4, 5, 6
## $ X1     <dbl> 5.350, 7.260, 10.970, 16.111, 17.300, 14.633
## $ X2     <dbl> 4.710, 6.700, 9.550, 13.950, 15.143, 12.183
## $ X3     <dbl> 5.070, 6.100, 10.510, 12.920, 13.801, 13.143
## $ X4     <dbl> 4.750, 6.290, 9.270, 13.310, 14.518, 13.339
## $ X5     <dbl> 5.56, 6.12, 10.37, 12.20, 13.53, 12.44
## $ X6     <dbl> 5.970, 6.080, 9.210, 12.000, 14.068, 12.871
## $ X7     <dbl> NA, 6.620, 10.430, 11.290, 13.330, 13.224
## $ X8     <dbl> NA, 8.100, 10.480, 12.050, 14.807, 13.981
## $ X9     <dbl> NA, 6.890, 12.260, 15.760, 15.643, 13.742
## $ X10    <dbl> NA, 7.040, 11.050, 13.340, 12.267, 13.483
## $ X11    <dbl> NA, 6.770, 11.010, 13.120, 13.438, 12.794
## $ X12    <dbl> NA, 6.890, 11.300, 12.610, 14.430, 12.812
## $ X13    <dbl> NA, 8.990, 11.460, 12.650, 14.534, 12.530
## $ X14    <dbl> NA, NA, 10.840, 12.870, 13.781, 12.686
## $ X15    <dbl> NA, NA, 10.990, 13.470, 13.757, 12.384
## $ X16    <dbl> NA, NA, 12.420, 15.680, 15.784, 14.193

df_wd

## # A tibble: 6 × 17
##   Season     X1     X2     X3     X4    X5     X6     X7     X8     X9
##    <int>  <dbl>  <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1      1  5.350  4.710  5.070  4.750  5.56  5.970     NA     NA     NA
## 2      2  7.260  6.700  6.100  6.290  6.12  6.080  6.620  8.100  6.890
## 3      3 10.970  9.550 10.510  9.270 10.37  9.210 10.430 10.480 12.260
## 4      4 16.111 13.950 12.920 13.310 12.20 12.000 11.290 12.050 15.760
## 5      5 17.300 15.143 13.801 14.518 13.53 14.068 13.330 14.807 15.643
## 6      6 14.633 12.183 13.143 13.339 12.44 12.871 13.224 13.981 13.742
## # ... with 7 more variables: X10 <dbl>, X11 <dbl>, X12 <dbl>, X13 <dbl>,
## #   X14 <dbl>, X15 <dbl>, X16 <dbl>

clean_wd <- df_wd %>% 
                  select(Season,Episode  = starts_with("X")) %>%
                  mutate_each(funs(round(.,1)), -Season)

# The arguments to gather():
# - data: Data object
# - key: Name of new key column (made from names of data columns)
# - value: Name of new value column
# - ...: Names of source columns that contain values
# - factor_key: Treat the new key column as a factor (instead of character vector)

# The arguments to spread():
# - data: Data object
# - key: Name of column containing the new column names
# - value: Name of column containing values

wd_long <- gather(clean_wd, Episodes, Ratings, -Season)
wd_wide <- spread(wd_long, Season, Ratings)

clean_wd <- wd_wide %>%
                    setNames(paste0('Season_', names(.))) %>%
                    rename(Episodes = Season_Episodes) %>%
                    arrange(Episodes)

Step: Data Analysis for walking dead dataset

For the analysis part we analyzed the walking dead rating over each season and how the ratings changed over the course of the seasons. To perform this analysis we calculated average ratings for each season and we can see that the populatity grew season over season but it dropped a little for the last season.

library(ggplot2)
  x1 <- wd_long %>%
            group_by(Season) %>%
            summarise(average_ratings = mean(Ratings, na.rm = TRUE)) %>%
            arrange(average_ratings) %>%
            mutate(rank = rank(desc(average_ratings))) %>%
            arrange(rank)

x1

## # A tibble: 6 × 3
##   Season average_ratings  rank
##    <int>           <dbl> <dbl>
## 1      5       14.375000     1
## 2      4       13.337500     2
## 3      6       13.143750     3
## 4      3       10.768750     4
## 5      2        6.915385     5
## 6      1        5.250000     6

Step: Cleaning up of the third dataset(spin off of Walking Dead)

library("tidyr")
library("dplyr")

df_fear_wd <- tbl_df(df_fear_wd)#wide format
glimpse(df_fear_wd)

## Observations: 2
## Variables: 16
## $ Season <int> 1, 2
## $ X1     <dbl> 10.13, 6.67
## $ X2     <dbl> 8.18, 5.58
## $ X3     <dbl> 7.19, 4.73
## $ X4     <dbl> 6.62, 4.80
## $ X5     <dbl> 6.66, 4.41
## $ X6     <dbl> 6.86, 4.49
## $ X7     <dbl> NA, 4.39
## $ X8     <dbl> NA, 3.86
## $ X9     <dbl> NA, 3.66
## $ X10    <dbl> NA, 2.99
## $ X11    <dbl> NA, 3.4
## $ X12    <dbl> NA, 3.62
## $ X13    <dbl> NA, 3.49
## $ X14    <dbl> NA, 3.67
## $ X15    <dbl> NA, 3.05

df_fear_wd

## # A tibble: 2 × 16
##   Season    X1    X2    X3    X4    X5    X6    X7    X8    X9   X10   X11
##    <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1      1 10.13  8.18  7.19  6.62  6.66  6.86    NA    NA    NA    NA    NA
## 2      2  6.67  5.58  4.73  4.80  4.41  4.49  4.39  3.86  3.66  2.99   3.4
## # ... with 4 more variables: X12 <dbl>, X13 <dbl>, X14 <dbl>, X15 <dbl>

clean_fear_wd <- df_fear_wd %>% 
                  select(Season,Episode  = starts_with("X")) %>%
                  mutate_each(funs(round(.,1)), -Season)


fear_wd_long <- gather(clean_fear_wd, Episodes, Ratings, -Season)
fear_wd_wide <- spread(fear_wd_long, Season, Ratings)

clean_fear_wd <- fear_wd_wide %>%
                    setNames(paste0('Season_', names(.))) %>%
                    rename(Episodes = Season_Episodes) %>%
                    arrange(Episodes)

Step: Data analysis

 x2 <-   fear_wd_long %>%
            group_by(Season) %>%
            summarise(average_ratings = mean(Ratings, na.rm = TRUE)) %>%
            arrange(average_ratings) %>%
            mutate(rank = rank(desc(average_ratings))) %>%
            arrange(rank)

Comparison

We will compare the ratings of original walking dead with fear the walking dead ratings, We can see that the ratings for the spin off series are not that great.

library(ggplot2)
x1

## # A tibble: 6 × 3
##   Season average_ratings  rank
##    <int>           <dbl> <dbl>
## 1      5       14.375000     1
## 2      4       13.337500     2
## 3      6       13.143750     3
## 4      3       10.768750     4
## 5      2        6.915385     5
## 6      1        5.250000     6

x2

## # A tibble: 2 × 3
##   Season average_ratings  rank
##    <int>           <dbl> <dbl>
## 1      1        7.616667     1
## 2      2        4.193333     2

qplot(Season, weight = average_ratings, data = x1, geom = "bar")

qplot(Season, weight = average_ratings, data = x2, geom = "bar")