In this assignment, you’ll practice collaborating around a code project with GitHub. You could consider our collective work as building out a book of examples on how to use TidyVerse functions.

GitHub repository: https://github.com/acatlin/SPRING2020TIDYVERSE

FiveThirtyEight.com datasets.

Kaggle datasets.

Your task here is to Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. (25 points)

Libraries

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Dataset

I will be using the – dataset

I downloaded this dataset from FiveThirtyEight.com datasets and uploaded the csv to GitHub

Capabilities

  1. read_csv
spi_matches <- read_csv("https://raw.githubusercontent.com/nathtrish334/Data-607/main/spi_matches.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   date = col_date(format = ""),
##   league = col_character(),
##   team1 = col_character(),
##   team2 = col_character()
## )
## i Use `spec()` for the full column specifications.
head(spi_matches)
## # A tibble: 6 x 23
##   season date       league_id league team1 team2  spi1  spi2 prob1 prob2 probtie
##    <dbl> <date>         <dbl> <chr>  <chr> <chr> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1   2016 2016-07-09      7921 FA Wo~ Live~ Read~  51.6  50.4 0.439 0.277   0.284
## 2   2016 2016-07-10      7921 FA Wo~ Arse~ Nott~  46.6  54.0 0.357 0.361   0.282
## 3   2016 2016-07-10      7921 FA Wo~ Chel~ Birm~  59.8  54.6 0.480 0.249   0.271
## 4   2016 2016-07-16      7921 FA Wo~ Live~ Nott~  53    52.4 0.429 0.270   0.301
## 5   2016 2016-07-17      7921 FA Wo~ Chel~ Arse~  59.4  61.0 0.412 0.316   0.272
## 6   2016 2016-07-24      7921 FA Wo~ Read~ Birm~  50.8  55.0 0.382 0.32    0.298
## # ... with 12 more variables: proj_score1 <dbl>, proj_score2 <dbl>,
## #   importance1 <dbl>, importance2 <dbl>, score1 <dbl>, score2 <dbl>,
## #   xg1 <dbl>, xg2 <dbl>, nsxg1 <dbl>, nsxg2 <dbl>, adj_score1 <dbl>,
## #   adj_score2 <dbl>
  1. select
    Select and display only a set of columns
spi_matches_select <-select(spi_matches, c("season", "league", "team1", "team2", "prob1", "prob2", "probtie", "score1", "score2"))
head(spi_matches_select)
## # A tibble: 6 x 9
##   season league         team1      team2       prob1 prob2 probtie score1 score2
##    <dbl> <chr>          <chr>      <chr>       <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1   2016 FA Women's Su~ Liverpool~ Reading     0.439 0.277   0.284      2      0
## 2   2016 FA Women's Su~ Arsenal W~ Notts Coun~ 0.357 0.361   0.282      2      0
## 3   2016 FA Women's Su~ Chelsea F~ Birmingham~ 0.480 0.249   0.271      1      1
## 4   2016 FA Women's Su~ Liverpool~ Notts Coun~ 0.429 0.270   0.301      0      0
## 5   2016 FA Women's Su~ Chelsea F~ Arsenal Wo~ 0.412 0.316   0.272      1      2
## 6   2016 FA Women's Su~ Reading    Birmingham~ 0.382 0.32    0.298      1      1
  1. filter
    I am going to filter SPI ratings from 2020 season and onwards for UEFA Champions League
spi_matches_filter <-filter(spi_matches_select, season >= 2020 & league == "UEFA Champions League")
head(spi_matches_filter)
## # A tibble: 6 x 9
##   season league        team1       team2      prob1  prob2 probtie score1 score2
##    <dbl> <chr>         <chr>       <chr>      <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
## 1   2020 UEFA Champio~ Dynamo Kiev Juventus   0.283 0.473    0.244      0      2
## 2   2020 UEFA Champio~ Zenit St P~ Club Brug~ 0.565 0.182    0.253      1      2
## 3   2020 UEFA Champio~ Lazio       Borussia ~ 0.275 0.479    0.246      3      1
## 4   2020 UEFA Champio~ Stade Renn~ FC Krasno~ 0.501 0.224    0.274      1      1
## 5   2020 UEFA Champio~ Barcelona   Ferencvar~ 0.865 0.0218   0.113      5      1
## 6   2020 UEFA Champio~ Chelsea     Sevilla FC 0.500 0.248    0.252      0      0
  1. Summarise
    I am going to find the number of times each league appears in the dataset
#spi_matches_league <-select(spi_matches_select, c("league"))
spi_matches_count <- spi_matches_select %>% count(league, name = "Count", sort = TRUE)
head(spi_matches_count)
## # A tibble: 6 x 2
##   league                      Count
##   <chr>                       <int>
## 1 English League Championship  2223
## 2 Barclays Premier League      1900
## 3 French Ligue 1               1900
## 4 Italy Serie A                1900
## 5 Spanish Primera Division     1900
## 6 Spanish Segunda Division     1865

Conclusion

I have demonstrated four capabilities of the dplyr package; these have been: reading a csv, filtering, selecting and summarising.