Week 1 Assignment - DATA 607 Data Acquisition & Management

##Silma Khan SPRING 2025

##Introduction In this Week 1 Assignment for DATA 607 Data Acquisition & Management course, we have to choose one of the provided datasets on fivethirtyeight (https://data.fivethirtyeight.com/), read more about that dataset, and then:

—> “Take the data, and create one or more code blocks. You should finish with a data frame that contains a subset of the columns in your selected dataset. If there is an obvious target (aka predictor or independent) variable, you should include this in your set of columns. You should include (or add if necessary) meaningful column names and replace (if necessary) any non-intuitive abbreviations used in the data that you selected.”

Upon looking further into the datasets provided, I have decided to conduct my assignment using the “2022-23 NBA Predictions”

Taking a closer look, this dataset, upon reading it through their Github repo contains a “nba-elo.csv” which contains a game-by-game Elo ratings and forecasts it back to the year 1946.

–> At first I was not familiar with that Elo ratings were. Looking for into it, I now know that Elo ratings is a number that basically measures a players skill level, whether it be for basketball, chess, baseball, or any other well –> The higher the rating is, the more skill they have and they are more likely to win. It shows and measures skill on a quantitative level.

Using this dataset, I want to create a new target variable, which I want to see the winning team.

In order to do this, I will:

  1. Load in the TidyVerse Library
  2. Pull in the CSV file (https://projects.fivethirtyeight.com/nba-model/nba_elo.csv)
  3. See all the column names and examine which ones I want to keep for the analysis and which ones I do not need in the new dataset
  4. Rename the columns that I choose to keep to more descriptive ones & create a new column for the target variable
  5. Check the new data

##Step #1: Install Tidyverse I will be installing and calling in the “tidyverse” library where it contained many different R packages that allows me to manipulate data which is needed in this case for once I create a subset

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

##Step #2: Pull in the CSV File Again, I have decided to choose the 2022-23 NBA Predictions through FiveThirtyEight and then decided to use the “nba_elo.csv” file as it contains data from 1946

nba_link <- "https://projects.fivethirtyeight.com/nba-model/nba_elo.csv"
nba_data <- read.csv(nba_link, stringsAsFactors = FALSE)

##Step #3: See all the Column names -Which to keep After pulling in the data, I am now able to view the column names as well as see what the dataset looks like

print(names(nba_data))
##  [1] "date"           "season"         "neutral"        "playoff"       
##  [5] "team1"          "team2"          "elo1_pre"       "elo2_pre"      
##  [9] "elo_prob1"      "elo_prob2"      "elo1_post"      "elo2_post"     
## [13] "carm.elo1_pre"  "carm.elo2_pre"  "carm.elo_prob1" "carm.elo_prob2"
## [17] "carm.elo1_post" "carm.elo2_post" "raptor1_pre"    "raptor2_pre"   
## [21] "raptor_prob1"   "raptor_prob2"   "score1"         "score2"        
## [25] "quality"        "importance"     "total_rating"
head(nba_data)
##         date season neutral playoff team1 team2 elo1_pre elo2_pre elo_prob1
## 1 1946-11-01   1947       0           TRH   NYK     1300 1300.000 0.6400650
## 2 1946-11-02   1947       0           PRO   BOS     1300 1300.000 0.6400650
## 3 1946-11-02   1947       0           STB   PIT     1300 1300.000 0.6400650
## 4 1946-11-02   1947       0           CHS   NYK     1300 1306.723 0.6311012
## 5 1946-11-02   1947       0           DTF   WSC     1300 1300.000 0.6400650
## 6 1946-11-03   1947       0           CLR   TRH     1300 1293.277 0.6489322
##   elo_prob2 elo1_post elo2_post carm.elo1_pre carm.elo2_pre carm.elo_prob1
## 1 0.3599350  1293.277  1306.723            NA            NA             NA
## 2 0.3599350  1305.154  1294.846            NA            NA             NA
## 3 0.3599350  1304.691  1295.309            NA            NA             NA
## 4 0.3688988  1309.652  1297.071            NA            NA             NA
## 5 0.3599350  1279.619  1320.381            NA            NA             NA
## 6 0.3510678  1307.123  1286.153            NA            NA             NA
##   carm.elo_prob2 carm.elo1_post carm.elo2_post raptor1_pre raptor2_pre
## 1             NA             NA             NA          NA          NA
## 2             NA             NA             NA          NA          NA
## 3             NA             NA             NA          NA          NA
## 4             NA             NA             NA          NA          NA
## 5             NA             NA             NA          NA          NA
## 6             NA             NA             NA          NA          NA
##   raptor_prob1 raptor_prob2 score1 score2 quality importance total_rating
## 1           NA           NA     66     68       0         NA           NA
## 2           NA           NA     59     53       0         NA           NA
## 3           NA           NA     56     51       0         NA           NA
## 4           NA           NA     63     47       0         NA           NA
## 5           NA           NA     33     50       0         NA           NA
## 6           NA           NA     71     60       0         NA           NA

Taking a look at the column names posted, along with the first 6 rows of data within the dataset, I can see that there are some rows that are not needed for my ultimate goal of seeing which team won, lost, or had a tie. For this overall target variable I am tryng to create (winner), I definitely want to create a subset using the columns: - date, season, team1, team2, team1_score, team2_score, game_quality

Although I can suffice with just the teams and the score columns in order to create a new column, I would like to keep these in case I decide to do further research.

##Step#4: Rename Columns & Create new df With New Target Variable

nba_winner <- nba_data %>%
  select(
    Date = date,
    Season = season,
    Team1 = team1,
    Team2 = team2,
    Team1_Score = score1,
    Team2_Score = score2,
    Game_Quality = quality
  ) %>%
  
  mutate(winner = case_when(
    Team1_Score > Team2_Score ~ Team1,
    Team1_Score < Team2_Score ~ Team2,
    TRUE ~ "Tie"
  ))

Using this, I was able to create a new subset for a new dataframe that I titled “nba_winner” –> Using the pipe operator (%>%), this allows me to use the nba_data as a starting point to create this subset –> Using the select() function iam able to choose the columns that I want to include in my new df while also being able to change the names, even though it is taking the same data –> The mutate() function allows me to add a new columnn to this new df which I decided to call “winner” as I am trying to see who the winner is. I also incorporated the case_when() function to allow and set conditions for this new column

##Step#5: Check new DataFrame

head(nba_winner)
##         Date Season Team1 Team2 Team1_Score Team2_Score Game_Quality winner
## 1 1946-11-01   1947   TRH   NYK          66          68            0    NYK
## 2 1946-11-02   1947   PRO   BOS          59          53            0    PRO
## 3 1946-11-02   1947   STB   PIT          56          51            0    STB
## 4 1946-11-02   1947   CHS   NYK          63          47            0    CHS
## 5 1946-11-02   1947   DTF   WSC          33          50            0    WSC
## 6 1946-11-03   1947   CLR   TRH          71          60            0    CLR

##Findings & Recommendations Through this assignment and creating a new variable, “Winner” we are able to get a better feel of which teams won which games and against who. Using this information if doing further analysis, we could see which teams have won the most, which teams lost the most, during which season did specific teams perform the best, and using more outside sources we can see why certain teams performed the way they did during such seasons.