##Silma Khan SPRING 2025
##Introduction In this Week 1 Assignment for DATA 607 Data Acquisition & Management course, we have to choose one of the provided datasets on fivethirtyeight (https://data.fivethirtyeight.com/), read more about that dataset, and then:
—> “Take the data, and create one or more code blocks. You should finish with a data frame that contains a subset of the columns in your selected dataset. If there is an obvious target (aka predictor or independent) variable, you should include this in your set of columns. You should include (or add if necessary) meaningful column names and replace (if necessary) any non-intuitive abbreviations used in the data that you selected.”
Upon looking further into the datasets provided, I have decided to conduct my assignment using the “2022-23 NBA Predictions”
Taking a closer look, this dataset, upon reading it through their Github repo contains a “nba-elo.csv” which contains a game-by-game Elo ratings and forecasts it back to the year 1946.
–> At first I was not familiar with that Elo ratings were. Looking for into it, I now know that Elo ratings is a number that basically measures a players skill level, whether it be for basketball, chess, baseball, or any other well –> The higher the rating is, the more skill they have and they are more likely to win. It shows and measures skill on a quantitative level.
Using this dataset, I want to create a new target variable, which I want to see the winning team.
In order to do this, I will:
##Step #1: Install Tidyverse I will be installing and calling in the “tidyverse” library where it contained many different R packages that allows me to manipulate data which is needed in this case for once I create a subset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##Step #2: Pull in the CSV File Again, I have decided to choose the 2022-23 NBA Predictions through FiveThirtyEight and then decided to use the “nba_elo.csv” file as it contains data from 1946
nba_link <- "https://projects.fivethirtyeight.com/nba-model/nba_elo.csv"
nba_data <- read.csv(nba_link, stringsAsFactors = FALSE)
##Step #3: See all the Column names -Which to keep After pulling in the data, I am now able to view the column names as well as see what the dataset looks like
print(names(nba_data))
## [1] "date" "season" "neutral" "playoff"
## [5] "team1" "team2" "elo1_pre" "elo2_pre"
## [9] "elo_prob1" "elo_prob2" "elo1_post" "elo2_post"
## [13] "carm.elo1_pre" "carm.elo2_pre" "carm.elo_prob1" "carm.elo_prob2"
## [17] "carm.elo1_post" "carm.elo2_post" "raptor1_pre" "raptor2_pre"
## [21] "raptor_prob1" "raptor_prob2" "score1" "score2"
## [25] "quality" "importance" "total_rating"
head(nba_data)
## date season neutral playoff team1 team2 elo1_pre elo2_pre elo_prob1
## 1 1946-11-01 1947 0 TRH NYK 1300 1300.000 0.6400650
## 2 1946-11-02 1947 0 PRO BOS 1300 1300.000 0.6400650
## 3 1946-11-02 1947 0 STB PIT 1300 1300.000 0.6400650
## 4 1946-11-02 1947 0 CHS NYK 1300 1306.723 0.6311012
## 5 1946-11-02 1947 0 DTF WSC 1300 1300.000 0.6400650
## 6 1946-11-03 1947 0 CLR TRH 1300 1293.277 0.6489322
## elo_prob2 elo1_post elo2_post carm.elo1_pre carm.elo2_pre carm.elo_prob1
## 1 0.3599350 1293.277 1306.723 NA NA NA
## 2 0.3599350 1305.154 1294.846 NA NA NA
## 3 0.3599350 1304.691 1295.309 NA NA NA
## 4 0.3688988 1309.652 1297.071 NA NA NA
## 5 0.3599350 1279.619 1320.381 NA NA NA
## 6 0.3510678 1307.123 1286.153 NA NA NA
## carm.elo_prob2 carm.elo1_post carm.elo2_post raptor1_pre raptor2_pre
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## raptor_prob1 raptor_prob2 score1 score2 quality importance total_rating
## 1 NA NA 66 68 0 NA NA
## 2 NA NA 59 53 0 NA NA
## 3 NA NA 56 51 0 NA NA
## 4 NA NA 63 47 0 NA NA
## 5 NA NA 33 50 0 NA NA
## 6 NA NA 71 60 0 NA NA
Taking a look at the column names posted, along with the first 6 rows of data within the dataset, I can see that there are some rows that are not needed for my ultimate goal of seeing which team won, lost, or had a tie. For this overall target variable I am tryng to create (winner), I definitely want to create a subset using the columns: - date, season, team1, team2, team1_score, team2_score, game_quality
Although I can suffice with just the teams and the score columns in order to create a new column, I would like to keep these in case I decide to do further research.
##Step#4: Rename Columns & Create new df With New Target Variable
nba_winner <- nba_data %>%
select(
Date = date,
Season = season,
Team1 = team1,
Team2 = team2,
Team1_Score = score1,
Team2_Score = score2,
Game_Quality = quality
) %>%
mutate(winner = case_when(
Team1_Score > Team2_Score ~ Team1,
Team1_Score < Team2_Score ~ Team2,
TRUE ~ "Tie"
))
Using this, I was able to create a new subset for a new dataframe that I titled “nba_winner” –> Using the pipe operator (%>%), this allows me to use the nba_data as a starting point to create this subset –> Using the select() function iam able to choose the columns that I want to include in my new df while also being able to change the names, even though it is taking the same data –> The mutate() function allows me to add a new columnn to this new df which I decided to call “winner” as I am trying to see who the winner is. I also incorporated the case_when() function to allow and set conditions for this new column
##Step#5: Check new DataFrame
head(nba_winner)
## Date Season Team1 Team2 Team1_Score Team2_Score Game_Quality winner
## 1 1946-11-01 1947 TRH NYK 66 68 0 NYK
## 2 1946-11-02 1947 PRO BOS 59 53 0 PRO
## 3 1946-11-02 1947 STB PIT 56 51 0 STB
## 4 1946-11-02 1947 CHS NYK 63 47 0 CHS
## 5 1946-11-02 1947 DTF WSC 33 50 0 WSC
## 6 1946-11-03 1947 CLR TRH 71 60 0 CLR
##Findings & Recommendations Through this assignment and creating a new variable, “Winner” we are able to get a better feel of which teams won which games and against who. Using this information if doing further analysis, we could see which teams have won the most, which teams lost the most, during which season did specific teams perform the best, and using more outside sources we can see why certain teams performed the way they did during such seasons.