Process
(Data Cleaning and Manipulation)
After importing the datasets, it’s essential to start by examining
their summary statistics, such as variables, data types, column and row
numbers, among other details.
Each dataframe contains 220 rows and 24 columns.
# Preview each dataframe
glimpse(nba_championships.df)
## Rows: 220
## Columns: 24
## $ Year <dbl> 1980, 1980, 1980, 1980, 1980, 1980, 1981, 1981, 1981, 1981, 1981,…
## $ Team <chr> "Lakers", "Lakers", "Lakers", "Lakers", "Lakers", "Lakers", "Celt…
## $ Game <dbl> 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4,…
## $ Win <dbl> 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1,…
## $ Home <dbl> 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,…
## $ MP <dbl> 240, 240, 240, 240, 240, 240, 240, 240, 240, 240, 240, 240, 240, …
## $ FG <dbl> 48, 48, 44, 44, 41, 45, 41, 41, 40, 35, 41, 43, 49, 35, 50, 45, 4…
## $ FGA <dbl> 89, 95, 92, 93, 91, 92, 95, 82, 89, 74, 94, 78, 93, 83, 91, 97, 1…
## $ FGP <dbl> 0.539, 0.505, 0.478, 0.473, 0.451, 0.489, 0.432, 0.500, 0.449, 0.…
## $ TP <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ TPA <dbl> 0, 1, 1, 0, 0, 2, 1, 3, 3, 3, 3, 4, 0, 5, 1, 1, 2, 0, 0, 1, 1, 1,…
## $ TPP <dbl> NA, 0.000, 0.000, NA, NA, 0.000, 0.000, 0.000, 0.667, 0.000, 0.00…
## $ FT <dbl> 13, 8, 23, 14, 26, 33, 16, 8, 12, 16, 27, 15, 26, 24, 28, 21, 8, …
## $ FTA <dbl> 15, 12, 30, 19, 33, 35, 20, 13, 19, 24, 35, 18, 35, 37, 47, 29, 1…
## $ FTP <dbl> 0.867, 0.667, 0.767, 0.737, 0.788, 0.943, 0.800, 0.615, 0.632, 0.…
## $ ORB <dbl> 12, 15, 22, 18, 19, 17, 25, 14, 16, 17, 19, 9, 19, 17, 17, 16, 26…
## $ DRB <dbl> 31, 37, 34, 31, 37, 35, 29, 34, 28, 30, 35, 28, 31, 22, 31, 33, 2…
## $ TRB <dbl> 43, 52, 56, 49, 56, 52, 54, 48, 44, 47, 54, 37, 50, 39, 48, 49, 4…
## $ AST <dbl> 30, 32, 20, 23, 28, 27, 23, 17, 24, 22, 25, 26, 34, 25, 30, 35, 3…
## $ STL <dbl> 5, 12, 5, 12, 7, 14, 6, 6, 12, 5, 5, 6, 11, 11, 15, 10, 5, 12, 11…
## $ BLK <dbl> 9, 7, 5, 6, 6, 4, 5, 7, 6, 6, 8, 0, 7, 6, 5, 4, 9, 11, 13, 6, 2, …
## $ TOV <dbl> 17, 26, 20, 19, 21, 17, 19, 22, 11, 22, 14, 13, 22, 18, 18, 12, 2…
## $ PF <dbl> 24, 27, 25, 22, 27, 22, 21, 27, 25, 22, 23, 21, 26, 21, 30, 21, 2…
## $ PTS <dbl> 109, 104, 111, 102, 108, 123, 98, 90, 94, 86, 109, 102, 124, 94, …
glimpse(nba_runnerUps.df)
## Rows: 220
## Columns: 24
## $ Year <dbl> 1980, 1980, 1980, 1980, 1980, 1980, 1981, 1981, 1981, 1981, 1981,…
## $ Team <chr> "Sixers", "Sixers", "Sixers", "Sixers", "Sixers", "Sixers", "Rock…
## $ Game <dbl> 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4,…
## $ Win <dbl> 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ Home <dbl> 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1,…
## $ MP <dbl> 240, 240, 240, 240, 240, 240, 240, 240, 240, 240, 240, 240, 240, …
## $ FG <dbl> 40, 43, 45, 41, 42, 47, 42, 34, 24, 37, 30, 36, 48, 49, 39, 44, 5…
## $ FGA <dbl> 90, 85, 93, 79, 94, 89, 99, 85, 79, 103, 84, 86, 98, 93, 88, 91, …
## $ FGP <dbl> 0.444, 0.506, 0.484, 0.519, 0.447, 0.528, 0.424, 0.400, 0.304, 0.…
## $ TP <dbl> 0, 0, 1, 0, 0, 0, 0, 2, 0, 1, 0, 0, 3, 0, 1, 0, 0, 0, 0, 2, 0, 1,…
## $ TPA <dbl> 2, 1, 4, 0, 3, 6, 2, 2, 1, 3, 1, 2, 6, 2, 1, 3, 0, 2, 2, 7, 3, 3,…
## $ TPP <dbl> 0.000, 0.000, 0.250, NA, 0.000, 0.000, 0.000, 1.000, 0.000, 0.333…
## $ FT <dbl> 22, 21, 10, 23, 19, 13, 11, 22, 23, 16, 20, 19, 18, 12, 29, 13, 2…
## $ FTA <dbl> 28, 27, 17, 26, 24, 22, 14, 32, 31, 22, 33, 23, 23, 21, 40, 20, 3…
## $ FTP <dbl> 0.786, 0.778, 0.588, 0.885, 0.792, 0.591, 0.786, 0.688, 0.742, 0.…
## $ ORB <dbl> 14, 5, 13, 5, 13, 7, 19, 13, 19, 28, 15, 18, 18, 20, 14, 11, 13, …
## $ DRB <dbl> 26, 29, 24, 29, 29, 29, 23, 22, 29, 21, 26, 23, 23, 32, 29, 29, 2…
## $ TRB <dbl> 40, 34, 37, 34, 42, 36, 42, 35, 48, 49, 41, 41, 41, 52, 43, 40, 3…
## $ AST <dbl> 28, 34, 34, 31, 32, 27, 23, 16, 10, 22, 15, 22, 28, 28, 25, 32, 3…
## $ STL <dbl> 12, 14, 12, 5, 9, 4, 15, 6, 6, 8, 8, 4, 11, 4, 10, 3, 14, 11, 7, …
## $ BLK <dbl> 13, 11, 8, 10, 7, 11, 3, 8, 10, 2, 9, 2, 7, 9, 8, 7, 13, 7, 8, 10…
## $ TOV <dbl> 14, 20, 13, 14, 12, 18, 10, 9, 21, 10, 17, 12, 18, 21, 19, 16, 11…
## $ PF <dbl> 17, 21, 25, 20, 25, 27, 20, 17, 19, 20, 24, 21, 26, 30, 36, 23, 1…
## $ PTS <dbl> 102, 107, 101, 105, 103, 107, 95, 92, 71, 91, 80, 91, 117, 110, 1…
Now that we’ve got a clearer picture of how these datasets are
organized, it’s time to tackle the Data Cleaning and Manipulation phase!
Given the similar structures of both datasets, it’s a good idea to clean
and prepare them separately before merging them later on.
We’ll start by working on the NBA Championship
dataset!
To begin our cleaning stage, we should first locate any NA values
from the NBA Championship dataset.
# Filter throughout all rows of where any variable contains at least one piece of NA
nba_championships.df %>% filter_all(any_vars(is.na(.)))
After filtering through the dataset, we discover there are 6 rows of
missing data, all from the Three Point Percentages column (TPP):
- Lakers 1980 Finals Games 1,4,and 5
- Lakers 1982 Finals Games 1 and 6
- Lakers 1983 Finals Game 1
Creating a visualization would help display this.
# Creating visualization of Three Point Percentages between 1980 - 1983
nba_championships.df %>%
filter(between(Year, 1980, 1983)) %>%
ggplot(aes(Game, Team, color = factor(TPP))) +
geom_point(na.rm = FALSE, size = 3.5)+
facet_wrap(~Year, ncol = 1)+
scale_x_continuous(breaks = c(1,2,3,4,5,6))+
labs(
x = "Game Number",
color = "Three Point Percentages",
title = "Three Point Percentages of NBA Championship Teams"
)

Now to fix this, using the fill() function from the Tidyr package can
help fill the NA values.
To decide what values to replace the NA’s with, we can safely place a
0 for all of them since each row shows 0 Three Point Attempts taken.
In order to correctly fill each NA row with 0, using the direction
“updown” will let R consider both up and down values of the NA value,
and choose the closest non-missing value. Since each NA value has at
least a 0 in either direction will replace the NA values with 0.
Lastly, we save these changes into a new data set named
“nba_championships_cleaned.df” to avoid altering the original
dataset.
nba_championships_cleaned.df <-
nba_championships.df %>%
fill(TPP, .direction = "updown")
print(nba_championships_cleaned.df)
## # A tibble: 220 × 24
## Year Team Game Win Home MP FG FGA FGP TP TPA TPP FT
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1980 Lake… 1 1 1 240 48 89 0.539 0 0 0 13
## 2 1980 Lake… 2 0 1 240 48 95 0.505 0 1 0 8
## 3 1980 Lake… 3 1 0 240 44 92 0.478 0 1 0 23
## 4 1980 Lake… 4 0 0 240 44 93 0.473 0 0 0 14
## 5 1980 Lake… 5 1 1 240 41 91 0.451 0 0 0 26
## 6 1980 Lake… 6 1 0 240 45 92 0.489 0 2 0 33
## 7 1981 Celt… 1 1 1 240 41 95 0.432 0 1 0 16
## 8 1981 Celt… 2 0 1 240 41 82 0.5 0 3 0 8
## 9 1981 Celt… 3 1 0 240 40 89 0.449 2 3 0.667 12
## 10 1981 Celt… 4 0 0 240 35 74 0.473 0 3 0 16
## # ℹ 210 more rows
## # ℹ 11 more variables: FTA <dbl>, FTP <dbl>, ORB <dbl>, DRB <dbl>, TRB <dbl>,
## # AST <dbl>, STL <dbl>, BLK <dbl>, TOV <dbl>, PF <dbl>, PTS <dbl>
Now that we have cleaner dataset after getting rid of the NA values,
the next step in this data cleaning and manipulation stage is to find
any duplications or errors.
If we create a simple visualization of how many times each NBA teams
shows up in this dataset, we see there are two duplcates of the Warriors
and Heat teams labeled as “Warriorrs and”‘Heat’“.
# Visulization of NBA teams
nba_championships.df %>%
ggplot(aes(Team)) +
geom_bar()+
coord_flip()

An easy fix would be to use an “if-else” statement from the Team
variable and create a condition that if Warriorrs or ‘Heat’ is found,
then it is to be replaced with its correct NBA team (Warriors and
Heat).
# Check the Team column and replace Warriorrs with Warriors and 'Heat' with Heat
nba_championships_cleaned.df <-
nba_championships_cleaned.df %>%
mutate(Team = if_else(Team == "Warriorrs", "Warriors", Team)) %>%
mutate(Team = if_else(Team == "'Heat'", "Heat", Team))
# Create same visualization with updated team names
nba_championships_cleaned.df %>%
ggplot(aes(Team)) +
geom_bar()+
coord_flip()

After fixing the duplicate team names, I spent a few hours digging
and checking for any remaining common errors such as duplicate Team
Points, or if two competing teams from a single game being declared both
winners or losers.
I was able to pick up some common errors during my search in the
Championship data-set.
These errors included:
1984 Celtics, Game 1: “PTS” should be changed from 115 to 109 and
“Win” should be set to 0 since they lost.
1984 Celtics, Game 2: “Win” should be set to 1 since they
won.
1996 Bulls, Game 6: “Home” should be set to 1 since they were the
home team.
2012 Heat Game 1 duplicates found. “Year” should be set to
2013.
# 1984 Celtics Game 1
nba_championships_cleaned.df[23, 24] <- 109
nba_championships_cleaned.df[23,4] <- 0
# 1984 Celtics Game 2
nba_championships_cleaned.df[24,4] <- 1
# 1996 Bulls Game 6
nba_championships_cleaned.df[97,5] <- 1
# 2012 Heat Game 1
nba_championships_cleaned.df[187, 1] <- 2013
All NA values and incorrect pieces of information within the dataset
have now been solved. A great idea would be to convert some variables
represented by binary values into Categorical variables. This would make
it easier to understand and communicate with the data.
The variables that can be changed to categorical variables could be
Home and Win since they contain only 1’s and 0’s.
# Converting Home and Win variables into factors (categorical variables)
nba_championships_cleaned.df <-
nba_championships_cleaned.df %>%
mutate(Home = as.factor(Home)) %>%
mutate(Win = as.factor(Win))
nba_championships_cleaned.df %>%
select(Win, Home)
With the Home and Win variables being successfully converted into
categorical variables, it’s best to replace the 0’s and 1’s in each
column with a label that’s more readable.
For the Home column we can replace 0’s with the label “Away Team” and
1’s with “Home Team”.
Then the Win column, 0’s can be replaced with “Loss” and 1’s with
“Win”.
From there, lets display a preview of how the columns look now. The
presentation going from numeric values to actual character labels.
# Home Variable --> numeric (0 & 1) to character (Away Team & Home Team)
home_levels <- c("Away Team", "Home Team")
levels(nba_championships_cleaned.df$Home) <- home_levels
#Win Variable --> numeric (0 & 1) to character (Loss & Win)
win_levels <- c("Loss", "Win")
levels(nba_championships_cleaned.df$Win) <- win_levels
nba_championships_cleaned.df %>% head(5)
Alright, we’ve just finished the data cleaning and manipulation stage
for our NBA Championship dataset! However we have one more dataset to
go, our NBA Runner-Ups datatset. Since we saw earlier that both these
datasets are similarly structured, we can carry out the same steps from
the NBA Championships dataset.
Likewise with the NBA Championship data-set, let’s first check the
runner-ups dataset for any missing values.
nba_runnerUps.df %>% filter_all(any_vars(is.na(.)))
After filtering all rows for any missing values, it appears that the
same column Three Point Percentages (TPP) has missing values this time
occurring for:
- Sixers 1980 Finals Game 4
- Sixers 1982 Finals Game 5
- Sixers 1984 Finals Game 4
To better illustrate this lets create a visualization.
nba_runnerUps.df %>%
filter(between(Year, 1980, 1984))
nba_runnerUps.df %>%
filter(between(Year, 1980, 1984)) %>%
ggplot(aes(Game, fill = factor(TPP)))+
geom_bar(position = "fill")+
facet_wrap(Year~Team) +
scale_x_continuous(breaks = c(1,2,3,4,5,6))+
labs(
x = "Game Number",
fill = "Three Point Percentages",
title = "Three Point Percentages of NBA Runner Up Teams"
)

Now that we know where the NA values are located from the NBA runner
ups data set, we can begin to replace each of those values with a 0
since each game likewise with the NBA Championships data set has Three
Pointers Attempt of 0 (TPA).
However, unlike last time, we can’t use the fill(TPP, .direction)
code line since each NA value either has an actual three point
percentage number above or below the NA value. So applying each NA value
with a 0 in the same direction won’t work.
Instead, we can physically replace each NA value with 0 ourselves,
since we know where the NA values are located.
nba_runnerUps_cleaned.df <- nba_runnerUps.df %>% replace(is.na(.), 0)
print(nba_runnerUps_cleaned.df)
## # A tibble: 220 × 24
## Year Team Game Win Home MP FG FGA FGP TP TPA TPP FT
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1980 Sixe… 1 0 0 240 40 90 0.444 0 2 0 22
## 2 1980 Sixe… 2 1 0 240 43 85 0.506 0 1 0 21
## 3 1980 Sixe… 3 0 1 240 45 93 0.484 1 4 0.25 10
## 4 1980 Sixe… 4 1 1 240 41 79 0.519 0 0 0 23
## 5 1980 Sixe… 5 0 0 240 42 94 0.447 0 3 0 19
## 6 1980 Sixe… 6 0 1 240 47 89 0.528 0 6 0 13
## 7 1981 Rock… 1 0 0 240 42 99 0.424 0 2 0 11
## 8 1981 Rock… 2 1 0 240 34 85 0.4 2 2 1 22
## 9 1981 Rock… 3 0 1 240 24 79 0.304 0 1 0 23
## 10 1981 Rock… 4 1 1 240 37 103 0.359 1 3 0.333 16
## # ℹ 210 more rows
## # ℹ 11 more variables: FTA <dbl>, FTP <dbl>, ORB <dbl>, DRB <dbl>, TRB <dbl>,
## # AST <dbl>, STL <dbl>, BLK <dbl>, TOV <dbl>, PF <dbl>, PTS <dbl>
Lets run a simple visualization of every NBA team to make sure that
each team is spelled correctly and there are no duplicates.
nba_runnerUps.df %>%
ggplot(aes(Team)) +
geom_bar()+
coord_flip()

Furthermore, I’ll spend some time looking around for remaining common
errors like how I found from the NBA Championship dataset.
Errors I found:
1982 Sixers, Game 2: “Home” should be set to 1 instead of 0 since
they’re the home team.
1984 Lakers, Game 1: “Win” should be set to 1 since they won and
“Home” should be set to 0 since they’re the away team.
1987 Celtics, Game 3: “MP” should be set to 240 instead of
40.
1998 Jazz, Game 5: “Home” should be set to 0 since they’re the
away team.
# 1982 Sixers Game 2
nba_runnerUps_cleaned.df[14,5] <- 1
# 1984 Lakers Game 1
nba_runnerUps_cleaned.df[23,4] <- 1
nba_runnerUps_cleaned.df[23,5] <- 0
# 1987 Celtics Game 3
nba_runnerUps_cleaned.df[44,6] <- 240
# 1998 Jazz Game 5
nba_runnerUps_cleaned.df[108,5] <- 0
Once all missing, duplicate, and incorrect data have been fixed, it’s
time to convert the same variables (Home and Win) into categorical
variables.
# Converting Home and Win variables into factors (categorical variables)
nba_runnerUps_cleaned.df <-
nba_runnerUps_cleaned.df %>%
mutate(Home = as.factor(Home)) %>%
mutate(Win = as.factor(Win))
nba_runnerUps_cleaned.df %>%
select(Win, Home)
Now these variables have been converted into categorical variables,
we can apply the same labels per variables.
Home: 0’s will be replaced by “Away Team” and 1’s with “Home
Team”.
Win: 0’s will be replaced by “Loss” and 1’s with “Win”.
# Home Variable --> numeric (0 & 1) to character (Away Team & Home Team)
runnerUps_home_levels <- c("Away Team", "Home Team")
levels(nba_runnerUps_cleaned.df$Home) <- runnerUps_home_levels
#Win Variable --> numeric (0 & 1) to character (Loss & Win)
runnerUps_win_levels <- c("Loss", "Win")
levels(nba_runnerUps_cleaned.df$Win) <- runnerUps_win_levels
nba_runnerUps_cleaned.df %>% head(5)
We have now completed the Data Cleaning and Manipulation Stage!
Up till now our data has been:
Processed: datasets are read-in and saved into our global
environment
Cleaned: fixed any missing, duplicate, and incorrect information
from the datasets
Manipulated: changed variables into categorical variables and
replaced information in the columns with something more
readable.
We can now start to begin our Analysis stage, where we dive into our
initial questions stated at the beginning of the project and unravel
meaningful insights.