In this document, we will explore the foundation of a typical data analysis workflow in R - importing, cleaning, and visualising data. These are essential skills for any data analyst or research, as they allow us to turn raw datasets into meaningful insights.
We’ll begin by learning how to import datasets from
external sources into R. Then, we’ll move on to data
manipulation, where we tidy the data by handling missing
values, selecting key variables, and creating new ones. Finally, we’ll
create our first visualisations using
ggplot2, one of R’s most powerful packages for data
exploration.
Before we can analyse or visualise any data, we first need to
import it into R. In most real-world cases, datasets
are stored externally in CSV files, Excel spreadsheets, or online
databases. We can easily import such files using the
read.csv() function from base R, which
means it’s built into R by default and doesn’t require any additional
packages.
my_data <- read.csv("data/my_dataset.csv")
However, R also provides access to many pre-made
datasets through packages. In this example, we’ll use the
nbastatR package, which gives access to historical and
current NBA statistics directly from reliable data sources such as
Basketball Reference and the NBA API.
Before using it, make sure the package is installed:
install.packages("nbastatR")
After the package is installed, we can now load it.
library(nbastatR)
Most R packages also have a GitHub page or similar
website where you can access additional help, documentation, and
examples.
These resources often include detailed explanations of available
functions, usage examples, and community discussions.
Once the package is loaded, we can use its functions to directly
fetch basketball statistics from official sources.
For example, we can retrieve player statistics for a particular season
using the nbaplayers_stats() function.
# Import team-level game data from the 2024 NBA Regular Season
nba_data <- game_logs(
seasons = 2024,
result_types = "team",
season_types = "Regular Season"
)
# This creates a dataframe called 'nba_data' containing team-level statistics such as points, rebounds, assists, and shooting percentages for each game.
# View the first few rows of the dataset
head(nba_data)
Now you have successfully downloaded your first dataset!
In the Environment tab (usually located in the
top-right corner of RStudio), you should see an object called
nba_data.
On the right-hand side, RStudio will display something like “2460
obs. of 47 variables” (the exact numbers will vary).
This means you have successfully imported the data, R has recognised it
as a dataframe and stored it in your session, ready for analysis.
Before we can analyse or visualise any data, it’s important to
clean and prepare it for use.
Raw datasets, even those from packages, often contain unnecessary
variables, missing values, or inconsistent names that make analysis
harder.
In R, we typically use the dplyr package (part of the
tidyverse) to perform data manipulation.
This package provides a set of intuitive “verbs” for transforming data,
such as:
select() – choose specific columnsfilter() – keep or remove certain rowsmutate() – create new variablesarrange() – sort your datasummarise() – calculate summary statisticsBefore we begin, make sure the package is installed and loaded:
install.packages("dplyr")
library(dplyr)
When you first import a dataset like this, it can look overwhelming.
To make the analysis more manageable, it’s helpful to start by
identifying a few key columns that are most relevant to
what we want to explore.
In this example, we’ll focus on team-level performance metrics such as
points, assists, rebounds, steals, blocks, and shooting percentages.
If you ever get lost or forget what the column names are, you can use
the colnames() function to list them all.
colnames(nba_data)
We’re now going to use the select() function from
dplyr to keep only the variables we need.
We’ll also use the pipe operator (%>%),
which lets us link multiple functions together in a clear, readable
way.
On a Mac, you can insert the pipe shortcut using Command + Shift
+ M.
By combining these tools, we can create a new dataframe that contains only our selected columns.
nba_data_clean <- nba_data %>%
select(
nameTeam, slugOpponent,
fg2mTeam, fg2aTeam, pctFG2Team,
fg3mTeam, fg3aTeam, pctFG3Team,
ftmTeam, ftaTeam,
orebTeam, drebTeam, trebTeam,
astTeam, stlTeam, blkTeam, tovTeam, pfTeam,
ptsTeam, plusminusTeam
)
Now that we have a cleaner dataframe containing all games played by each team, we can go a step further and summarise the season as a whole.
Each row in our current dataset represents a single game, but what we want is a season-level summary for each team — combining all of their game statistics into total season values.
This allows us to see overall team performance rather than game-by-game data.
To begin, we’ll first sum up the total statistics for each team without any rounding or formatting (just yet).
This allows us to see the raw totals clearly before we tidy the output in the next step.
We’ll use group_by() to group all games by team name,
and summarise() to add up each numeric column we care
about.
The n() function counts how many games each team played,
and sum() totals each volume statistic (like points and
rebounds).
The mean() function is used to calculate average
values, such as a team’s average shooting percentages (for
2-point, 3-point, and free-throw accuracy) across all games in the
season.
nba_team_totals_2024 <- nba_data_clean %>%
group_by(nameTeam) %>%
summarise(
G = n(),
fg2m_total = sum(fg2mTeam, na.rm = TRUE),
fg2a_total = sum(fg2aTeam, na.rm = TRUE),
fg3m_total = sum(fg3mTeam, na.rm = TRUE),
fg3a_total = sum(fg3aTeam, na.rm = TRUE),
ftm_total = sum(ftmTeam, na.rm = TRUE),
fta_total = sum(ftaTeam, na.rm = TRUE),
oreb_total = sum(orebTeam, na.rm = TRUE),
dreb_total = sum(drebTeam, na.rm = TRUE),
treb_total = sum(trebTeam, na.rm = TRUE),
ast_total = sum(astTeam, na.rm = TRUE),
stl_total = sum(stlTeam, na.rm = TRUE),
blk_total = sum(blkTeam, na.rm = TRUE),
tov_total = sum(tovTeam, na.rm = TRUE),
pf_total = sum(pfTeam, na.rm = TRUE),
pts_total = sum(ptsTeam, na.rm = TRUE),
plusminus_total = sum(plusminusTeam, na.rm = TRUE),
pctFG2_avg = mean(pctFG2Team, na.rm = TRUE),
pctFG3_avg = mean(pctFG3Team, na.rm = TRUE),
pctFT_avg = mean(ftmTeam / ftaTeam, na.rm = TRUE)
)
Now that we’ve created a table of total statistics for each team, the next step is to make the output easier to interpret.
We can do this by renaming, rounding, and rearranging some of the columns.
We’ll first rename the columns to make them shorter and easier to
read, then create percentage columns using mutate(), and
finally reorder the table neatly using select() and
arrange().
When summarising, we can rename variables directly by using backticks
around new names like 2P or 3P — these make
our column names cleaner and easier to interpret.
nba_team_totals_2024 <- nba_data_clean %>%
group_by(nameTeam) %>%
summarise(
G = n(),
`2P` = sum(fg2mTeam, na.rm = TRUE),
`2PA` = sum(fg2aTeam, na.rm = TRUE),
`3P` = sum(fg3mTeam, na.rm = TRUE),
`3PA` = sum(fg3aTeam, na.rm = TRUE),
FT = sum(ftmTeam, na.rm = TRUE),
FTA = sum(ftaTeam, na.rm = TRUE),
ORB = sum(orebTeam, na.rm = TRUE),
DRB = sum(drebTeam, na.rm = TRUE),
TRB = sum(trebTeam, na.rm = TRUE),
AST = sum(astTeam, na.rm = TRUE),
STL = sum(stlTeam, na.rm = TRUE),
BLK = sum(blkTeam, na.rm = TRUE),
TOV = sum(tovTeam, na.rm = TRUE),
PF = sum(pfTeam, na.rm = TRUE),
PTS = sum(ptsTeam, na.rm = TRUE),
`+/-` = sum(plusminusTeam, na.rm = TRUE)
)
Next, we’ll calculate each team’s shooting accuracy — two-point, three-point, and free-throw percentage — and round them to three decimal places.
nba_team_totals_2024 <- nba_team_totals_2024 %>%
mutate(
`2P%` = round(`2P` / `2PA`, 3),
`3P%` = round(`3P` / `3PA`, 3),
`FT%` = round(FT / FTA, 3)
)
Finally, we’ll rename the nameTeam column to
Team, arrange the statistics into a logical order, and sort
the teams alphabetically.
nba_team_totals_2024 <- nba_team_totals_2024 %>%
select(
Team = nameTeam,
G, `2P`, `2PA`, `2P%`,
`3P`, `3PA`, `3P%`,
FT, FTA, `FT%`,
ORB, DRB, TRB, AST, STL, BLK, TOV, PF, PTS, `+/-`
) %>%
arrange(Team)
Now we have a clean season totals table for every team, organised alphabetically and ready for analysis.
This table provides a clear summary of each team’s overall performance — including points, assists, rebound, shooting percentages, etc.. — across the entire 2023/24 NBA Regular Season.
With our data now tidy and structured, we can move on to the final step: visualising the results to uncover patterns and insights.
Now that our dataset is cleaned and summarised, we can begin to
visualise the results.
Visualisations help us quickly identify trends, comparisons, and
patterns that may not be obvious from tables alone.
We’ll start with a basic bar chart using the
ggplot2 package — one of R’s most popular and powerful
tools for data visualisation.
In this example, we’ll look at the total points scored by each NBA team in the 2024 season.
# Create a bar chart showing total points scored by each team
ggplot(nba_team_totals_2024, aes(x = reorder(Team, PTS), y = PTS)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
title = "Total Points Scored by Each NBA Team (2024 Season)",
x = "Team",
y = "Total Points"
) +
theme_minimal()
The function ggplot() is used to initialise the graph
and specify the dataset and variables being plotted. Inside the
aes() function, Team is placed on the x-axis and PTS (total
points) on the y-axis. The reorder() function automatically
sorts the bars based on total points, so the highest-scoring teams are
displayed last (and therefore appear at the top once flipped).
The geom_col() function creates the bars themselves,
while the fill argument sets their colour to “steelblue”. The
coord_flip() command rotates the chart horizontally, making
long team names easier to read.
The labs() function adds the chart title and axis
labels, ensuring the plot is self-explanatory. Finally,
theme_minimal() applies a clean and simple design, removing
unnecessary background elements.
Overall, this visualisation provides a clear and effective way to compare how many total points each NBA team scored throughout the 2024 season.
Let’s now explore whether teams that score the most points also tend
to commit more turnovers.
Turnovers can be a sign of aggressive offense — teams that handle the
ball more may also make more mistakes.
We’ll focus on the top 10 scoring teams and use a scatter plot to examine this relationship.
# Filter the top 10 teams by total points
top10_teams <- nba_team_totals_2024 %>%
arrange(desc(PTS)) %>%
slice_head(n = 10)
Here we create a new dataset that only includes the top 10 highest-scoring teams from the 2024 season.
The arrange(desc(PTS)) function sorts all teams in
descending order based on total points scored, and
slice_head(n = 10) keeps just the first 10 rows — giving us
the top 10 teams.
This smaller dataset, called top10_teams, will be used for our next visualisation.
# Create scatter plot of turnovers vs points
ggplot(top10_teams, aes(x = TOV, y = PTS, label = Team)) +
geom_point(color = "darkorange", size = 3) +
geom_text(vjust = -0.6, size = 3) +
geom_smooth(method = "lm", se = FALSE, color = "black", linetype = "dashed") +
labs(
title = "Relationship Between Turnovers and Points (Top 10 Scoring Teams, 2024)",
x = "Total Turnovers",
y = "Total Points Scored"
) +
theme_minimal()
This code creates a scatter plot that shows how turnovers relate to
total points scored by the top ten teams. The plot is built using the
ggplot() function again, which sets up the data
(top10_teams) and defines the x- and y-variables with
aes(x = TOV, y = PTS).
The geom_point() layer plots each team as an orange dot,
helping us visualise their position based on turnovers and points.
The geom_text() function adds team names above each
point so we can identify which team each dot represents.
Next, geom_smooth(method = "lm") adds a dashed trend
line that fits a simple linear model, allowing us to see the overall
direction of the relationship between turnovers and points.
The labs() function is used to add a descriptive title
and axis labels, while theme_minimal() applies a clean and
simple design to the chart.
Altogether, these elements work together to create a clear, readable visual that highlights whether teams with more turnovers tend to score more or fewer points.
Now that we’ve created and interpreted two different visualisations, you’ve successfully produced your first data plots in R!. These visuals not only make the data easier to understand, but also help uncover important insights that might not be obvious from a table alone.
From our scatter plot, we can observe a downward-sloping dashed trend line, which shows a negative relationship between turnovers and total points. This means that as turnovers increase, total points tend to decrease. In basketball terms, teams that turn the ball over more often usually lose scoring opportunities.
Overall, by visualising the data in different ways, we can begin to interpret key patterns, identify relationships, and draw meaningful conclusions.
In this document, we’ve walked through the full process of importing, cleaning, and visualising data in R.
Starting with raw game logs from the nbastatR package,
we used dplyr to clean and summarise the data into
meaningful team statistics, and then used ggplot2 to create
clear, insightful visualisations.
Through this workflow, we turned raw data into interpretable insights and have learnt to:
Our visualisations demonstrated how effective data exploration can reveal important findings.
Overall, this exercise demonstrates how R can be used as a powerful tool for data analysis.