1 Introduction

In this document, we will explore the foundation of a typical data analysis workflow in R - importing, cleaning, and visualising data. These are essential skills for any data analyst or research, as they allow us to turn raw datasets into meaningful insights.

We’ll begin by learning how to import datasets from external sources into R. Then, we’ll move on to data manipulation, where we tidy the data by handling missing values, selecting key variables, and creating new ones. Finally, we’ll create our first visualisations using ggplot2, one of R’s most powerful packages for data exploration.

2 Importing Data

Before we can analyse or visualise any data, we first need to import it into R. In most real-world cases, datasets are stored externally in CSV files, Excel spreadsheets, or online databases. We can easily import such files using the read.csv() function from base R, which means it’s built into R by default and doesn’t require any additional packages.

my_data <- read.csv("data/my_dataset.csv")

However, R also provides access to many pre-made datasets through packages. In this example, we’ll use the nbastatR package, which gives access to historical and current NBA statistics directly from reliable data sources such as Basketball Reference and the NBA API.

Before using it, make sure the package is installed:

install.packages("nbastatR")

After the package is installed, we can now load it.

library(nbastatR)

Most R packages also have a GitHub page or similar website where you can access additional help, documentation, and examples.
These resources often include detailed explanations of available functions, usage examples, and community discussions.

Once the package is loaded, we can use its functions to directly fetch basketball statistics from official sources.
For example, we can retrieve player statistics for a particular season using the nbaplayers_stats() function.

# Import team-level game data from the 2024 NBA Regular Season
nba_data <- game_logs(
  seasons = 2024,
  result_types = "team",
  season_types = "Regular Season"
)

# This creates a dataframe called 'nba_data' containing team-level statistics such as points,  rebounds, assists, and shooting percentages for each game.

# View the first few rows of the dataset
head(nba_data)

Now you have successfully downloaded your first dataset!
In the Environment tab (usually located in the top-right corner of RStudio), you should see an object called nba_data.

On the right-hand side, RStudio will display something like “2460 obs. of 47 variables” (the exact numbers will vary).
This means you have successfully imported the data, R has recognised it as a dataframe and stored it in your session, ready for analysis.

3 Cleaning & Manipulating Data

Before we can analyse or visualise any data, it’s important to clean and prepare it for use.
Raw datasets, even those from packages, often contain unnecessary variables, missing values, or inconsistent names that make analysis harder.

In R, we typically use the dplyr package (part of the tidyverse) to perform data manipulation.
This package provides a set of intuitive “verbs” for transforming data, such as:

select() – choose specific columns
filter() – keep or remove certain rows
mutate() – create new variables
arrange() – sort your data
summarise() – calculate summary statistics

Before we begin, make sure the package is installed and loaded:

install.packages("dplyr")

library(dplyr)

When you first import a dataset like this, it can look overwhelming. To make the analysis more manageable, it’s helpful to start by identifying a few key columns that are most relevant to what we want to explore.
In this example, we’ll focus on team-level performance metrics such as points, assists, rebounds, steals, blocks, and shooting percentages.

If you ever get lost or forget what the column names are, you can use the colnames() function to list them all.

colnames(nba_data)

We’re now going to use the select() function from dplyr to keep only the variables we need.
We’ll also use the pipe operator (%>%), which lets us link multiple functions together in a clear, readable way.
On a Mac, you can insert the pipe shortcut using Command + Shift + M.

By combining these tools, we can create a new dataframe that contains only our selected columns.

nba_data_clean <- nba_data %>%
  select(
    nameTeam, slugOpponent,
    fg2mTeam, fg2aTeam, pctFG2Team,
    fg3mTeam, fg3aTeam, pctFG3Team,
    ftmTeam, ftaTeam,
    orebTeam, drebTeam, trebTeam,
    astTeam, stlTeam, blkTeam, tovTeam, pfTeam,
    ptsTeam, plusminusTeam
  )

Now that we have a cleaner dataframe containing all games played by each team, we can go a step further and summarise the season as a whole.

Each row in our current dataset represents a single game, but what we want is a season-level summary for each team — combining all of their game statistics into total season values.

This allows us to see overall team performance rather than game-by-game data.

To begin, we’ll first sum up the total statistics for each team without any rounding or formatting (just yet).

This allows us to see the raw totals clearly before we tidy the output in the next step.

We’ll use group_by() to group all games by team name, and summarise() to add up each numeric column we care about.
The n() function counts how many games each team played, and sum() totals each volume statistic (like points and rebounds).
The mean() function is used to calculate average values, such as a team’s average shooting percentages (for 2-point, 3-point, and free-throw accuracy) across all games in the season.

nba_team_totals_2024 <- nba_data_clean %>%
  group_by(nameTeam) %>%
  summarise(
    G = n(),
    fg2m_total = sum(fg2mTeam, na.rm = TRUE),
    fg2a_total = sum(fg2aTeam, na.rm = TRUE),
    fg3m_total = sum(fg3mTeam, na.rm = TRUE),
    fg3a_total = sum(fg3aTeam, na.rm = TRUE),
    ftm_total = sum(ftmTeam, na.rm = TRUE),
    fta_total = sum(ftaTeam, na.rm = TRUE),
    oreb_total = sum(orebTeam, na.rm = TRUE),
    dreb_total = sum(drebTeam, na.rm = TRUE),
    treb_total = sum(trebTeam, na.rm = TRUE),
    ast_total = sum(astTeam, na.rm = TRUE),
    stl_total = sum(stlTeam, na.rm = TRUE),
    blk_total = sum(blkTeam, na.rm = TRUE),
    tov_total = sum(tovTeam, na.rm = TRUE),
    pf_total = sum(pfTeam, na.rm = TRUE),
    pts_total = sum(ptsTeam, na.rm = TRUE),
    plusminus_total = sum(plusminusTeam, na.rm = TRUE),
    pctFG2_avg = mean(pctFG2Team, na.rm = TRUE),
    pctFG3_avg = mean(pctFG3Team, na.rm = TRUE),
    pctFT_avg = mean(ftmTeam / ftaTeam, na.rm = TRUE)
  )

Now that we’ve created a table of total statistics for each team, the next step is to make the output easier to interpret.

We can do this by renaming, rounding, and rearranging some of the columns.

We’ll first rename the columns to make them shorter and easier to read, then create percentage columns using mutate(), and finally reorder the table neatly using select() and arrange().

3.0.1 Step 1: Rename columns inside summarise

When summarising, we can rename variables directly by using backticks around new names like 2P or 3P — these make our column names cleaner and easier to interpret.

nba_team_totals_2024 <- nba_data_clean %>%
  group_by(nameTeam) %>%
  summarise(
    G = n(),
    `2P` = sum(fg2mTeam, na.rm = TRUE),
    `2PA` = sum(fg2aTeam, na.rm = TRUE),
    `3P` = sum(fg3mTeam, na.rm = TRUE),
    `3PA` = sum(fg3aTeam, na.rm = TRUE),
    FT = sum(ftmTeam, na.rm = TRUE),
    FTA = sum(ftaTeam, na.rm = TRUE),
    ORB = sum(orebTeam, na.rm = TRUE),
    DRB = sum(drebTeam, na.rm = TRUE),
    TRB = sum(trebTeam, na.rm = TRUE),
    AST = sum(astTeam, na.rm = TRUE),
    STL = sum(stlTeam, na.rm = TRUE),
    BLK = sum(blkTeam, na.rm = TRUE),
    TOV = sum(tovTeam, na.rm = TRUE),
    PF = sum(pfTeam, na.rm = TRUE),
    PTS = sum(ptsTeam, na.rm = TRUE),
    `+/-` = sum(plusminusTeam, na.rm = TRUE)
  )

3.0.2 Step 2: Create shooting percentage columns

Next, we’ll calculate each team’s shooting accuracy — two-point, three-point, and free-throw percentage — and round them to three decimal places.

nba_team_totals_2024 <- nba_team_totals_2024 %>%
  mutate(
    `2P%` = round(`2P` / `2PA`, 3),
    `3P%` = round(`3P` / `3PA`, 3),
    `FT%` = round(FT / FTA, 3)
  )

3.0.3 Step 3: Reorder and tidy the columns

Finally, we’ll rename the nameTeam column to Team, arrange the statistics into a logical order, and sort the teams alphabetically.

nba_team_totals_2024 <- nba_team_totals_2024 %>%
  select(
    Team = nameTeam,
    G, `2P`, `2PA`, `2P%`,
    `3P`, `3PA`, `3P%`,
    FT, FTA, `FT%`,
    ORB, DRB, TRB, AST, STL, BLK, TOV, PF, PTS, `+/-`
  ) %>%
  arrange(Team)

Now we have a clean season totals table for every team, organised alphabetically and ready for analysis.

This table provides a clear summary of each team’s overall performance — including points, assists, rebound, shooting percentages, etc.. — across the entire 2023/24 NBA Regular Season.

With our data now tidy and structured, we can move on to the final step: visualising the results to uncover patterns and insights.

4 Visualising Data

Now that our dataset is cleaned and summarised, we can begin to visualise the results.
Visualisations help us quickly identify trends, comparisons, and patterns that may not be obvious from tables alone.

We’ll start with a basic bar chart using the ggplot2 package — one of R’s most popular and powerful tools for data visualisation.

In this example, we’ll look at the total points scored by each NBA team in the 2024 season.

# Create a bar chart showing total points scored by each team
ggplot(nba_team_totals_2024, aes(x = reorder(Team, PTS), y = PTS)) +  
  geom_col(fill = "steelblue") +                                      
  coord_flip() +                                                      
  labs(                                                               
    title = "Total Points Scored by Each NBA Team (2024 Season)",
    x = "Team",
    y = "Total Points"
  ) +
  theme_minimal()

The function ggplot() is used to initialise the graph and specify the dataset and variables being plotted. Inside the aes() function, Team is placed on the x-axis and PTS (total points) on the y-axis. The reorder() function automatically sorts the bars based on total points, so the highest-scoring teams are displayed last (and therefore appear at the top once flipped).

The geom_col() function creates the bars themselves, while the fill argument sets their colour to “steelblue”. The coord_flip() command rotates the chart horizontally, making long team names easier to read.

The labs() function adds the chart title and axis labels, ensuring the plot is self-explanatory. Finally, theme_minimal() applies a clean and simple design, removing unnecessary background elements.

Overall, this visualisation provides a clear and effective way to compare how many total points each NBA team scored throughout the 2024 season.

4.1 Scatter Plot – Turnovers vs Points (Top 10 Teams)

Let’s now explore whether teams that score the most points also tend to commit more turnovers.
Turnovers can be a sign of aggressive offense — teams that handle the ball more may also make more mistakes.

We’ll focus on the top 10 scoring teams and use a scatter plot to examine this relationship.

# Filter the top 10 teams by total points
top10_teams <- nba_team_totals_2024 %>%
  arrange(desc(PTS)) %>%
  slice_head(n = 10)

Here we create a new dataset that only includes the top 10 highest-scoring teams from the 2024 season.

The arrange(desc(PTS)) function sorts all teams in descending order based on total points scored, and slice_head(n = 10) keeps just the first 10 rows — giving us the top 10 teams.

This smaller dataset, called top10_teams, will be used for our next visualisation.

# Create scatter plot of turnovers vs points
ggplot(top10_teams, aes(x = TOV, y = PTS, label = Team)) +
  geom_point(color = "darkorange", size = 3) +
  geom_text(vjust = -0.6, size = 3) +
  geom_smooth(method = "lm", se = FALSE, color = "black", linetype = "dashed") +
  labs(
    title = "Relationship Between Turnovers and Points (Top 10 Scoring Teams, 2024)",
    x = "Total Turnovers",
    y = "Total Points Scored"
  ) +
  theme_minimal()

This code creates a scatter plot that shows how turnovers relate to total points scored by the top ten teams. The plot is built using the ggplot() function again, which sets up the data (top10_teams) and defines the x- and y-variables with aes(x = TOV, y = PTS).

The geom_point() layer plots each team as an orange dot, helping us visualise their position based on turnovers and points.

The geom_text() function adds team names above each point so we can identify which team each dot represents.

Next, geom_smooth(method = "lm") adds a dashed trend line that fits a simple linear model, allowing us to see the overall direction of the relationship between turnovers and points.

The labs() function is used to add a descriptive title and axis labels, while theme_minimal() applies a clean and simple design to the chart.

Altogether, these elements work together to create a clear, readable visual that highlights whether teams with more turnovers tend to score more or fewer points.

4.2 Interpretations

Now that we’ve created and interpreted two different visualisations, you’ve successfully produced your first data plots in R!. These visuals not only make the data easier to understand, but also help uncover important insights that might not be obvious from a table alone.

From our scatter plot, we can observe a downward-sloping dashed trend line, which shows a negative relationship between turnovers and total points. This means that as turnovers increase, total points tend to decrease. In basketball terms, teams that turn the ball over more often usually lose scoring opportunities.

Overall, by visualising the data in different ways, we can begin to interpret key patterns, identify relationships, and draw meaningful conclusions.

5 Conclusion

In this document, we’ve walked through the full process of importing, cleaning, and visualising data in R.

Starting with raw game logs from the nbastatR package, we used dplyr to clean and summarise the data into meaningful team statistics, and then used ggplot2 to create clear, insightful visualisations.

Through this workflow, we turned raw data into interpretable insights and have learnt to:

Select and summarise key variables
Create visualisations that highlight patterns and relationships

Our visualisations demonstrated how effective data exploration can reveal important findings.

Overall, this exercise demonstrates how R can be used as a powerful tool for data analysis.

Importing, Manipulating & Visualising Data

Nabil Menai

2025-10-31