Beachhead Assignment- Code

Author

Ciara Bonnett

Best Selling Albums

This assignment is meant to through and over a public dataset that is available on Kaggle about the best-selling music albums of all time. The data set uses sales, artist, and release year to help to explore trends.

Dataset source: Kaggle – Best-Selling Albums

Approach:

I will begin by loading a dataset that I have accessible in a GitHub repository. I plan on researching the data and creating subsets that is relatable to the CSV such as album title, artist, release year, and sales. The variables can and will be renamed to improve the clarity and consistency. Once cleaned and organized I should be able to explore basic analysis to better understand the patterns in sales across artists and time periods. One anticipated challenge is ensuring that numeric variables such as sales figures are consistently formatted and free of non-numeric characters. The dataset could also have missing values that will require some cleaning.

# Load tidyverse (includes readr, dplyr, ggplot2)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Read the CSV directly from GitHub
albums_raw <- read_csv(
  "https://raw.githubusercontent.com/CiaraBonn12/Data607-First-Assignment/main/best_selling_albums.csv"
)
Rows: 49 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Album, Artist, Country, Genre
dbl (3): Year, Sales_Millions, Decade

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Check the first few rows
head(albums_raw)
# A tibble: 6 × 7
  Album                         Artist  Year Sales_Millions Decade Country Genre
  <chr>                         <chr>  <dbl>          <dbl>  <dbl> <chr>   <chr>
1 Thriller                      Micha…  1982             70   1980 USA     Pop  
2 Back in Black                 AC/DC   1980             50   1980 Austra… Rock 
3 The Dark Side of the Moon     Pink …  1973             45   1970 UK      Rock 
4 The Bodyguard                 Whitn…  1992             45   1990 USA     Pop  
5 Bat Out of Hell               Meat …  1977             43   1970 USA     Rock 
6 Their Greatest Hits (1971–19… Eagles  1976             42   1970 USA     Rock 
# 🔹 ADD THIS LINE TO SEE EXACT COLUMN NAMES
colnames(albums_raw)
[1] "Album"          "Artist"         "Year"           "Sales_Millions"
[5] "Decade"         "Country"        "Genre"         

Data Acquisition https://raw.githubusercontent.com/CiaraBonn12/Data607-First-Assignment/main/best_selling_albums.csv

In this section, the steps for obtaining the raw data are discussed. For the analysis to be reproducible, it is important to show that the data source is publicly available, eliminating the need for local data access.

Data Preparation and Transformation

In this section I clean the dataset by ensuring sales and year are numeric and removing rows with missing values in those fields. I then create a smaller dataset (top_albums) containing the 10 highest-selling albums, which makes the visualization and interpretation easier.

# Clean and prepare the dataset

albums <- albums_raw %>%
  mutate(
    Sales_Millions = as.numeric(Sales_Millions),
    Year = as.integer(Year)
  ) %>%
  drop_na(Sales_Millions, Year)

# Select top 10 albums by sales
top_albums <- albums %>%
  slice_max(Sales_Millions, n = 10)

# View result
top_albums
# A tibble: 10 × 7
   Album                        Artist  Year Sales_Millions Decade Country Genre
   <chr>                        <chr>  <int>          <dbl>  <dbl> <chr>   <chr>
 1 Thriller                     Micha…  1982             70   1980 USA     Pop  
 2 Back in Black                AC/DC   1980             50   1980 Austra… Rock 
 3 The Dark Side of the Moon    Pink …  1973             45   1970 UK      Rock 
 4 The Bodyguard                Whitn…  1992             45   1990 USA     Pop  
 5 Bat Out of Hell              Meat …  1977             43   1970 USA     Rock 
 6 Their Greatest Hits (1971–1… Eagles  1976             42   1970 USA     Rock 
 7 Hotel California             Eagles  1976             42   1970 USA     Rock 
 8 Come On Over                 Shani…  1997             40   1990 Canada  Coun…
 9 Rumours                      Fleet…  1977             40   1970 UK      Rock 
10 Saturday Night Fever         Bee G…  1977             40   1970 UK      Disco

Visualization

The visualization shows a small set of albums that dominate total sales compared to the rest of the dataset. This type of “top-heavy” distribution suggests that a handful of releases account for a large share of total sales, which may reflect cultural reach, marketing, and the size of the music market in different decades.

ggplot(top_albums, aes(x = reorder(Album, Sales_Millions), y = Sales_Millions)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 10 Best-Selling Albums",
    x = "Album",
    y = "Sales (Millions)"
  )

Additional Summary Analysis

This summary provides additional context by examining average album sales across decades, which helps identify broader trends beyond the top-selling albums.

albums %>%
  group_by(Decade) %>%
  summarize(
    avg_sales = mean(Sales_Millions, na.rm = TRUE),
    album_count = n()
  ) %>%
  arrange(desc(avg_sales))
# A tibble: 7 × 3
  Decade avg_sales album_count
   <dbl>     <dbl>       <int>
1   1970      34.7          12
2   1980      33.2          13
3   1990      31.1           9
4   1960      29             2
5   2010      25.2           4
6   2000      25             8
7   1950      20             1

Conclusions and Recommendations

This dataset was successfully acquired from a public source (Kaggle) and made reproducible by loading it through a GitHub raw URL. After converting key variables to numeric and removing missing values, I was able to identify the top-selling albums and visualize the results.

To extend this work, I could compare sales across decades using the Decade column, examine whether certain genres dominate high sales, or verify/update the sales values against another reference source to ensure accuracy.

Use of Generative AI

Generative AI tools were used to assist with debugging R and Quarto syntax, understanding error messages, and refining the overall structure of the analysis. All code and analysis were reviewed, understood, and executed by the author.

OpenAI. (2026). ChatGPT (Version 5.2) [Large language model]. https://chat.openai.com. Accessed February 2026.