This assignment is meant to through and over a public dataset that is available on Kaggle about the best-selling music albums of all time. The data set uses sales, artist, and release year to help to explore trends.
I will begin by loading a dataset that I have accessible in a GitHub repository. I plan on researching the data and creating subsets that is relatable to the CSV such as album title, artist, release year, and sales. The variables can and will be renamed to improve the clarity and consistency. Once cleaned and organized I should be able to explore basic analysis to better understand the patterns in sales across artists and time periods. One anticipated challenge is ensuring that numeric variables such as sales figures are consistently formatted and free of non-numeric characters. The dataset could also have missing values that will require some cleaning.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Read the CSV directly from GitHubalbums_raw <-read_csv("https://raw.githubusercontent.com/CiaraBonn12/Data607-First-Assignment/main/best_selling_albums.csv")
Rows: 49 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Album, Artist, Country, Genre
dbl (3): Year, Sales_Millions, Decade
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Check the first few rowshead(albums_raw)
# A tibble: 6 × 7
Album Artist Year Sales_Millions Decade Country Genre
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
1 Thriller Micha… 1982 70 1980 USA Pop
2 Back in Black AC/DC 1980 50 1980 Austra… Rock
3 The Dark Side of the Moon Pink … 1973 45 1970 UK Rock
4 The Bodyguard Whitn… 1992 45 1990 USA Pop
5 Bat Out of Hell Meat … 1977 43 1970 USA Rock
6 Their Greatest Hits (1971–19… Eagles 1976 42 1970 USA Rock
# 🔹 ADD THIS LINE TO SEE EXACT COLUMN NAMEScolnames(albums_raw)
Data Acquisition https://raw.githubusercontent.com/CiaraBonn12/Data607-First-Assignment/main/best_selling_albums.csv
In this section, the steps for obtaining the raw data are discussed. For the analysis to be reproducible, it is important to show that the data source is publicly available, eliminating the need for local data access.
Data Preparation and Transformation
In this section I clean the dataset by ensuring sales and year are numeric and removing rows with missing values in those fields. I then create a smaller dataset (top_albums) containing the 10 highest-selling albums, which makes the visualization and interpretation easier.
# Clean and prepare the datasetalbums <- albums_raw %>%mutate(Sales_Millions =as.numeric(Sales_Millions),Year =as.integer(Year) ) %>%drop_na(Sales_Millions, Year)# Select top 10 albums by salestop_albums <- albums %>%slice_max(Sales_Millions, n =10)# View resulttop_albums
# A tibble: 10 × 7
Album Artist Year Sales_Millions Decade Country Genre
<chr> <chr> <int> <dbl> <dbl> <chr> <chr>
1 Thriller Micha… 1982 70 1980 USA Pop
2 Back in Black AC/DC 1980 50 1980 Austra… Rock
3 The Dark Side of the Moon Pink … 1973 45 1970 UK Rock
4 The Bodyguard Whitn… 1992 45 1990 USA Pop
5 Bat Out of Hell Meat … 1977 43 1970 USA Rock
6 Their Greatest Hits (1971–1… Eagles 1976 42 1970 USA Rock
7 Hotel California Eagles 1976 42 1970 USA Rock
8 Come On Over Shani… 1997 40 1990 Canada Coun…
9 Rumours Fleet… 1977 40 1970 UK Rock
10 Saturday Night Fever Bee G… 1977 40 1970 UK Disco
Visualization
The visualization shows a small set of albums that dominate total sales compared to the rest of the dataset. This type of “top-heavy” distribution suggests that a handful of releases account for a large share of total sales, which may reflect cultural reach, marketing, and the size of the music market in different decades.
This summary provides additional context by examining average album sales across decades, which helps identify broader trends beyond the top-selling albums.
albums %>%group_by(Decade) %>%summarize(avg_sales =mean(Sales_Millions, na.rm =TRUE),album_count =n() ) %>%arrange(desc(avg_sales))
This dataset was successfully acquired from a public source (Kaggle) and made reproducible by loading it through a GitHub raw URL. After converting key variables to numeric and removing missing values, I was able to identify the top-selling albums and visualize the results.
To extend this work, I could compare sales across decades using the Decade column, examine whether certain genres dominate high sales, or verify/update the sales values against another reference source to ensure accuracy.
Use of Generative AI
Generative AI tools were used to assist with debugging R and Quarto syntax, understanding error messages, and refining the overall structure of the analysis. All code and analysis were reviewed, understood, and executed by the author.
OpenAI. (2026). ChatGPT (Version 5.2) [Large language model]. https://chat.openai.com. Accessed February 2026.