library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(readr)

Introduction

In this mid-term project, we analyze the Spotify Songs dataset to gain insights into song characteristics and popularity. We aim to understand what makes a song popular and how various attributes like danceability, energy, and more contribute to a song’s success. This analysis will provide valuable insights for music enthusiasts, artists, and the music industry.

Problem Statement

The main problem statement is to identify the key factors that contribute to the popularity of songs on Spotify.

Approach

To address this problem, we will conduct exploratory data analysis (EDA) and visualize the relationships between song attributes and popularity. We will use various plots and statistical analysis to draw meaningful conclusions.

Stakeholders

stakeholders <- “Stakeholders include music artists, record labels, and music streaming platforms looking to improve song recommendations and understand user preferences.”

Exploratory Data Analysis (EDA)

When delving into the Spotify Songs dataset to answer The questions about song popularity and its correlation with various attributes, we have several avenues to explore:

Correlations: We’ll begin by calculating correlations between song attributes, such as danceability, energy, valence, and song popularity. This will help us identify which attributes are strongly associated with a song’s popularity.

Grouping and Aggregation: To gain a deeper understanding, we’ll group songs by various factors, including genre, artist, and release date. This approach will allow us to discern patterns in popularity within these specific groups.

Time Series Analysis: Tracking trends in song popularity over time is crucial. We’ll analyze the data by aggregating it based on release dates, helping us identify whether newer songs tend to be more popular.

To visually represent my findings and make the data more accessible, we’ll utilize various types of plots and tables:

Scatter Plots: We’ll create scatter plots to visualize the relationship between two numeric variables, such as danceability vs. popularity or energy vs. popularity.

Tables: To provide a clear summary of my findings, we’ll generate tables with key statistics, such as means, medians, and standard deviations for various attributes. These tables will be crucial in comparing statistics across different genres or artists

my journey in this analysis may require some learning:

Advanced Statistical Analysis: If we choose to integrate advanced statistical models, we may need to explore techniques like linear regression, multiple regression, or even machine learning models to predict popularity accurately.

Advanced Visualization: For more intricate visualizations, we may need to explore advanced data visualization libraries and techniques that allow us to convey my findings with precision.

my data analysis journey often involves iteration, beginning with straightforward visualizations and progressively incorporating more complex techniques as needed.

spotify_data <- read_csv("spotify_songs.csv")
## Rows: 32833 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
num_variables <- ncol(spotify_data)
missing_data <- sum(is.na(spotify_data))



spotify_cleaned_data <- spotify_data %>%
  drop_na() 

# Create a bar plot to show the average popularity of songs in each playlist genre
popularity_by_genre <- spotify_cleaned_data %>%
  group_by(playlist_genre) %>%
  summarize(avg_popularity = mean(track_popularity, na.rm = TRUE))

# Sort the genres by average popularity in descending order
popularity_by_genre <- popularity_by_genre %>%
  arrange(desc(avg_popularity))

# Create the bar plot
ggplot(popularity_by_genre, aes(x = reorder(playlist_genre, avg_popularity), y = avg_popularity)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Average Popularity of Songs by Playlist Genre",
       x = "Playlist Genre",
       y = "Average Popularity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(spotify_cleaned_data, aes(x = danceability, y = energy)) +
  geom_point() +
  labs(title = "Danceability vs. Energy",
       x = "Danceability",
       y = "Energy")