Week 6 Confidence Intervals

set.seed(123)

dataset <- read.csv("spotify-2023.csv")

names(dataset)

##  [1] "track_name"           "artist.s._name"       "artist_count"        
##  [4] "released_year"        "released_month"       "released_day"        
##  [7] "in_spotify_playlists" "in_spotify_charts"    "streams"             
## [10] "in_apple_playlists"   "in_apple_charts"      "in_deezer_playlists" 
## [13] "in_deezer_charts"     "in_shazam_charts"     "bpm"                 
## [16] "key"                  "mode"                 "danceability_."      
## [19] "valence_."            "energy_."             "acousticness_."      
## [22] "instrumentalness_."   "liveness_."           "speechiness_."

dataset$in_deezer_playlists <- as.integer(dataset$in_deezer_playlists)

## Warning: NAs introduced by coercion

print(typeof(dataset$in_deezer_playlists))

## [1] "integer"

dataset$in_shazam_charts <- as.integer(dataset$in_shazam_charts)

## Warning: NAs introduced by coercion

print(typeof(dataset$in_shazam_charts))

## [1] "integer"

dataset$total_playlist_inclusions<-dataset$in_spotify_playlists+dataset$in_apple_playlists+dataset$in_deezer_playlists+dataset$in_shazam_charts

dataset$average_chart_position <- (dataset$in_spotify_charts + dataset$in_apple_charts + dataset$in_deezer_charts + dataset$in_shazam_charts) / 4

plot(dataset$total_playlist_inclusions, dataset$streams, xlab = "Total Playlist Inclusions", ylab = "Streams", main = "Relationship between Total Playlist Inclusions and Streams")

There appears to be a positive correlation between total playlist inclusions and streams. This means that songs included in more playlists tend to have higher streams.
No outliers are readily apparent in the graph.

Points to Consider:

The positive correlation suggests that playlist inclusion plays a role in increasing a song’s popularity on Spotify. This is likely because playlists expose songs to a wider audience and can lead to more listens.
It’s important to note that correlation doesn’t imply causation. Just because a song is in more playlists doesn’t necessarily mean it will have more streams. Other factors, like genre, artist popularity, and song characteristics, also likely influence streams.

plot(dataset$average_chart_position, dataset$streams, xlab = "Average Chart Position", ylab = "Streams", main = "Relationship between Average Chart Position and Streams")

There appears to be a weak negative correlation between average chart position and streams. This means that songs with a higher average chart position (meaning they charted higher on average across different platforms) tend to have slightly fewer streams.
There are a few outliers in the upper left portion of the graph. These outliers represent songs with a high average chart position (possibly charting at #1 on some platforms) but with lower streams.

Points to Consider:

The weak negative correlation is interesting and suggests that charting highly might not directly translate to high streams on Spotify. It’s possible that songs chart high on other platforms but don’t necessarily get added to many Spotify playlists, where users tend to discover new music.
The outliers could be due to several factors. For instance, a song might be popular on radio or music videos but not streamed as much on Spotify. Alternatively, a recently released song might have charted highly initially but not yet gained widespread streams.

print(sum(is.na(dataset$total_playlist_inclusions)))

## [1] 123

print(sum(is.na(dataset$average_chart_position)))

## [1] 57

print(sum(is.na(dataset$streams)))

## [1] 0

dataset$total_playlist_inclusions <- ifelse(is.na(dataset$total_playlist_inclusions), 0, dataset$total_playlist_inclusions)

dataset$average_chart_position <- ifelse(is.na(dataset$average_chart_position), 0, dataset$average_chart_position)

# Calculating correlation coefficients
cor1 <- cor(dataset$total_playlist_inclusions, dataset$streams)
cor2 <- cor(dataset$average_chart_position, dataset$streams)

# Printing correlation coefficients
cat("Correlation coefficient between Total Playlist Inclusions and Streams:", cor1, "\n")

## Correlation coefficient between Total Playlist Inclusions and Streams: 0.2636325

cat("Correlation coefficient between Average Chart Position and Streams:", cor2, "\n")

## Correlation coefficient between Average Chart Position and Streams: 0.08899804

1. Total Playlist Inclusions vs. Streams (correlation coefficient: 0.2636)

Visualization: The graph showed a positive correlation, with songs in more playlists generally having higher streams.
Correlation Coefficient: A coefficient of 0.2636 indicates a weak positive correlation. This aligns with the visualization, where the data points show an upward trend but with significant scattering.

2. Average Chart Position vs. Streams (correlation coefficient: 0.0890)

Visualization: The graph displayed a weak negative correlation, with songs charting higher on average having slightly fewer streams.
Correlation Coefficient: A coefficient of 0.0890 is very close to zero, indicating almost no correlation. This reinforces the observation from the visualization where there’s no clear linear trend.

Why the Values Make Sense:

The weak positive correlation between total playlist inclusions and streams makes sense because being featured in more playlists exposes songs to a wider audience, potentially leading to more listens. However, the weakness of the correlation suggests other factors besides playlist inclusions also influence streams.
The near-zero correlation between average chart position and streams is interesting. While charting high might indicate some level of popularity, it doesn’t necessarily translate directly to high streams on Spotify. The visualization supported this with outliers where some songs charted highly but had lower streams, possibly due to factors like genre preference on Spotify or being new releases.

Overall, the correlation coefficients confirm what we observed visually in the graphs. There’s a weak positive influence of playlist inclusions on streams, while average chart position has a negligible linear relationship with streams on Spotify in this data.

# Build confidence intervals for response variables
# Assuming 'streams' as the response variable
confidence_interval <- t.test(dataset$streams)$conf.int
cat("Confidence Interval for Streams:", confidence_interval, "\n")

## Confidence Interval for Streams: 477566048 549629814

Conclusion:

There is a 95% chance that the average number of streams for all songs on Spotify in 2023 lies within this range. In other words, I can be fairly certain that the typical song in this dataset received somewhere between approximately 477 million and 550 million streams.

Important Considerations:

This confidence interval is specific to the data I analyzed. It may not perfectly represent the entire population of songs on Spotify, which includes songs from all years.
The analysis focused on the relationship between total playlist inclusions and streams. While playlist inclusions likely play a role in streams, other factors certainly influence a song’s popularity on Spotify. These factors, not accounted for in this analysis, could contribute to the variation in streams observed in the data.

Overall, the confidence interval provides a valuable estimate of the average number of streams for songs on Spotify in 2023, along with a measure of uncertainty associated with that estimate.

Week 6 Confidence Intervals

Gagan

2024-03-20