I’m importing data about music genres from https://www.kaggle.com/vicsuperman/prediction-of-music-genre. Firstly I check if there are any NA entries. There are 5 NA entries so I’m omitting then.
## [1] TRUE
We can see that there are equal amounts of of entries for each music genre. In this part of the report I’ll be focusing on the popularity of the music genres. I’m creating a histogram of song’s popularity. I’m calculating the mean, and sd of the popularity of the songs.
## [1] 44.22042
## [1] 15.54201
I’m calculating the coefficient of variation. it’s equal to 35% which is quite high. I’m plotting the popularities of music genres to see what may be causing this high coefficient of variation.
## [1] 0.3514668
As we can see the values of popularity for genres are vastly different. Let’s take a closer look at two of those genres.
## [1] 24.2716
## [1] 7.35911
Let’s calculate minimal sample size for anime and rock music genres to have a precision = 1
##
## sample.size.mean object: Sample size for mean estimate
## With finite population correction: N=5000, precision e=1 and standard deviation S=9.6752
##
## Sample size needed: 336
##
## sample.size.mean object: Sample size for mean estimate
## With finite population correction: N=5000, precision e=1 and standard deviation S=7.3591
##
## Sample size needed: 200
As we can see our minimal sample size must be at least 336, so let’s set n = 340. We need to also remember that sample size shouldn’t be larger than 10% of population. In our case population_size * 10% = 500 > n.
n <- 340
samp_mean_anime <- rep(NA, 50)
samp_sd_anime <- rep(NA, 50)
samp_mean_rock <- rep(NA, 50)
samp_sd_rock <- rep(NA, 50)
Now we calculate means for 100 iterations of a loop.
for(i in 1:50){
samp_anime <- sample(anime$popularity, n)
samp_mean_anime[i] <- mean(samp_anime)
samp_sd_anime[i] <- sd(anime$popularity)
samp_rock <- sample(rock$popularity, n)
samp_mean_rock[i] <- mean(samp_rock)
samp_sd_rock[i] <- sd(rock$popularity)
}
Then we create confidence intervals for 95% confidence level.
lower_vector_anime <- samp_mean_anime - 1.96 * samp_sd_anime / sqrt(n)
upper_vector_anime <- samp_mean_anime + 1.96 * samp_sd_anime / sqrt(n)
lower_vector_rock <- samp_mean_rock - 1.96 * samp_sd_rock / sqrt(n)
upper_vector_rock <- samp_mean_rock + 1.96 * samp_sd_rock / sqrt(n)
Let’s peek at the first intervals
c(lower_vector_anime[1], upper_vector_anime[1])
## [1] 24.02745 26.08431
c(lower_vector_rock[1], upper_vector_rock[1])
## [1] 59.44423 61.00871
And finally plot them, and calculate estimated means for anime and rock.
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 24.375323 -0.004347
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 59.643061 0.001652
## [1] 24.26447
## [1] 59.68518
In the end our estimated means for anime and rock equal:
anime_estimated_mean = mean(samp_mean_anime)
anime_estimated_mean
## [1] 24.26447
rock_estimated_mean = mean(samp_mean_rock)
rock_estimated_mean
## [1] 59.68518
And estimated intervals with 95% of confidence equal for anime and rock respectively:
c(sum(lower_vector_anime) / 50, sum(upper_vector_anime) / 50)
## [1] 23.23604 25.29290
c(sum(lower_vector_rock) / 50, sum(upper_vector_rock) / 50)
## [1] 58.90293 60.46742