The data

I’m importing data about music genres from https://www.kaggle.com/vicsuperman/prediction-of-music-genre. Firstly I check if there are any NA entries. There are 5 NA entries so I’m omitting then.

## [1] TRUE

Descriptive Statistics

We can see that there are equal amounts of of entries for each music genre. In this part of the report I’ll be focusing on the popularity of the music genres. I’m creating a histogram of song’s popularity. I’m calculating the mean, and sd of the popularity of the songs.

## [1] 44.22042
## [1] 15.54201

I’m calculating the coefficient of variation. it’s equal to 35% which is quite high. I’m plotting the popularities of music genres to see what may be causing this high coefficient of variation.

## [1] 0.3514668

As we can see the values of popularity for genres are vastly different. Let’s take a closer look at two of those genres.

## [1] 24.2716
## [1] 7.35911

Sampling means

Let’s calculate minimal sample size for anime and rock music genres to have a precision = 1

## 
## sample.size.mean object: Sample size for mean estimate
## With finite population correction: N=5000, precision e=1 and standard deviation S=9.6752
## 
## Sample size needed: 336
## 
## sample.size.mean object: Sample size for mean estimate
## With finite population correction: N=5000, precision e=1 and standard deviation S=7.3591
## 
## Sample size needed: 200

As we can see our minimal sample size must be at least 336, so let’s set n = 340. We need to also remember that sample size shouldn’t be larger than 10% of population. In our case population_size * 10% = 500 > n.

n <- 340

samp_mean_anime <- rep(NA, 50)
samp_sd_anime <- rep(NA, 50)

samp_mean_rock <- rep(NA, 50)
samp_sd_rock <- rep(NA, 50)

Now we calculate means for 100 iterations of a loop.

for(i in 1:50){
  samp_anime <- sample(anime$popularity, n) 
  samp_mean_anime[i] <- mean(samp_anime)    
  samp_sd_anime[i] <- sd(anime$popularity)  
  
  samp_rock <- sample(rock$popularity, n) 
  samp_mean_rock[i] <- mean(samp_rock)    
  samp_sd_rock[i] <- sd(rock$popularity)   
}

Then we create confidence intervals for 95% confidence level.

lower_vector_anime <- samp_mean_anime - 1.96 * samp_sd_anime / sqrt(n) 
upper_vector_anime <- samp_mean_anime + 1.96 * samp_sd_anime / sqrt(n)

lower_vector_rock <- samp_mean_rock - 1.96 * samp_sd_rock / sqrt(n) 
upper_vector_rock <- samp_mean_rock + 1.96 * samp_sd_rock / sqrt(n)

Let’s peek at the first intervals

c(lower_vector_anime[1], upper_vector_anime[1])
## [1] 24.02745 26.08431
c(lower_vector_rock[1], upper_vector_rock[1])
## [1] 59.44423 61.00871

And finally plot them, and calculate estimated means for anime and rock.

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##   24.375323    -0.004347

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##   59.643061     0.001652
## [1] 24.26447
## [1] 59.68518

Results

In the end our estimated means for anime and rock equal:

anime_estimated_mean = mean(samp_mean_anime)
anime_estimated_mean
## [1] 24.26447
rock_estimated_mean = mean(samp_mean_rock)
rock_estimated_mean
## [1] 59.68518

And estimated intervals with 95% of confidence equal for anime and rock respectively:

c(sum(lower_vector_anime) / 50, sum(upper_vector_anime) / 50)
## [1] 23.23604 25.29290
c(sum(lower_vector_rock) / 50, sum(upper_vector_rock) / 50)
## [1] 58.90293 60.46742