library(dplyr)

TRM<-read.csv("top_rated_movies.csv", header=TRUE, sep =",")
TRM
#mean of N
sumRT = summary(TRM$run_time)
sumRT
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    45.0   109.0   129.0   130.7   149.0   238.0       1
N_mean<-mean(sumRT)

N_mean
## [1] 114.5347

The 1st quartile of the entire data set is 109 minutes. The median is 129 minutes and the 3rd quartile is 149 minutes. The shortest movies was 45 minutes, and on the other side, the longest movies was 238 minutes. The average length was 130.7 minutes.

#1. 100 samples of n=20 calculate sample mean of run_time values

set.seed(26871011) 


for (i in 1:100)
{
  sample <- sample(TRM$run_time,20,replace = TRUE)
}

n_mean<-mean(sample)
n_mean
## [1] 131.65
N_mean
## [1] 114.5347
nvariance<-(n_mean-N_mean)/249
nvariance
## [1] 0.0687361
#DO this for n = 40 copy and paste.
set.seed(359177) 


for (i in 1:100)
{
  sample <- sample(TRM$run_time,40,replace = TRUE)
}

n_mean4<-mean(sample)
n_mean4
## [1] 125
N_mean
## [1] 114.5347
nvariance4<-(n_mean4-N_mean)/249
nvariance4
## [1] 0.04202928
means <- c(n_mean,n_mean4,N_mean)#for histogram

The mean of n being 20 was 131.65 minutes and the mean of n being 40 was 125. The mean of N was 114. As n got bigger it was getting closer to the mean of N. The variances were in correlation as well, when n was 20, it had a slightly bigger value. When n became 40 its value decreased, meaning it got closer to the actual mean of N.

#2.Examine distribution using histogram to find mean of n(20), N(250),and variance of N

hist(means, main="Means",xlab="Average Length of Movies",col="red", breaks=c(40,50,60,70,80,90,100,110,120,130,140,150,160,170,180,190,200,210,220,230,240,250))

I could not figure out how to change the individual bar colors. But, the right bar is n = 20, the middle bar is n = 40, and the left bar is N. The x axis goes from 50 to 250 and increases every 10, this is so that the bars don’t overlap. From this pattern, every new bar will get closer to N, as long as the x axis gets more spread out.

#3. Compare to Center Limit Theorem. in what it tells us about sampling distributions. (n -> inf.)

set.seed(26871011) 


for (i in 1:100)
{
  sample <- sample(TRM$run_time,30,replace = TRUE)
}

n_meanNEW<-mean(sample)
n_meanNEW
## [1] 129.7333
N_mean
## [1] 114.5347
nvarianceNEW<-(n_meanNEW - N_mean)/249
nvarianceNEW
## [1] 0.06103865

Lets test the Center Limit Theorem using this code. Before so, let me restate the means, n = 20 is 131.65, n=40 is 125, N is 114.53. I put in the same seed when n was 20, to see the difference as n increases. For this code chunk I changed the n to five, so it grabbed five random movies from the data set, 100 times. The new mean was 164 which is very far from N. Then if you were to change n to 30, it becomes 129.73, it gets closer to N. The variance did get smaller every time n increased, this is expected, the mean is getting closer to the actual mean.

The code seems not to work after changing n to a bigger value, it somehow goes further away from N? But it works when increasing in small amounts.