sample = rnorm(10)
print(sample)
## [1] -1.5632682 -2.8713066 0.8154888 0.5660628 0.2383023 -0.4079712
## [7] 1.4777754 -0.2598601 -0.6642029 -1.3547532
Repeat the same command. Do you get the same output? Why? It is because the rnorm command is randomly drawing 10 samples from the standard normal distribution, so each time it is different.
Use rnorm() to sample 100 values from the standard normal distribution. This time, instead of simply printing the output of rnorm, assign it to a variable named r100.
r100 = rnorm(100)
print(r100)
## [1] -0.376600076 -1.528664633 -1.001159096 -0.302222185 -1.203761111
## [6] -1.750541534 1.579443454 0.950184380 -1.714790849 -2.630846722
## [11] 0.719878275 -0.720273274 -0.420644133 0.541252577 0.912361347
## [16] 0.543248976 -0.485050091 -0.949393919 0.270831061 -1.163556100
## [21] -0.731706822 0.517906813 0.411664447 0.309613686 -0.683389445
## [26] 0.121864402 1.107029454 -0.685165056 0.108912765 -0.211736420
## [31] -0.931273536 -1.498330050 -0.536996750 0.640512449 -0.018249322
## [36] -0.607302126 0.609105241 0.041778274 -0.737323270 1.815448189
## [41] -0.802833057 0.215583928 0.420523297 -1.537215213 2.316228300
## [46] -1.503882422 -1.297731765 -2.432626202 -1.185538629 0.476947317
## [51] -0.550488853 1.190627134 -1.333594099 0.013663901 0.906153087
## [56] 0.416008377 1.826552030 -0.589089511 -0.470591971 0.052809090
## [61] 0.295372437 -1.355960522 -0.479325624 0.446903732 -0.058697091
## [66] -0.517200236 -0.317234555 1.098972430 0.075388213 0.359691381
## [71] -0.540431923 -1.082516288 0.229762706 0.299952902 2.033323627
## [76] -0.669358984 -2.521084470 0.233512599 -0.313664158 -0.822032366
## [81] 1.607563610 0.517033446 0.178974984 0.512128856 -0.004759043
## [86] -1.376598203 -1.487668242 -1.062408032 -0.390806945 -1.233702541
## [91] -0.165539077 0.100235826 0.191070147 -0.468308275 0.966417645
## [96] -0.516561234 -0.347311864 -0.542999580 -0.100262051 0.674651520
4.Compute the mean of the variable r100 using the function mean(). What value do you expect? Does the sample mean that you just computed differ from it? Why?
mean100 = mean(r100)
print(mean100)
## [1] -0.2010991
The expected value should be 0, the sample mean I just computed is -0.051 (to 3dp), which is different. This is because the computed mean is an estimation based on 100 samples from the true normal distribution. The estimation should be closer to the ground truth with a larger sample size.
n = length(r100)
s = sum(r100)
mean = s/n
print(mean)
## [1] -0.2010991
Yes, I got exactly the same answer.
sd = sd(r100)
print(sd)
## [1] 0.9669434
The expected value is 1,the S.D. I just computed is 1.037. They are different because the S.D. I computed is an estimation based on only 100 samples.
var = var(r100)
print(var)
## [1] 0.9349795
print(sd**2)
## [1] 0.9349795
variance is the square root of standard deviation of s.d. It holds exactly.
r10k = rnorm(10000)
mean10k = mean(r10k)
sd10k = sd(r10k)
print(mean10k)
## [1] 0.007719336
print(sd10k)
## [1] 1.003161
the computed mean and s.d. are 0.0008238 (to 3 s.f.) and 0.99063 respectively. It is closer to the true mean and s.d.. This is because the estimation used a much larger sample so it is more representative of the distribution.
hist(r10k)
yes it looks normal.
plot.ecdf(r10k)
ecdf shows the cumulative frequency distribution. Around the area in the
histogram where there are not many samples, the corresponding ecdf area
is much flatter, as the values don’t increase much. In contrast, around
the area in the histogram where there are many samples (around the
center), the corresponding ecdf area is much steeper, as there are many
data points so the cumulative values increase much faster.
r10k = rnorm(10000)
r10k.m1 = rnorm(10000, mean = 1)
plot(ecdf(r10k), main = "ECDF plot")
lines(ecdf(r10k.m1), col = "red")
The steepest increase for the norm dist with mean = 1 (red line) happens
at x = 1, whereas that of mean = 0 happens at x = 0. This is because in
normal distribution, the density of data points are highest around the
mean.
r10k.s2 = rnorm(10000, mean = 0, sd = 2)
r10k = rnorm(10000)
r10k.m1 = rnorm(10000, mean = 1)
plot(ecdf(r10k), main = "ECDF plot")
lines(ecdf(r10k.m1), col = "red")
lines(ecdf(r10k.s2), col = "blue")
the slope of the ECDF of sd = 2 is more smooth and gradual. This is
because data points are more spread out so the increase in y-value (cdf)
is more even.
r10k.lt0 = r10k < 0
print(head(r10k, n = 10))
## [1] 1.8273974 0.4196995 -0.4003270 -0.9726189 -0.3492060 0.4706674
## [7] -0.8689653 1.3721277 -0.5363281 -0.1996623
print(head(r10k.lt0, n = 10))
## [1] FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
table(r10k.lt0)
## r10k.lt0
## FALSE TRUE
## 4958 5042
Yes; because r10k is a normal distribution with a mean of 0, meaning that the number of samples more or less than zero should be roughly the same.
pnorm(0)
## [1] 0.5
Yes, half of the samples are less than 0. This is because this is a normal distribution with a mean of 0.
r10k = rnorm(10000)
plot(ecdf(r10k), main = "ECDF plot")
abline(v = 1)
Judging by eyes, around 1000 out of 10,000 values are greater than
1.
r10k.gt1 = r10k > 1
table(r10k.gt1)
## r10k.gt1
## FALSE TRUE
## 8411 1589
Yes.
pnorm(1) # this is the expected fraction of values less than 1
## [1] 0.8413447
1 - pnorm(1) # this is the expected fraction of values greater than 1
## [1] 0.1586553
around a fraction of 0.159 of values are greater than 1
r10k.s2 = rnorm(10000, mean = 0, sd = 2)
1 - pnorm(10, mean = 0, sd = 2)
## [1] 2.866516e-07
around 2.87e-07 fraction of values are expected to be larger than 10