Assignment1

Use the function rnorm() to sample 10 values from the standard normal distribution. What are the true mean and true standard deviation for the standard normal distribution? The true mean and s.d. for standard normal distribution is 0 and 1.

sample = rnorm(10)
print(sample)

##  [1] -1.5632682 -2.8713066  0.8154888  0.5660628  0.2383023 -0.4079712
##  [7]  1.4777754 -0.2598601 -0.6642029 -1.3547532

Repeat the same command. Do you get the same output? Why? It is because the rnorm command is randomly drawing 10 samples from the standard normal distribution, so each time it is different.
Use rnorm() to sample 100 values from the standard normal distribution. This time, instead of simply printing the output of rnorm, assign it to a variable named r100.

r100 = rnorm(100)
print(r100)

##   [1] -0.376600076 -1.528664633 -1.001159096 -0.302222185 -1.203761111
##   [6] -1.750541534  1.579443454  0.950184380 -1.714790849 -2.630846722
##  [11]  0.719878275 -0.720273274 -0.420644133  0.541252577  0.912361347
##  [16]  0.543248976 -0.485050091 -0.949393919  0.270831061 -1.163556100
##  [21] -0.731706822  0.517906813  0.411664447  0.309613686 -0.683389445
##  [26]  0.121864402  1.107029454 -0.685165056  0.108912765 -0.211736420
##  [31] -0.931273536 -1.498330050 -0.536996750  0.640512449 -0.018249322
##  [36] -0.607302126  0.609105241  0.041778274 -0.737323270  1.815448189
##  [41] -0.802833057  0.215583928  0.420523297 -1.537215213  2.316228300
##  [46] -1.503882422 -1.297731765 -2.432626202 -1.185538629  0.476947317
##  [51] -0.550488853  1.190627134 -1.333594099  0.013663901  0.906153087
##  [56]  0.416008377  1.826552030 -0.589089511 -0.470591971  0.052809090
##  [61]  0.295372437 -1.355960522 -0.479325624  0.446903732 -0.058697091
##  [66] -0.517200236 -0.317234555  1.098972430  0.075388213  0.359691381
##  [71] -0.540431923 -1.082516288  0.229762706  0.299952902  2.033323627
##  [76] -0.669358984 -2.521084470  0.233512599 -0.313664158 -0.822032366
##  [81]  1.607563610  0.517033446  0.178974984  0.512128856 -0.004759043
##  [86] -1.376598203 -1.487668242 -1.062408032 -0.390806945 -1.233702541
##  [91] -0.165539077  0.100235826  0.191070147 -0.468308275  0.966417645
##  [96] -0.516561234 -0.347311864 -0.542999580 -0.100262051  0.674651520

4.Compute the mean of the variable r100 using the function mean(). What value do you expect? Does the sample mean that you just computed differ from it? Why?

mean100 = mean(r100)
print(mean100)

## [1] -0.2010991

The expected value should be 0, the sample mean I just computed is -0.051 (to 3dp), which is different. This is because the computed mean is an estimation based on 100 samples from the true normal distribution. The estimation should be closer to the ground truth with a larger sample size.

Combine the functions length() and sum() to compute the mean of r100 in an alternative way. Do you get exactly the same answer?

n = length(r100)
s = sum(r100)
mean = s/n
print(mean)

## [1] -0.2010991

Yes, I got exactly the same answer.

Compute the standard deviation of the variable r100 using the function sd(). What value do you expect? Does the standard deviation that you just computed differ from it? Why?

sd = sd(r100)
print(sd)

## [1] 0.9669434

The expected value is 1,the S.D. I just computed is 1.037. They are different because the S.D. I computed is an estimation based on only 100 samples.

Use the function var() to compute the variance of r100. What relationship do you expect between the variance and standard deviation of r100? Does it hold exactly?

var = var(r100)
print(var)

## [1] 0.9349795

print(sd**2)

## [1] 0.9349795

variance is the square root of standard deviation of s.d. It holds exactly.

Generate a set of 10,000 standard normal random variates and store it as r10K. Compute the mean and standard deviation. How do the values differ from what you got for r100? Why do you think that is?

r10k = rnorm(10000)
mean10k = mean(r10k)
sd10k = sd(r10k)
print(mean10k)

## [1] 0.007719336

print(sd10k)

## [1] 1.003161

the computed mean and s.d. are 0.0008238 (to 3 s.f.) and 0.99063 respectively. It is closer to the true mean and s.d.. This is because the estimation used a much larger sample so it is more representative of the distribution.

Plot a histogram of r10K using hist(). Describe the shape. Does it look normal?

hist(r10k)

yes it looks normal.

Plot an ECDF (empirical cumulative distribution function) of r10K using plot.ecdf(). How does the shape of this ECDF relate to that of the histogram for the same data?

plot.ecdf(r10k)

ecdf shows the cumulative frequency distribution. Around the area in the histogram where there are not many samples, the corresponding ecdf area is much flatter, as the values don’t increase much. In contrast, around the area in the histogram where there are many samples (around the center), the corresponding ecdf area is much steeper, as there are many data points so the cumulative values increase much faster.

Generate a set of 10,000 random normal variates with an expected mean equal to 1. Call it r10K.m1. Add an ECDF for r10K.m1 to the current plot and color it red (see the help page for plot.ecdf). How do they differ?

r10k = rnorm(10000)
r10k.m1 = rnorm(10000, mean = 1)
plot(ecdf(r10k), main = "ECDF plot")
lines(ecdf(r10k.m1), col = "red")

The steepest increase for the norm dist with mean = 1 (red line) happens at x = 1, whereas that of mean = 0 happens at x = 0. This is because in normal distribution, the density of data points are highest around the mean.

Generate a set of 10,000 random normal variates with an expected mean equal to zero and an expected standard deviation equal to 2. Call it r10K.s2. Add its ECDF to the current plot and color it blue. How does this ECDF differ?

r10k.s2 = rnorm(10000, mean = 0, sd = 2)
r10k = rnorm(10000)
r10k.m1 = rnorm(10000, mean = 1)
plot(ecdf(r10k), main = "ECDF plot")
lines(ecdf(r10k.m1), col = "red")
lines(ecdf(r10k.s2), col = "blue")

the slope of the ECDF of sd = 2 is more smooth and gradual. This is because data points are more spread out so the increase in y-value (cdf) is more even.

Create a logical array r10K.lt0 that tells us whether each object in r10K is less than zero. Use head() to look at the first few values of r10K, and then the corresponding values of r10K.lt0.

r10k.lt0 = r10k < 0
print(head(r10k, n = 10))

##  [1]  1.8273974  0.4196995 -0.4003270 -0.9726189 -0.3492060  0.4706674
##  [7] -0.8689653  1.3721277 -0.5363281 -0.1996623

print(head(r10k.lt0, n = 10))

##  [1] FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE

Use table() to summarize r10K.lt0. Is this what you expected? Motivate your answer.

table(r10k.lt0)

## r10k.lt0
## FALSE  TRUE 
##  4958  5042

Yes; because r10k is a normal distribution with a mean of 0, meaning that the number of samples more or less than zero should be roughly the same.

Use pnorm() to obtain the theoretically expected fraction of values less than 0. Is this what you would expect?

pnorm(0)

## [1] 0.5

Yes, half of the samples are less than 0. This is because this is a normal distribution with a mean of 0.

Redraw the ECDF for r10K only, and then use abline() to draw a vertical line at x = 1. Estimate (by eye) the proportion of values in r10K that are greater than 1. What is your estimate? How many values (out of 10,000) do you expect to be > 1?

r10k = rnorm(10000)
plot(ecdf(r10k), main = "ECDF plot") 
abline(v = 1)

Judging by eyes, around 1000 out of 10,000 values are greater than 1.

Create a logical array r10K.gt1 that tells us whether each object in r10K is greater than 1.

r10k.gt1 = r10k > 1

Use table() to summarize r10K.gt1. Is this what you expected? Motivate your answer.

table(r10k.gt1)

## r10k.gt1
## FALSE  TRUE 
##  8411  1589

Yes.

Use pnorm() to obtain the theoretically expected fraction of values greater than 1. It may require a little bit of thinking to get this right.

pnorm(1) # this is the expected fraction of values less than 1

## [1] 0.8413447

1 - pnorm(1) # this is the expected fraction of values greater than 1

## [1] 0.1586553

around a fraction of 0.159 of values are greater than 1

What fraction of values of r10.s2 do you expect to be larger than 10?

r10k.s2 = rnorm(10000, mean = 0, sd = 2)
1 - pnorm(10, mean = 0, sd = 2)

## [1] 2.866516e-07

around 2.87e-07 fraction of values are expected to be larger than 10

Assignment1

Xiaoyun Li

2024-01-22