Standardize data and estimate the population

Question 1

a) Create a normal distribution (mean=940, sd=190) and standardize it (let’s call it rnorm_std)

d1 <- rnorm(n=500, mean=940, sd=190)
rnorm_std <- (d1-mean(d1))/sd(d1)

a-i) What should we expect the mean and standard deviation of rnorm_std to be, and why?

mean(rnorm_std);sd(rnorm_std)

## [1] -1.965709e-16

## [1] 1

We should expect the mean of the standardized data to be 0 and the standard deviation of the standardized data to be 1. This is because the process of standardization centers the data around 0 and scales it so that the spread of the data is represented by 1 standard deviation.

a-ii) What should the distribution (shape) of rnorm_std look like, and why?

plot(density(rnorm_std), col="blue", lwd=2, 
     main = "Distribution 1")

The distribution (shape) of rnorm_std should look like a bell-shaped curve, just be similar to original data.

When we standardize the data, the shape of the distribution should not change. Standardization only affects the location and scale of the data, but it does not change the shape of the distribution.

a-iii) What do we generally call distributions that are normal and standardized?

Distributions that are both normal and standardized are typically referred to as standard normal distributions or standard Gaussian distributions.

b) Create a standardized version of minday discussed in question 3 (let’s call it minday_std)

minday_std <- (minday-mean(minday))/sd(minday)

b-i) What should we expect the mean and standard deviation of minday_std to be, and why?

mean(minday_std);sd(minday_std)

## [1] -4.25589e-17

## [1] 1

b-ii) What should the distribution of minday_std look like compared to minday, and why?

plot(density(minday_std), col="blue", lwd=2, 
     main = "Distribution minday_std")

Therefore, the shape of the distribution minday_std and the minday is the same.

Question2

Install the compstatslib package from Github (see class notes) and run the plot_sample_ci() function that simulates samples drawn randomly from a population.

require("remotes")

## 載入需要的套件：remotes

## Warning: 套件 'remotes' 是用 R 版本 4.2.2 來建造的

remotes::install_github("soumyaray/compstatslib")

## Skipping install of 'compstatslib' from a github remote, the SHA1 (cc90eb60) has not changed since last install.
##   Use `force = TRUE` to force installation

library("compstatslib")

a) Simulate 100 samples (each of size 100), from a normally distributed population of 10,000:

plot_sample_ci(num_samples = 100, sample_size = 100, pop_size=10000, 
               distr_func=rnorm, mean=20, sd=3)

a-i) How many samples do we expect to NOT include the population mean in its 95% CI?

As long as the sample size is sufficiently large (typically n >= 30), the sampling distribution of the sample means will be approximately normal, and the 95% confidence interval will have a coverage probability of approximately 95%. Therefore, we would expect that approximately 5% of the samples would not include the population mean in their 95% confidence interval.

That is, we expect that there will be 5 samples(100 samples * 5%) which is not include the population mean in its 95% CI.

a-ii) How many samples do we expect to NOT include the population mean in their 99% CI?

As previous mentioned, we expect that there will be 1 sample(100 samples * 1%) which is not include the population mean in its 99% CI.

b) Rerun the previous simulation with the same number of samples, but larger sample size (sample_size=300):

plot_sample_ci(num_samples = 100, sample_size = 300, pop_size=10000, 
               distr_func=rnorm, mean=20, sd=3)

b-i) Now that the size of each sample has increased, do we expect their 95% and 99% CI to become wider or narrower than before?

Yes, we expect the 95% and 99% confidence intervals to become narrower than before. Increasing the sample size will decrease the standard error and make the confidence interval narrower. This effect is more pronounced for larger confidence levels, such as the 99% confidence interval, which requires a larger t-value and therefore a smaller confidence interval.

b-ii) This time, how many samples (out of the 100) would we expect to NOT include the population mean in its 95% CI?

Since we increase the sample size from 100 to 300, as the sample size increases, the estimate of the population mean becomes more precise, and the confidence interval becomes narrower.

Therefore, the number of samples that do not include the population mean in its 95% CI would be smaller than when the sample size was 100.

However, the exact number cannot be determined without more information about the population.

c) If we ran the above two examples (a and b) using a uniformly distributed population, how do you expect your answers to (a) and (b) to change, and why?

plot_sample_ci(num_samples = 100, sample_size = 100, pop_size=10000, 
               distr_func=runif)

plot_sample_ci(num_samples = 100, sample_size = 300, pop_size=10000, 
               distr_func=runif)

If we change the population distribution from normal to uniform, then the answers to the previous questions will change. The standard error of the mean would be larger than in the case of a normal distribution when the population is uniformly distributed.

The reason is that the variability of the uniform distribution might be larger than the variability of the normal distribution. As a result, the confidence interval will be wider for a given sample size compared to the case of a normally distributed population.

In the case of a uniform distribution, increasing the sample size will still decrease the width of the confidence interval, but the effect will be less pronounced compared to a normal distribution. This is because the standard error of the mean increases more slowly with sample size for a uniform distribution compared to a normal distribution.

Therefore, if we ran the above two examples using a uniformly distributed population, we would expect the 95% and 99% confidence intervals to be wider compared to the case of a normally distributed population with the same mean and variance. The exact number of samples that do not include the population mean in its 95% CI or 99% CI will depend on the specific values of the mean, variance, and sample size, as well as the distribution of the population.

Question3

#bookings <- read.table("C:/R-language/BACS/first_bookings_datetime_sample.txt", header=TRUE)
bookings$datetime[1:9]

## [1] "4/16/2014 17:30"  "1/11/2014 20:00"  "3/24/2013 12:00"  "8/8/2013 12:00"  
## [5] "2/16/2013 18:00"  "5/25/2014 15:00"  "12/18/2013 19:00" "12/23/2012 12:00"
## [9] "10/18/2013 20:00"

#hours  <- as.POSIXlt(bookings$datetime, format="%m/%d/%Y %H:%M")$hour
#mins   <- as.POSIXlt(bookings$datetime, format="%m/%d/%Y %H:%M")$min
#minday <- hours*60 + mins
plot(density(minday), main="Minute (of the day) of first ever booking", col="blue", lwd=2)

a) What is the “average” booking time for new members making their first restaurant booking?(use minday)

mean(minday)

## [1] 942.4964

a-i) Use traditional statistical methods to estimate the population mean of minday.

mean_md <- mean(minday);mean_md

## [1] 942.4964

sde_md <- sd(minday) / sqrt(length(minday));sde_md

## [1] 0.5997673

# 95% confidence interval (CI) of the sampling means
po_ci <- mean_md + 1.96 * sde_md
ne_ci <- mean_md - 1.96 * sde_md
cat("the population mean of minday might be", ne_ci, "to", po_ci)

## the population mean of minday might be 941.3208 to 943.6719

a-ii) Bootstrap to produce 2000 new samples from the original sample.

compute_sample_mean <- function(sample0) {
  resample <- sample(sample0, length(sample0), replace=TRUE)
  mean(resample)
}
btstp <- replicate(2000,compute_sample_mean(minday))

a-iii) Visualize the means of the 2000 bootstrapped samples.

plot(density(btstp), col="blue", lwd=2)

a-iv) Estimate the 95% CI of the bootstrapped means using the quantile function.

quantile(btstp, probs=c(0.025, 0.975))

##     2.5%    97.5% 
## 941.3642 943.7016

b)By what time of day, have half the new members of the day already arrived at their restaurant?

require(lubridate)

## 載入需要的套件：lubridate

## 
## 載入套件：'lubridate'

## 下列物件被遮斷自 'package:base':
## 
##     date, intersect, setdiff, union

med_md <- median(minday)
time_md <- dminutes(med_md)
format(as_datetime(time_md),"%H:%M:%S")

## [1] "17:20:00"

b-i) Estimate the median of minday.

median(minday)

## [1] 1040

b-ii) Visualize the medians of the 2000 bootstrapped samples.

compute_sample_med <- function(sample0) {
  resample <- sample(sample0, length(sample0), replace=TRUE)
  median(resample)
}
btstp1 <- replicate(2000,compute_sample_med(minday))
plot(density(btstp1), col="blue", lwd=2)

b-iii) Estimate the 95% CI of the bootstrapped medians using the quantile function.

quantile(btstp1, probs=c(0.025, 0.975))

##  2.5% 97.5% 
##  1020  1050