d1 <- rnorm(n=500, mean=940, sd=190)
rnorm_std <- (d1-mean(d1))/sd(d1)
mean(rnorm_std);sd(rnorm_std)
## [1] -1.965709e-16
## [1] 1
We should expect the mean of the standardized data to be 0 and the standard deviation of the standardized data to be 1. This is because the process of standardization centers the data around 0 and scales it so that the spread of the data is represented by 1 standard deviation.
plot(density(rnorm_std), col="blue", lwd=2,
main = "Distribution 1")
The distribution (shape) of rnorm_std should look like a bell-shaped curve, just be similar to original data.
When we standardize the data, the shape of the distribution should not change. Standardization only affects the location and scale of the data, but it does not change the shape of the distribution.
Distributions that are both normal and standardized are typically referred to as standard normal distributions or standard Gaussian distributions.
minday_std <- (minday-mean(minday))/sd(minday)
mean(minday_std);sd(minday_std)
## [1] -4.25589e-17
## [1] 1
We should expect the mean of the standardized data to be 0 and the standard deviation of the standardized data to be 1. This is because the process of standardization centers the data around 0 and scales it so that the spread of the data is represented by 1 standard deviation.
plot(density(minday_std), col="blue", lwd=2,
main = "Distribution minday_std")
When we standardize the data, the shape of the distribution should not change. Standardization only affects the location and scale of the data, but it does not change the shape of the distribution.
Therefore, the shape of the distribution minday_std and the minday is the same.
require("remotes")
## 載入需要的套件:remotes
## Warning: 套件 'remotes' 是用 R 版本 4.2.2 來建造的
remotes::install_github("soumyaray/compstatslib")
## Skipping install of 'compstatslib' from a github remote, the SHA1 (cc90eb60) has not changed since last install.
## Use `force = TRUE` to force installation
library("compstatslib")
plot_sample_ci(num_samples = 100, sample_size = 100, pop_size=10000,
distr_func=rnorm, mean=20, sd=3)
As long as the sample size is sufficiently large (typically n >= 30), the sampling distribution of the sample means will be approximately normal, and the 95% confidence interval will have a coverage probability of approximately 95%. Therefore, we would expect that approximately 5% of the samples would not include the population mean in their 95% confidence interval.
That is, we expect that there will be 5 samples(100 samples * 5%) which is not include the population mean in its 95% CI.
As previous mentioned, we expect that there will be 1 sample(100 samples * 1%) which is not include the population mean in its 99% CI.
plot_sample_ci(num_samples = 100, sample_size = 300, pop_size=10000,
distr_func=rnorm, mean=20, sd=3)
Yes, we expect the 95% and 99% confidence intervals to become narrower than before. Increasing the sample size will decrease the standard error and make the confidence interval narrower. This effect is more pronounced for larger confidence levels, such as the 99% confidence interval, which requires a larger t-value and therefore a smaller confidence interval.
Since we increase the sample size from 100 to 300, as the sample size increases, the estimate of the population mean becomes more precise, and the confidence interval becomes narrower.
Therefore, the number of samples that do not include the population mean in its 95% CI would be smaller than when the sample size was 100.
However, the exact number cannot be determined without more information about the population.
plot_sample_ci(num_samples = 100, sample_size = 100, pop_size=10000,
distr_func=runif)
plot_sample_ci(num_samples = 100, sample_size = 300, pop_size=10000,
distr_func=runif)
If we change the population distribution from normal to uniform,
then the answers to the previous questions will change. The standard
error of the mean would be larger than in the case of a normal
distribution when the population is uniformly distributed.
The reason is that the variability of the uniform distribution might be larger than the variability of the normal distribution. As a result, the confidence interval will be wider for a given sample size compared to the case of a normally distributed population.
In the case of a uniform distribution, increasing the sample size will still decrease the width of the confidence interval, but the effect will be less pronounced compared to a normal distribution. This is because the standard error of the mean increases more slowly with sample size for a uniform distribution compared to a normal distribution.
Therefore, if we ran the above two examples using a uniformly distributed population, we would expect the 95% and 99% confidence intervals to be wider compared to the case of a normally distributed population with the same mean and variance. The exact number of samples that do not include the population mean in its 95% CI or 99% CI will depend on the specific values of the mean, variance, and sample size, as well as the distribution of the population.
#bookings <- read.table("C:/R-language/BACS/first_bookings_datetime_sample.txt", header=TRUE)
bookings$datetime[1:9]
## [1] "4/16/2014 17:30" "1/11/2014 20:00" "3/24/2013 12:00" "8/8/2013 12:00"
## [5] "2/16/2013 18:00" "5/25/2014 15:00" "12/18/2013 19:00" "12/23/2012 12:00"
## [9] "10/18/2013 20:00"
#hours <- as.POSIXlt(bookings$datetime, format="%m/%d/%Y %H:%M")$hour
#mins <- as.POSIXlt(bookings$datetime, format="%m/%d/%Y %H:%M")$min
#minday <- hours*60 + mins
plot(density(minday), main="Minute (of the day) of first ever booking", col="blue", lwd=2)
mean(minday)
## [1] 942.4964
mean_md <- mean(minday);mean_md
## [1] 942.4964
sde_md <- sd(minday) / sqrt(length(minday));sde_md
## [1] 0.5997673
# 95% confidence interval (CI) of the sampling means
po_ci <- mean_md + 1.96 * sde_md
ne_ci <- mean_md - 1.96 * sde_md
cat("the population mean of minday might be", ne_ci, "to", po_ci)
## the population mean of minday might be 941.3208 to 943.6719
compute_sample_mean <- function(sample0) {
resample <- sample(sample0, length(sample0), replace=TRUE)
mean(resample)
}
btstp <- replicate(2000,compute_sample_mean(minday))
plot(density(btstp), col="blue", lwd=2)
quantile(btstp, probs=c(0.025, 0.975))
## 2.5% 97.5%
## 941.3642 943.7016
require(lubridate)
## 載入需要的套件:lubridate
##
## 載入套件:'lubridate'
## 下列物件被遮斷自 'package:base':
##
## date, intersect, setdiff, union
med_md <- median(minday)
time_md <- dminutes(med_md)
format(as_datetime(time_md),"%H:%M:%S")
## [1] "17:20:00"
median(minday)
## [1] 1040
compute_sample_med <- function(sample0) {
resample <- sample(sample0, length(sample0), replace=TRUE)
median(resample)
}
btstp1 <- replicate(2000,compute_sample_med(minday))
plot(density(btstp1), col="blue", lwd=2)
quantile(btstp1, probs=c(0.025, 0.975))
## 2.5% 97.5%
## 1020 1050