Preliminaries:
Final assignment should include hard copy of RMarkdown knitted output with all text, R code, and relevant plots answers to all questions.
Overview:The goal of this assignment is for you to get your feet wet with R, and at the same time, to learn some basics of probability. To do this, you will create some data sets with random numbers by sampling from random variables with different probability density functions using R. You will use these data sets to help you think about samples, populations, and how we use probability models to characterize the frequency distribution of data (random variables) and processes. We will consider one discrete probability model, and one continuous probability model.
x1 <- rpois(1000,lambda=6) #create the dataset for question 1
hist(x1,ylab="frequency", xlab="days per month",col="red",main="Distribution
of the number of rainfall events per month in Boston")
populationmean <- 6
samplemean <- mean(x1) #sample mean
samplemean
## [1] 6.083
populationmean
## [1] 6
populationmean==samplemean #comepare this two value
## [1] FALSE
Explain.Explain.It is not the same as popultion mean. We calculate the sample mean based on the samples we choose from all the events. It does not necessarily show the accurate mean covering every elemnts in the population.
pro4 <- dpois(4,lambda=6)
pro4
## [1] 0.1338526
pro10 <- dpois(10,lambda=samplemean)
pro10
## [1] 0.04361158
The equation which I use to calculate is dpoisson=(lambda^x*e^(-lamba))/(x)! The parameter lambda means in a given period how many times an event occurs in average. In the code, I am using two different lambda to calculate two probability which one is x equals to 4 and the other is x equals to 10.
y1 <- NULL
for (i in 1:5) y1 <- c(y1,mean(sample(x1,20)))
y1
## [1] 5.50 6.40 6.05 5.95 4.75
Explain what you find. When the size of choosen samples increases, the estimated of population mean is more close to 6, which is the given population mean.
s1 <- NULL
s1=sample(rpois(1000,lambda=6),100)
hist(s1,ylab="frequency", xlab="one random sample of 100 from the Poisson distribution",
col="blue",main="100 Samples Distribution")
y2 <- NULL
for (i in 1:1000) y2 <- c(y2,mean(sample(rpois(1000,lambda=6),100)))
hist(y2,ylab="frequency", xlab="mean",col="blue",main="Mean Distribution")
The X coordinate of s1 refers to the number of days that rainfall event occurs within a relativelly common range based on the sample from 0 to 14.The X coordinate of y2 refers to the 1000 sample means from 5.0 to 7.0 based on the 1000 drawn samples. This two histogram basically follow a Poisson distribution pattern.
A continuous density function: the univariate normal distribution. Assume that the distribution for annual total rainfall in Boston is Gaussian. Let’s assume a mean annual precipitation in Boston is 1000 mm, with a standard deviation of 100 mm (i.e., these are your population values).
x2 <-rnorm(1000,mean=1000,sd=100) #create the dataset for question 2
hist(x2,col="blue",ylab="frequency", xlab="Annual Precipitation (unit:mm)",main="Annual precipitation in Boston Distribution")
sd2=100
populationmean2=1000
populationvariance2=100
populationvariance2=sd2^2
samplemean2=mean(x2)
samplevariance2=var(x2)
populationmean2==samplemean2
## [1] FALSE
populationvariance2==samplevariance2
## [1] FALSE
Are they equal? Why or why not?No, they are not equal. We calculate the sample mean and sample varience based on the samples we partially choose from all the events. It does not reflect the whole picture of all population.
p2=pnorm(800,mean=1000,sd=100)
p2
## [1] 0.02275013
ysd100=NULL
for (i in 1:5) ysd100 <- c(ysd100,mean(sample(x2,20)))
ysd100
## [1] 972.1097 972.4360 965.4682 1024.5405 991.2550
Explain.The sample means rise and fall at 1000.
ysd50=NULL
for (i in 1:5) ysd50 <- c(ysd50,mean(sample(rnorm(1000,mean=1000,sd=50),20)))
ysd50
## [1] 1000.2568 1001.4322 1012.6153 1001.1425 991.9023
ysd150=NULL
for (i in 1:5) ysd150 <- c(ysd150,mean(sample(rnorm(1000,mean=1000,sd=150),20)))
ysd150
## [1] 1010.1731 975.1043 1019.2247 1014.8209 988.9999
E=c(ysd50,ysd100,ysd150)
EM= matrix(E, ncol=3)
EM
## [,1] [,2] [,3]
## [1,] 1000.2568 972.1097 1010.1731
## [2,] 1001.4322 972.4360 975.1043
## [3,] 1012.6153 965.4682 1019.2247
## [4,] 1001.1425 1024.5405 1014.8209
## [5,] 991.9023 991.2550 988.9999
Explain. The bigger the standard deviationsa are, the sample means shfit more widely. I think when I choose a relatively smaller sample size, the sample means are going to fluctuate more dramatically, and vice versa.
n=100
mean=1000
d50 <-rnorm(100,mean=1000,sd=50) # Standard normal distribution
d100 <-rnorm(100,mean=1000,sd=100)
d150 <-rnorm(100,mean=1000,sd=150)
d50var <-var(d50)
d100var <-var(d100)
d150var <-var(d150)
se50 <- sd(d50)/sqrt(n)
se100 <- sd(d100)/sqrt(n)
se150 <- sd(d150)/sqrt(n)
dm50=mean(d50)
dm100=mean(d100)
dm150=mean(d150)
right50 <- dm50+se50
left50 <- dm50-se50
left50
## [1] 1005.925
right50
## [1] 1015.884
right100 <- dm100+se100
left100 <- dm100-se100
left100
## [1] 980.847
right100
## [1] 1003.483
right150 <- dm150+se150
left150 <- dm150-se150
left150
## [1] 998.1701
right150
## [1] 1028.962
Explain. When the standard deviation increases, the standard error goes up proportionally. At the same time, the range in which the random variables are supposed to be gets wider, which means the confidence interval gets wider. In terms of variance, it is changing along the standard error depently.