GE516 Assignment 1: Basic Probability Theory

Preliminaries:

Final assignment should include hard copy of RMarkdown knitted output with all text, R code, and relevant plots answers to all questions.
Some useful R commands:
- rpois, dpois, ppois
- rnorm, dnorm, pnorm
- mean, sample, hist, var
Some important tips:
- Create a unique folder where you work and store your script files.
- Tell R where to work by setting the working directory (“setwd”)
- Use a script file to do your work - don’t forget to save it before exiting!

Overview:The goal of this assignment is for you to get your feet wet with R, and at the same time, to learn some basics of probability. To do this, you will create some data sets with random numbers by sampling from random variables with different probability density functions using R. You will use these data sets to help you think about samples, populations, and how we use probability models to characterize the frequency distribution of data (random variables) and processes. We will consider one discrete probability model, and one continuous probability model.

Problem 1

A discrete probability density function: the Poisson distribution. In this first problem, we will examine a discrete PDF. Specifically, let’s assume that the number of rainfall events per month in Boston is described by a Poisson PDF, and that the population mean is six events per month.

Create a random sample of 1000 values that has the rainfall properties for Boston, as described above. This will be your “data set.”

x1 <- rpois(1000,lambda=6)   #create the dataset for question 1

Plot the histogram showing the frequency distribution for the number of rain events per month in Boston, based on these data. Make sure to add meaningful axis labels and a title to the histogram.

hist(x1,ylab="frequency", xlab="days per month",col="red",main="Distribution
     of the number of rainfall events per month in Boston")

What is the average number of events in your sample? Is it the same as the population mean? Why or why not?

populationmean <- 6
samplemean <- mean(x1)  #sample mean
samplemean

## [1] 6.083

populationmean

## [1] 6

populationmean==samplemean  #comepare this two value

## [1] FALSE

Explain.Explain.It is not the same as popultion mean. We calculate the sample mean based on the samples we choose from all the events. It does not necessarily show the accurate mean covering every elemnts in the population.

What is the probability of Boston receiving 4 rainfall events per month? Using your data, estimate the probability of 10 rainfall events per month? Show how you compute this analytically using R and explain how you calculate this probability using the PDF for a Poisson process (i.e., give the equation and explain the parameters)

pro4 <- dpois(4,lambda=6)  
pro4

## [1] 0.1338526

pro10 <- dpois(10,lambda=samplemean)
pro10

## [1] 0.04361158

The equation which I use to calculate is dpoisson=(lambda^x*e^(-lamba))/(x)! The parameter lambda means in a given period how many times an event occurs in average. In the code, I am using two different lambda to calculate two probability which one is x equals to 4 and the other is x equals to 10.

Use the “sample” command to extract 5 different sub-samples of 20 from your set of 1000. Compute the mean for each. Explain what you find. What is the relationship between the size of your sample and your estimate of the population mean?

y1 <- NULL
for (i in 1:5) y1 <- c(y1,mean(sample(x1,20)))
y1

## [1] 5.50 6.40 6.05 5.95 4.75

Explain what you find. When the size of choosen samples increases, the estimated of population mean is more close to 6, which is the given population mean.

Write a short loop to compute the mean for each of 1000 different random samples of 100 events drawn from a Poisson distribution with a rate parameter equal to 6. Then, plot the histogram for one random sample of 100 from the Poisson distribution, and plot the histogram of the 1000 means. Describe and interpret your results.

s1 <- NULL
s1=sample(rpois(1000,lambda=6),100)
hist(s1,ylab="frequency", xlab="one random sample of 100 from the Poisson distribution",
     col="blue",main="100 Samples Distribution")

y2 <- NULL
for (i in 1:1000) y2 <- c(y2,mean(sample(rpois(1000,lambda=6),100)))
hist(y2,ylab="frequency", xlab="mean",col="blue",main="Mean Distribution")

The X coordinate of s1 refers to the number of days that rainfall event occurs within a relativelly common range based on the sample from 0 to 14.The X coordinate of y2 refers to the 1000 sample means from 5.0 to 7.0 based on the 1000 drawn samples. This two histogram basically follow a Poisson distribution pattern.

Problem 2

A continuous density function: the univariate normal distribution. Assume that the distribution for annual total rainfall in Boston is Gaussian. Let’s assume a mean annual precipitation in Boston is 1000 mm, with a standard deviation of 100 mm (i.e., these are your population values).

Use R to create a random sample of 1000 data values drawn from this population.

x2 <-rnorm(1000,mean=1000,sd=100)      #create the dataset for question 2

Plot the histogram showing the frequency distribution for annual precipitation amounts in Boston based on your sample.

hist(x2,col="blue",ylab="frequency", xlab="Annual Precipitation (unit:mm)",main="Annual precipitation in Boston Distribution")

What is the mean and variance for your data set? Are these equal to the population mean and variance that you used to generate the data? Why or why not?

sd2=100
populationmean2=1000
populationvariance2=100
populationvariance2=sd2^2
samplemean2=mean(x2)                         
samplevariance2=var(x2)                         

populationmean2==samplemean2

## [1] FALSE

populationvariance2==samplevariance2

## [1] FALSE

Are they equal? Why or why not?No, they are not equal. We calculate the sample mean and sample varience based on the samples we partially choose from all the events. It does not reflect the whole picture of all population.

Use the cumulative frequency distribution and your population parmeters to calculate the probability that mean annual precipitation is 800 mm or less.

p2=pnorm(800,mean=1000,sd=100)
p2

## [1] 0.02275013

Use the “sample” command to extract 5 sub-samples of 20 from your set of 1000, and compute the mean for each. Explain what you find.

ysd100=NULL
for (i in 1:5) ysd100 <- c(ysd100,mean(sample(x2,20)))
ysd100

## [1]  972.1097  972.4360  965.4682 1024.5405  991.2550

Explain.The sample means rise and fall at 1000.

Repeat part E two more times, setting the standard deviation to be 50 and 150 mm (note, this may result in negative rainfall amounts - for the purposes of thisexercise don’t worry about this). Compute the mean for each sample and explain the patterns that you observe. If you varied the size of your sub-samples (i.e., n=10, 20, 50, 100), how would you expect the results to change?

ysd50=NULL
for (i in 1:5) ysd50 <- c(ysd50,mean(sample(rnorm(1000,mean=1000,sd=50),20)))
ysd50

## [1] 1000.2568 1001.4322 1012.6153 1001.1425  991.9023

ysd150=NULL
for (i in 1:5) ysd150 <- c(ysd150,mean(sample(rnorm(1000,mean=1000,sd=150),20)))
ysd150

## [1] 1010.1731  975.1043 1019.2247 1014.8209  988.9999

E=c(ysd50,ysd100,ysd150)
EM= matrix(E, ncol=3)   
EM

##           [,1]      [,2]      [,3]
## [1,] 1000.2568  972.1097 1010.1731
## [2,] 1001.4322  972.4360  975.1043
## [3,] 1012.6153  965.4682 1019.2247
## [4,] 1001.1425 1024.5405 1014.8209
## [5,]  991.9023  991.2550  988.9999

Explain. The bigger the standard deviationsa are, the sample means shfit more widely. I think when I choose a relatively smaller sample size, the sample means are going to fluctuate more dramatically, and vice versa.

Estimate the standard error and confidence intervals on mean precipitation based on random samples drawn with the same population mean (1000 mm), and standard deviations equal to 50, 100, and 150 mm, with n=100. Explain what you find. Based on your results, how do the size and variance in your sample affect uncertainty in your estimate of the population mean?

n=100
mean=1000
d50 <-rnorm(100,mean=1000,sd=50) # Standard normal distribution
d100 <-rnorm(100,mean=1000,sd=100)
d150 <-rnorm(100,mean=1000,sd=150)
d50var <-var(d50)
d100var <-var(d100)
d150var <-var(d150)

se50 <- sd(d50)/sqrt(n)
se100 <- sd(d100)/sqrt(n)
se150 <- sd(d150)/sqrt(n)

dm50=mean(d50)
dm100=mean(d100)
dm150=mean(d150)

right50 <- dm50+se50
left50 <- dm50-se50
left50

## [1] 1005.925

right50

## [1] 1015.884

right100 <- dm100+se100
left100 <- dm100-se100
left100

## [1] 980.847

right100

## [1] 1003.483

right150 <- dm150+se150
left150 <- dm150-se150
left150

## [1] 998.1701

right150

## [1] 1028.962

Explain. When the standard deviation increases, the standard error goes up proportionally. At the same time, the range in which the random variables are supposed to be gets wider, which means the confidence interval gets wider. In terms of variance, it is changing along the standard error depently.

GE516 Assignment 1: Basic Probability Theory

Chen Tang

Due in class October 1

Problem 1

Problem 2