Lab - Sampling distributions

rm(list = ls())

Lab report

Load data:

#load(url("https://stat.duke.edu/~mc301/data/ames.RData"))
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")

Set a seed:

## [1] 132323

Exercises:

Exercise 1:

area<-ames$Gr.Liv.Area
price<-ames$SalePrice
summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

hist(area)

The distribution seems right skewed, with a slight bell curve shape, however the curve is not very smooth since the data is not very consistent. The highest peak is also right skewed, rather than in the center of the data. There is also more data to the right of the peak than there is to the left, and it is unimodal.

Exercise 2:

sampl<-sample(area, 50)
summary(sampl)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     864    1148    1442    1584    1970    3672

hist(sampl)

The distribution of these sample is more spread out, not a very consistent, but there is no visible bell curve. There is a slight curve that is right skewed, but the data does not seem to have a normal distribution.

Exercise 3:

mean(sampl)

## [1] 1584.18

samp2<-sample(area, 50)
mean(samp2)

## [1] 1614.9

The mean of “samp2” seems to be much higher than the original “sampl” mean, however our “samp2” mean of 1486.6 square feet seems to be closer to the true population mean of 1499.69 sqaure feet. If we took a sample size of 100 and 1000, I think the sample size of 1000 would be more accurate, because the Law of Big numbers states that larger sample sizes leads to more accurate results.

Exercise 4:

sample_means50<-rep(NA, 5000)
for(i in 1:5000){
  samp<-sample(area, 50)
  sample_means50[i]<-mean(samp)
}
hist(sample_means50)

hist(sample_means50, breaks = 25)

length(sample_means50)

## [1] 5000

There are 5000 elements in “sample_means50”. The distribution of this sample seems to follow a normal bell curve shape, with a center at around 1500. This distribution is relatively normal, and if we were to instead collect 50,000 sample means I think the distribution would still be relatively normal. Since a larger sample size would yield more accurate results with smoother curves.

Exercise 5:

sample_means_small<- rep(NA, 100)
samp<-sample(area, 50)
sample_means_small[1]<-mean(samp)
samp<-sample(area, 50)
sample_means_small[2]<-mean(samp)
samp<-sample(area, 50)
sample_means_small[4]<-mean(samp)
sample_means_small<-rep(NA, 100)
for(i in 1:100){
  samp<-sample(area, 50)
  sample_means_small[i]<-mean(samp)
  print(i)
}

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
## [1] 21
## [1] 22
## [1] 23
## [1] 24
## [1] 25
## [1] 26
## [1] 27
## [1] 28
## [1] 29
## [1] 30
## [1] 31
## [1] 32
## [1] 33
## [1] 34
## [1] 35
## [1] 36
## [1] 37
## [1] 38
## [1] 39
## [1] 40
## [1] 41
## [1] 42
## [1] 43
## [1] 44
## [1] 45
## [1] 46
## [1] 47
## [1] 48
## [1] 49
## [1] 50
## [1] 51
## [1] 52
## [1] 53
## [1] 54
## [1] 55
## [1] 56
## [1] 57
## [1] 58
## [1] 59
## [1] 60
## [1] 61
## [1] 62
## [1] 63
## [1] 64
## [1] 65
## [1] 66
## [1] 67
## [1] 68
## [1] 69
## [1] 70
## [1] 71
## [1] 72
## [1] 73
## [1] 74
## [1] 75
## [1] 76
## [1] 77
## [1] 78
## [1] 79
## [1] 80
## [1] 81
## [1] 82
## [1] 83
## [1] 84
## [1] 85
## [1] 86
## [1] 87
## [1] 88
## [1] 89
## [1] 90
## [1] 91
## [1] 92
## [1] 93
## [1] 94
## [1] 95
## [1] 96
## [1] 97
## [1] 98
## [1] 99
## [1] 100

length(sample_means_small)

## [1] 100

There are 100 elements in the sample “sample_means_small”. These values represent the means of the random samples of 50 elements from within “area” data column.

Exercise 6:

sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
  samp <- sample(area, 100)
  sample_means100[i] <- mean(samp)
}
par(mfrow=c(1, 1))
xlimits<-range(sample_means10)
hist(sample_means10, breaks = 20, xlim = xlimits)

hist(sample_means50, breaks = 20, xlim=xlimits)

hist(sample_means100, breaks = 20, xlim=xlimits)

As sample size increases, the distribution range decreases and becomes more narrow. The mean frequencies are also taller as the calculated sample means increase.

On your own:

1:

priceSample<-sample(price, 50)
mean(priceSample)

## [1] 175717.2

The best point of estimate for the population mean of this sample is 196,517.

2:

sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
  samp <- sample(price, 50)
  sample_means50[i] <- mean(samp)
}

par(mfrow=c(1,1))
hist(sample_means50, breaks = 20)

mean(sample_means50)

## [1] 180548.3

mean(price)

## [1] 180796.1

The distribution of the sample “sample_means50” has relatively normal distribution. It is unimodal with a bell curve shape, and the center of the distribution is around 180,000. The mean home price of the population seems to be around 180796.1, and the population mean is close to the price mean at 181282.6. #### 3:

sample_means150 <- rep(NA, 5000)
for(i in 1:5000){
  samp <- sample(price, 150)
  sample_means150[i] <- mean(samp)
}
par(mfrow=c(1,1))
xlimits <- range(sample_means150)
hist(sample_means50,breaks = 30, xlim = xlimits)

hist(sample_means150,breaks = 20, xlim = xlimits)

mean(sample_means50)

## [1] 180548.3

mean(sample_means150)

## [1] 180979.2

The shape of the sample “sample_means150” seems to be normal, with a bell curve shape. However, it does look to be slightly less smooth towards the 3rd quartile of the distribution, since the bar heights don’t form a perfect curve. While the sample “sample_means50” has a more smooth appearance, with a normal bell curve shape and a center around 180,000 like the other sample. The sample “sample_means150” is more likely to be the mean sale price of homes in Ames, which would be around 180,000. The calculated mean for the sample of 150 is 180804.3.

4:

The sample from 3, with sample “sample_means150” would have the smallest spread, because it has a larger sample size of 150. We would prefer a distribution with a smaller spread, since it is most statisically apporpriated when seeking accuracy for means.