knitr::opts_chunk$set(echo = TRUE)
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
library(DATA606)
##
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics
## This package is designed to support this course. The text book used
## is OpenIntro Statistics, 3rd Edition. You can read this by typing
## vignette('os3') or visit www.OpenIntro.org.
##
## The getLabs() function will return a list of the labs available.
##
## The demo(package='DATA606') will list the demos that are available.
##
## Attaching package: 'DATA606'
## The following object is masked from 'package:utils':
##
## demo
library(psych)
The population data is unimodal and right skewed. You can expect area per house to be around 1442 (median)
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
describe(area)
## vars n mean sd median trimmed mad min max range skew
## X1 1 2930 1499.69 505.51 1442 1452.25 461.09 334 5642 5308 1.27
## kurtosis se
## X1 4.12 9.34
qqnorm(area)
qqline(area)
This sample is also a unimodal right skewed distribution. With an expected area of 1343 (median)
set.seed(100)
samp1 <- sample(area, 50)
describe(samp1)
## vars n mean sd median trimmed mad min max range skew
## X1 1 50 1441.52 448.68 1343 1381.65 360.27 729 2673 1944 1.05
## kurtosis se
## X1 0.4 63.45
qqnorm(samp1)
qqline(samp1)
If we keep taking samples of 50, I imagine about 95% of the means to be within two standard deviations of the true mean. As the sample size gets greater, the sample means should be closer to the real mean.
set.seed(200)
samp2 = sample(area,50)
samp3 = sample(area,100)
samp4 = sample(area,1000)
describe(samp2)
## vars n mean sd median trimmed mad min max range skew
## X1 1 50 1486.7 508.94 1446 1466.7 614.54 572 2501 1929 0.19
## kurtosis se
## X1 -0.96 71.97
describe(samp3)
## vars n mean sd median trimmed mad min max range skew
## X1 1 100 1547.4 491.44 1498 1509.84 435.88 599 3672 3073 1.01
## kurtosis se
## X1 2.44 49.14
describe(samp4)
## vars n mean sd median trimmed mad min max range skew
## X1 1 1000 1492.12 510.1 1427.5 1441.42 454.42 438 4676 4238 1.28
## kurtosis se
## X1 3.39 16.13
There are 5,000 elements in the sample_means50. The sampling distribution is approaching a perfect normality, unimodal without skew. The center is its mean around 1500. The distribution would become more normalized as the sample goes up.
Each of the 100 elements respresents a sampled mean of area.
sample_means_small = rep(0,100)
for(i in 1:100){
samp <- sample(area, 50)
sample_means_small[i] <- mean(samp)
}
length(sample_means_small)
## [1] 100
The center, centers itself as a normal distribution does. That is to say, more results are closer to the mean. The spread becomes unimodal and less skewed.
sampq1 = sample(price,50)
mean(sampq1)
## [1] 160984.7
Shape of the sampling distribution is normal. I would guess the mean home price of the population to be 180,000.
set.seed(2500)
sample_means50 = rep(NA,5000)
for(i in 1:5000){
samp <- sample(price, 50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50, breaks =30)
qqnorm(sample_means50)
qqline(sample_means50)
describe(sample_means50)
## vars n mean sd median trimmed mad min max
## X1 1 5000 180769.7 11129.42 180149.5 180490.9 11055.2 134775.5 223689.5
## range skew kurtosis se
## X1 88913.98 0.25 0.14 157.39
mean(price)
## [1] 180796.1
The shape of this sampling distribution looks to keep everything the same as the last, only fewer outliers and all results are closer to the mean which looks again to be around 180,000.
set.seed(2500)
sample_means150 = rep(NA,5000)
for(i in 1:5000){
samp <- sample(price, 150)
sample_means150[i] <- mean(samp)
}
hist(sample_means150, breaks =30)
qqnorm(sample_means150)
qqline(sample_means150)
describe(sample_means150)
## vars n mean sd median trimmed mad min max
## X1 1 5000 180854.6 6403.95 180819.7 180791.4 6291.15 158473.6 203272.4
## range skew kurtosis se
## X1 44798.75 0.08 0 90.57
mean(price)
## [1] 180796.1
Distribution 3 has a smaller spread, we would use this smaller spread to find closer to true values for almost all conceivable observations on the population data.