STAT 360 Lab 14: Resampling Methods (Part 2)

Name: Rebecca Lewis

Set Up Work Space

Load R libraries

library(rmarkdown)
library(knitr)

Set the Seed

set.seed(33)

Load the Kansas City property values data

propertyValues <- read.csv(url("https://www.dropbox.com/s/kbzr00qy4b9kks3/STAT_360_-_Property_Values.csv?dl=1"))
attach(propertyValues)

Estimating the Sampling Distribution Theoretically

Exercise 1:

Use sample() to take a random sample of 10 Selling Prices from the 21613 available properties, assign it to the object originalSample, and calculate the corresponding sample statistic (i.e., the sample mean).

originalSample<-sample(SellingPrice,10, replace = FALSE)
Samplemean<-mean(originalSample)
Samplemean

## [1] 581350

Exercise 2:

Use originalSample (i.e., the sample) to estimate the theoretical standard error for the sampling distribution of Selling Price for a sample of 10 and use this to construct a 95% Confidence Interval around the mean of originalSample. Hint: you can use the standard deviation of the sample as a proxy for the standard deviation of the population.

theostanderror<-sd(originalSample)/sqrt(10)
theostanderror

## [1] 174674.2

Lower<--theostanderror*1.96+Samplemean
Upper<-theostanderror*1.96+Samplemean
Lower

## [1] 238988.6

Upper

## [1] 923711.4

Jackknifing the Sampling Distribution

Exercise 3:

Construct a minus-one resample of originalSample by removing the first value, calculate the resampled mean, and assign it to the object jackknife1.

resample<-(originalSample[2:10])
jackknife1<-mean(resample)
jackknife1

## [1] 604555.6

Exercise 4:

Construct all other possible minue-one resamples of originalSample (other than the one constructed in Exercise 3), calculate the corresponding resampled means, and assign them to the objects jacknife2, jackknife3, etc.

jackknife2<-mean(originalSample[c(1,3:10)])
jackknife3<-mean(originalSample[c(1:2,4:10)])
jackknife4<-mean(originalSample[c(1:3,5:10)])
jackknife5<-mean(originalSample[c(1:4,6:10)])
jackknife6<-mean(originalSample[c(1:5,7:10)])
jackknife7<-mean(originalSample[c(1:6,8:10)])
jackknife8<-mean(originalSample[c(1:7,9:10)])
jackknife9<-mean(originalSample[c(1:8,10)])
jackknife10<-mean(originalSample[c(1:9)])

Exercise 5:

Use c() to concatenate (i.e., combine) jackknife1, jackknife2, etc. into a single object named jackknifedMeans

jackknifedMeans<-c(jackknife1,jackknife2,jackknife3,jackknife4,jackknife5,jackknife6,jackknife7,jackknife8,jackknife9,jackknife10)

Exercise 6:

Use hist() to generate a histogram of jackknifedMeans (i.e., the resampled sampling distribution) and describe the shape of the resulting distribution.

hist(jackknifedMeans)

The histogram is unimodal, not symmetric and left shewed with an outlier.

Exercise 7:

Use quantile() to calculate the range of the middle 95% of values in jackknifedMeans (i.e., a jackknifed confidence interval). Hint: quantile() produces values based on percentiles, so you will need to determine which percentiles would yield the middle 95% of values.

quantile(jackknifedMeans,.025)

##     2.5% 
## 448444.4

quantile(jackknifedMeans,.975)

##    97.5% 
## 620930.6

Exercise 8:

How does your answer to Exercise 7 compare to the 95% Confidence Interval that was calculated in Exercise 2?

My answer in exercise 7 for the 95% confidence range is 449444.4 to 620930.6 and the one in 2 is 238988.6 to 923711.4. The confidence interval in 7 is smaller.

Shuffling the Sampling Distribution

Exercise 9:

Calculate the difference between the average Selling Price of properties with and without a Waterfront and assign it to the object sampleDifference. Hint: this sample difference can be used as a test statistic in a two-sample t-test.

sampleDifference<-mean(SellingPrice[(c(Waterfront==1))])-mean(SellingPrice[(c(Waterfront==0))])
sampleDifference

## [1] 1130871

Exercise 10:

Use sample() to randomly reorder the values of Waterfront and assign the resulting values to shuffledWaterfront.

shuffledWaterfront<- sample(SellingPrice, (sum(Waterfront==1)))

Exercise 11:

Calculate the difference between the average Selling Price of properties based on the delineation of shuffledWaterfront (i.e., the reshuffled sample difference). Hint: placing brackets at the end of a vector (along with comparative commands such as <, >, or ==) can be used to specify which elements you wish to keep, similar to the subset() function - you will need this layout for Exercise 12.

(sum(shuffledWaterfront)/(sum(Waterfront==1)))-(sum(SellingPrice)-sum(shuffledWaterfront))/(sum(Waterfront==0))

## [1] -695.0819

Exercise 12:

Use replicate() and sample() to take randomly reorder the values of Waterfront and use the resulting delineation to calculate the difference between the average Selling price of the two groups 10,000 times and assign the resulting differences to the object shuffledDifferences. Hint: you will need to calculate the reshuffled sample difference within the replicate() function.

shuffledDifferences<-replicate(10000,(sum(sample(SellingPrice, (sum(Waterfront==1))))/(sum(Waterfront==1)))-(sum(SellingPrice)-sum(sample(SellingPrice, (sum(Waterfront==1)))))/(sum(Waterfront==0)))

Exercise 13:

Calculate the proportion of shuffledDifferences that were greater than or equal to the sampleDifference. Hint: you just manually calculated a P-value by comparing a sample statistic to a reshuffled sampling distribution.

sum(shuffledDifferences>=sampleDifference)

## [1] 0

Exercise 14:

How does your answer to Exercise 13 compare to the P-value for a two-sample t-test? Hint: you can you t.test() to perform both one-sample and two-sample t-tests.

t.test(propertyValues$SellingPrice[propertyValues$Waterfront==1],propertyValues$SellingPrice[propertyValues$Waterfront==0],alternative = "greater")

## 
##  Welch Two Sample t-test
## 
## data:  propertyValues$SellingPrice[propertyValues$Waterfront == 1] and propertyValues$SellingPrice[propertyValues$Waterfront == 0]
## t = 12.882, df = 162.23, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  985645.3      Inf
## sample estimates:
## mean of x mean of y 
## 1662524.2  531653.4

Since the pvalue is so small, smaller than .05 we conclude that the answer is the same and that we will not see any property values of not waterfront property greater than waterfront property values. This is because waterfront properties cost alot more.