library(tidyverse)
## Warning: package 'dplyr' was built under R version 4.4.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Welcome to Lab 3 (Week 7). In this lab, you will

  1. implement a permutation test;

  2. implement a T-test;

  3. illustrate the difference between a T-distribution and the standard normal distribution;

We will return to using the Craigs List data from San Francisco, which we’ve used in previous classes and labs.

Read in data:

craigslist <- read.csv("craigslist.csv", 

                       header = TRUE)

Replicate

To repeat things in R, there is a more convenient and efficient way than using for loop. The function replicate is designed for evaluating an expression repeatedly. For example, to get a vector of length 50 where each element is generated independently from the sum of five normal distributed samples:

set.seed(20170209)
example1 <- replicate(50, sum(rnorm(5)))

Go to the console window to have a look at the ‘example1’ vector. We don’t want or need the results to show up in this markdown file.

The equivalence of the above code using for loops will be:

# create a vector of length 0 to store the results
example2 <- c()
# calculate for 50 times
for (i in 1:50){
  # concatenate the calculated sum of normal samples to the end of `results` vector
  example2 <- c(example2, sum(rnorm(5)))
}

Go to the console window to have a look at the ‘example2’ vector.

The first argument of replicate function is the number of replications. And the second argument is an expression which will be evaluated repeatedly, which usually involves random sampling and simulation. (Otherwise, you would get a vector of one identical number. That’s useless!) Sometimes, the things you want to replicate can not finish on one line, and then you may need to use the big bracket such as the following. Elements in the example3 vector would be the value of the last expression in the bracket. (Since we set the same seed for example1 and example3, you may want to compare the value of them in the console window to see what happened.)

set.seed(20170209)
example3 <- replicate(50, {
  a <- rnorm(5)
  b <- 5
  sum(a) + b    # return the value 
})

Permutation test

Exercise 1

You will do a permutation test to compare the average one bedroom apartment rent price in Berkeley and Palo Alto.

  1. Subset the dataset to only consider one bedroom apartments in Berkeley and Palo Alto.
subsetdata <- craigslist %>%
  filter(brs == 1 & (grepl("berkeley", location) | grepl("palo alto", location)))
head(subsetdata)
##                  time price size brs
## 1 2016-09-27 18:53:00  1090  400   1
## 2 2016-09-27 17:29:00  3275  706   1
## 3 2016-09-27 17:25:00  2491  735   1
## 4 2016-09-27 17:25:00  2825  716   1
## 5 2016-09-27 17:14:00  2825  716   1
## 6 2016-09-27 17:13:00  2925  787   1
##                                                             title
## 1                     Private/small studio ALL UTILITIES INCLUDED
## 2       Luxury  Convenience all at Parker Movein special Hurry in
## 3                  One Bedroom Ready For You 500 off Move In Cost
## 4                                  Ready Now for Immediate Movein
## 5               Super Sharp 1 bed/1 bath Rooftop Lounge Panoramic
## 6 Large 1 bedroom with Den  in beautiful LEEDCertified Green Bldg
##                       link location
## 1 /eby/apa/5802540177.html berkeley
## 2 /eby/apa/5802443907.html berkeley
## 3 /eby/apa/5802429182.html berkeley
## 4 /eby/apa/5802414465.html berkeley
## 5 /eby/apa/5797264384.html berkeley
## 6 /eby/apa/5768826232.html berkeley
tail(subsetdata)
##                    time price size brs
## 415 2016-10-05 14:16:00  2650  650   1
## 416 2016-10-05 11:36:00  3250  810   1
## 417 2016-10-05 11:22:00  1900   NA   1
## 418 2016-10-04 22:14:00  3000  860   1
## 419 2016-10-04 17:36:00  2695   NA   1
## 420 2016-10-04 12:07:00  2000   NA   1
##                                                                 title
## 415                                        Downtown Palo Alto 1BR 1BA
## 416     1 Bdrm Available Soon Minutes away from StanfordPet Friendly 
## 417                      Your Bungalow Awaits Home Without The Hassle
## 418                            PALO ALTO Furnished 1Bedroom Apartment
## 419 Charming Duplex Near Mid Town Washer  Dryer in unit BT Properties
## 420                       Fully Furnished Communal Nurse in a 3Br/2Ba
##                         link  location
## 415 /pen/apa/5814759365.html palo alto
## 416 /pen/apa/5814470622.html palo alto
## 417 /pen/apa/5814444097.html palo alto
## 418 /pen/apa/5813643062.html palo alto
## 419 /pen/apa/5813410731.html palo alto
## 420 /pen/apa/5807514271.html palo alto
subsetdata <- subsetdata %>%
  filter(!is.na(price))
  1. Calculate the number of postings for Berkeley in the data frame subset.
no.berkeley <- subsetdata %>%
  filter(location == "berkeley")
nrow(no.berkeley)
## [1] 178
head(no.berkeley)
##                  time price size brs
## 1 2016-09-27 18:53:00  1090  400   1
## 2 2016-09-27 17:29:00  3275  706   1
## 3 2016-09-27 17:25:00  2491  735   1
## 4 2016-09-27 17:25:00  2825  716   1
## 5 2016-09-27 17:14:00  2825  716   1
## 6 2016-09-27 17:13:00  2925  787   1
##                                                             title
## 1                     Private/small studio ALL UTILITIES INCLUDED
## 2       Luxury  Convenience all at Parker Movein special Hurry in
## 3                  One Bedroom Ready For You 500 off Move In Cost
## 4                                  Ready Now for Immediate Movein
## 5               Super Sharp 1 bed/1 bath Rooftop Lounge Panoramic
## 6 Large 1 bedroom with Den  in beautiful LEEDCertified Green Bldg
##                       link location
## 1 /eby/apa/5802540177.html berkeley
## 2 /eby/apa/5802443907.html berkeley
## 3 /eby/apa/5802429182.html berkeley
## 4 /eby/apa/5802414465.html berkeley
## 5 /eby/apa/5797264384.html berkeley
## 6 /eby/apa/5768826232.html berkeley
no.berkeley = NULL

In the body of this markdown or via a code chunk, write a sentence with the results of the number of postings for Berkeley.

There are 178 posting for Berkeley

  1. Calculate the observed statistics, i.e., the absolute difference between the mean of Berkeley and mean of Palo Alto one bedroom rent price.
mean_berkeley <- subsetdata %>%
  filter(grepl("Berkeley", title)) %>%
  summarise(mean_price = mean(price, na.rm = TRUE)) %>%
  pull(mean_price)

mean_palo_alto <- subsetdata %>%
  filter(grepl("Palo Alto", title)) %>%
  summarise(mean_price = mean(price, na.rm = TRUE)) %>%
  pull(mean_price)

observed_statistic <- abs(mean_berkeley - mean_palo_alto)
observed_statistic
## [1] 338.9298
mean_berkeley
## [1] 2243.333
mean_palo_alto
## [1] 2582.263
stat.obs <- NULL

In the body of this markdown or via a code chunk, write a sentence with the results of the number of postings for Berkeley.

There are 178 postings for Berkeley with a mean price of $2243.33

  1. Calculate the permuted statistics (absolute mean difference), repeat for 1000 times. HINT: use sample function to sample from a group of observations. You can use either for loop or the replicate function introduced above. However, it is a good practice to try replicate.
set.seed(20172828)
permuted_stat <- function() {
  permuted_prices <- sample(subsetdata$price)
  
# permuted prices 
subsetdata$permuted_price <- permuted_prices
  
# mean rent for Berkeley (permuted)
mean_berkeley_perm <- subsetdata %>%
  filter(grepl("Berkeley", title)) %>%
  summarise(mean_price = mean(permuted_price, na.rm = TRUE)) %>%
  pull(mean_price)
  
# mean rent for Palo Alto (permuted)
mean_palo_alto_perm <- subsetdata %>%
  filter(grepl("Palo Alto", title)) %>%
  summarise(mean_price = mean(permuted_price, na.rm = TRUE)) %>%
  pull(mean_price)
  
# absolute difference in means
abs(mean_berkeley_perm - mean_palo_alto_perm)
}

# 1000 permutations and calculate the permuted statistics
permuted_statistics <- replicate(1000, permuted_stat())

# first few permuted statistics
head(permuted_statistics)
## [1] 130.90100  13.76817  82.90351 111.34962  40.24311 107.61153
stat.bootstrap <- NULL
  1. Calculate the p-value of our observed statistic using our approximation of the sampling distribution from part (d). We are testing if there is any difference between the two means. Our null hypothesis is that there is no difference between the two means. Our alternative hypothesis is that there is a difference.
observed_statistic <- abs(mean_berkeley - mean_palo_alto)
p_value <- mean(permuted_statistics >= observed_statistic)
p_value 
## [1] 0.008
p.value <- NULL

In the body of this markdown or via a code chunk: - State the null hypothesis of your test - State the alternative hypothesis of your test. - Write a conclusion (accept or fail to accept) at a 5% significance level. Be sure to reference your p.value.

Null Hypothesis: There is no difference between the mean rent prices for one-bedroom apartments in Berkeley and Palo Alto.

Alternative Hypothesis: There is a difference between the mean rent prices for one-bedroom apartments in Berkeley and Palo Alto.

Conclusion: Reject (Fail to accept) the Null hypothesis because the p-value is less than 0.05 (0.008) meaning that there is a difference between mean prices of one bedroom apartments in Berkeley and Palo Alto.

T-test

With the function t.test, implementing t-tests is easy in R. We will illustrate using a simulated example:

# Generate 100 samples from N(0, 1)
group1 <- rnorm(100, mean = 0, sd = 1)
# Generate 100 samples from N(0.2, 1)
group2 <- rnorm(100, mean = 0.2, sd = 1)
# perform t test
t.test(group1, group2)
## 
##  Welch Two Sample t-test
## 
## data:  group1 and group2
## t = -2.5037, df = 197.99, p-value = 0.0131
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.62601481 -0.07436485
## sample estimates:
##   mean of x   mean of y 
## -0.06727757  0.28291226

t.test usually print the results. To obtain the value of the t statistics or p-value, it is generally not wise to copy and paste from the printed output. Imagine you need to do 1000 t-test simultaneously, it is impossible to copy and paste every time. We need to figure out the output of t.test:

ttest.result <- t.test(group1, group2)

If you check the ttest.result object in your Environment window, you will find that it is a list of 9. And each element in the list stores some information.

names(ttest.result)
##  [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
##  [6] "null.value"  "stderr"      "alternative" "method"      "data.name"

To get the p-value, you can use either dollar sign or double square bracket, which are two ways usually used to extract elements from lists.

ttest.result$p.value    # with dollar sign, you can do tab completion
## [1] 0.01309847
ttest.result[['p.value']]
## [1] 0.01309847

Exercise 2.

Now apply t-test to compare the mean rent of one bedroom listings in Palo Alto and Berkeley.

subsetdata$location <- ifelse(grepl("Berkeley", subsetdata$title), "Berkeley", "Palo Alto")
t_test_result <- t.test(
  price ~ location,  
  data = subsetdata  
)
t_test_result
## 
##  Welch Two Sample t-test
## 
## data:  price by location
## t = -3.8583, df = 51.912, p-value = 0.0003172
## alternative hypothesis: true difference in means between group Berkeley and group Palo Alto is not equal to 0
## 95 percent confidence interval:
##  -574.1646 -181.2639
## sample estimates:
##  mean in group Berkeley mean in group Palo Alto 
##                2243.333                2621.048
rent.1b.tstat <- NULL
rent.1b.pvalue <- t_test_result$p.value
rent.1b.pvalue
## [1] 0.0003172079
rent.1b.pvalue <- NULL

In the body of this markdown or via a code chunk, write a sentence with the results of the t-test. What conclusion would you draw at a 5% significance level? Is the t-test valid for use in this situation? Justify your response with a graph.

ggplot(subsetdata, aes(x = price, fill = location)) +
  geom_histogram(aes(y = ..density..), alpha = 0.5, position = "identity", binwidth = 500) +
  geom_density(alpha = 0.2) +
  facet_wrap(~location) +
  labs(title = "Rent Price Distribution for Berkeley and Palo Alto", 
       x = "Rent Price", 
       y = "Density") +
  theme_minimal()
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Explanation:

The P-value (0.0003172079) shows that we should reject the null hypothesis and that there is a difference in the mean prices between Berkeley and Palo Alto. The graph above shows that the distributions are roughly bell shaped and even though there is a mild skew to the graphs it is not drastic enough to dismiss the t-test making it an appropriate test.

T-distribution vs Normal distribution

The T-distribution has mean = 0 and variance = n/(n-2). It is parameterized by just one parameter, the degrees of freedom. The degrees of freedom for practical purposes is equal to the sample size minus 1 for a one sample T-test. For a two sample T-test, as done above, the degrees of freedom is calculated through a formula (that you don’t need to know).

The T-distribution is similar in shape to the standard normal distribution, but the relative difference in the tail regions of the distributions is significant. This implies that for hypothesis tests, the p-value we find will be quite different if we use the wrong distribution as our sampling distribution for the test statistic.

The following code finds the tail probability for a T-score (or T-statistic / T-value) of -1, with 30 degrees of freedom.

pt(-1, df=30)
## [1] 0.1626543

We can see that this different from that of a Z-score (standard normal distribution) of -1

pnorm(-1)
## [1] 0.1586553

The difference is larger when we change the degrees of freedom to 10

pt(-1, df=10)
## [1] 0.1704466

Exercise 3.

  1. Find the tail probability of a T-score of -2 with 30 degrees of freedom. Divide this by the tail probability of a Z-score of -2.
t_tail_prob <- pt(-2, df = 30)
z_tail_prob <- pnorm(-2)
e3a <- t_tail_prob / z_tail_prob
cat("T-tail probability:", t_tail_prob, "\n")
## T-tail probability: 0.02731252
cat("Z-tail probability:", z_tail_prob, "\n")
## Z-tail probability: 0.02275013
cat("The ratio (e3a) is:", e3a, "\n")
## The ratio (e3a) is: 1.200543
e3a = NULL

Display the results of your tail probability using either a statement in the markdown body and/or a code chunk.

T-tail probability: 0.02731252 Z-tail probability: 0.02275013 The ratio (e3a) is: 1.200543

  1. Repeat part (a) but for 10 degrees of freedom.
t_tail_prob2 <- pt(-2, df = 10)
z_tail_prob2 <- pnorm(-2)
e3b <- t_tail_prob2 / z_tail_prob2
cat("T-tail probability:", t_tail_prob2, "\n")
## T-tail probability: 0.03669402
cat("Z-tail probability:", z_tail_prob2, "\n")
## Z-tail probability: 0.02275013
cat("The ratio (e3b) is:", e3b, "\n")
## The ratio (e3b) is: 1.612914
e3b = NULL

Display the results of your tail probability using either a statement in the markdown body and/or a code chunk.

T-tail probability: 0.03669402 Z-tail probability: 0.02275013 The ratio (e3b) is: 1.612914

It is common to run a hypothesis test with a significance level of 5 percent, or 0.05. This corresponds to rejecting the null hypothesis if we are more than 2 standard errors from the mean, or a Z-score of less than -2 or more than 2. However, if we do not know the true standard error, and we use the sample standard deviation to estimate it, then we must use the T-distribution. You can see from above that in this case our true P-value is much greater than we would otherwise think it is under the standard normal distribution. This is the effect of the uncertainty that comes from not knowing the true population standard deviation.