library(tidyverse)
## Warning: package 'dplyr' was built under R version 4.4.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Welcome to Lab 3 (Week 7). In this lab, you will
implement a permutation test;
implement a T-test;
illustrate the difference between a T-distribution and the standard normal distribution;
We will return to using the Craigs List data from San Francisco, which we’ve used in previous classes and labs.
Read in data:
craigslist <- read.csv("craigslist.csv",
header = TRUE)
To repeat things in R, there is a more convenient and efficient way
than using for loop. The function replicate is designed for
evaluating an expression repeatedly. For example, to get a vector of
length 50 where each element is generated independently from the sum of
five normal distributed samples:
set.seed(20170209)
example1 <- replicate(50, sum(rnorm(5)))
Go to the console window to have a look at the ‘example1’ vector. We don’t want or need the results to show up in this markdown file.
The equivalence of the above code using for loops will be:
# create a vector of length 0 to store the results
example2 <- c()
# calculate for 50 times
for (i in 1:50){
# concatenate the calculated sum of normal samples to the end of `results` vector
example2 <- c(example2, sum(rnorm(5)))
}
Go to the console window to have a look at the ‘example2’ vector.
The first argument of replicate function is the number
of replications. And the second argument is an expression which will be
evaluated repeatedly, which usually involves random sampling and
simulation. (Otherwise, you would get a vector of one identical number.
That’s useless!) Sometimes, the things you want to replicate can not
finish on one line, and then you may need to use the big bracket such as
the following. Elements in the example3 vector would be the
value of the last expression in the bracket. (Since we set the same seed
for example1 and example3, you may want to
compare the value of them in the console window to see what
happened.)
set.seed(20170209)
example3 <- replicate(50, {
a <- rnorm(5)
b <- 5
sum(a) + b # return the value
})
Exercise 1
You will do a permutation test to compare the average one bedroom apartment rent price in Berkeley and Palo Alto.
subsetdata <- craigslist %>%
filter(brs == 1 & (grepl("berkeley", location) | grepl("palo alto", location)))
head(subsetdata)
## time price size brs
## 1 2016-09-27 18:53:00 1090 400 1
## 2 2016-09-27 17:29:00 3275 706 1
## 3 2016-09-27 17:25:00 2491 735 1
## 4 2016-09-27 17:25:00 2825 716 1
## 5 2016-09-27 17:14:00 2825 716 1
## 6 2016-09-27 17:13:00 2925 787 1
## title
## 1 Private/small studio ALL UTILITIES INCLUDED
## 2 Luxury Convenience all at Parker Movein special Hurry in
## 3 One Bedroom Ready For You 500 off Move In Cost
## 4 Ready Now for Immediate Movein
## 5 Super Sharp 1 bed/1 bath Rooftop Lounge Panoramic
## 6 Large 1 bedroom with Den in beautiful LEEDCertified Green Bldg
## link location
## 1 /eby/apa/5802540177.html berkeley
## 2 /eby/apa/5802443907.html berkeley
## 3 /eby/apa/5802429182.html berkeley
## 4 /eby/apa/5802414465.html berkeley
## 5 /eby/apa/5797264384.html berkeley
## 6 /eby/apa/5768826232.html berkeley
tail(subsetdata)
## time price size brs
## 415 2016-10-05 14:16:00 2650 650 1
## 416 2016-10-05 11:36:00 3250 810 1
## 417 2016-10-05 11:22:00 1900 NA 1
## 418 2016-10-04 22:14:00 3000 860 1
## 419 2016-10-04 17:36:00 2695 NA 1
## 420 2016-10-04 12:07:00 2000 NA 1
## title
## 415 Downtown Palo Alto 1BR 1BA
## 416 1 Bdrm Available Soon Minutes away from StanfordPet Friendly
## 417 Your Bungalow Awaits Home Without The Hassle
## 418 PALO ALTO Furnished 1Bedroom Apartment
## 419 Charming Duplex Near Mid Town Washer Dryer in unit BT Properties
## 420 Fully Furnished Communal Nurse in a 3Br/2Ba
## link location
## 415 /pen/apa/5814759365.html palo alto
## 416 /pen/apa/5814470622.html palo alto
## 417 /pen/apa/5814444097.html palo alto
## 418 /pen/apa/5813643062.html palo alto
## 419 /pen/apa/5813410731.html palo alto
## 420 /pen/apa/5807514271.html palo alto
subsetdata <- subsetdata %>%
filter(!is.na(price))
subset.no.berkeley <- subsetdata %>%
filter(location == "berkeley")
nrow(no.berkeley)
## [1] 178
head(no.berkeley)
## time price size brs
## 1 2016-09-27 18:53:00 1090 400 1
## 2 2016-09-27 17:29:00 3275 706 1
## 3 2016-09-27 17:25:00 2491 735 1
## 4 2016-09-27 17:25:00 2825 716 1
## 5 2016-09-27 17:14:00 2825 716 1
## 6 2016-09-27 17:13:00 2925 787 1
## title
## 1 Private/small studio ALL UTILITIES INCLUDED
## 2 Luxury Convenience all at Parker Movein special Hurry in
## 3 One Bedroom Ready For You 500 off Move In Cost
## 4 Ready Now for Immediate Movein
## 5 Super Sharp 1 bed/1 bath Rooftop Lounge Panoramic
## 6 Large 1 bedroom with Den in beautiful LEEDCertified Green Bldg
## link location
## 1 /eby/apa/5802540177.html berkeley
## 2 /eby/apa/5802443907.html berkeley
## 3 /eby/apa/5802429182.html berkeley
## 4 /eby/apa/5802414465.html berkeley
## 5 /eby/apa/5797264384.html berkeley
## 6 /eby/apa/5768826232.html berkeley
no.berkeley = NULL
In the body of this markdown or via a code chunk, write a sentence with the results of the number of postings for Berkeley.
There are 178 posting for Berkeley
mean_berkeley <- subsetdata %>%
filter(grepl("Berkeley", title)) %>%
summarise(mean_price = mean(price, na.rm = TRUE)) %>%
pull(mean_price)
mean_palo_alto <- subsetdata %>%
filter(grepl("Palo Alto", title)) %>%
summarise(mean_price = mean(price, na.rm = TRUE)) %>%
pull(mean_price)
observed_statistic <- abs(mean_berkeley - mean_palo_alto)
observed_statistic
## [1] 338.9298
mean_berkeley
## [1] 2243.333
mean_palo_alto
## [1] 2582.263
stat.obs <- NULL
In the body of this markdown or via a code chunk, write a sentence with the results of the number of postings for Berkeley.
There are 178 postings for Berkeley with a mean price of $2243.33
sample function to sample from a
group of observations. You can use either for loop or the
replicate function introduced above. However, it is a good
practice to try replicate.set.seed(20172828)
permuted_stat <- function() {
permuted_prices <- sample(subsetdata$price)
# permuted prices
subsetdata$permuted_price <- permuted_prices
# mean rent for Berkeley (permuted)
mean_berkeley_perm <- subsetdata %>%
filter(grepl("Berkeley", title)) %>%
summarise(mean_price = mean(permuted_price, na.rm = TRUE)) %>%
pull(mean_price)
# mean rent for Palo Alto (permuted)
mean_palo_alto_perm <- subsetdata %>%
filter(grepl("Palo Alto", title)) %>%
summarise(mean_price = mean(permuted_price, na.rm = TRUE)) %>%
pull(mean_price)
# absolute difference in means
abs(mean_berkeley_perm - mean_palo_alto_perm)
}
# 1000 permutations and calculate the permuted statistics
permuted_statistics <- replicate(1000, permuted_stat())
# first few permuted statistics
head(permuted_statistics)
## [1] 130.90100 13.76817 82.90351 111.34962 40.24311 107.61153
stat.bootstrap <- NULL
observed_statistic <- abs(mean_berkeley - mean_palo_alto)
p_value <- mean(permuted_statistics >= observed_statistic)
p_value
## [1] 0.008
p.value <- NULL
In the body of this markdown or via a code chunk: - State the null hypothesis of your test - State the alternative hypothesis of your test. - Write a conclusion (accept or fail to accept) at a 5% significance level. Be sure to reference your p.value.
Null Hypothesis: There is no difference between the mean rent prices for one-bedroom apartments in Berkeley and Palo Alto.
Alternative Hypothesis: There is a difference between the mean rent prices for one-bedroom apartments in Berkeley and Palo Alto.
Conclusion: Reject (Fail to accept) the Null hypothesis because the p-value is less than 0.05 (0.008) meaning that there is a difference between mean prices of one bedroom apartments in Berkeley and Palo Alto.
With the function t.test, implementing t-tests is easy
in R. We will illustrate using a simulated example:
# Generate 100 samples from N(0, 1)
group1 <- rnorm(100, mean = 0, sd = 1)
# Generate 100 samples from N(0.2, 1)
group2 <- rnorm(100, mean = 0.2, sd = 1)
# perform t test
t.test(group1, group2)
##
## Welch Two Sample t-test
##
## data: group1 and group2
## t = -2.5037, df = 197.99, p-value = 0.0131
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.62601481 -0.07436485
## sample estimates:
## mean of x mean of y
## -0.06727757 0.28291226
t.test usually print the results. To obtain the value of
the t statistics or p-value, it is generally not wise to copy and paste
from the printed output. Imagine you need to do 1000 t-test
simultaneously, it is impossible to copy and paste every time. We need
to figure out the output of t.test:
ttest.result <- t.test(group1, group2)
If you check the ttest.result object in your Environment
window, you will find that it is a list of 9. And each element in the
list stores some information.
names(ttest.result)
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "stderr" "alternative" "method" "data.name"
To get the p-value, you can use either dollar sign or double square bracket, which are two ways usually used to extract elements from lists.
ttest.result$p.value # with dollar sign, you can do tab completion
## [1] 0.01309847
ttest.result[['p.value']]
## [1] 0.01309847
Exercise 2.
Now apply t-test to compare the mean rent of one bedroom listings in Palo Alto and Berkeley.
subsetdata$location <- ifelse(grepl("Berkeley", subsetdata$title), "Berkeley", "Palo Alto")
t_test_result <- t.test(
price ~ location,
data = subsetdata
)
t_test_result
##
## Welch Two Sample t-test
##
## data: price by location
## t = -3.8583, df = 51.912, p-value = 0.0003172
## alternative hypothesis: true difference in means between group Berkeley and group Palo Alto is not equal to 0
## 95 percent confidence interval:
## -574.1646 -181.2639
## sample estimates:
## mean in group Berkeley mean in group Palo Alto
## 2243.333 2621.048
rent.1b.tstat <- NULL
rent.1b.pvalue <- t_test_result$p.value
rent.1b.pvalue
## [1] 0.0003172079
rent.1b.pvalue <- NULL
In the body of this markdown or via a code chunk, write a sentence with the results of the t-test. What conclusion would you draw at a 5% significance level? Is the t-test valid for use in this situation? Justify your response with a graph.
ggplot(subsetdata, aes(x = price, fill = location)) +
geom_histogram(aes(y = ..density..), alpha = 0.5, position = "identity", binwidth = 500) +
geom_density(alpha = 0.2) +
facet_wrap(~location) +
labs(title = "Rent Price Distribution for Berkeley and Palo Alto",
x = "Rent Price",
y = "Density") +
theme_minimal()
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The P-value (0.0003172079) shows that we should reject the null hypothesis and that there is a difference in the mean prices between Berkeley and Palo Alto. The graph above shows that the distributions are roughly bell shaped and even though there is a mild skew to the graphs it is not drastic enough to dismiss the t-test making it an appropriate test.
The T-distribution has mean = 0 and variance = n/(n-2). It is parameterized by just one parameter, the degrees of freedom. The degrees of freedom for practical purposes is equal to the sample size minus 1 for a one sample T-test. For a two sample T-test, as done above, the degrees of freedom is calculated through a formula (that you don’t need to know).
The T-distribution is similar in shape to the standard normal distribution, but the relative difference in the tail regions of the distributions is significant. This implies that for hypothesis tests, the p-value we find will be quite different if we use the wrong distribution as our sampling distribution for the test statistic.
The following code finds the tail probability for a T-score (or T-statistic / T-value) of -1, with 30 degrees of freedom.
pt(-1, df=30)
## [1] 0.1626543
We can see that this different from that of a Z-score (standard normal distribution) of -1
pnorm(-1)
## [1] 0.1586553
The difference is larger when we change the degrees of freedom to 10
pt(-1, df=10)
## [1] 0.1704466
Exercise 3.
t_tail_prob <- pt(-2, df = 30)
z_tail_prob <- pnorm(-2)
e3a <- t_tail_prob / z_tail_prob
cat("T-tail probability:", t_tail_prob, "\n")
## T-tail probability: 0.02731252
cat("Z-tail probability:", z_tail_prob, "\n")
## Z-tail probability: 0.02275013
cat("The ratio (e3a) is:", e3a, "\n")
## The ratio (e3a) is: 1.200543
e3a = NULL
Display the results of your tail probability using either a statement in the markdown body and/or a code chunk.
T-tail probability: 0.02731252 Z-tail probability: 0.02275013 The ratio (e3a) is: 1.200543
t_tail_prob2 <- pt(-2, df = 10)
z_tail_prob2 <- pnorm(-2)
e3b <- t_tail_prob2 / z_tail_prob2
cat("T-tail probability:", t_tail_prob2, "\n")
## T-tail probability: 0.03669402
cat("Z-tail probability:", z_tail_prob2, "\n")
## Z-tail probability: 0.02275013
cat("The ratio (e3b) is:", e3b, "\n")
## The ratio (e3b) is: 1.612914
e3b = NULL
Display the results of your tail probability using either a statement in the markdown body and/or a code chunk.
T-tail probability: 0.03669402 Z-tail probability: 0.02275013 The ratio (e3b) is: 1.612914
It is common to run a hypothesis test with a significance level of 5 percent, or 0.05. This corresponds to rejecting the null hypothesis if we are more than 2 standard errors from the mean, or a Z-score of less than -2 or more than 2. However, if we do not know the true standard error, and we use the sample standard deviation to estimate it, then we must use the T-distribution. You can see from above that in this case our true P-value is much greater than we would otherwise think it is under the standard normal distribution. This is the effect of the uncertainty that comes from not knowing the true population standard deviation.