We’ll be using the AirBnB data set from New York City in 2019. It lists every apartment available in New York City in 2019.
The relevant columns are:
In this homework, you’ll write a function to calculate a confidence interval for the difference in average price between apartments in Manhattan vs Brooklyn. Then using that function, the simulate collecting 10,000 different samples and calculating a 95% confidence interval for each sample and see if almost 95% of the intervals contain the true difference in average price between the two boroughs.
Save a data set named abnb_clean that is:
abnb_full |> 
  # Keeping the specified rows
  filter(
    room_type == "Entire home/apt",
    neighbourhood_group %in% c("Brooklyn", "Manhattan"),
    availability_365 >= 10,
    price <= 1000
  ) |> 
  # Keeping the specified columns
  dplyr::select(
    id, 
    host_id,
    host_name,
    borough = neighbourhood_group,
    price, 
    minimum_nights
  ) ->
  
  abnb_clean
# Displaying the results
tibble(abnb_clean)
## # A tibble: 12,976 × 6
##       id host_id host_name        borough   price minimum_nights
##    <int>   <int> <chr>            <chr>     <int>          <int>
##  1  2595    2845 Jennifer         Manhattan   225              1
##  2  3831    4869 LisaRoxanne      Brooklyn     89              1
##  3  5099    7322 Chris            Manhattan   200              3
##  4  5238    7549 Ben              Manhattan   150              1
##  5  6848   15991 Allen & Irina    Brooklyn    140              2
##  6  7097   17571 Jane             Brooklyn    215              2
##  7  7726   20950 Adam And Charity Brooklyn     99              3
##  8  7750   17985 Sing             Manhattan   190              7
##  9  8490   25183 Nathalie         Brooklyn    120              2
## 10  9357   30193 Tommi            Manhattan   150             10
## # ℹ 12,966 more rows
The code below will take a random sample of 5% of each Manhattan and Brooklyn apartments (called stratified sampling) and save it as abnb_sample. Use is as a test case for question 2 below, with price as y and borough for the groups
# Making sure the sample is the same for everyone
RNGversion("4.1.0"); set.seed(1234)
# Taking the stratified sample
abnb_clean |> 
  slice_sample(
    by = borough,
    prop = 0.01
  ) ->
  abnb_sample
# Checking the results:
abnb_sample |> 
  summarize(
    .by = borough,
    apts = n(),
    price_avg = mean(price) |> scales::dollar(),
    price_sd  = sd(price)   |> scales::dollar()
  ) |> 
  gt::gt()
| borough | apts | price_avg | price_sd | 
|---|---|---|---|
| Manhattan | 75 | $254.71 | $197.08 | 
| Brooklyn | 54 | $177.09 | $94.10 | 
You’ll write a function named two_sample_CI that
takes two vectors (y & groups) and calculates a confidence interval
for the difference between the two means.
The function will have 3 arguments:
y = the vector of numeric values to calculate the
confidence interval fromgroups = the vector that indicates which group the
values in y belong toCL = the confidence level of the confidence interval in
proportions (default = 0.95)The function should return the following 4 objects (with the correct names)
averages = a vector with the mean of each group: \(\bar{y}_1\) and \(\bar{y}_2\)diff_avg = difference in means: \(\bar{y}_1 - \bar{y}_2\)diff_SE = the standard error of the difference in
means: \(\textrm{SE}(\bar{y}_1 - \bar{y}_2) =
\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\)diff_CI = a vector that has the confidence interval for
the difference in means:\[(\bar{y}_1 - \bar{y}_2) \pm z_{\alpha/2}\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\]
To calculate \(z_{\alpha/2}\), use
z = qnorm(p = (1-CL)/2, lower.tail = F)
two_sample_CI <-
  function(y, groups, CL = 0.95){
    ## Calculating the summary stats
    # Joining y and groups as a data.frame to use the dplyr verbs
    data.frame(
      y,
      groups 
    ) |> 
      # Calculating n, mean, and sd for each group
      summarize(
        .by = groups,
        n = n(),
        y_bar = mean(y),
        sd = sd(y)
      ) ->
      summary_stats
    
    # Calculating the needed values for the CI
    summary_stats |> 
      # Calculating the variance for each group
      mutate(variance = sd^2/n) |> 
      
      # Calculating the difference in means and the standard error
      summarize(
        avg_diff = abs(diff(y_bar)),
        diff_SE = sqrt(sum(variance))
      ) |> 
      # Calculating the lower and upper bound of the CI
      mutate(
        ci_lower = avg_diff - qnorm(p = (1-CL)/2, lower.tail = F)*diff_SE,
        ci_upper = avg_diff + qnorm(p = (1-CL)/2, lower.tail = F)*diff_SE
      ) ->
      ci_values
    
    return(
      list(
        averages = summary_stats$y_bar,
        diff_avg = ci_values$avg_diff,
        diff_SE  = ci_values$diff_SE,
        diff_CI  = c(ci_values$ci_lower, ci_values$ci_upper)
      )
    )
  }
Use the code chunk below to check the results:
# 90% CI
two_sample_CI(
  y = abnb_sample$price, 
  groups = abnb_sample$borough, 
  CL = 0.90
)
## $averages
## [1] 254.7067 177.0926
## 
## $diff_avg
## [1] 77.61407
## 
## $diff_SE
## [1] 26.11239
## 
## $diff_CI
## [1]  34.66302 120.56513
# 95% CI
two_sample_CI(
  y = abnb_sample$price, 
  groups = abnb_sample$borough
)
## $averages
## [1] 254.7067 177.0926
## 
## $diff_avg
## [1] 77.61407
## 
## $diff_SE
## [1] 26.11239
## 
## $diff_CI
## [1]  26.43474 128.79341
# 99% CI
two_sample_CI(
  y = abnb_sample$price, 
  groups = abnb_sample$borough, 
  CL = 0.99
)
## $averages
## [1] 254.7067 177.0926
## 
## $diff_avg
## [1] 77.61407
## 
## $diff_SE
## [1] 26.11239
## 
## $diff_CI
## [1]  10.35302 144.87513
Write a simulation that will take a random sample of 1% from the abnb_clean NYC apartments data set and use the function from question 2 to calculate the 90% CI. Repeat the process a total of 10,000 times
Store the results in a data.frame named sim_results that has 10,000 rows and six columns:
The results should match what is in the solutions in Brightspace
# How many loops to perform
m <- 10000
# Create the data frame to store the results
sim_results <- 
  data.frame(
    Manhattan_avg = rep(-1, m),
    Brooklyn_avg  = rep(-1, m),
    diff_avg      = rep(-1, m),
    diff_SE       = rep(-1, m),
    CI_lower      = rep(-1, m),
    CI_upper      = rep(-1, m)
  )
tibble(sim_results)
## # A tibble: 10,000 × 6
##    Manhattan_avg Brooklyn_avg diff_avg diff_SE CI_lower CI_upper
##            <dbl>        <dbl>    <dbl>   <dbl>    <dbl>    <dbl>
##  1            -1           -1       -1      -1       -1       -1
##  2            -1           -1       -1      -1       -1       -1
##  3            -1           -1       -1      -1       -1       -1
##  4            -1           -1       -1      -1       -1       -1
##  5            -1           -1       -1      -1       -1       -1
##  6            -1           -1       -1      -1       -1       -1
##  7            -1           -1       -1      -1       -1       -1
##  8            -1           -1       -1      -1       -1       -1
##  9            -1           -1       -1      -1       -1       -1
## 10            -1           -1       -1      -1       -1       -1
## # ℹ 9,990 more rows
Run the simulation where each loop should:
You can check the results in Brightspace to see if your results match
# Making sure the results are the same every time
RNGversion("4.1.0"); set.seed(2870)
# Write the simulation below:
for (i in 1:m){
  # Getting the sample from abnb_clean
    loop_sample <- 
      abnb_clean |> 
      slice_sample(
        by = borough,
        prop = 0.01
      )
    
    # Using the function to create the summary stats
    loop_results <- 
      two_sample_CI(
        y = loop_sample$price,
        groups = loop_sample$borough,
        CL = 0.90
      )
    
    ## Storing the results in the columns of sim_results
    # The individual means
    sim_results[i, 1:2] <- loop_results$averages 
    
    # The difference in means
    sim_results[i, 3] <- loop_results$diff_avg
    
    # The SE of the difference in means
    sim_results[i, 4] <- loop_results$diff_SE
    
    # The confidence interval
    sim_results[i, 5:6] <- loop_results$diff_CI
}
tibble(sim_results)
## # A tibble: 10,000 × 6
##    Manhattan_avg Brooklyn_avg diff_avg diff_SE CI_lower CI_upper
##            <dbl>        <dbl>    <dbl>   <dbl>    <dbl>    <dbl>
##  1          251.         196.     54.5    24.7   13.9       95.2
##  2          244.         186.     58.3    26.4   14.9      102. 
##  3          211.         181.     30.1    21.6   -5.45      65.7
##  4          249.         229.     20.8    33.0  -33.6       75.1
##  5          253.         178.     75.2    23.9   35.9      115. 
##  6          222.         139.     82.9    15.1   58.0      108. 
##  7          245.         162.     82.9    19.7   50.5      115. 
##  8          252.         212.     40.0    24.5   -0.415     80.3
##  9          223.         186.     37.7    21.5    2.34      73.0
## 10          228.         167.     61.3    17.3   32.8       89.7
## # ℹ 9,990 more rows
Calculate, save, and display the true difference in the average price between apartments in Manhattan vs Brooklyn in the abnb_clean data set, then calculate the percentage of the 10,000 intervals created in question 3B.
# Calculating the true difference
abnb_clean |> 
  # The individual means
  summarize(
    .by = borough,
    price_avg = mean(price)
  ) |> 
  # The difference
  summarize(diff_avg = price_avg |> diff() |> abs()) |> 
  pull(diff_avg) ->
  diff_in_means
diff_in_means
## [1] 64.6262
# Calculating the percentage of intervals that contain the true difference
sim_results |> 
  mutate(contains_true = CI_lower <= diff_in_means & CI_upper >= diff_in_means) |> 
  summarize(interval_prop = mean(contains_true))
##   interval_prop
## 1        0.9009
sim_results |> 
  filter(CI_lower <= diff_in_means & CI_upper >= diff_in_means) |> 
  nrow()
## [1] 9009
Does the simulation indicate that the 90% confidence intervals are accurate?
Create the graph in Brightspace that illustrates how accurate the last 100 confidence intervals are. The graph should have:
sim_results |> 
  
  # Picking the last 100 rows
  slice_tail(
    n = 100
  ) |> 
  
  # Specifying the colors to use if the CI contains the parameter or not
  mutate(
    line_color = if_else(CI_lower < diff_in_means & CI_upper > diff_in_means,
                         "steelblue",
                         "tomato"),
    row_num = row_number()
  ) |> 
  
  ## Creating the graph
  ggplot(
    mapping = aes(
      y = row_num
    )
  ) + 
  
  # Drawing the line segment for the CI
  geom_segment(
    mapping = aes(
      x = CI_lower,
      y = row_num,
      xend = CI_upper,
      yend = row_num,
      color = line_color
    ),
    linewidth = 1
  ) + 
  
  # Adding the difference in means
  geom_point(
    mapping = aes(
      x = diff_avg
    ),
    size = 2
  ) + 
  
  # Adding the vertical line at the true difference
  geom_vline(
    xintercept = diff_in_means,
    linetype = "dashed",
    linewidth = 1
  ) +
  
  # Changing the labels on the graph
  labs(
    title = "Results of 90% Confidence intervals for a difference in means",
    x = "Difference in average price of apartments in Manhattan and Brooklyn",
    y = NULL
  ) +
  
  # Making the colors match the value in the column
  scale_color_identity() + 
 
  # Removing the values on the y-axis
  scale_y_continuous(
    breaks = NULL,
    expand = c(0.01, 0)
  ) +
   
  # Changing the breaks on the x-axis
  scale_x_continuous(
    breaks = seq(-75, 200, 25),
    minor_breaks = NULL,
    labels = scales::label_dollar()
  ) +
  
  theme_bw() + 
  
  theme(
    plot.title = element_text(hjust = 0.5),
#    axis.ticks.y = element_blank()
  )