Data Description

We’ll be using the AirBnB data set from New York City in 2019. It lists every apartment available in New York City in 2019.

The relevant columns are:

In this homework, you’ll write a function to calculate a confidence interval for the difference in average price between apartments in Manhattan vs Brooklyn. Then using that function, the simulate collecting 10,000 different samples and calculating a 95% confidence interval for each sample and see if almost 95% of the intervals contain the true difference in average price between the two boroughs.

Question 1: Data Cleaning and sampling

Save a data set named abnb_clean that is:

abnb_full |> 
  # Keeping the specified rows
  filter(
    room_type == "Entire home/apt",
    neighbourhood_group %in% c("Brooklyn", "Manhattan"),
    availability_365 >= 10,
    price <= 1000
  ) |> 
  # Keeping the specified columns
  dplyr::select(
    id, 
    host_id,
    host_name,
    borough = neighbourhood_group,
    price, 
    minimum_nights
  ) ->
  
  abnb_clean

# Displaying the results
tibble(abnb_clean)
## # A tibble: 12,976 × 6
##       id host_id host_name        borough   price minimum_nights
##    <int>   <int> <chr>            <chr>     <int>          <int>
##  1  2595    2845 Jennifer         Manhattan   225              1
##  2  3831    4869 LisaRoxanne      Brooklyn     89              1
##  3  5099    7322 Chris            Manhattan   200              3
##  4  5238    7549 Ben              Manhattan   150              1
##  5  6848   15991 Allen & Irina    Brooklyn    140              2
##  6  7097   17571 Jane             Brooklyn    215              2
##  7  7726   20950 Adam And Charity Brooklyn     99              3
##  8  7750   17985 Sing             Manhattan   190              7
##  9  8490   25183 Nathalie         Brooklyn    120              2
## 10  9357   30193 Tommi            Manhattan   150             10
## # ℹ 12,966 more rows

The code below will take a random sample of 5% of each Manhattan and Brooklyn apartments (called stratified sampling) and save it as abnb_sample. Use is as a test case for question 2 below, with price as y and borough for the groups

# Making sure the sample is the same for everyone
RNGversion("4.1.0"); set.seed(1234)

# Taking the stratified sample
abnb_clean |> 
  slice_sample(
    by = borough,
    prop = 0.01
  ) ->
  abnb_sample

# Checking the results:
abnb_sample |> 
  summarize(
    .by = borough,
    apts = n(),
    price_avg = mean(price) |> scales::dollar(),
    price_sd  = sd(price)   |> scales::dollar()
  ) |> 
  gt::gt()
borough apts price_avg price_sd
Manhattan 75 $254.71 $197.08
Brooklyn 54 $177.09 $94.10

Question 2: Writing a function to calculate a two-sample z-interval

You’ll write a function named two_sample_CI that takes two vectors (y & groups) and calculates a confidence interval for the difference between the two means.

The function will have 3 arguments:

  1. y = the vector of numeric values to calculate the confidence interval from
  2. groups = the vector that indicates which group the values in y belong to
  3. CL = the confidence level of the confidence interval in proportions (default = 0.95)

The function should return the following 4 objects (with the correct names)

\[(\bar{y}_1 - \bar{y}_2) \pm z_{\alpha/2}\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\]

To calculate \(z_{\alpha/2}\), use z = qnorm(p = (1-CL)/2, lower.tail = F)

two_sample_CI <-
  function(y, groups, CL = 0.95){
    ## Calculating the summary stats
    # Joining y and groups as a data.frame to use the dplyr verbs
    data.frame(
      y,
      groups 
    ) |> 
      # Calculating n, mean, and sd for each group
      summarize(
        .by = groups,
        n = n(),
        y_bar = mean(y),
        sd = sd(y)
      ) ->
      summary_stats
    
    # Calculating the needed values for the CI
    summary_stats |> 
      # Calculating the variance for each group
      mutate(variance = sd^2/n) |> 
      
      # Calculating the difference in means and the standard error
      summarize(
        avg_diff = abs(diff(y_bar)),
        diff_SE = sqrt(sum(variance))
      ) |> 
      # Calculating the lower and upper bound of the CI
      mutate(
        ci_lower = avg_diff - qnorm(p = (1-CL)/2, lower.tail = F)*diff_SE,
        ci_upper = avg_diff + qnorm(p = (1-CL)/2, lower.tail = F)*diff_SE
      ) ->
      ci_values
    
    return(
      list(
        averages = summary_stats$y_bar,
        diff_avg = ci_values$avg_diff,
        diff_SE  = ci_values$diff_SE,
        diff_CI  = c(ci_values$ci_lower, ci_values$ci_upper)
      )
    )
  }

Use the code chunk below to check the results:

# 90% CI
two_sample_CI(
  y = abnb_sample$price, 
  groups = abnb_sample$borough, 
  CL = 0.90
)
## $averages
## [1] 254.7067 177.0926
## 
## $diff_avg
## [1] 77.61407
## 
## $diff_SE
## [1] 26.11239
## 
## $diff_CI
## [1]  34.66302 120.56513
# 95% CI
two_sample_CI(
  y = abnb_sample$price, 
  groups = abnb_sample$borough
)
## $averages
## [1] 254.7067 177.0926
## 
## $diff_avg
## [1] 77.61407
## 
## $diff_SE
## [1] 26.11239
## 
## $diff_CI
## [1]  26.43474 128.79341
# 99% CI
two_sample_CI(
  y = abnb_sample$price, 
  groups = abnb_sample$borough, 
  CL = 0.99
)
## $averages
## [1] 254.7067 177.0926
## 
## $diff_avg
## [1] 77.61407
## 
## $diff_SE
## [1] 26.11239
## 
## $diff_CI
## [1]  10.35302 144.87513

Question 3: Write a simulation to calculate 10,000 90% confidence intervals

Write a simulation that will take a random sample of 1% from the abnb_clean NYC apartments data set and use the function from question 2 to calculate the 90% CI. Repeat the process a total of 10,000 times

Question 3A) Create a data frame to store the results

Store the results in a data.frame named sim_results that has 10,000 rows and six columns:

  • Col 1 = The mean for group 1: Manhattan
  • Col 2 = The mean for group 2: Brooklyn
  • Col 3 = The difference in means: Manhattan - Brooklyn
  • Col 4 = The standard error of the difference
  • Col 5 = The lower boundary of the 90% confidence interval
  • Col 6 = The upper boundary of the 90% confidence interval

The results should match what is in the solutions in Brightspace

# How many loops to perform
m <- 10000

# Create the data frame to store the results
sim_results <- 
  data.frame(
    Manhattan_avg = rep(-1, m),
    Brooklyn_avg  = rep(-1, m),
    diff_avg      = rep(-1, m),
    diff_SE       = rep(-1, m),
    CI_lower      = rep(-1, m),
    CI_upper      = rep(-1, m)
  )

tibble(sim_results)
## # A tibble: 10,000 × 6
##    Manhattan_avg Brooklyn_avg diff_avg diff_SE CI_lower CI_upper
##            <dbl>        <dbl>    <dbl>   <dbl>    <dbl>    <dbl>
##  1            -1           -1       -1      -1       -1       -1
##  2            -1           -1       -1      -1       -1       -1
##  3            -1           -1       -1      -1       -1       -1
##  4            -1           -1       -1      -1       -1       -1
##  5            -1           -1       -1      -1       -1       -1
##  6            -1           -1       -1      -1       -1       -1
##  7            -1           -1       -1      -1       -1       -1
##  8            -1           -1       -1      -1       -1       -1
##  9            -1           -1       -1      -1       -1       -1
## 10            -1           -1       -1      -1       -1       -1
## # ℹ 9,990 more rows

Question 3B) Running the simulation

Run the simulation where each loop should:

  1. Take a stratified sample of 1% of each apartment location from the abnb_clean data set
    • See the code chunk at the end of question 1
  2. Use the function created in question 2 to calculate the 90% confidence interval and store the results in the correct location in sim_results data.frame

You can check the results in Brightspace to see if your results match

# Making sure the results are the same every time
RNGversion("4.1.0"); set.seed(2870)

# Write the simulation below:
for (i in 1:m){
  # Getting the sample from abnb_clean
    loop_sample <- 
      abnb_clean |> 
      slice_sample(
        by = borough,
        prop = 0.01
      )
    
    # Using the function to create the summary stats
    loop_results <- 
      two_sample_CI(
        y = loop_sample$price,
        groups = loop_sample$borough,
        CL = 0.90
      )
    
    ## Storing the results in the columns of sim_results
    # The individual means
    sim_results[i, 1:2] <- loop_results$averages 
    
    # The difference in means
    sim_results[i, 3] <- loop_results$diff_avg
    
    # The SE of the difference in means
    sim_results[i, 4] <- loop_results$diff_SE
    
    # The confidence interval
    sim_results[i, 5:6] <- loop_results$diff_CI
}

tibble(sim_results)
## # A tibble: 10,000 × 6
##    Manhattan_avg Brooklyn_avg diff_avg diff_SE CI_lower CI_upper
##            <dbl>        <dbl>    <dbl>   <dbl>    <dbl>    <dbl>
##  1          251.         196.     54.5    24.7   13.9       95.2
##  2          244.         186.     58.3    26.4   14.9      102. 
##  3          211.         181.     30.1    21.6   -5.45      65.7
##  4          249.         229.     20.8    33.0  -33.6       75.1
##  5          253.         178.     75.2    23.9   35.9      115. 
##  6          222.         139.     82.9    15.1   58.0      108. 
##  7          245.         162.     82.9    19.7   50.5      115. 
##  8          252.         212.     40.0    24.5   -0.415     80.3
##  9          223.         186.     37.7    21.5    2.34      73.0
## 10          228.         167.     61.3    17.3   32.8       89.7
## # ℹ 9,990 more rows

Question 3C) Percentage of intervals that contain the true difference

Calculate, save, and display the true difference in the average price between apartments in Manhattan vs Brooklyn in the abnb_clean data set, then calculate the percentage of the 10,000 intervals created in question 3B.

# Calculating the true difference
abnb_clean |> 
  # The individual means
  summarize(
    .by = borough,
    price_avg = mean(price)
  ) |> 
  # The difference
  summarize(diff_avg = price_avg |> diff() |> abs()) |> 
  pull(diff_avg) ->
  diff_in_means

diff_in_means
## [1] 64.6262
# Calculating the percentage of intervals that contain the true difference
sim_results |> 
  mutate(contains_true = CI_lower <= diff_in_means & CI_upper >= diff_in_means) |> 
  summarize(interval_prop = mean(contains_true))
##   interval_prop
## 1        0.9009
sim_results |> 
  filter(CI_lower <= diff_in_means & CI_upper >= diff_in_means) |> 
  nrow()
## [1] 9009

Does the simulation indicate that the 90% confidence intervals are accurate?

Question 3D) Graph of the last 100 simulations

Create the graph in Brightspace that illustrates how accurate the last 100 confidence intervals are. The graph should have:

  • A point for the difference in means
  • A line that starts at the lower end of the CI and ends at the upper end of the CI
  • The line should be a shade of blue if it contains the true difference and a shade of red/orange if it does not
  • A vertical, dashed line that displays the true difference in means
sim_results |> 
  
  # Picking the last 100 rows
  slice_tail(
    n = 100
  ) |> 
  
  # Specifying the colors to use if the CI contains the parameter or not
  mutate(
    line_color = if_else(CI_lower < diff_in_means & CI_upper > diff_in_means,
                         "steelblue",
                         "tomato"),
    row_num = row_number()
  ) |> 
  
  ## Creating the graph
  ggplot(
    mapping = aes(
      y = row_num
    )
  ) + 
  
  # Drawing the line segment for the CI
  geom_segment(
    mapping = aes(
      x = CI_lower,
      y = row_num,
      xend = CI_upper,
      yend = row_num,
      color = line_color
    ),
    linewidth = 1
  ) + 
  
  # Adding the difference in means
  geom_point(
    mapping = aes(
      x = diff_avg
    ),
    size = 2
  ) + 
  
  # Adding the vertical line at the true difference
  geom_vline(
    xintercept = diff_in_means,
    linetype = "dashed",
    linewidth = 1
  ) +
  
  # Changing the labels on the graph
  labs(
    title = "Results of 90% Confidence intervals for a difference in means",
    x = "Difference in average price of apartments in Manhattan and Brooklyn",
    y = NULL
  ) +
  
  # Making the colors match the value in the column
  scale_color_identity() + 
 
  # Removing the values on the y-axis
  scale_y_continuous(
    breaks = NULL,
    expand = c(0.01, 0)
  ) +
   
  # Changing the breaks on the x-axis
  scale_x_continuous(
    breaks = seq(-75, 200, 25),
    minor_breaks = NULL,
    labels = scales::label_dollar()
  ) +
  
  theme_bw() + 
  
  theme(
    plot.title = element_text(hjust = 0.5),
#    axis.ticks.y = element_blank()
  )