Build at least two pairs of numeric variables

Pair 1: Price vs. Minimum Nights

  • Variables: price_per_min_night (calculated response) vs. minimum_nights(explanatory).

    library(tidyverse)
    ## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
    ## ✔ dplyr     1.1.4     ✔ readr     2.1.5
    ## ✔ forcats   1.0.0     ✔ stringr   1.5.1
    ## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
    ## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
    ## ✔ purrr     1.0.2     
    ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
    ## ✖ dplyr::filter() masks stats::filter()
    ## ✖ dplyr::lag()    masks stats::lag()
    ## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
    data <- read_csv("AB_NYC_2019.csv")
    ## Rows: 48895 Columns: 16
    ## ── Column specification ────────────────────────────────────────────────────────
    ## Delimiter: ","
    ## chr   (5): name, host_name, neighbourhood_group, neighbourhood, room_type
    ## dbl  (10): id, host_id, latitude, longitude, price, minimum_nights, number_o...
    ## date  (1): last_review
    ## 
    ## ℹ Use `spec()` to retrieve the full column specification for this data.
    ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    # Create the calculated column 'price_per_min_night'
    data <- data %>% mutate(price_per_min_night = price / minimum_nights)
    
    # Display a quick view of the dataset to ensure the new column is created
    head(data)
    ## # A tibble: 6 × 17
    ##      id name        host_id host_name neighbourhood_group neighbourhood latitude
    ##   <dbl> <chr>         <dbl> <chr>     <chr>               <chr>            <dbl>
    ## 1  2539 Clean & qu…    2787 John      Brooklyn            Kensington        40.6
    ## 2  2595 Skylit Mid…    2845 Jennifer  Manhattan           Midtown           40.8
    ## 3  3647 THE VILLAG…    4632 Elisabeth Manhattan           Harlem            40.8
    ## 4  3831 Cozy Entir…    4869 LisaRoxa… Brooklyn            Clinton Hill      40.7
    ## 5  5022 Entire Apt…    7192 Laura     Manhattan           East Harlem       40.8
    ## 6  5099 Large Cozy…    7322 Chris     Manhattan           Murray Hill       40.7
    ## # ℹ 10 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
    ## #   minimum_nights <dbl>, number_of_reviews <dbl>, last_review <date>,
    ## #   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
    ## #   availability_365 <dbl>, price_per_min_night <dbl>
  • Insight: Regardless of the minimum stay needed, calculating price_per_min_night (price divided by minimum nights) provides a more accurate indication of the nightly rate. This might show whether or not properties that have higher minimum stay requirements typically charge more or less per night than those with lower requirements.

  • Significance: By predicting the nightly cost based on minimum stay constraints, visitors who may have specific duration requirements might benefit from an understanding of this relationship.

  • Further Questions:

    What are the differences in minimum night prices between neighborhoods?
    When it comes to minimum night needs, do different property types display different pricing patterns?

Pair 2: Price per Review vs. Number of Reviews

number_of_reviews (explanatory) vs. price_per_review (response), which can create price /number_of_reviews.

# Load necessary libraries


data <- data %>% mutate(number_of_reviews = ifelse(number_of_reviews == 0, NA, number_of_reviews))

# Create the calculated column 'price_per_review'
data <- data %>% mutate(price_per_review = price / number_of_reviews)


head(data)
## # A tibble: 6 × 18
##      id name        host_id host_name neighbourhood_group neighbourhood latitude
##   <dbl> <chr>         <dbl> <chr>     <chr>               <chr>            <dbl>
## 1  2539 Clean & qu…    2787 John      Brooklyn            Kensington        40.6
## 2  2595 Skylit Mid…    2845 Jennifer  Manhattan           Midtown           40.8
## 3  3647 THE VILLAG…    4632 Elisabeth Manhattan           Harlem            40.8
## 4  3831 Cozy Entir…    4869 LisaRoxa… Brooklyn            Clinton Hill      40.7
## 5  5022 Entire Apt…    7192 Laura     Manhattan           East Harlem       40.8
## 6  5099 Large Cozy…    7322 Chris     Manhattan           Murray Hill       40.7
## # ℹ 11 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
## #   minimum_nights <dbl>, number_of_reviews <dbl>, last_review <date>,
## #   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
## #   availability_365 <dbl>, price_per_min_night <dbl>, price_per_review <dbl>
  • Insights: Using this pair, we can investigate the relationship between a listing’s price and popularity as determined by the number of reviews. If many guests leave negative reviews, a high price per review might be indicate of either expensive properties or possibly lower quality.

  • Significance: Determining whether listings are seen as valuable or well-reviewed compared to those that are expensive or perform poorly in terms of guest satisfaction is made easier by analyzing this relationship.

  • Further questions:

    What is the average price per review across different neighborhoods, and what is the relationship between this and the total cost of actual property?
    What are the seasonality effects in the relationship between price per review and quantity of reviews over time?

Plot a visualization for each relationship, and draw some conclusions based on the plot

Plot 1: price_per_min_night vs. minimum_nights

# Scatter plot for price_per_min_night vs minimum_nights
library(ggplot2)

ggplot(data, aes(x = minimum_nights, y = price_per_min_night)) +
  geom_point(alpha = 0.5) +
  labs(title = "Price per Minimum Night vs Minimum Nights",
       x = "Minimum Nights",
       y = "Price per Minimum Night") +
  theme_minimal()

Plot 2: Price per Review vs Number of Reviews

# Replace NA values in number_of_reviews and price_per_review with 0
data <- data %>%
  replace_na(list(number_of_reviews = 0, price_per_review = 0))

# Now plot again
ggplot(data, aes(x = number_of_reviews, y = price_per_review)) +
  geom_point(alpha = 0.5) +
  labs(title = "Price per Review vs Number of Reviews",
       x = "Number of Reviews",
       y = "Price per Review") +
  theme_minimal()

Calculate Correlation Coefficient

Correlation for Price per minimum night vs. Minimum Nights

cor_price_per_min_night <- cor(data$price_per_min_night, data$minimum_nights, use = "complete.obs")
cor_price_per_min_night
## [1] -0.1053576

Insights

Weak Negative Relationship: This slightly negative connection suggests that the nightly rate generally tends to drop a little when the minimum stay required rises. This might be the result of hosts lowering the nightly charge in an effort attract in longer-term visitors, which might encourage them to make longer-term reservations.

Significance


Budgeting for Guests: According to this information, guests who are planning longer stays may find that longer minimum stays result in a slightly better nightly rate.
Host Pricing Strategy: In order to stay competitive and attract in longer bookings, hosts may utilize this information to modify their pricing strategy, including providing discounted nightly rates for properties with higher minimum night requirements.



Correlation for Price per Review vs Number of Reviews

cor_price_per_review_reviews <- cor(data$price_per_review, data$number_of_reviews, use = "complete.obs")
cor_price_per_review_reviews
## [1] -0.128673

Insights: The price per review and the quantity of reviews have a weakly negative link, as indicated by the correlation coefficient of -0.1287. This implies that the price per review generally tends to fall down a little as the number of reviews increases. This may suggest that listings with a higher number of reviews are thought to offer better value, which could result in lower costs for each review.

Significance:
This discovery raises concerns regarding the relationship between cost and guest satisfaction. Listings with a high number of reviews yet a low price per review could be more popular or offer better value, attracting more visitors and reviews in the process.

Confidence Interval for price per minimum night

mean_price_per_min_night <- mean(data$price_per_min_night, na.rm = TRUE)
se_price_per_min_night <- sd(data$price_per_min_night, na.rm = TRUE) / sqrt(nrow(data))

# 95% Confidence Interval
ci_price_per_min_night <- c(
  mean_price_per_min_night - 1.96 * se_price_per_min_night,
  mean_price_per_min_night + 1.96 * se_price_per_min_night
)

ci_price_per_min_night
## [1] 68.77712 71.57138

Insights

  • The price per minimum night confidence interval suggests consistency in pricing across listings with varied minimum stay requirements, helping gauge average affordability for guests.
  • Higher values in the interval may indicate listings aimed at long-term stays or premium experiences, while lower values can signal budget-friendly options.

Significance

  • For Guests: This interval helps travelers budget by revealing the typical nightly cost based on minimum stay requirements.
  • For Hosts: Hosts can adjust pricing strategies to fit within or above the interval, enhancing competitiveness or emphasizing exclusivity.

Confidence Interval for Price Per Review

# Calculate mean and standard error
mean_price_per_review <- mean(data$price_per_review, na.rm = TRUE)
se_price_per_review <- sd(data$price_per_review, na.rm = TRUE) / sqrt(nrow(data))

# Confidence interval (95%)
ci_price_per_review <- c(mean_price_per_review - 1.96 * se_price_per_review, mean_price_per_review + 1.96 * se_price_per_review)
ci_price_per_review
## [1] 29.42625 31.26067