Assignment

Assignment 1 ETC1010

Madeline Hill

Part 1

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(here)
here() starts at /Users/maddyhill/Documents/Monash University/2025/Semester 2/ETC1010/Assigment 1
business <- read_csv("Data/business_filtered.csv")
Rows: 7239 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): business_id, name, city, state, postal_code, categories_raw
dbl (7): latitude, longitude, stars_business, review_count_business, is_open...
lgl (9): accepts_credit_cards, good_for_kids, by_appointment_only, bike_park...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(business, 6)
# A tibble: 6 × 22
  business_id    name  city  state postal_code latitude longitude stars_business
  <chr>          <chr> <chr> <chr> <chr>          <dbl>     <dbl>          <dbl>
1 LMXa9Vi82nHJs… Oish… Olds… FL    34677           28.0     -82.7            4.5
2 -8x48OWU5z7m_… Caro… Tucs… AZ    85713           32.2    -111.             5  
3 xpT_P_9MdqVUN… Aspe… Indi… IN    46237           39.7     -86.1            1.5
4 U_gjhQ9phwKEq… Chad… Chad… PA    19317           39.9     -75.6            3.5
5 AH5JQnN6mIPee… Tamp… Tampa FL    33607           28.0     -82.5            4  
6 z66yFUpXMDxCP… Rele… Nash… TN    37204           36.1     -86.8            4.5
# ℹ 14 more variables: review_count_business <dbl>, is_open <dbl>,
#   categories_raw <chr>, accepts_credit_cards <lgl>, good_for_kids <lgl>,
#   by_appointment_only <lgl>, bike_parking <lgl>, restaurants_takeout <lgl>,
#   restaurants_delivery <lgl>, outdoor_seating <lgl>,
#   restaurants_reservations <lgl>, hours_per_week <dbl>,
#   open_on_weekends <lgl>, n_days_open <dbl>
n_rows <- nrow(business)
n_rows
[1] 7239
n_vars <- ncol(business)
n_vars
[1] 22
mean_rating <- round(mean(business$stars_business, na.rm = TRUE), 2)
mean_rating
[1] 3.61
mean_reviews_5star <- business %>%
  filter(stars_business == 5) %>%
  summarise(val = round(mean(review_count_business, na.rm = TRUE), 2)) %>%
  pull(val)

mean_reviews_5star
[1] 15.35

Part Two

Section i

summary_stats <- summary(business$stars_business)
summary_stats
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   3.000   4.000   3.608   4.500   5.000 
library(ggplot2)

ggplot(business, aes(x = stars_business)) +
  geom_histogram(binwidth = 0.5, fill = "grey", color = "black") +
  labs(title = "Distribution of Yelp Star Ratings",
       x = "Star Rating",
       y = "Count") +
  theme_minimal()

The histogram shows that the distribution of Yelp star ratings is left-skewed, with businesses mostly rated between 3.5 and 5 stars. The five-number summary confirms this, with the minimum at 1 and the median around 4. Skewness is illustrate as the lower tail is longer, and there is a ceiling effect at 5 stars. The distribution is not symmetric, as more businesses cluster near the top end with higher ratings.

Section ii

state_table <- business %>%
  filter(state %in% c("AZ", "CA")) %>%
  group_by(state, stars_business) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(state) %>%
  mutate(percentage = round(100 * n / sum(n), 2))

state_table
# A tibble: 18 × 4
# Groups:   state [2]
   state stars_business     n percentage
   <chr>          <dbl> <int>      <dbl>
 1 AZ               1       8       1.65
 2 AZ               1.5    14       2.88
 3 AZ               2      27       5.56
 4 AZ               2.5    46       9.47
 5 AZ               3      69      14.2 
 6 AZ               3.5    96      19.8 
 7 AZ               4     105      21.6 
 8 AZ               4.5    69      14.2 
 9 AZ               5      52      10.7 
10 CA               1       3       1.04
11 CA               1.5     3       1.04
12 CA               2      13       4.51
13 CA               2.5    11       3.82
14 CA               3      13       4.51
15 CA               3.5    41      14.2 
16 CA               4      65      22.6 
17 CA               4.5    56      19.4 
18 CA               5      83      28.8 
state_reviews <- business %>%
  filter(state %in% c("AZ", "CA")) %>%
  group_by(state, stars_business) %>%
  summarise(mean_reviews = round(mean(review_count_business, na.rm = TRUE), 2),
            .groups = "drop")

state_reviews
# A tibble: 18 × 3
   state stars_business mean_reviews
   <chr>          <dbl>        <dbl>
 1 AZ               1           8.88
 2 AZ               1.5        28.9 
 3 AZ               2          21.1 
 4 AZ               2.5        32.3 
 5 AZ               3          35.0 
 6 AZ               3.5        45.0 
 7 AZ               4          57.7 
 8 AZ               4.5        36.9 
 9 AZ               5          13.2 
10 CA               1           7.67
11 CA               1.5        44   
12 CA               2          22.7 
13 CA               2.5        30.6 
14 CA               3          39   
15 CA               3.5        63.2 
16 CA               4          98.5 
17 CA               4.5        67.4 
18 CA               5          15.6 

California businesses have a higher proportion of 4–5 star ratings compared to Arizona. However, the average count of reviews in California are also higher, suggesting larger customer bases. Arizona has a flatter distribution across ratings, indicating more variance in perceived quality. This could reflect differences in market size and competition.

Section iii

library(stringr)

restaurant_businesses <- business %>%
  filter(str_detect(str_to_lower(categories_raw), "restaurant"))
open_closed_table <- restaurant_businesses %>%
  group_by(is_open, stars_business) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(is_open) %>%
  mutate(percentage = round(100 * n / sum(n), 2))

open_closed_table
# A tibble: 18 × 4
# Groups:   is_open [2]
   is_open stars_business     n percentage
     <dbl>          <dbl> <int>      <dbl>
 1       0            1       5       0.64
 2       0            1.5    10       1.27
 3       0            2      33       4.19
 4       0            2.5    80      10.2 
 5       0            3     154      19.6 
 6       0            3.5   189      24.0 
 7       0            4     189      24.0 
 8       0            4.5   104      13.2 
 9       0            5      23       2.92
10       1            1       5       0.31
11       1            1.5    68       4.28
12       1            2     116       7.3 
13       1            2.5   146       9.18
14       1            3     191      12.0 
15       1            3.5   324      20.4 
16       1            4     405      25.5 
17       1            4.5   279      17.6 
18       1            5      56       3.52

Among restaurants, those that remain open tend to have a higher share of 4–5 star ratings, while closed venues are represented at lower star levels. This suggests that better-rated restaurants are more likely to stay open, consistent with customer preference driving business sustainability.

Section iv

weekend_effect <- restaurant_businesses %>%
  group_by(open_on_weekends) %>%
  summarise(mean_rating = round(mean(stars_business, na.rm = TRUE), 2),
            .groups = "drop")

weekend_effect
# A tibble: 3 × 2
  open_on_weekends mean_rating
  <lgl>                  <dbl>
1 FALSE                   4.14
2 TRUE                    3.52
3 NA                      3.31

Restaurants open on weekends show a slightly higher average rating compared to those closed on weekends. This may be because weekend availability aligns with peak customer demand, allowing these restaurants to recieve more positive reviews. On the other hand, restaurants that close on weekends might miss key opportunities to attract and satisfy diners.