── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(here)
here() starts at /Users/maddyhill/Documents/Monash University/2025/Semester 2/ETC1010/Assigment 1
business <-read_csv("Data/business_filtered.csv")
Rows: 7239 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): business_id, name, city, state, postal_code, categories_raw
dbl (7): latitude, longitude, stars_business, review_count_business, is_open...
lgl (9): accepts_credit_cards, good_for_kids, by_appointment_only, bike_park...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 3.000 4.000 3.608 4.500 5.000
library(ggplot2)ggplot(business, aes(x = stars_business)) +geom_histogram(binwidth =0.5, fill ="grey", color ="black") +labs(title ="Distribution of Yelp Star Ratings",x ="Star Rating",y ="Count") +theme_minimal()
The histogram shows that the distribution of Yelp star ratings is left-skewed, with businesses mostly rated between 3.5 and 5 stars. The five-number summary confirms this, with the minimum at 1 and the median around 4. Skewness is illustrate as the lower tail is longer, and there is a ceiling effect at 5 stars. The distribution is not symmetric, as more businesses cluster near the top end with higher ratings.
Section ii
state_table <- business %>%filter(state %in%c("AZ", "CA")) %>%group_by(state, stars_business) %>%summarise(n =n(), .groups ="drop") %>%group_by(state) %>%mutate(percentage =round(100* n /sum(n), 2))state_table
# A tibble: 18 × 4
# Groups: state [2]
state stars_business n percentage
<chr> <dbl> <int> <dbl>
1 AZ 1 8 1.65
2 AZ 1.5 14 2.88
3 AZ 2 27 5.56
4 AZ 2.5 46 9.47
5 AZ 3 69 14.2
6 AZ 3.5 96 19.8
7 AZ 4 105 21.6
8 AZ 4.5 69 14.2
9 AZ 5 52 10.7
10 CA 1 3 1.04
11 CA 1.5 3 1.04
12 CA 2 13 4.51
13 CA 2.5 11 3.82
14 CA 3 13 4.51
15 CA 3.5 41 14.2
16 CA 4 65 22.6
17 CA 4.5 56 19.4
18 CA 5 83 28.8
# A tibble: 18 × 3
state stars_business mean_reviews
<chr> <dbl> <dbl>
1 AZ 1 8.88
2 AZ 1.5 28.9
3 AZ 2 21.1
4 AZ 2.5 32.3
5 AZ 3 35.0
6 AZ 3.5 45.0
7 AZ 4 57.7
8 AZ 4.5 36.9
9 AZ 5 13.2
10 CA 1 7.67
11 CA 1.5 44
12 CA 2 22.7
13 CA 2.5 30.6
14 CA 3 39
15 CA 3.5 63.2
16 CA 4 98.5
17 CA 4.5 67.4
18 CA 5 15.6
California businesses have a higher proportion of 4–5 star ratings compared to Arizona. However, the average count of reviews in California are also higher, suggesting larger customer bases. Arizona has a flatter distribution across ratings, indicating more variance in perceived quality. This could reflect differences in market size and competition.
Section iii
library(stringr)restaurant_businesses <- business %>%filter(str_detect(str_to_lower(categories_raw), "restaurant"))
Among restaurants, those that remain open tend to have a higher share of 4–5 star ratings, while closed venues are represented at lower star levels. This suggests that better-rated restaurants are more likely to stay open, consistent with customer preference driving business sustainability.
# A tibble: 3 × 2
open_on_weekends mean_rating
<lgl> <dbl>
1 FALSE 4.14
2 TRUE 3.52
3 NA 3.31
Restaurants open on weekends show a slightly higher average rating compared to those closed on weekends. This may be because weekend availability aligns with peak customer demand, allowing these restaurants to recieve more positive reviews. On the other hand, restaurants that close on weekends might miss key opportunities to attract and satisfy diners.