Haig Bedros - HW3 Problem Statement:
How does the number of locations of a restaurant chain impact its
inspection scores over the years?
This question aims to investigate whether there is any relationship
between the number of locations a restaurant chain has and their
inspection scores over time.
By analyzing the dataset, we are trying to determine whether
restaurant chains with more locations tend to have higher or lower
inspection scores compared to those with fewer locations. For example,
my hypothesis is that those with a higher number of locations would tend
to be overlooked and have overall higher inspection scores than those
with fewer locations.
Additionally, we will explore how inspection scores change for
restaurant chains with different numbers of locations as the years
progress.
This analysis could provide valuable insights into how the scale and
growth of restaurant chains might influence their overall compliance
with food safety and hygiene standards.
Load required libraries & Read the CSV
file:
# Load required libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(readr)
# make sure to run install.packages("readr") if the package is not already installed
# BONUS: Read the CSV file from github
github_csv_url <- "https://raw.githubusercontent.com/hbedros/R_HW3/main/restaurant_inspections.csv"
restaurant_data <- read_csv(github_csv_url)
## New names:
## • `` -> `...1`
## Rows: 27178 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): business_name
## dbl (4): ...1, inspection_score, Year, NumberofLocations
## lgl (1): Weekend
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Step 1: Data Exploration:
This should include summary statistics, means, medians, quantiles,
or any other relevant information about the data set. Please include
some conclusions in the R Markdown text.
# Introduction:
# In this analysis, we explored the restaurant inspection data to understand various aspects of the inspection scores and the impact of the number of locations on the scores over the years.
# Data Summary:
# We began by summarizing the data, calculating the mean, median, and quartiles for the inspection scores. We also identified the number of unique businesses present in the dataset.
# Summary statistics
summary(restaurant_data)
## ...1 business_name inspection_score Year
## Min. : 1 Length:27178 Min. : 66.00 Min. :2000
## 1st Qu.: 6795 Class :character 1st Qu.: 90.00 1st Qu.:2006
## Median :13590 Mode :character Median : 95.00 Median :2009
## Mean :13590 Mean : 93.64 Mean :2010
## 3rd Qu.:20384 3rd Qu.:100.00 3rd Qu.:2016
## Max. :27178 Max. :100.00 Max. :2019
## NumberofLocations Weekend
## Min. : 1.00 Mode :logical
## 1st Qu.: 27.00 FALSE:26968
## Median : 41.00 TRUE :210
## Mean : 64.77
## 3rd Qu.: 71.00
## Max. :646.00
# Insight on the data
head(restaurant_data, 5)
## # A tibble: 5 × 6
## ...1 business_name inspection_score Year NumberofLocations Weekend
## <dbl> <chr> <dbl> <dbl> <dbl> <lgl>
## 1 1 MCGINLEYS PUB 94 2017 9 FALSE
## 2 2 VILLAGE INN #1 86 2015 66 FALSE
## 3 3 RONNIE SUSHI 2 80 2016 79 FALSE
## 4 4 FRED MEYER - RETAIL FI… 96 2003 86 FALSE
## 5 5 PHO GRILL 83 2017 53 FALSE
# Calculate summary statistics for inspection_score
summary_stats <- restaurant_data %>%
summarize(
Mean_Inspection_Score = mean(inspection_score),
Median_Inspection_Score = median(inspection_score),
Q1_Inspection_Score = quantile(inspection_score, 0.25),
Q3_Inspection_Score = quantile(inspection_score, 0.75)
)
# Calculate the number of unique businesses from the business_name column
unique_businesses <- restaurant_data %>%
distinct(business_name) %>%
nrow()
# View the results
print("Summary Statistics for Inspection Score:")
## [1] "Summary Statistics for Inspection Score:"
print(summary_stats)
## # A tibble: 1 × 4
## Mean_Inspection_Score Median_Inspection_Score Q1_Inspection_Score
## <dbl> <dbl> <dbl>
## 1 93.6 95 90
## # ℹ 1 more variable: Q3_Inspection_Score <dbl>
cat("\nNumber of Unique Businesses:", unique_businesses, "\n")
##
## Number of Unique Businesses: 1618
# Extract unique years of inspections from the "Year" column and order them in ascending order
unique_years <- restaurant_data %>%
distinct(Year) %>%
arrange(Year) %>%
pull(Year)
# View the unique years
cat("\nUnique Years of Inspections:", unique_years, "\n")
##
## Unique Years of Inspections: 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2015 2016 2017 2018 2019
Step 3: Meaningful question for analysis:
How does the number of locations of a restaurant chain impact its
inspection scores over the years?
# Group the data by Year and business_name and count the number of inspections for each combination
inspection_counts <- restaurant_data %>%
group_by(Year, business_name) %>%
summarise(Num_of_Inspections = n())
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
# Find the year with the highest count of inspections
most_inspected_year <- inspection_counts %>%
filter(Num_of_Inspections == max(Num_of_Inspections)) %>%
pull(Year)
# View the result
cat("Year with the most inspections for business_names:", most_inspected_year, "\n")
## Year with the most inspections for business_names: 2000 2001 2002 2003 2004 2005 2005 2006 2007 2008 2009 2015 2016 2017 2018 2018 2019
# We select the most recent inspection year from our data set: 2019
restaurant_data_2019 <- restaurant_data %>%
filter(Year == 2019)
# Group the data by business_name and calculate the mean and median inspection scores, and NumberofLocations for each business
new_dataset_2019 <- restaurant_data_2019 %>%
group_by(business_name) %>%
summarise(
Mean_Inspection_Score = mean(inspection_score),
Median_Inspection_Score = median(inspection_score),
NumberofLocations = last(NumberofLocations)
)
# View the new dataset for the year 2019
print(new_dataset_2019)
## # A tibble: 418 × 4
## business_name Mean_Inspection_Score Median_Inspection_Sc…¹ NumberofLocations
## <chr> <dbl> <dbl> <dbl>
## 1 3M3R LLC DBA … 95 95 13
## 2 907 ALEHOUSE 96 96 15
## 3 907 ALEHOUSE … 100 100 6
## 4 A SLICE OF HE… 100 100 8
## 5 A WHOLE LATTE… 98 98 12
## 6 ABBOTT LOOP S… 100 100 36
## 7 AFC SUSHI #11… 94 94 29
## 8 AFC SUSHI #18… 98 98 51
## 9 AFC SUSHI #66… 99 99 26
## 10 AFC SUSHI @ C… 96 96 4
## # ℹ 408 more rows
## # ℹ abbreviated name: ¹​Median_Inspection_Score
# We also select another random year of inspection to compare: 2009
restaurant_data_2009 <- restaurant_data %>%
filter(Year == 2009)
# Group the data by business_name and calculate the mean and median inspection scores, and NumberofLocations for each business
new_dataset_2009 <- restaurant_data_2009 %>%
group_by(business_name) %>%
summarise(
Mean_Inspection_Score = mean(inspection_score),
Median_Inspection_Score = median(inspection_score),
NumberofLocations = last(NumberofLocations)
)
# View the new dataset for the year 2009
print(new_dataset_2009)
## # A tibble: 706 × 4
## business_name Mean_Inspection_Score Median_Inspection_Sc…¹ NumberofLocations
## <chr> <dbl> <dbl> <dbl>
## 1 10TH & M SEAF… 100 100 17
## 2 3 LITTLE PIGS… 100 100 19
## 3 5TH AVENUE DE… 91 90 70
## 4 88TH ST PIZZA 98 98 43
## 5 ABBOTT LOOP S… 100 100 36
## 6 AFC SUSHI #11… 100 100 29
## 7 AFC SUSHI #18… 88 88 51
## 8 AFC SUSHI #66… 93 93 26
## 9 AFC SUSHI #71… 100 100 28
## 10 AFC SUSHI @ F… 91.8 89 51
## # ℹ 696 more rows
## # ℹ abbreviated name: ¹​Median_Inspection_Score
Step 4: Graphics:
Please make sure to display at least one scatter plot, box plot and
histogram. Don’t be limited to this. Please explore the many other
options in R packages such as ggplot2.
Data visualization for year 2019
#To visualize the correlation between the average inspection_score for each business and the NumberofLocations, you can use a scatter plot. Each data point in the scatter plot will represent a unique business, with the x-axis showing the NumberofLocations and the y-axis representing the average inspection_score.
# Data visualization - Scatter plot to visualize correlation between average inspection_score and NumberofLocations
scatter_plot_2019 <- ggplot(new_dataset_2019, aes(x = NumberofLocations, y = Mean_Inspection_Score)) +
geom_point(size = 3, color = "steelblue") +
labs(x = "Number of Locations", y = "Average Inspection Score", title = "Correlation between Average Inspection Score and Number of Locations",
subtitle = "Year: 2019") +
theme_minimal()
# Data visualization - Box plot to compare Inspection Scores across NumberofLocations
box_plot_2019 <- ggplot(new_dataset_2019, aes(x = factor(NumberofLocations), y = Mean_Inspection_Score)) +
geom_boxplot(fill = "steelblue", color = "black") +
labs(x = "Number of Locations", y = "Average Inspection Score", title = "Comparison of Inspection Scores across Number of Locations",
subtitle = "Year: 2019") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Data visualization - Histogram to show the distribution of Inspection Scores
histogram_2019 <- ggplot(new_dataset_2019, aes(x = Mean_Inspection_Score)) +
geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
labs(x = "Average Inspection Score", y = "Frequency", title = "Distribution of Average Inspection Scores",
subtitle = "Year: 2019", caption = "Binwidth = 5") +
theme_minimal()
# Display the scatter plot
print(scatter_plot_2019)

print(box_plot_2019)

print(histogram_2019)

Data visualization for year 2009
# Data visualization - Scatter plot to visualize correlation between average inspection_score and NumberofLocations
scatter_plot_2009 <- ggplot(new_dataset_2009, aes(x = NumberofLocations, y = Mean_Inspection_Score)) +
geom_point(size = 3, color = "steelblue") +
labs(x = "Number of Locations", y = "Average Inspection Score", title = "Correlation between Average Inspection Score and Number of Locations",
subtitle = "Year: 2009") +
theme_minimal()
# Data visualization - Box plot to compare Inspection Scores across NumberofLocations
box_plot_2009 <- ggplot(new_dataset_2009, aes(x = factor(NumberofLocations), y = Mean_Inspection_Score)) +
geom_boxplot(fill = "steelblue", color = "black") +
labs(x = "Number of Locations", y = "Average Inspection Score", title = "Comparison of Inspection Scores across Number of Locations",
subtitle = "Year: 2009") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Data visualization - Histogram to show the distribution of Inspection Scores
histogram_2009 <- ggplot(new_dataset_2009, aes(x = Mean_Inspection_Score)) +
geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
labs(x = "Average Inspection Score", y = "Frequency", title = "Distribution of Average Inspection Scores",
subtitle = "Year: 2009", caption = "Binwidth = 5") +
theme_minimal()
# Display the scatter plot
print(scatter_plot_2009)

print(box_plot_2009)

print(histogram_2009)

Step 5: Compare the results of our analysis
# Create a new column to indicate the year in each dataset
new_dataset_2009$Year <- "2009"
new_dataset_2019$Year <- "2019"
# Combine the datasets
combined_dataset <- rbind(new_dataset_2009, new_dataset_2019)
# Plot the combined scatter plot
combined_scatter_plot <- ggplot(combined_dataset, aes(x = NumberofLocations, y = Mean_Inspection_Score, color = Year)) +
geom_point(size = 3) +
labs(x = "Number of Locations", y = "Average Inspection Score", title = "Correlation between Average Inspection Score and Number of Locations",
subtitle = "Comparison between 2009 and 2019") +
theme_minimal()
# Display the plot
print(combined_scatter_plot)

Step 6: Analysis Conclusion
We conducted a preliminary comparison of the analysis graphs for the
years 2009 and 2019 to evaluate our hypothesis that an increased number
of restaurant locations would negatively affect the inspection score.
However, based on the graphs, we do not observe a significant
correlation between the increase in the number of locations and a
negative impact on the inspection score. It appears that our initial
hypothesis is probably not true.