Haig Bedros - HW3 Problem Statement:

How does the number of locations of a restaurant chain impact its inspection scores over the years?
This question aims to investigate whether there is any relationship between the number of locations a restaurant chain has and their inspection scores over time.
By analyzing the dataset, we are trying to determine whether restaurant chains with more locations tend to have higher or lower inspection scores compared to those with fewer locations. For example, my hypothesis is that those with a higher number of locations would tend to be overlooked and have overall higher inspection scores than those with fewer locations.
Additionally, we will explore how inspection scores change for restaurant chains with different numbers of locations as the years progress.
This analysis could provide valuable insights into how the scale and growth of restaurant chains might influence their overall compliance with food safety and hygiene standards.

Load required libraries & Read the CSV file:

# Load required libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(readr)

# make sure to run install.packages("readr") if the package is not already installed

# BONUS: Read the CSV file from github
github_csv_url <- "https://raw.githubusercontent.com/hbedros/R_HW3/main/restaurant_inspections.csv"
restaurant_data <- read_csv(github_csv_url)
## New names:
## • `` -> `...1`
## Rows: 27178 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): business_name
## dbl (4): ...1, inspection_score, Year, NumberofLocations
## lgl (1): Weekend
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Step 1: Data Exploration:

This should include summary statistics, means, medians, quantiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.
# Introduction:
# In this analysis, we explored the restaurant inspection data to understand various aspects of the inspection scores and the impact of the number of locations on the scores over the years.

# Data Summary:
# We began by summarizing the data, calculating the mean, median, and quartiles for the inspection scores. We also identified the number of unique businesses present in the dataset.

# Summary statistics
summary(restaurant_data)
##       ...1       business_name      inspection_score      Year     
##  Min.   :    1   Length:27178       Min.   : 66.00   Min.   :2000  
##  1st Qu.: 6795   Class :character   1st Qu.: 90.00   1st Qu.:2006  
##  Median :13590   Mode  :character   Median : 95.00   Median :2009  
##  Mean   :13590                      Mean   : 93.64   Mean   :2010  
##  3rd Qu.:20384                      3rd Qu.:100.00   3rd Qu.:2016  
##  Max.   :27178                      Max.   :100.00   Max.   :2019  
##  NumberofLocations  Weekend       
##  Min.   :  1.00    Mode :logical  
##  1st Qu.: 27.00    FALSE:26968    
##  Median : 41.00    TRUE :210      
##  Mean   : 64.77                   
##  3rd Qu.: 71.00                   
##  Max.   :646.00
# Insight on the data
head(restaurant_data, 5)
## # A tibble: 5 × 6
##    ...1 business_name           inspection_score  Year NumberofLocations Weekend
##   <dbl> <chr>                              <dbl> <dbl>             <dbl> <lgl>  
## 1     1 MCGINLEYS PUB                         94  2017                 9 FALSE  
## 2     2 VILLAGE INN #1                        86  2015                66 FALSE  
## 3     3 RONNIE SUSHI 2                        80  2016                79 FALSE  
## 4     4 FRED MEYER - RETAIL FI…               96  2003                86 FALSE  
## 5     5 PHO GRILL                             83  2017                53 FALSE
# Calculate summary statistics for inspection_score
summary_stats <- restaurant_data %>%
  summarize(
    Mean_Inspection_Score = mean(inspection_score),
    Median_Inspection_Score = median(inspection_score),
    Q1_Inspection_Score = quantile(inspection_score, 0.25),
    Q3_Inspection_Score = quantile(inspection_score, 0.75)
  )

# Calculate the number of unique businesses from the business_name column
unique_businesses <- restaurant_data %>%
  distinct(business_name) %>%
  nrow()

# View the results
print("Summary Statistics for Inspection Score:")
## [1] "Summary Statistics for Inspection Score:"
print(summary_stats)
## # A tibble: 1 × 4
##   Mean_Inspection_Score Median_Inspection_Score Q1_Inspection_Score
##                   <dbl>                   <dbl>               <dbl>
## 1                  93.6                      95                  90
## # ℹ 1 more variable: Q3_Inspection_Score <dbl>
cat("\nNumber of Unique Businesses:", unique_businesses, "\n")
## 
## Number of Unique Businesses: 1618
# Extract unique years of inspections from the "Year" column and order them in ascending order
unique_years <- restaurant_data %>%
  distinct(Year) %>%
  arrange(Year) %>%
  pull(Year)

# View the unique years
cat("\nUnique Years of Inspections:", unique_years, "\n")
## 
## Unique Years of Inspections: 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2015 2016 2017 2018 2019

Step 2: Data wrangling:

Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together)
#In this section, we will create a subset of the restaurant inspection data to focus on unique businesses and analyze key inspection statistics for each individual restaurant. The primary objective is to gain insights into the performance of different businesses and identify trends related to their inspection scores and operational characteristics.

# Create a subset with unique businesses and their inspection statistics
subset_data <- restaurant_data %>%
  group_by(business_name) %>%
  summarise(
    Mean_Inspection_Score = mean(inspection_score),
    Median_Inspection_Score = median(inspection_score),
    Last_NumberofLocations = last(NumberofLocations),
    Most_Recent_Year = max(Year)
  )

# View the subset data
print(subset_data)
## # A tibble: 1,618 × 5
##    business_name               Mean_Inspection_Score Median_Inspection_Score
##    <chr>                                       <dbl>                   <dbl>
##  1 10TH & M SEAFOODS                            96                        96
##  2 12-100 COFFEE & COMMUNITIES                  96.6                      97
##  3 3 LITTLE PIGS - S                            99.7                     100
##  4 3M3R LLC DBA YAMA SUSHI                      95                        95
##  5 49TH STATE BREWERY                           90.7                      88
##  6 49TH STATE BREWERY - BAR                     96                        96
##  7 5TH AVENUE DELI                              90.5                      90
##  8 88TH ST PIZZA                                95.7                      95
##  9 907                                          96                        96
## 10 907 ALEHOUSE                                 91.7                      90
## # ℹ 1,608 more rows
## # ℹ 2 more variables: Last_NumberofLocations <dbl>, Most_Recent_Year <dbl>

Step 3: Meaningful question for analysis:

How does the number of locations of a restaurant chain impact its inspection scores over the years?
# Group the data by Year and business_name and count the number of inspections for each combination
inspection_counts <- restaurant_data %>%
  group_by(Year, business_name) %>%
  summarise(Num_of_Inspections = n())
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
# Find the year with the highest count of inspections
most_inspected_year <- inspection_counts %>%
  filter(Num_of_Inspections == max(Num_of_Inspections)) %>%
  pull(Year)

# View the result
cat("Year with the most inspections for business_names:", most_inspected_year, "\n")
## Year with the most inspections for business_names: 2000 2001 2002 2003 2004 2005 2005 2006 2007 2008 2009 2015 2016 2017 2018 2018 2019
# We select the most recent inspection year from our data set: 2019
restaurant_data_2019 <- restaurant_data %>%
  filter(Year == 2019)

# Group the data by business_name and calculate the mean and median inspection scores, and NumberofLocations for each business
new_dataset_2019 <- restaurant_data_2019 %>%
  group_by(business_name) %>%
  summarise(
    Mean_Inspection_Score = mean(inspection_score),
    Median_Inspection_Score = median(inspection_score),
    NumberofLocations = last(NumberofLocations)
  )

# View the new dataset for the year 2019
print(new_dataset_2019)
## # A tibble: 418 × 4
##    business_name  Mean_Inspection_Score Median_Inspection_Sc…¹ NumberofLocations
##    <chr>                          <dbl>                  <dbl>             <dbl>
##  1 3M3R LLC DBA …                    95                     95                13
##  2 907 ALEHOUSE                      96                     96                15
##  3 907 ALEHOUSE …                   100                    100                 6
##  4 A SLICE OF HE…                   100                    100                 8
##  5 A WHOLE LATTE…                    98                     98                12
##  6 ABBOTT LOOP S…                   100                    100                36
##  7 AFC SUSHI #11…                    94                     94                29
##  8 AFC SUSHI #18…                    98                     98                51
##  9 AFC SUSHI #66…                    99                     99                26
## 10 AFC SUSHI @ C…                    96                     96                 4
## # ℹ 408 more rows
## # ℹ abbreviated name: ¹​Median_Inspection_Score
# We also select another random year of inspection to compare: 2009
restaurant_data_2009 <- restaurant_data %>%
  filter(Year == 2009)

# Group the data by business_name and calculate the mean and median inspection scores, and NumberofLocations for each business
new_dataset_2009 <- restaurant_data_2009 %>%
  group_by(business_name) %>%
  summarise(
    Mean_Inspection_Score = mean(inspection_score),
    Median_Inspection_Score = median(inspection_score),
    NumberofLocations = last(NumberofLocations)
  )

# View the new dataset for the year 2009
print(new_dataset_2009)
## # A tibble: 706 × 4
##    business_name  Mean_Inspection_Score Median_Inspection_Sc…¹ NumberofLocations
##    <chr>                          <dbl>                  <dbl>             <dbl>
##  1 10TH & M SEAF…                 100                      100                17
##  2 3 LITTLE PIGS…                 100                      100                19
##  3 5TH AVENUE DE…                  91                       90                70
##  4 88TH ST PIZZA                   98                       98                43
##  5 ABBOTT LOOP S…                 100                      100                36
##  6 AFC SUSHI #11…                 100                      100                29
##  7 AFC SUSHI #18…                  88                       88                51
##  8 AFC SUSHI #66…                  93                       93                26
##  9 AFC SUSHI #71…                 100                      100                28
## 10 AFC SUSHI @ F…                  91.8                     89                51
## # ℹ 696 more rows
## # ℹ abbreviated name: ¹​Median_Inspection_Score

Step 4: Graphics:

Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.

Data visualization for year 2019

#To visualize the correlation between the average inspection_score for each business and the NumberofLocations, you can use a scatter plot. Each data point in the scatter plot will represent a unique business, with the x-axis showing the NumberofLocations and the y-axis representing the average inspection_score.

# Data visualization - Scatter plot to visualize correlation between average inspection_score and NumberofLocations
scatter_plot_2019 <- ggplot(new_dataset_2019, aes(x = NumberofLocations, y = Mean_Inspection_Score)) +
  geom_point(size = 3, color = "steelblue") +
  labs(x = "Number of Locations", y = "Average Inspection Score", title = "Correlation between Average Inspection Score and Number of Locations",
       subtitle = "Year: 2019") +
  theme_minimal()

# Data visualization - Box plot to compare Inspection Scores across NumberofLocations
box_plot_2019 <- ggplot(new_dataset_2019, aes(x = factor(NumberofLocations), y = Mean_Inspection_Score)) +
  geom_boxplot(fill = "steelblue", color = "black") +
  labs(x = "Number of Locations", y = "Average Inspection Score", title = "Comparison of Inspection Scores across Number of Locations",
       subtitle = "Year: 2019") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Data visualization - Histogram to show the distribution of Inspection Scores
histogram_2019 <- ggplot(new_dataset_2019, aes(x = Mean_Inspection_Score)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
  labs(x = "Average Inspection Score", y = "Frequency", title = "Distribution of Average Inspection Scores",
       subtitle = "Year: 2019", caption = "Binwidth = 5") +
  theme_minimal()

# Display the scatter plot
print(scatter_plot_2019)

print(box_plot_2019)

print(histogram_2019)

Data visualization for year 2009

# Data visualization - Scatter plot to visualize correlation between average inspection_score and NumberofLocations
scatter_plot_2009 <- ggplot(new_dataset_2009, aes(x = NumberofLocations, y = Mean_Inspection_Score)) +
  geom_point(size = 3, color = "steelblue") +
  labs(x = "Number of Locations", y = "Average Inspection Score", title = "Correlation between Average Inspection Score and Number of Locations",
       subtitle = "Year: 2009") +
  theme_minimal()

# Data visualization - Box plot to compare Inspection Scores across NumberofLocations
box_plot_2009 <- ggplot(new_dataset_2009, aes(x = factor(NumberofLocations), y = Mean_Inspection_Score)) +
  geom_boxplot(fill = "steelblue", color = "black") +
  labs(x = "Number of Locations", y = "Average Inspection Score", title = "Comparison of Inspection Scores across Number of Locations",
       subtitle = "Year: 2009") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Data visualization - Histogram to show the distribution of Inspection Scores
histogram_2009 <- ggplot(new_dataset_2009, aes(x = Mean_Inspection_Score)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
  labs(x = "Average Inspection Score", y = "Frequency", title = "Distribution of Average Inspection Scores",
       subtitle = "Year: 2009", caption = "Binwidth = 5") +
  theme_minimal()

# Display the scatter plot
print(scatter_plot_2009)

print(box_plot_2009)

print(histogram_2009)

Step 5: Compare the results of our analysis

# Create a new column to indicate the year in each dataset
new_dataset_2009$Year <- "2009"
new_dataset_2019$Year <- "2019"

# Combine the datasets
combined_dataset <- rbind(new_dataset_2009, new_dataset_2019)

# Plot the combined scatter plot
combined_scatter_plot <- ggplot(combined_dataset, aes(x = NumberofLocations, y = Mean_Inspection_Score, color = Year)) +
  geom_point(size = 3) +
  labs(x = "Number of Locations", y = "Average Inspection Score", title = "Correlation between Average Inspection Score and Number of Locations",
       subtitle = "Comparison between 2009 and 2019") +
  theme_minimal()

# Display the plot
print(combined_scatter_plot)

Step 6: Analysis Conclusion

We conducted a preliminary comparison of the analysis graphs for the years 2009 and 2019 to evaluate our hypothesis that an increased number of restaurant locations would negatively affect the inspection score. However, based on the graphs, we do not observe a significant correlation between the increase in the number of locations and a negative impact on the inspection score. It appears that our initial hypothesis is probably not true.