#Project Summary

Vivendo, a fast-food chain operating across Brazil, often handles compensation claims related to food poisoning. To support improvements in customer satisfaction, this analysis explored how long it takes to close claims across four legal office locations: Recife, São Luís, Fortaleza, and Natal. The dataset included 2,000 records and 8 fields, which were cleaned and preprocessed to ensure accuracy in the analysis. Key issues such as missing values in amount_paid and linked_cases, and inconsistent labeling in the cause field were addressed.

The findings revealed that Recife handles the highest number of claims, while Natal receives the fewest. Through visualizations such as histograms and boxplots, it became evident that most claims are closed within 200 days, though some outliers exceed 300 days. When comparing the claim closure times across locations (after removing outliers), São Luís demonstrated the most consistent and quickest response times, followed closely by Recife. In contrast, Natal exhibited greater variability, suggesting inconsistency in claim handling. These insights can guide the legal team in identifying best practices and improving efficiency in offices with longer or inconsistent closure times.

# Set working directory and load required package
knitr::opts_chunk$set(echo = TRUE)
setwd("C:/Users/Yvonne/Downloads")



#LOAD LIBRARY
library(ggplot2)

# Load the dataset

claim <- read.csv("food_claims_2212_cleaned.csv", header = TRUE, sep = ",")

# Preview first few rows
head(claim)
##   claim_id time_to_close claim_amount amount_paid  location
## 1        1           317     74474.55    51231.37    RECIFE
## 2        2           195     52137.83    42111.30 FORTALEZA
## 3        3           183     24447.20    23986.30  SAO LUIS
## 4        4           186     29006.28    27942.72 FORTALEZA
## 5        5           138     19520.60    16251.06    RECIFE
## 6        6           183     47529.14    38011.98     NATAL
##   individuals_on_claim linked_cases     cause
## 1                   15        FALSE   unknown
## 2                   12         TRUE   unknown
## 3                   10         TRUE      meat
## 4                   11        FALSE      meat
## 5                   11        FALSE vegetable
## 6                   11        FALSE   unknown
# Plot: Total claims per location
#The bar chart below shows the total claims filed at each of the four legal offices: Recife, Sao Luis, Fortaleza, and Natal. It is evident from the graph that Recife has the highest number of claims, followed by Sao Luis, Fortaleza, and finally, Natal.

#The distribution of the graph is not even since the highest visited office location(RECIFE) count is almost 3 times the lowest visited office location(NATAL). It is therefore advisable for the legal team to investigate why they prefer reporting at RECIFE.

ggplot(data = claim, aes(x = location)) +
  geom_bar(stat = "count") +
  ggtitle("TOTAL CLAIMS PER LOCATION") +
  theme(plot.title = element_text(hjust = 0.5))

# Plot: Distribution of time to close
#In order for people to be compensated quickly,the number of days it takes for the closing of claims should also be considered since the longer the days are taken to report complains may also mean that the days for compensation will be pushed further. The histogram below shows the distribution of number of days it takes to close claims.

#The majority of the claims take less than 200 days before they're closed,the rest exceed 200 days.From the histogram,we can say that the distribution of time to close claims is right-skewed and that there exists outliers exceeding 300 days.
ggplot(data = claim, aes(x = time_to_close)) +
  geom_histogram(binwidth = 10) +
  ggtitle("DISTRIBUTION OF TIME TO CLOSE CLAIMS") +
  theme(plot.title = element_text(hjust = 0.5))

# Plot: Time to close by location
# So far,we know that RECIFE legal office has received a lot of claim reports for food poisoning. In order to make a comparison we'll have to add information on number of days it takes to close claims.

#A boxplot is useful when comparing distribution of a continuous(time_to_close) variable for each category(4 locations).

#From the boxplot,the Interquartile range and median seem to be closer hence difficult to make comparisons.This is due to presence of outliers.


ggplot(data = claim, aes(x = location, y = time_to_close)) +
  geom_boxplot() +
  ggtitle("TIME TO CLOSE BY LOCATION") +
  theme(plot.title = element_text(hjust = 0.5))

# Remove outliers using IQR method
#In order to ease comparisons,the outliers were removed using IQR method and a new boxplot was plotted as shown below.

#The Interquartile range(IQR) for the distributions indicates that SAO LUIS has the smallest range followed by RECIFE,then FORTALEZA and the one with the highest IQR is NATAL.

#The IQR for time to close claims in RECIFE is lower compared to that of FORTALEZA and NATAL offices even though it showed the highest number of claims.This may suggest that the number of days it takes to close the claims is lower compared to the two legal offices(FORTALEZA and NATAL). Higher IQR indicates that values in the middle are spread out therefore less consistency while lower IQR values show that middle values are clustered therefore more reliable and consistent results.In this case areas with lower IQR include RECIFE and SAO LUIS.

Q1 <- quantile(claim$time_to_close, 0.25)
Q3 <- quantile(claim$time_to_close, 0.75)
IQR_val <- Q3 - Q1

cleaned_claim <- subset(claim, time_to_close > (Q1 - 1.5 * IQR_val) & time_to_close < (Q3 + 1.5 * IQR_val))

# Check the cleaned dataset
dim(cleaned_claim)
## [1] 1877    8
summary(cleaned_claim)
##     claim_id    time_to_close  claim_amount    amount_paid   
##  Min.   :   2   Min.   : 90   Min.   : 1760   Min.   : 1517  
##  1st Qu.: 508   1st Qu.:156   1st Qu.:13203   1st Qu.:10824  
##  Median :1008   Median :177   Median :23832   Median :19406  
##  Mean   :1006   Mean   :178   Mean   :25775   Mean   :20582  
##  3rd Qu.:1506   3rd Qu.:196   3rd Qu.:36693   3rd Qu.:29225  
##  Max.   :2000   Max.   :272   Max.   :76107   Max.   :52499  
##    location         individuals_on_claim linked_cases       cause          
##  Length:1877        Min.   : 1.000       Mode :logical   Length:1877       
##  Class :character   1st Qu.: 4.000       FALSE:1396      Class :character  
##  Mode  :character   Median : 8.000       TRUE :481       Mode  :character  
##                     Mean   : 7.806                                         
##                     3rd Qu.:11.000                                         
##                     Max.   :15.000
# Plot: Cleaned time to close by location

# From the results below,the legal team should focus on locations that have less than 200 days to close but also take into account those locations that have a time to close of more than 200 days.This will enable it to work on ways to reduce the time it takes before closing claims.This in turn may reduce the time it takes before compensating customers.
ggplot(data = cleaned_claim, aes(x = location, y = time_to_close)) +
  geom_boxplot() +
  ggtitle("TIME TO CLOSE BY LOCATION (CLEANED)") +
  theme(plot.title = element_text(hjust = 0.5))

# Export cleaned dataset
write.csv(cleaned_claim, file = "cleaned_claim.csv", row.names = FALSE)