This is an individual assignment and you must work on it on your own. Collaboration on the assignment constitute collusion. For more on collusion and misconduct please see this webpage.
This assignment is designed to simulate a scenario in which you are taking over someone’s existing work and continuing with it to draw some further insights. The dataset you have used has already gone through preliminary cleaning, and it will be your job to use this cleaned dataset to answer questions.
You have just joined a large insurance company as a data analyst. You’ve first job is to help the fraud team understand the characteristics of fraudulent insurance claims. To get you started in this new role, your manager has asked you to perform a short EDA on a snippet of customer data that the fraud team has compiled. You are to communicate your findings to the chief analytics leader. This is not a formal report, but rather something you are giving to your manager that describes the data with some interesting insights.
Please make sure you read the hints throughout the assignment to help guide you on the tasks.
The points allocated for each of the elements in the assignment are marked in the questions and in certain cases, code scaffolding has been provided. However, if you feel this scaffolding is unhelpful, you are not obliged to use it.
This assignment is out of 50 and will be worth 10% of your total grade. Due on: Thursday, 22 August 2024, 11:59 PM (Melbourne time).
For this assignment, you will need to upload the following into Moodle:
The rendered html file saved as a pdf. The assignment will be only marked if the pdf is uploaded in Moodle. The submitted assignment pdf file must have all the code and output visible.
To complete the assignment, you will need to fill in the blanks
with appropriate R code for some questions. These sections are marked
with ___. For other questions, you will need to write the
entire R code chunk.
At a minimum, your assignment should be able to be
“knitted” using the Knit button for your Rmarkdown
document so that you can produce a html file that you will save as pdf
file and upload it into Moodle.
If you want to look at what the assignment looks like as you progress
, remember that you can set the R chunk options to
eval = FALSE like so to ensure that you can knit the
file:
```{r this-chunk-will-not-run, eval = FALSE} `r''`
a <- 1 + 2
```
If you use eval = FALSE or
echo = FALSE, please remember to ensure that you have set
to eval = TRUE and echo = TRUE when you submit
the assignment, to ensure all your R codes run.
IMPORTANT: You must use R code to answer all the questions in the report.
This assignment is due on Thursday, 22 August 2024, 11:55 PM. You will submit the knitted html file saved as a pdf via Moodle. Please make sure you add your name on the YAML part of the Rmd file before you knit it and save it as pdf.
Remember, you can look up the help file for functions by typing:
?function_name. For example, ?mean.
library(tidyverse)
fraud <- read.csv("D:/DATA ANA/data/fraud.csv")
Display the first 7 rows of the data set (1pt). Hint: Check ?head in your R console.
head(fraud, 7)
## customer Claim.Amount coverage education employment_status gender
## 1 BU79786 2,764 Basic Bachelor Employed F
## 2 QZ44356 6,980 Extended Bachelor Unemployed F
## 3 AI49188 12,887 Premium Bachelor Employed F
## 4 WW63253 7,646 Basic Bachelor Unemployed M
## 5 HB64268 2,814 Basic Bachelor Employed M
## 6 OC83172 8,256 Basic Bachelor Employed F
## 7 XZ87318 5,381 Basic College Employed F
## Annual.Income.... location_category marital_status age monthly_premium
## 1 56274 Suburban Married 59 43
## 2 0 Suburban Single 29 96
## 3 48767 Suburban Married 42 108
## 4 0 Suburban Married 36 116
## 5 43836 Rural Single 23 83
## 6 62902 Rural Married 33 79
## 7 55350 Suburban Married 17 107
## months_since_last_claim vehicle_size reported_to_police
## 1 32 Medsize n
## 2 13 Medsize y
## 3 18 Medsize n
## 4 18 Medsize y
## 5 12 Medsize n
## 6 14 Medsize n
## 7 0 Medsize n
## witness_details_provided weekend car_driveable fraud_detected
## 1 y n y 0
## 2 y n y 0
## 3 y n y 0
## 4 y n y 0
## 5 y y y 0
## 6 y n y 0
## 7 n y y 1
How many observations and variables does the data set fraud have? Use the inline code to complete the sentence below (4pts)
dim(fraud)
## [1] 9134 18
Your inline code should look like the following. Please paste the inline code in the code chunk below for us to mark.
#The number of observations are `r nrow(fraud)` (2pt) and the number of variables are `r ncol(fraud)` (2pt)
Using the fraud data set, rename the Claim Amount
variable to claim_amt and the
Annual Income ($) variable to ann_inc. Save
this new data frame as fraud (4pts).
HINT: use ” ” for the original variable names
fraud <- dplyr::rename(fraud, claim_amt="Claim.Amount", ann_inc = "Annual.Income....")
fraud <- fraud %>%
mutate(claim_amt = as.factor(claim_amt)) %>%
mutate(claim_amt = as.numeric(claim_amt))
Display the first 6 rows and 7 columns corresponding to the customers with “Premium coverage” (1pt). HINT: use c(,) in head()
fraud %>% filter(coverage=="Premium") %>% head(c(6,7))
## customer claim_amt coverage education employment_status gender
## 1 AI49188 449 Premium Bachelor Employed F
## 2 CF85061 4469 Premium Master Unemployed M
## 3 DP39365 5304 Premium Master Employed M
## 4 FL50705 4977 Premium High School or Below Employed F
## 5 US89481 2491 Premium Bachelor Unemployed F
## 6 GE62437 452 Premium College Employed F
## ann_inc
## 1 48767
## 2 0
## 3 77026
## 4 66140
## 5 0
## 6 86584
Using the fraud data set and the function filter,
extract all claims where fraud was detected and the claim amount was
greater than $5,000. Call this new data set high_fraud.
(4pts)
high_fraud <- fraud %>%
filter(fraud_detected == TRUE, claim_amt > 5000)
How many variables and observations are in the high_fraud
data frame? (2pts) HINT: you can use
count or nrow to complete
this.
num_variable <- ncol(high_fraud)
num_observation <- nrow(high_fraud)
high_fraud dataframe to answer the following
questions.Which education types have been detected as committing high levels of fraud? Display the top three education groups. (5pts)
high_fraud %>% group_by(education) %>%
summarise(n=n()) %>%
arrange(desc(n)) %>%
head(3)
Which insurance coverage has the highest level of high insurance fraud detected? (2pts)
high_fraud_coverage <- high_fraud %>%
group_by(coverage) %>%
summarise(high_fraud_count = n()) %>%
arrange(desc(high_fraud_count))
high_fraud_coverage
## # A tibble: 3 × 2
## coverage high_fraud_count
## <chr> <int>
## 1 Basic 43
## 2 Extended 25
## 3 Premium 8
Are males or females more likely to commit high levels of fraud?. (2pts)
high_fraud_gender <- high_fraud %>%
group_by(gender) %>%
summarise(high_fraud_count = n()) %>%
arrange(desc(high_fraud_count))
high_fraud_gender
## # A tibble: 2 × 2
## gender high_fraud_count
## <chr> <int>
## 1 F 43
## 2 M 33
high_fraud dataframe to answer the following
questions.Report the sample size, average, median, standard deviation and
inter-quartile range of claim amount, grouped by vehicle size. HINT:
create a tibble named temp.*
(5pts)
temp<- high_fraud %>%
group_by(vehicle_size) %>%
summarise(
sample_size= n(),
Average = mean(claim_amt),
Median = median(claim_amt),
standar_deviation =sd(claim_amt),
IQR =IQR(claim_amt)
)
In three (3) or four (4) sentences: - Explain what these values tell us about the shape of the distribution of claim amounts by vehicle size (3 pts) - Based upon this, use the most appropriate measure of central tendency to rank the vehicle size group that had the highest claims from lowest to highest (no R code is required) (2pts)
The data demonstrated claim amount for different vehicle sizes for each category, for instance, we have average and median which are close to each other indicates that the distribution are balanced and not skewed in any way. Moreover, larger vehicle hence to have a more wide spread in claim amount as seen in the high standard deviation and IQR. Since the distribution are balanced, the average between each sizes from the given claim amount from lowest to highest is Small = 5252.33, Medsize = 5295.60, and Large = 5298.
HINT: think about the relationship between the mean and the median
Use a boxplot to display the the amount claimed by different education levels. Properly label the axes of the boxplot and assign different colours to each group. (5 pts)
ggplot(high_fraud, aes(x = education, y = claim_amt, fill = education)) +
geom_boxplot() +
labs(title = "Claim Amount by Education Level",
x = "Education Level",
y = "Claim Amount") +
theme_minimal() +
scale_fill_brewer(palette = "Set3")
The boxplot shows that people who have highschool education and below have the widest spread amount out of all, thus indicating the highest variability. Moreover, the group also have the largest interquartile range, suggesting that almost half the claims have widers range when compared with other education levels.
Create a scatter plot of annual income against amount claimed (be sure to get the x and y axes correct). (2pts)
ggplot(high_fraud, aes(x = ann_inc, y = claim_amt)) +
geom_point() +
labs(title = "Annual Income vs Claimed ammount",
x = "Annual Income",
y = "Claim Amount") +
theme_minimal()
Do you see any pattern in the scatterplot? Explain in one or two sentences. (5pts)
The scatterplot above have a weak correlation between the annual income with the claim amount. This may result in when annual income increases that a small amount of claim amount also increase. However, the relation between those two variable is not strong indicated by the big amount of scatter data in the graph.