This is an individual assignment and you must work on it on your own. Collaboration on the assignment constitute collusion. For more on collusion and misconduct please see this webpage.
This assignment is designed to simulate a scenario in which you are taking over someone’s existing work and continuing with it to draw some further insights. The dataset you have used has already gone through preliminary cleaning, and it will be your job to use this cleaned dataset to answer questions.
You have just joined a large insurance company as a data analyst. You’ve first job is to help the fraud team understand the characteristics of fraudulent insurance claims. To get you started in this new role, your manager has asked you to perform a short EDA on a snippet of customer data that the fraud team has compiled. You are to communicate your findings to the chief analytics leader. This is not a formal report, but rather something you are giving to your manager that describes the data with some interesting insights.
Please make sure you read the hints throughout the assignment to help guide you on the tasks.
The points allocated for each of the elements in the assignment are marked in the questions and in certain cases, code scaffolding has been provided. However, if you feel this scaffolding is unhelpful, you are not obliged to use it.
This assignment is out of 50 and will be worth 10% of your total grade. Due on: Thursday, 22 August 2024, 11:59 PM (Melbourne time).
For this assignment, you will need to upload the following into Moodle:
The rendered html file saved as a pdf. The assignment will be only marked if the pdf is uploaded in Moodle. The submitted assignment pdf file must have all the code and output visible.
To complete the assignment, you will need to fill in the blanks
with appropriate R code for some questions. These sections are marked
with ___
. For other questions, you will need to write the
entire R code chunk.
At a minimum, your assignment should be able to be
“knitted” using the Knit
button for your Rmarkdown
document so that you can produce a html file that you will save as pdf
file and upload it into Moodle.
If you want to look at what the assignment looks like as you progress
, remember that you can set the R chunk options to
eval = FALSE
like so to ensure that you can knit the
file:
```{r this-chunk-will-not-run, eval = FALSE} `r''`
a <- 1 + 2
```
If you use eval = FALSE
or
echo = FALSE
, please remember to ensure that you have set
to eval = TRUE
and echo = TRUE
when you submit
the assignment, to ensure all your R codes run.
IMPORTANT: You must use R code to answer all the questions in the report.
This assignment is due on Thursday, 22 August 2024, 11:55 PM. You will submit the knitted html file saved as a pdf via Moodle. Please make sure you add your name on the YAML part of the Rmd file before you knit it and save it as pdf.
Remember, you can look up the help file for functions by typing:
?function_name
. For example, ?mean
.
library(tidyverse)
fraud <- read_csv("D:/ETC1010/fraud.csv")
Display the first 7 rows of the data set (1pt). Hint: Check ?head in your R console.
head(fraud, 7)
## # A tibble: 7 × 18
## customer `Claim Amount` coverage education employment_status gender
## <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 BU79786 2764 Basic Bachelor Employed F
## 2 QZ44356 6980 Extended Bachelor Unemployed F
## 3 AI49188 12887 Premium Bachelor Employed F
## 4 WW63253 7646 Basic Bachelor Unemployed M
## 5 HB64268 2814 Basic Bachelor Employed M
## 6 OC83172 8256 Basic Bachelor Employed F
## 7 XZ87318 5381 Basic College Employed F
## # ℹ 12 more variables: `Annual Income ($)` <dbl>, location_category <chr>,
## # marital_status <chr>, age <dbl>, monthly_premium <dbl>,
## # months_since_last_claim <dbl>, vehicle_size <chr>,
## # reported_to_police <chr>, witness_details_provided <chr>, weekend <chr>,
## # car_driveable <chr>, fraud_detected <dbl>
How many observations and variables does the data set fraud have? Use the inline code to complete the sentence below (4pts)
[Put the inline sentence here]
Your inline code should look like the following. Please paste the inline code in the code chunk below for us to mark.
#The number of observations are `r _nrow(fraud)__` (2pt) and the number of variables are `r _ncol(fraud)__` (2pt)
nrow(fraud)
## [1] 9134
ncol(fraud)
## [1] 18
Using the fraud data set, rename the Claim Amount
variable to claim_amt
and the
Annual Income ($)
variable to ann_inc
. Save
this new data frame as fraud
(4pts).
HINT: use ” ” for the original variable names
fraud <- dplyr::rename(fraud, claim_amt="Claim Amount", ann_inc = "Annual Income ($)")
Display the first 6 rows and 7 columns corresponding to the customers with “Premium coverage” (1pt). HINT: use c(,) in head()
head(fraud[fraud$coverage == "Premium", c(1:7)])
## # A tibble: 6 × 7
## customer claim_amt coverage education employment_status gender ann_inc
## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 AI49188 12887 Premium Bachelor Employed F 48767
## 2 CF85061 7216 Premium Master Unemployed M 0
## 3 DP39365 8799 Premium Master Employed M 77026
## 4 FL50705 8163 Premium High School or B… Employed F 66140
## 5 US89481 3946 Premium Bachelor Unemployed F 0
## 6 GE62437 12903 Premium College Employed F 86584
Using the fraud data set and the function filter
,
extract all claims where fraud was detected and the claim amount was
greater than $5,000. Call this new data set high_fraud
.
(4pts)
high_fraud <- filter(fraud, fraud_detected == TRUE, claim_amt > 5000)
How many variables and observations are in the high_fraud
data frame? (2pts) HINT: you can use
count
or nrow
to complete
this.
nrow(high_fraud)
## [1] 542
ncol(high_fraud)
## [1] 18
#The number of observations are 542 and the number of variables are 18.
high_fraud
dataframe to answer the following
questions.Which education types have been detected as committing high levels of fraud? Display the top three education groups. (5pts)
high_fraud %>%
group_by(education) %>%
summarise(n=n()) %>%
arrange(desc(n)) %>%
head(3)
## # A tibble: 3 × 2
## education n
## <chr> <int>
## 1 Bachelor 165
## 2 High School or Below 162
## 3 College 153
Which insurance coverage has the highest level of high insurance fraud detected? (2pts)
high_fraud %>%
group_by(coverage) %>%
summarise(n=n()) %>%
arrange(desc(n)) %>%
head(1)
## # A tibble: 1 × 2
## coverage n
## <chr> <int>
## 1 Basic 305
# The Basic insurance coverage has the highest level of high insurance fraud.
Are males or females more likely to commit high levels of fraud?. (2pts)
high_fraud %>%
group_by(gender) %>%
summarise(n=n()) %>%
arrange(desc(n))
## # A tibble: 2 × 2
## gender n
## <chr> <int>
## 1 M 290
## 2 F 252
#Males are more likely to commit high levels of fraud.
high_fraud
dataframe to answer the following
questions.Report the sample size, average, median, standard deviation and
inter-quartile range of claim amount, grouped by vehicle size. HINT:
create a tibble named temp
.*
(5pts)
temp <- high_fraud %>%
group_by(vehicle_size) %>%
summarise(sample_size = n(),
average = mean(claim_amt, na.rm = TRUE),
median = median(claim_amt, na.rm = TRUE),
sd = sd(claim_amt, na.rm = TRUE),
IQR = IQR(claim_amt, na.rm = TRUE)
)
temp
## # A tibble: 3 × 6
## vehicle_size sample_size average median sd IQR
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Large 50 10182. 7605 6434. 3914.
## 2 Medsize 386 11357. 8589 7636. 6282.
## 3 Small 106 10800. 8058 6815. 6738
In three (3) or four (4) sentences: - Explain what these values tell us about the shape of the distribution of claim amounts by vehicle size (3 pts) - Based upon this, use the most appropriate measure of central tendency to rank the vehicle size group that had the highest claims from lowest to highest (no R code is required) (2pts)
HINT: think about the relationship between the mean and the median
Use a boxplot to display the the amount claimed by different education levels. Properly label the axes of the boxplot and assign different colours to each group. (5 pts)
high_fraud%>%
ggplot(aes(x=education, y=claim_amt, fill=education))+
geom_boxplot()+
theme_minimal()
#The ‘Doctors’ group has the least variability. This is because the boxes are narrower, indicating a smaller range between quartiles and fewer outliers.
Create a scatter plot of annual income against amount claimed (be sure to get the x and y axes correct). (2pts)
ggplot(high_fraud, aes(x = ann_inc, y = claim_amt)) +
geom_point() +
theme_minimal()
Do you see any pattern in the scatterplot? Explain in one or two sentences. (5pts)