Instructions to Students

This is an individual assignment and you must work on it on your own. Collaboration on the assignment constitute collusion. For more on collusion and misconduct please see this webpage.

This assignment is designed to simulate a scenario in which you are taking over someone’s existing work and continuing with it to draw some further insights. The dataset you have used has already gone through preliminary cleaning, and it will be your job to use this cleaned dataset to answer questions.

You have just joined a large insurance company as a data analyst. You’ve first job is to help the fraud team understand the characteristics of fraudulent insurance claims. To get you started in this new role, your manager has asked you to perform a short EDA on a snippet of customer data that the fraud team has compiled. You are to communicate your findings to the chief analytics leader. This is not a formal report, but rather something you are giving to your manager that describes the data with some interesting insights.

Please make sure you read the hints throughout the assignment to help guide you on the tasks.

The points allocated for each of the elements in the assignment are marked in the questions and in certain cases, code scaffolding has been provided. However, if you feel this scaffolding is unhelpful, you are not obliged to use it.

Marking + Grades

This assignment is out of 50 and will be worth 10% of your total grade. Due on: Thursday, 22 August 2024, 11:59 PM (Melbourne time).

For this assignment, you will need to upload the following into Moodle:

If you want to look at what the assignment looks like as you progress , remember that you can set the R chunk options to eval = FALSE like so to ensure that you can knit the file:

```{r this-chunk-will-not-run, eval = FALSE} `r''`
a <- 1 + 2
```

If you use eval = FALSE or echo = FALSE, please remember to ensure that you have set to eval = TRUE and echo = TRUE when you submit the assignment, to ensure all your R codes run.

IMPORTANT: You must use R code to answer all the questions in the report.

Due Date

This assignment is due on Thursday, 22 August 2024, 11:55 PM. You will submit the knitted html file saved as a pdf via Moodle. Please make sure you add your name on the YAML part of the Rmd file before you knit it and save it as pdf.

How to find help from R functions?

Remember, you can look up the help file for functions by typing: ?function_name. For example, ?mean.

Load all the libraries that you need here

library(tidyverse)
library(dplyr)
library(ggplot2)

Read in the data

fraud <- read_csv("fraud (1).csv")

Question 1

Display the first 7 rows of the data set (1pt). Hint: Check ?head in your R console.

head(fraud,n=7)
## # A tibble: 7 × 18
##   customer `Claim Amount` coverage education employment_status gender
##   <chr>             <dbl> <chr>    <chr>     <chr>             <chr> 
## 1 BU79786            2764 Basic    Bachelor  Employed          F     
## 2 QZ44356            6980 Extended Bachelor  Unemployed        F     
## 3 AI49188           12887 Premium  Bachelor  Employed          F     
## 4 WW63253            7646 Basic    Bachelor  Unemployed        M     
## 5 HB64268            2814 Basic    Bachelor  Employed          M     
## 6 OC83172            8256 Basic    Bachelor  Employed          F     
## 7 XZ87318            5381 Basic    College   Employed          F     
## # ℹ 12 more variables: `Annual Income ($)` <dbl>, location_category <chr>,
## #   marital_status <chr>, age <dbl>, monthly_premium <dbl>,
## #   months_since_last_claim <dbl>, vehicle_size <chr>,
## #   reported_to_police <chr>, witness_details_provided <chr>, weekend <chr>,
## #   car_driveable <chr>, fraud_detected <dbl>

Question 2

How many observations and variables does the data set fraud have? Use the inline code to complete the sentence below (4pts)

#The number of observations are:
nrow(fraud) 
## [1] 9134
#and the number of variables are: 
ncol(fraud)
## [1] 18

Question 3

Using the fraud data set, rename the Claim Amount variable to claim_amt and the Annual Income ($) variable to ann_inc. Save this new data frame as fraud (4pts). HINT: use ” ” for the original variable names

?rename

fraud <- dplyr::rename(fraud, claim_amt="Claim Amount", ann_inc = "Annual Income ($)")

Question 4

Display the first 6 rows and 7 columns corresponding to the customers with “Premium coverage” (1pt). HINT: use c(,) in head()

premium_coverage<-filter(fraud,coverage=="Premium")
head(premium_coverage, c(6,7))
## # A tibble: 6 × 7
##   customer claim_amt coverage education         employment_status gender ann_inc
##   <chr>        <dbl> <chr>    <chr>             <chr>             <chr>    <dbl>
## 1 AI49188      12887 Premium  Bachelor          Employed          F        48767
## 2 CF85061       7216 Premium  Master            Unemployed        M            0
## 3 DP39365       8799 Premium  Master            Employed          M        77026
## 4 FL50705       8163 Premium  High School or B… Employed          F        66140
## 5 US89481       3946 Premium  Bachelor          Unemployed        F            0
## 6 GE62437      12903 Premium  College           Employed          F        86584

Question 5

Using the fraud data set and the function filter, extract all claims where fraud was detected and the claim amount was greater than $5,000. Call this new data set high_fraud. (4pts)

high_fraud <- filter(fraud,claim_amt>5000,fraud_detected==1)

Question 6

How many variables and observations are in the high_fraud data frame? (2pts) HINT: you can use count or nrow to complete this.

#The number of observations are:
nrow(high_fraud) 
## [1] 542
#and the number of variables are: 
ncol(high_fraud)
## [1] 18

Question 7

Use the high_fraud dataframe to answer the following questions.

Question 7.1

Which education types have been detected as committing high levels of fraud? Display the top three education groups. (5pts)

high_fraud %>% group_by(education) %>% summarise(n=n()) %>% arrange(desc(n)) %>% head(3) 
## # A tibble: 3 × 2
##   education                n
##   <chr>                <int>
## 1 Bachelor               165
## 2 High School or Below   162
## 3 College                153

Question 7.2

Which insurance coverage has the highest level of high insurance fraud detected? (2pts)

high_fraud %>% group_by(coverage)%>% summarise(n=n())%>% head(1)
## # A tibble: 1 × 2
##   coverage     n
##   <chr>    <int>
## 1 Basic      305

Question 7.3

Are males or females more likely to commit high levels of fraud?. (2pts)

high_fraud %>% group_by(gender)%>% summarise(n=n())
## # A tibble: 2 × 2
##   gender     n
##   <chr>  <int>
## 1 F        252
## 2 M        290
#Therefore males are most likely to commit high fraud.

Question 8

Use the high_fraud dataframe to answer the following questions.

Question 8.1

Report the sample size, average, median, standard deviation and inter-quartile range of claim amount, grouped by vehicle size. HINT: create a tibble named temp.* (5pts)

temp <- high_fraud %>% group_by(vehicle_size) %>% 
summarise(average = mean(claim_amt, na.rm = TRUE), 
    median = median(claim_amt, na.rm = TRUE), 
    std_dev = sd(claim_amt, na.rm = TRUE),   
    iqr = IQR(claim_amt, na.rm = TRUE)
)
temp
## # A tibble: 3 × 5
##   vehicle_size average median std_dev   iqr
##   <chr>          <dbl>  <dbl>   <dbl> <dbl>
## 1 Large         10182.   7605   6434. 3914.
## 2 Medsize       11357.   8589   7636. 6282.
## 3 Small         10800.   8058   6815. 6738

Question 8.2

In three (3) or four (4) sentences: - Explain what these values tell us about the shape of the distribution of claim amounts by vehicle size (3 pts) - Based upon this, use the most appropriate measure of central tendency to rank the vehicle size group that had the highest claims from lowest to highest (no R code is required) (2pts)

HINT: think about the relationship between the mean and the median

#These value colleced indicate that the data distribution is positively skewed as the median is much lower than the mean. This means that there is likely to be some outliers that are greater than the median amount. In this case, the most appropriate measure of central tendency is the median value as it is the most accurate measure because it is not effected by outliers. Therefore medium size vehicles have the highest claim amount followed by small and then large.

Question 8.3

Use a boxplot to display the the amount claimed by different education levels. Properly label the axes of the boxplot and assign different colours to each group. (5 pts)

high_fraud %>% ggplot(aes(x=education,y=claim_amt,fill=education))+ geom_boxplot()+ labs(title = "Claim Amounts by Education Levels", x="Education Level",y="Amount Claimed")

Question 8.4

  1. By visual analysis, which group has the highest and lowest variability? (2pts)
#Using visual analysis, Doctor level of education has the lowest variability of claim amounts and high school or below has the highest variablity of claim amounts.
  1. Which education level has the highest interquartile range of fraudulent claims. (1pt)
#College has the highest interquartile range of fraudlent claims.

Question 8.5

Create a scatter plot of annual income against amount claimed (be sure to get the x and y axes correct). (2pts)

high_fraud %>% ggplot(aes(x=ann_inc,y=claim_amt))+geom_point()+labs(title="Annual Income Against Amount Claimed",x="Annual Income",y="Claim Amount")

Question 8.6

Do you see any pattern in the scatterplot? Explain in one or two sentences. (5pts)

#The scatterplot displays the pattern that the data is positively skewed, this means that the majority of high fraud occurs in lower annual incomes. It also reveals that most claims are less than $20000.