Instructions to Students

This is an individual assignment and you must work on it on your own. Collaboration on the assignment constitute collusion. For more on collusion and misconduct please see this webpage.

This assignment is designed to simulate a scenario in which you are taking over someone’s existing work and continuing with it to draw some further insights. The dataset you have used has already gone through preliminary cleaning, and it will be your job to use this cleaned dataset to answer questions.

You have just joined a large insurance company as a data analyst. You’ve first job is to help the fraud team understand the characteristics of fraudulent insurance claims. To get you started in this new role, your manager has asked you to perform a short EDA on a snippet of customer data that the fraud team has compiled. You are to communicate your findings to the chief analytics leader. This is not a formal report, but rather something you are giving to your manager that describes the data with some interesting insights.

Please make sure you read the hints throughout the assignment to help guide you on the tasks.

The points allocated for each of the elements in the assignment are marked in the questions and in certain cases, code scaffolding has been provided. However, if you feel this scaffolding is unhelpful, you are not obliged to use it.

Marking + Grades

This assignment is out of 50 and will be worth 10% of your total grade. Due on: Thursday, 22 August 2024, 11:59 PM (Melbourne time).

For this assignment, you will need to upload the following into Moodle:

The rendered html file saved as a pdf. The assignment will be only marked if the pdf is uploaded in Moodle. The submitted assignment pdf file must have all the code and output visible.
To complete the assignment, you will need to fill in the blanks with appropriate R code for some questions. These sections are marked with ___. For other questions, you will need to write the entire R code chunk.
At a minimum, your assignment should be able to be “knitted” using the Knit button for your Rmarkdown document so that you can produce a html file that you will save as pdf file and upload it into Moodle.

If you want to look at what the assignment looks like as you progress , remember that you can set the R chunk options to eval = FALSE like so to ensure that you can knit the file:

```{r this-chunk-will-not-run, eval = FALSE} `r''`
a <- 1 + 2
```

If you use eval = FALSE or echo = FALSE, please remember to ensure that you have set to eval = TRUE and echo = TRUE when you submit the assignment, to ensure all your R codes run.

IMPORTANT: You must use R code to answer all the questions in the report.

Due Date

This assignment is due on Thursday, 22 August 2024, 11:55 PM. You will submit the knitted html file saved as a pdf via Moodle. Please make sure you add your name on the YAML part of the Rmd file before you knit it and save it as pdf.

How to find help from R functions?

Remember, you can look up the help file for functions by typing: ?function_name. For example, ?mean.

Load all the libraries that you need here

library(tidyverse)

Read in the data

fraud <- read_csv("D:/ETC1010/fraud.csv")

Question 1

Display the first 7 rows of the data set (1pt). Hint: Check ?head in your R console.

head(fraud, 7)

## # A tibble: 7 × 18
##   customer `Claim Amount` coverage education employment_status gender
##   <chr>             <dbl> <chr>    <chr>     <chr>             <chr> 
## 1 BU79786            2764 Basic    Bachelor  Employed          F     
## 2 QZ44356            6980 Extended Bachelor  Unemployed        F     
## 3 AI49188           12887 Premium  Bachelor  Employed          F     
## 4 WW63253            7646 Basic    Bachelor  Unemployed        M     
## 5 HB64268            2814 Basic    Bachelor  Employed          M     
## 6 OC83172            8256 Basic    Bachelor  Employed          F     
## 7 XZ87318            5381 Basic    College   Employed          F     
## # ℹ 12 more variables: `Annual Income ($)` <dbl>, location_category <chr>,
## #   marital_status <chr>, age <dbl>, monthly_premium <dbl>,
## #   months_since_last_claim <dbl>, vehicle_size <chr>,
## #   reported_to_police <chr>, witness_details_provided <chr>, weekend <chr>,
## #   car_driveable <chr>, fraud_detected <dbl>

Question 2

How many observations and variables does the data set fraud have? Use the inline code to complete the sentence below (4pts)

[Put the inline sentence here]

Your inline code should look like the following. Please paste the inline code in the code chunk below for us to mark.

#The number of observations are `r _nrow(fraud)__` (2pt) and the number of variables are `r _ncol(fraud)__` (2pt)
nrow(fraud)

## [1] 9134

ncol(fraud)

## [1] 18

Question 3

Using the fraud data set, rename the Claim Amount variable to claim_amt and the Annual Income ($) variable to ann_inc. Save this new data frame as fraud (4pts). HINT: use ” ” for the original variable names

fraud <- dplyr::rename(fraud, claim_amt="Claim Amount", ann_inc = "Annual Income ($)")

Question 4

Display the first 6 rows and 7 columns corresponding to the customers with “Premium coverage” (1pt). HINT: use c(,) in head()

head(fraud[fraud$coverage == "Premium", c(1:7)])

## # A tibble: 6 × 7
##   customer claim_amt coverage education         employment_status gender ann_inc
##   <chr>        <dbl> <chr>    <chr>             <chr>             <chr>    <dbl>
## 1 AI49188      12887 Premium  Bachelor          Employed          F        48767
## 2 CF85061       7216 Premium  Master            Unemployed        M            0
## 3 DP39365       8799 Premium  Master            Employed          M        77026
## 4 FL50705       8163 Premium  High School or B… Employed          F        66140
## 5 US89481       3946 Premium  Bachelor          Unemployed        F            0
## 6 GE62437      12903 Premium  College           Employed          F        86584

Question 5

Using the fraud data set and the function filter, extract all claims where fraud was detected and the claim amount was greater than $5,000. Call this new data set high_fraud. (4pts)

high_fraud <- filter(fraud, fraud_detected == TRUE, claim_amt > 5000)

Question 6

How many variables and observations are in the high_fraud data frame? (2pts) HINT: you can use count or nrow to complete this.

nrow(high_fraud)

## [1] 542

ncol(high_fraud)

## [1] 18

#The number of observations are 542 and the number of variables are 18.

Question 7

Use the `high_fraud` dataframe to answer the following questions.

Question 7.1

Which education types have been detected as committing high levels of fraud? Display the top three education groups. (5pts)

high_fraud %>%
  group_by(education) %>% 
  summarise(n=n()) %>% 
  arrange(desc(n)) %>% 
  head(3)

## # A tibble: 3 × 2
##   education                n
##   <chr>                <int>
## 1 Bachelor               165
## 2 High School or Below   162
## 3 College                153

Question 7.2

Which insurance coverage has the highest level of high insurance fraud detected? (2pts)

high_fraud %>% 
  group_by(coverage) %>%
  summarise(n=n()) %>%
  arrange(desc(n)) %>% 
  head(1)

## # A tibble: 1 × 2
##   coverage     n
##   <chr>    <int>
## 1 Basic      305

# The Basic insurance coverage has the highest level of high insurance fraud.

Question 7.3

Are males or females more likely to commit high levels of fraud?. (2pts)

high_fraud %>% 
  group_by(gender) %>% 
  summarise(n=n()) %>%
  arrange(desc(n))

## # A tibble: 2 × 2
##   gender     n
##   <chr>  <int>
## 1 M        290
## 2 F        252

#Males are more likely to commit high levels of fraud.

Question 8

Use the `high_fraud` dataframe to answer the following questions.

Question 8.1

Report the sample size, average, median, standard deviation and inter-quartile range of claim amount, grouped by vehicle size. HINT: create a tibble named temp.* (5pts)

temp <- high_fraud %>% 
  group_by(vehicle_size) %>% 
  summarise(sample_size = n(),
    average = mean(claim_amt, na.rm = TRUE),
    median = median(claim_amt, na.rm = TRUE),
    sd = sd(claim_amt, na.rm = TRUE),
    IQR = IQR(claim_amt, na.rm = TRUE)
  )
temp

## # A tibble: 3 × 6
##   vehicle_size sample_size average median    sd   IQR
##   <chr>              <int>   <dbl>  <dbl> <dbl> <dbl>
## 1 Large                 50  10182.   7605 6434. 3914.
## 2 Medsize              386  11357.   8589 7636. 6282.
## 3 Small                106  10800.   8058 6815. 6738

Question 8.2

In three (3) or four (4) sentences: - Explain what these values tell us about the shape of the distribution of claim amounts by vehicle size (3 pts) - Based upon this, use the most appropriate measure of central tendency to rank the vehicle size group that had the highest claims from lowest to highest (no R code is required) (2pts)

HINT: think about the relationship between the mean and the median

Question 8.3

Use a boxplot to display the the amount claimed by different education levels. Properly label the axes of the boxplot and assign different colours to each group. (5 pts)

high_fraud%>%
  ggplot(aes(x=education, y=claim_amt, fill=education))+
  geom_boxplot()+
  theme_minimal()

Question 8.4

By visual analysis, which group has the highest and lowest variability? (2pts)
Which education level has the highest interquartile range of fraudulent claims. (1pt)

a) The ‘College’ group has the greatest variability. This is demonstrated by a large range between the upper and lower quartiles and several outliers.

#The ‘Doctors’ group has the least variability. This is because the boxes are narrower, indicating a smaller range between quartiles and fewer outliers.

b) Groups with the highest interquartile range (IQR) will have the widest boxes in the box plot. Here the “College” group meets that condition and has the highest interquartile range.

Question 8.5

Create a scatter plot of annual income against amount claimed (be sure to get the x and y axes correct). (2pts)

ggplot(high_fraud, aes(x = ann_inc, y = claim_amt)) +
  geom_point() +
  theme_minimal()

Question 8.6

Do you see any pattern in the scatterplot? Explain in one or two sentences. (5pts)

Overall, the data are fairly dispersed, with no strong linear pattern. The scatterplot shows that most claims are concentrated at lower levels of annual income, particularly around zero. There are a few outliers with higher claim amounts.

ETC1010-5510: Introduction to Data Analysis

Yan Liu

Instructions to Students

Marking + Grades

Due Date

How to find help from R functions?

Load all the libraries that you need here

Read in the data

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Use the `high_fraud` dataframe to answer the following questions.

Question 7.1

Question 7.2

Question 7.3

Question 8

Use the `high_fraud` dataframe to answer the following questions.

Question 8.1

Question 8.2

Question 8.3

Question 8.4

a) The ‘College’ group has the greatest variability. This is demonstrated by a large range between the upper and lower quartiles and several outliers.

b) Groups with the highest interquartile range (IQR) will have the widest boxes in the box plot. Here the “College” group meets that condition and has the highest interquartile range.

Question 8.5

Question 8.6

Overall, the data are fairly dispersed, with no strong linear pattern. The scatterplot shows that most claims are concentrated at lower levels of annual income, particularly around zero. There are a few outliers with higher claim amounts.

ETC1010-5510: Introduction to Data Analysis

Yan Liu

Instructions to Students

Marking + Grades

Due Date

How to find help from R functions?

Load all the libraries that you need here

Read in the data

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Use the high_fraud dataframe to answer the following questions.

Question 7.1

Question 7.2

Question 7.3

Question 8

Use the high_fraud dataframe to answer the following questions.

Question 8.1

Question 8.2

Question 8.3

Question 8.4

a) The ‘College’ group has the greatest variability. This is demonstrated by a large range between the upper and lower quartiles and several outliers.

b) Groups with the highest interquartile range (IQR) will have the widest boxes in the box plot. Here the “College” group meets that condition and has the highest interquartile range.

Question 8.5

Question 8.6

Overall, the data are fairly dispersed, with no strong linear pattern. The scatterplot shows that most claims are concentrated at lower levels of annual income, particularly around zero. There are a few outliers with higher claim amounts.

Use the `high_fraud` dataframe to answer the following questions.

Use the `high_fraud` dataframe to answer the following questions.