ETC1010-5510: Introduction to Data Analysis

Instructions to Students

This is an individual assignment and you must work on it on your own. Collaboration on the assignment constitute collusion. For more on collusion and misconduct please see this webpage.

This assignment is designed to simulate a scenario in which you are taking over someone’s existing work and continuing with it to draw some further insights. The dataset you have used has already gone through preliminary cleaning, and it will be your job to use this cleaned dataset to answer questions.

You have just joined a large insurance company as a data analyst. You’ve first job is to help the fraud team understand the characteristics of fraudulent insurance claims. To get you started in this new role, your manager has asked you to perform a short EDA on a snippet of customer data that the fraud team has compiled. You are to communicate your findings to the chief analytics leader. This is not a formal report, but rather something you are giving to your manager that describes the data with some interesting insights.

Please make sure you read the hints throughout the assignment to help guide you on the tasks.

The points allocated for each of the elements in the assignment are marked in the questions and in certain cases, code scaffolding has been provided. However, if you feel this scaffolding is unhelpful, you are not obliged to use it.

Marking + Grades

This assignment is out of 50 and will be worth 10% of your total grade. Due on: Thursday, 22 August 2024, 11:59 PM (Melbourne time).

For this assignment, you will need to upload the following into Moodle:

The rendered html file saved as a pdf. The assignment will be only marked if the pdf is uploaded in Moodle. The submitted assignment pdf file must have all the code and output visible.
To complete the assignment, you will need to fill in the blanks with appropriate R code for some questions. These sections are marked with ___. For other questions, you will need to write the entire R code chunk.
At a minimum, your assignment should be able to be “knitted” using the Knit button for your Rmarkdown document so that you can produce a html file that you will save as pdf file and upload it into Moodle.

If you want to look at what the assignment looks like as you progress , remember that you can set the R chunk options to eval = FALSE like so to ensure that you can knit the file:

```{r this-chunk-will-not-run, eval = FALSE} `r''`
a <- 1 + 2
```

If you use eval = FALSE or echo = FALSE, please remember to ensure that you have set to eval = TRUE and echo = TRUE when you submit the assignment, to ensure all your R codes run.

IMPORTANT: You must use R code to answer all the questions in the report.

Due Date

This assignment is due on Thursday, 22 August 2024, 11:55 PM. You will submit the knitted html file saved as a pdf via Moodle. Please make sure you add your name on the YAML part of the Rmd file before you knit it and save it as pdf.

How to find help from R functions?

Remember, you can look up the help file for functions by typing: ?function_name. For example, ?mean.

Load all the libraries that you need here

library(tidyverse)

Read in the data

fraud <- read.csv("D:/DATA ANA/data/fraud.csv")

Question 1

Display the first 7 rows of the data set (1pt). Hint: Check ?head in your R console.

head(fraud, 7)

##   customer Claim.Amount coverage education employment_status gender
## 1  BU79786        2,764    Basic  Bachelor          Employed      F
## 2  QZ44356        6,980 Extended  Bachelor        Unemployed      F
## 3  AI49188       12,887  Premium  Bachelor          Employed      F
## 4  WW63253        7,646    Basic  Bachelor        Unemployed      M
## 5  HB64268        2,814    Basic  Bachelor          Employed      M
## 6  OC83172        8,256    Basic  Bachelor          Employed      F
## 7  XZ87318        5,381    Basic   College          Employed      F
##   Annual.Income.... location_category marital_status age monthly_premium
## 1             56274          Suburban        Married  59              43
## 2                 0          Suburban         Single  29              96
## 3             48767          Suburban        Married  42             108
## 4                 0          Suburban        Married  36             116
## 5             43836             Rural         Single  23              83
## 6             62902             Rural        Married  33              79
## 7             55350          Suburban        Married  17             107
##   months_since_last_claim vehicle_size reported_to_police
## 1                      32      Medsize                  n
## 2                      13      Medsize                  y
## 3                      18      Medsize                  n
## 4                      18      Medsize                  y
## 5                      12      Medsize                  n
## 6                      14      Medsize                  n
## 7                       0      Medsize                  n
##   witness_details_provided weekend car_driveable fraud_detected
## 1                        y       n             y              0
## 2                        y       n             y              0
## 3                        y       n             y              0
## 4                        y       n             y              0
## 5                        y       y             y              0
## 6                        y       n             y              0
## 7                        n       y             y              1

Question 2

How many observations and variables does the data set fraud have? Use the inline code to complete the sentence below (4pts)

dim(fraud)

## [1] 9134   18

Your inline code should look like the following. Please paste the inline code in the code chunk below for us to mark.

#The number of observations are `r nrow(fraud)` (2pt) and the number of variables are `r ncol(fraud)` (2pt)

Question 3

Using the fraud data set, rename the Claim Amount variable to claim_amt and the Annual Income ($) variable to ann_inc. Save this new data frame as fraud (4pts). HINT: use ” ” for the original variable names

fraud <- dplyr::rename(fraud, claim_amt="Claim.Amount", ann_inc = "Annual.Income....")
fraud <- fraud %>%
  mutate(claim_amt = as.factor(claim_amt)) %>%
  mutate(claim_amt = as.numeric(claim_amt))

Question 4

Display the first 6 rows and 7 columns corresponding to the customers with “Premium coverage” (1pt). HINT: use c(,) in head()

fraud %>% filter(coverage=="Premium") %>% head(c(6,7))

##   customer claim_amt coverage            education employment_status gender
## 1  AI49188       449  Premium             Bachelor          Employed      F
## 2  CF85061      4469  Premium               Master        Unemployed      M
## 3  DP39365      5304  Premium               Master          Employed      M
## 4  FL50705      4977  Premium High School or Below          Employed      F
## 5  US89481      2491  Premium             Bachelor        Unemployed      F
## 6  GE62437       452  Premium              College          Employed      F
##   ann_inc
## 1   48767
## 2       0
## 3   77026
## 4   66140
## 5       0
## 6   86584

Question 5

Using the fraud data set and the function filter, extract all claims where fraud was detected and the claim amount was greater than $5,000. Call this new data set high_fraud. (4pts)

high_fraud <-  fraud %>%
  filter(fraud_detected == TRUE, claim_amt > 5000)

Question 6

How many variables and observations are in the high_fraud data frame? (2pts) HINT: you can use count or nrow to complete this.

num_variable <- ncol(high_fraud)
num_observation <- nrow(high_fraud)

Question 7

Use the `high_fraud` dataframe to answer the following questions.

Question 7.1

Which education types have been detected as committing high levels of fraud? Display the top three education groups. (5pts)

high_fraud %>% group_by(education) %>% 
  summarise(n=n()) %>% 
  arrange(desc(n)) %>% 
  head(3)

Question 7.2

Which insurance coverage has the highest level of high insurance fraud detected? (2pts)

high_fraud_coverage <- high_fraud %>%
  group_by(coverage) %>%
  summarise(high_fraud_count = n()) %>%
  arrange(desc(high_fraud_count))

high_fraud_coverage

## # A tibble: 3 × 2
##   coverage high_fraud_count
##   <chr>               <int>
## 1 Basic                  43
## 2 Extended               25
## 3 Premium                 8

Question 7.3

Are males or females more likely to commit high levels of fraud?. (2pts)

high_fraud_gender <- high_fraud %>%
  group_by(gender) %>%
  summarise(high_fraud_count = n()) %>%
  arrange(desc(high_fraud_count))
high_fraud_gender

## # A tibble: 2 × 2
##   gender high_fraud_count
##   <chr>             <int>
## 1 F                    43
## 2 M                    33

Question 8

Use the `high_fraud` dataframe to answer the following questions.

Question 8.1

Report the sample size, average, median, standard deviation and inter-quartile range of claim amount, grouped by vehicle size. HINT: create a tibble named temp.* (5pts)

temp<- high_fraud %>%
  group_by(vehicle_size) %>% 
summarise(
  sample_size= n(), 
  Average = mean(claim_amt),
  Median = median(claim_amt),
  standar_deviation =sd(claim_amt),
  IQR =IQR(claim_amt)
)

Question 8.2

In three (3) or four (4) sentences: - Explain what these values tell us about the shape of the distribution of claim amounts by vehicle size (3 pts) - Based upon this, use the most appropriate measure of central tendency to rank the vehicle size group that had the highest claims from lowest to highest (no R code is required) (2pts)

The data demonstrated claim amount for different vehicle sizes for each category, for instance, we have average and median which are close to each other indicates that the distribution are balanced and not skewed in any way. Moreover, larger vehicle hence to have a more wide spread in claim amount as seen in the high standard deviation and IQR. Since the distribution are balanced, the average between each sizes from the given claim amount from lowest to highest is Small = 5252.33, Medsize = 5295.60, and Large = 5298.

HINT: think about the relationship between the mean and the median

Question 8.3

Use a boxplot to display the the amount claimed by different education levels. Properly label the axes of the boxplot and assign different colours to each group. (5 pts)

ggplot(high_fraud, aes(x = education, y = claim_amt, fill = education)) +
  geom_boxplot() +
  labs(title = "Claim Amount by Education Level",
       x = "Education Level",
       y = "Claim Amount") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3")

Question 8.4

By visual analysis, which group has the highest and lowest variability? (2pts)
Which education level has the highest interquartile range of fraudulent claims. (1pt)

The boxplot shows that people who have highschool education and below have the widest spread amount out of all, thus indicating the highest variability. Moreover, the group also have the largest interquartile range, suggesting that almost half the claims have widers range when compared with other education levels.

Question 8.5

Create a scatter plot of annual income against amount claimed (be sure to get the x and y axes correct). (2pts)

ggplot(high_fraud, aes(x = ann_inc, y = claim_amt)) +
  geom_point() +
  labs(title = "Annual Income vs Claimed ammount",
       x = "Annual Income",
       y = "Claim Amount") +
  theme_minimal()

Question 8.6

Do you see any pattern in the scatterplot? Explain in one or two sentences. (5pts)

The scatterplot above have a weak correlation between the annual income with the claim amount. This may result in when annual income increases that a small amount of claim amount also increase. However, the relation between those two variable is not strong indicated by the big amount of scatter data in the graph.

ETC1010-5510: Introduction to Data Analysis

HOANG GIA PHAT-35325453

Instructions to Students

Marking + Grades

Due Date

How to find help from R functions?

Load all the libraries that you need here

Read in the data

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Use the high_fraud dataframe to answer the following questions.

Question 7.1

Question 7.2

Question 7.3

Question 8

Use the high_fraud dataframe to answer the following questions.

Question 8.1

Question 8.2

Question 8.3

Question 8.4

Question 8.5

Question 8.6

Use the `high_fraud` dataframe to answer the following questions.

Use the `high_fraud` dataframe to answer the following questions.