Description and Source of Data

This data set contains 1000 observations on 39 variables concerning insurance claims made by motorists after being involved in a collision. Data can be downloaded from: https://www.kaggle.com/roshansharma/insurance-claim.

Age and Education Level of Drivers

The median and mean of this data set are very close. This tells me that the histogram will most likely have a peak in the middle, and have a uniform distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.00   32.00   38.00   38.95   44.00   64.00

As you can see from this histogram, my original prediction was correct that there would be a uniform distribution. I believe this may dispel a common conception that younger drivers are more reckless and cause more accidents than older drivers.

## 
##   Associate     College High School          JD     Masters          MD 
##         145         122         160         161         143         144 
##         PhD 
##         125

Collisions and Car Manufacturers

## 
##     Accura       Audi        BMW  Chevrolet      Dodge       Ford      Honda 
##         68         69         72         76         80         72         55 
##       Jeep   Mercedes     Nissan       Saab     Suburu     Toyota Volkswagen 
##         67         65         78         80         80         70         68

This is an analysis of the total claim amount, and the make of the car that made the claim. One manufacturer that stood out the most was toyota, as most of their claims are lower than all of the other car manufacturers.

Age and Claim Amounts

This shows that there does not seem to be much of a difference, if any, when it comes to claim amounts for each age.

Univariate Analysis, Total Claim Amount

## [1]    100  41775  58055  70595 114920

This is an analysis of the total claim amounts. This raises the question, why are some claims so low, and some so high, even over $100,000?

Bivariate Analysis, Type of Incident Vs Claim Amount

One possibility that I will test is if the type of incident affects the total claim amount.

This box plot confirms this theory, as it shows multi-vehicle and single-vehicle collisions normally have a much higher claim amount than parked car and vehicle theft claims.

Analysis Critique

A hypothetical data analyst created the following graph to help him figure out whether certain types of incidents tend to occur more often at certain times of the day.

  1. The Two Variables in this analysis are the type of Incident, and what time they occurred
  2. This graph does a pretty good job of answering the analysts question because it shows the average time of day that each incident occurs, but it also shows that each one could happen at any time. The graph shows that most collisions happen in the evening/afternoon, and the parked car and vehicle theft incidents are more common very early in the morning

One Alternative method that could be used to address this question would be to make a density function, and have each color correlate to the incident type.