During this holiday season, Americans will be driving to see new places, Christmas decorations and visit their loved ones. As a frequent driver myself, I get insulted by people I know from other states about how I am a terrible driver because I am a New Yorker. According to NY Magazine some of the reasons that New Yorkers get the label of being a "bad driver" is because “people that drive in very congested areas lose sight of the rules of the road” or “people in states with large metropolitan areas tend to take public transit more often and lose some of their driving skills.” [...] We’re just not driving as much as everyone else”. Fivethirtyeight uses three determining factors to see if a state’s driver is terrible or not: the number of car crashes, how much insurance companies pay out, and how much insurance companies charge drivers. In this analysis, I will be checking to see if driver behavior predicts the cost of car insurance. My prediction is, the higher the rate of collisions in a state, the higher insurance premiums would be there.
Context on the data collection: Data is collected by Mona Chalabi from FiveThirtyEight (National Highway Traffic Safety Administration 2009- 2012 (NHTSA) (https://www-fars.nhtsa.dot.gov/Main/index.aspx) and National Association of Insurance Commissioners 2010 & 2011): https://github.com/fivethirtyeight/data/tree/master/bad-drivers This is an observational study.
Description of the dependent variable (what is being measured): For this linear regression analysis, the dependent variable is car insurance premiums. Car insurance is different for each state/area and there are various factors that determine what the average insurance premium is.
Description of the independent variable: The independent variable is the number of drivers involved in fatal collisions per billion miles. Even though this isn’t actually an independent variable in a bigger picture because fatal collisions can depend on multiple factors such as driving after drinking alcohol and driving with other distractions. However, for this analysis I am looking into it as a possible cause for high car insurance rates.
Research Question: Does driver behavior predict car insurance? The worse the driver (higher collisions) a state has, would the car insurance premium be higher?
head(drivers, n=1)
## State Number.of.drivers.involved.in.fatal.collisions.per.billion.miles
## 1 Alabama 18.8
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Speeding
## 1 39
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Alcohol.Impaired
## 1 30
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Not.Distracted
## 1 96
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Had.Not.Been.Involved.In.Any.Previous.Accidents
## 1 80
## Car.Insurance.Premiums....
## 1 784.55
## Losses.incurred.by.insurance.companies.for.collisions.per.insured.driver....
## 1 145.08
dim(drivers)
## [1] 51 8
colnames(drivers)
## [1] "State"
## [2] "Number.of.drivers.involved.in.fatal.collisions.per.billion.miles"
## [3] "Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Speeding"
## [4] "Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Alcohol.Impaired"
## [5] "Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Not.Distracted"
## [6] "Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Had.Not.Been.Involved.In.Any.Previous.Accidents"
## [7] "Car.Insurance.Premiums...."
## [8] "Losses.incurred.by.insurance.companies.for.collisions.per.insured.driver...."
#premium<-drivers$Car.Insurance.Premiums
#print(premium)
#fatalCollisions<-drivers$Number.of.drivers.involved.in.fatal.collisions.per.billion.miles
#state<-drivers$State
drivers <- drivers %>%
rename(premium= Car.Insurance.Premiums....)
colnames(drivers)
## [1] "State"
## [2] "Number.of.drivers.involved.in.fatal.collisions.per.billion.miles"
## [3] "Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Speeding"
## [4] "Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Alcohol.Impaired"
## [5] "Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Not.Distracted"
## [6] "Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Had.Not.Been.Involved.In.Any.Previous.Accidents"
## [7] "premium"
## [8] "Losses.incurred.by.insurance.companies.for.collisions.per.insured.driver...."
#grades_updated <- grades_updated %>%
# rename(Term = Col2)
drivers <- drivers %>%
rename(NumOfDriversFatalCollisions= Number.of.drivers.involved.in.fatal.collisions.per.billion.miles)
head(drivers)
## State NumOfDriversFatalCollisions
## 1 Alabama 18.8
## 2 Alaska 18.1
## 3 Arizona 18.6
## 4 Arkansas 22.4
## 5 California 12.0
## 6 Colorado 13.6
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Speeding
## 1 39
## 2 41
## 3 35
## 4 18
## 5 35
## 6 37
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Alcohol.Impaired
## 1 30
## 2 25
## 3 28
## 4 26
## 5 28
## 6 28
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Not.Distracted
## 1 96
## 2 90
## 3 84
## 4 94
## 5 91
## 6 79
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Had.Not.Been.Involved.In.Any.Previous.Accidents
## 1 80
## 2 94
## 3 96
## 4 95
## 5 89
## 6 95
## premium
## 1 784.55
## 2 1053.48
## 3 899.47
## 4 827.34
## 5 878.41
## 6 835.50
## Losses.incurred.by.insurance.companies.for.collisions.per.insured.driver....
## 1 145.08
## 2 133.93
## 3 110.35
## 4 142.39
## 5 165.63
## 6 139.91
plot(x = drivers$NumOfDriversFatalCollisions, y = drivers$Car.Insurance.Premiums,
xlab = "Car Insurance",
ylab = "Collisions",
main = "Car Insurance Vs. Collisions",
col="purple"
)
#Syntax: cor(x, y, method)
drivers %>%
summarise(cor(drivers$premium, drivers$NumOfDriversFatalCollisions, use = "complete.obs"))
## cor(drivers$premium, drivers$NumOfDriversFatalCollisions, use = "complete.obs")
## 1 -0.1997019
The correlation coefficient of -0.1997 shows how there is a weak negative linear relationship (out of 1). A negative correlation indicates two variables that tend to move in opposite directions: a positive change in one variable will be accompanied by a negative change in the other variable. A positive correlation indicates that the variables move in the same direction: a positive change in one variable will tend to accompany a positive change in the other variable. Since it is 0.2 out of 1, it is not considered as "strong".
#plot_ss(x, y, data, showSquares = FALSE, leastSquares = FALSE)
DATA606::plot_ss(x = drivers$NumOfDriversFatalCollisions, y = drivers$premium , showSquares = TRUE)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## 1023.354 -8.638
##
## Sum of Squares: 1526088
The showsquares=TRUE part, shows the residuals are the difference between the observed values and the values predicted by the line. The residual sum of squares shows the level of variance in the error term (residuals), of a regression model. The smaller the residual sum of squares, the better your model will fit your data. When the residual sum of squares gets large, the poorer your model fits your data. A value of zero means your model is a perfect fit.
#lm( fitting_formula, dataframe )
#(Y~X)
m1 <- lm(drivers$premium~drivers$NumOfDriversFatalCollisions, data = drivers)
print(m1)
##
## Call:
## lm(formula = drivers$premium ~ drivers$NumOfDriversFatalCollisions,
## data = drivers)
##
## Coefficients:
## (Intercept) drivers$NumOfDriversFatalCollisions
## 1023.354 -8.638
summary(m1)
##
## Call:
## lm(formula = drivers$premium ~ drivers$NumOfDriversFatalCollisions,
## data = drivers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -249.23 -136.43 -22.29 133.45 435.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1023.354 98.748 10.363 6.08e-14 ***
## drivers$NumOfDriversFatalCollisions -8.638 6.055 -1.427 0.16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 176.5 on 49 degrees of freedom
## Multiple R-squared: 0.03988, Adjusted R-squared: 0.02029
## F-statistic: 2.035 on 1 and 49 DF, p-value: 0.16
Equation: y=1023.35 - 8.638* premium
Slope: Each additional amount of fatal collisions that have happened in a state, we would expect the premium to increase by -8.638.
Intercept: The amount of car insurance premium with no amount of change based on fatal accidents is 1023.354 (y intercept). Multiple R squared tells us the measure of how well observed outcomes are replicated by the model. My multiple R squared value is 0.0398.
Residual: Having a negative residual means that the predicted value is too high, similarly if you have a positive residual it means that the predicted value was too low. The aim of a regression line is to minimise the sum of residuals.
My independent variable does not seem to predict my dependent one better as my R Square value is less than premium and NumOfDriversFatalCollisions.
ggplot(data = drivers, aes(x = drivers$NumOfDriversFatalCollisions, y = drivers$premium)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE)
## Warning: Use of `drivers$NumOfDriversFatalCollisions` is discouraged. Use
## `NumOfDriversFatalCollisions` instead.
## Warning: Use of `drivers$premium` is discouraged. Use `premium` instead.
## Warning: Use of `drivers$NumOfDriversFatalCollisions` is discouraged. Use
## `NumOfDriversFatalCollisions` instead.
## Warning: Use of `drivers$premium` is discouraged. Use `premium` instead.
## `geom_smooth()` using formula 'y ~ x'
geom_smooth creates the line by fitting a linear model, which is used in the above code.
This line can be used to predict y at any value of x. When predictions are made for values of x that are beyond the range of the observed data, it is referred to as extrapolation and is not usually recommended. However, predictions made within the range of the data are more reliable. They’re also used to compute the residuals.
ggplot(data = m1, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
We saw above if the relationship is linear or not with the scatterplot. We should also verify this condition with a plot of the residuals vs. fitted (predicted) values.
#Residuals per count
ggplot(data = m1, aes(x = .resid)) +
geom_histogram(binwidth = 20) +
xlab("Residuals")
#graphical technique for determining if two data sets come from populations with a common distribution
ggplot(data = m1, aes(sample = .resid)) +
stat_qq()
We see both the normal probability plot and histogram show that the distribution of these data are nearly normal (almost a normal curve).
#Sorting states in ascending order of collisions
drivers %>%
select(State) %>%
arrange((drivers$NumOfDriversFatalCollisions))
## State
## 1 District of Columbia
## 2 Massachusetts
## 3 Minnesota
## 4 Washington
## 5 Connecticut
## 6 Rhode Island
## 7 New Jersey
## 8 Utah
## 9 New Hampshire
## 10 California
## 11 New York
## 12 Maryland
## 13 Virginia
## 14 Illinois
## 15 Oregon
## 16 Colorado
## 17 Vermont
## 18 Wisconsin
## 19 Michigan
## 20 Ohio
## 21 Indiana
## 22 Nevada
## 23 Nebraska
## 24 Maine
## 25 Idaho
## 26 Georgia
## 27 Iowa
## 28 Missouri
## 29 Delaware
## 30 North Carolina
## 31 Wyoming
## 32 Hawaii
## 33 Mississippi
## 34 Kansas
## 35 Florida
## 36 Alaska
## 37 Pennsylvania
## 38 New Mexico
## 39 Arizona
## 40 Alabama
## 41 South Dakota
## 42 Texas
## 43 Tennessee
## 44 Oklahoma
## 45 Louisiana
## 46 Kentucky
## 47 Montana
## 48 Arkansas
## 49 West Virginia
## 50 North Dakota
## 51 South Carolina
#NJ is the highest, Idaho is the lowest
drivers %>%
mutate(state = fct_reorder(drivers$State, drivers$premium)) %>%
ggplot(., aes(x = drivers$State,y = drivers$premium, fill = state)) +
geom_bar(position = "stack", stat="identity") +
ylab("Car Insurance Premium Per State in the United States") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
coord_flip()
#Reading in the data I collected
survey_data<-read.csv('https://raw.githubusercontent.com/Sangeetha-007/Worst-Drivers-in-the-US/main/NY%20Drivers_Road%20Experience%20(Responses)%20-%20Form%20Responses%201%20(1).csv')
head(survey_data, n=2)
## Timestamp X Gender Age Which.state.are.you.from
## 1 12/5/2022 18:19:29 NA Female 21-25 New York
## 2 12/5/2022 20:43:14 NA Male 25-30 New York
## Do.you.think.New.Yorkers.drivers.from.a.populated.city.are.terrible.drivers.from.your.experience..from.driving.or.being.in.a.car.or.even.from.crossing.the.road..
## 1 No
## 2 Yes
## Whether.you.chose..yes..or..no...why.do.you.feel.that.way.
## 1 I don't feel like New York drivers are worse than drivers in other states. Many of them stop for people crossing the road. It is just more crowded in the city so there's more problems because of that.
## 2 Unnecessary honking. Traffic violations. e.g. frequent running of red lights, not waiting for turns during stop signs. Switching of lanes is very awful as well.
#Bar graph shows age groups
ggplot(data = survey_data, aes(x = Age)) + geom_bar(fill ='#9F2B68')
#Word cloud on their reviews
thoughts<-survey_data$Whether.you.chose..yes..or..no...why.do.you.feel.that.way.
wordcloud(words = thoughts, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35,colors=brewer.pal(8, "Dark2"))
## Loading required namespace: tm
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents
Although I believed car insurance rates should increase according to driver behavior (more collisions), the data shows otherwise. The sum of squares was extremely high (1,526,088), and the higher the sum of squares, the worse the fit of the model on the data. My standard error was 98.75 for the intercept and 6.1 for NumOfDriversFatalCollisions. A high standard error shows that the sample means are widely spread apart from the population mean, thus my sample does not represent the population. My residual sum also was too large, therefore my model didn’t fit the data well either. I think the reason my numbers were so high is because the number of collisions in a state doesn't have a direct correlation to how high car insurance premiums are (thus my weak negative correlation). There are many determining factors towards car insurance premiums such as demography, how close you are to a major city, gender, your driving background, your age, how many people are attached to your premium, the brand of your car, etc.