1. Introduction
2. Question for analysis
3. Description of data used
3.1 Load data and summary of the data
3.2 Format
3.3 Applying mean(), meadian() functions
4. Transforming data
4.1 New dataset grouped by state
4.2 New dataset grouped by year
5. Visualize the data
5.1 Scatter Plot for main dataset
5.2 Pie charts
5.3 Box plot
5.4 Histograms and scatterplots for subsets
5.4.1 Data per Year
5.4.2 Data per State
6. Conclusion
This is the final project for the Bridge course to demonstrate the skills learned through the 3 weeks course. The dataset “Drunk Driving Laws and Traffic Deaths” will be used for the analysis. The main functions will be applied to analyze the dataset such as mean(), median(), mode(), summary() as well as graphic tools (ggplot2 library).
Does the fatality rate due to drunk driving depends on the well-being?
Dataset “Drunk Driving Laws and Traffic Deaths” represents a panel of 48 observations from 1982 to 1988 conducted in each State of the United States.
Sourse: https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Fatality.csv
Files located: https://raw.githubusercontent.com/ex-pr/Data-set-week-3/main/Fatality.csv
The dataset was loaded from the Github repository. Only columns from 2 to 11 are used. The first column was removed as it distracts from the main information. The column names were changed for a better understanding of the information they contain. The next step is to show a summary for the dataset for the understanding of the values it contains.
fatality <- read.csv("https://raw.githubusercontent.com/ex-pr/Data-set-week-3/main/Fatality.csv", header=TRUE, sep=",")
fatality <- fatality[, 2:11]
colnames(fatality) <- c("State", "Year", "Fatality_rate", "Beer_tax", "Legal_drink_age", "Jail_sentence", "Community_service", "Miles_per_driver", "Unemployment_rate", "Income")
summary(fatality)
## State Year Fatality_rate Beer_tax
## Length:336 Min. :1982 Min. :0.8212 Min. :0.04331
## Class :character 1st Qu.:1983 1st Qu.:1.6237 1st Qu.:0.20885
## Mode :character Median :1985 Median :1.9560 Median :0.35259
## Mean :1985 Mean :2.0404 Mean :0.51326
## 3rd Qu.:1987 3rd Qu.:2.4179 3rd Qu.:0.65157
## Max. :1988 Max. :4.2178 Max. :2.72076
## Legal_drink_age Jail_sentence Community_service Miles_per_driver
## Min. :18.00 Length:336 Length:336 Min. : 4.576
## 1st Qu.:20.00 Class :character Class :character 1st Qu.: 7.183
## Median :21.00 Mode :character Mode :character Median : 7.796
## Mean :20.46 Mean : 7.891
## 3rd Qu.:21.00 3rd Qu.: 8.504
## Max. :21.00 Max. :26.148
## Unemployment_rate Income
## Min. : 2.400 Min. : 9514
## 1st Qu.: 5.475 1st Qu.:12086
## Median : 7.000 Median :13763
## Mean : 7.347 Mean :13880
## 3rd Qu.: 8.900 3rd Qu.:15175
## Max. :18.000 Max. :22193
A dataframe contains :
State - state Postal code
Year - year from 1982 to 1988
Fatality_rate - traffic fatality rate (deaths per 10000)
Beer_tax - tax on case of beer
Legal_drink_age - minimum legal drinking age
Jail_sentence - mandatory jail sentence ?
Community_service - mandatory community service ?
Miles_per_driver - average miles per driver
Unemployment_ _rate - unemployment rate
Income - per capita personal income
The goal is to calculate the mean/median fatality, beer tax, age of a drinker, miles per driver, income per family in 48 states through 7 years, and what drink age is the most common if a state had a jail sentence or community service for a drunk driver.
mean_fatality <- mean(fatality[,3])
median_fatality <- median(fatality[,3])
mean_beertax <- mean(fatality[,4])
median_beertax <- median(fatality[,4])
mean_miles <- mean(fatality[,8])
median_miles <- median(fatality[,8])
mean_unempl <- mean(fatality[,9])
median_unempl <- median(fatality[,9])
mean_income <- mean(fatality[,10])
median_income <- median(fatality[,10])
mode_jail <- names(which.max(table(fatality[,6])))
mode_community <- names(which.max(table(fatality[,7])))
mode_drinkage <- names(which.max(table(fatality[,5])))
print(sprintf("Mean fatality rate is %f, median fatality rate is %f, mean beer tax is %f, median beer tax is %f, mean miles per driver is %f, median miles per driver is %f", mean_fatality, median_fatality, mean_beertax, median_beertax,mean_miles, median_miles))
## [1] "Mean fatality rate is 2.040444, median fatality rate is 1.955955, mean beer tax is 0.513256, median beer tax is 0.352589, mean miles per driver is 7.890754, median miles per driver is 7.796219"
print(sprintf("Mean unemployment rate is %f, median unemployment rate is %f, mean income per family is %f, median income per family is %f", mean_unempl, median_unempl, mean_income, median_income))
## [1] "Mean unemployment rate is 7.346726, median unemployment rate is 7.000000, mean income per family is 13880.184533, median income per family is 13763.128906"
print(sprintf("Most common is %s to jail sentence for drunk driving. Most common is %s to community service for drunk driving. The most common drink age is %s", mode_jail, mode_community, mode_drinkage))
## [1] "Most common is no to jail sentence for drunk driving. Most common is no to community service for drunk driving. The most common drink age is 21"
The new data set will group observations for 7 years for 1 state in 1 row. Each row will include state name, mean fatality rate for 7 years, mean income, mean beer tax, mean miles per driver, mean unemployment rate, the summary fatality rate for 7 years. Also, the data is shown in descending order where the state with the maximum summary fatality at the top and with a minimum at the bottom.
fatality_state <- data.frame(fatality %>%
group_by(State)%>%
summarize(fatal_state = mean(Fatality_rate), income_state = mean(Income), beertax_state=mean(Beer_tax), miles_state=mean(Miles_per_driver), unempl_state=mean(Unemployment_rate), sum_fatal=sum(Fatality_rate)))
fatality_state <- data.frame(fatality_state%>%
arrange(desc(sum_fatal)))
summary(fatality_state)
## State fatal_state income_state beertax_state
## Length:48 Min. :1.110 Min. : 9951 Min. :0.04817
## Class :character 1st Qu.:1.661 1st Qu.:12054 1st Qu.:0.21161
## Mode :character Median :1.974 Median :13616 Median :0.37183
## Mean :2.040 Mean :13880 Mean :0.51326
## 3rd Qu.:2.402 3rd Qu.:15144 3rd Qu.:0.64710
## Max. :3.653 Max. :19516 Max. :2.44051
## miles_state unempl_state sum_fatal
## Min. : 5.130 Min. : 4.100 Min. : 7.771
## 1st Qu.: 7.361 1st Qu.: 5.861 1st Qu.:11.630
## Median : 7.904 Median : 7.271 Median :13.815
## Mean : 7.891 Mean : 7.347 Mean :14.283
## 3rd Qu.: 8.263 3rd Qu.: 8.639 3rd Qu.:16.814
## Max. :10.593 Max. :13.200 Max. :25.572
To find the states with the maximum mean fatality rate, summary fatality rate, income, beer tax and unemployment rate, the code below is used
fatality_state[which.max(fatality_state$fatal_state),]
## State fatal_state income_state beertax_state miles_state unempl_state
## 1 NM 3.653197 11682.84 0.3824027 9.239428 8.785714
## sum_fatal
## 1 25.57238
fatality_state[which.max(fatality_state$sum_fatal),]
## State fatal_state income_state beertax_state miles_state unempl_state
## 1 NM 3.653197 11682.84 0.3824027 9.239428 8.785714
## sum_fatal
## 1 25.57238
fatality_state[which.max(fatality_state$income_state),]
## State fatal_state income_state beertax_state miles_state unempl_state
## 42 CT 1.463509 19515.82 0.231545 7.247288 4.642857
## sum_fatal
## 42 10.24456
fatality_state[which.max(fatality_state$beertax_state),]
## State fatal_state income_state beertax_state miles_state unempl_state
## 13 GA 2.401569 13316.89 2.440507 9.089214 6.428571
## sum_fatal
## 13 16.81098
fatality_state[which.max(fatality_state$unempl_state),]
## State fatal_state income_state beertax_state miles_state unempl_state
## 16 WV 2.300624 10812.34 0.4272953 6.585874 13.2
## sum_fatal
## 16 16.10437
To find the states with the minimum mean fatality rate, summary fatality rate, income, beer tax and unemployment rate, the code below is used
fatality_state[which.min(fatality_state$fatal_state),]
## State fatal_state income_state beertax_state miles_state unempl_state
## 48 RI 1.110077 14713.41 0.1562781 6.007946 5.657143
## sum_fatal
## 48 7.77054
fatality_state[which.min(fatality_state$income_state),]
## State fatal_state income_state beertax_state miles_state unempl_state
## 5 MS 2.761846 9950.87 1.047007 7.368382 10.71429
## sum_fatal
## 5 19.33292
fatality_state[which.min(fatality_state$beertax_state),]
## State fatal_state income_state beertax_state miles_state unempl_state
## 2 WY 3.217534 13452.9 0.04816791 10.59269 7.357143
## sum_fatal
## 2 22.52274
fatality_state[which.min(fatality_state$unempl_state),]
## State fatal_state income_state beertax_state miles_state unempl_state
## 31 NH 1.798824 16281.71 0.6470018 7.917257 4.1
## sum_fatal
## 31 12.59177
7 states with the highest summary fatality rate
fatality_state[1:10, ]
## State fatal_state income_state beertax_state miles_state unempl_state
## 1 NM 3.653197 11682.84 0.38240269 9.239428 8.785714
## 2 WY 3.217534 13452.90 0.04816791 10.592690 7.357143
## 3 MT 2.903021 12044.86 0.32661333 9.270159 7.828572
## 4 SC 2.821669 11394.35 1.84964757 8.200333 7.285714
## 5 MS 2.761846 9950.87 1.04700737 7.368382 10.714286
## 6 NV 2.745260 15685.50 0.20064095 8.023373 7.600000
## 7 AZ 2.705900 13535.99 0.31104035 7.742038 7.128571
## 8 ID 2.571667 11551.72 0.36125929 8.000849 8.171429
## 9 FL 2.477799 14737.28 1.11561312 7.824188 6.442857
## 10 AR 2.435336 11066.19 0.59057525 7.411801 8.857143
## sum_fatal
## 1 25.57238
## 2 22.52274
## 3 20.32115
## 4 19.75168
## 5 19.33292
## 6 19.21682
## 7 18.94130
## 8 18.00167
## 9 17.34459
## 10 17.04735
The new data set will group observations for 48 states for 1 year per 1 row. Each row will include year, mean fatality rate for 48 states in this year, mean income, mean beer tax, mean miles per driver, mean unemployment rate, the summary fatality rate for 48 states in this year. Also, the data is shown in descending order where the state with the maximum summary fatality at the top and with a minimum at the bottom.
fatality_year <- data.frame(fatality %>%
group_by(Year)%>%
summarize(fatal_year = mean(Fatality_rate), income_year = mean(Income), beertax_year=mean(Beer_tax), miles_year=mean(Miles_per_driver), unempl_year=mean(Unemployment_rate), sum_fatality=sum(Fatality_rate)))
fatality_year <- data.frame(fatality_year%>%
arrange(desc(sum_fatality)))
summary(fatality_year)
## Year fatal_year income_year beertax_year
## Min. :1982 Min. :1.974 Min. :12998 Min. :0.4798
## 1st Qu.:1984 1st Qu.:2.012 1st Qu.:13345 1st Qu.:0.5019
## Median :1985 Median :2.061 Median :13843 Median :0.5169
## Mean :1985 Mean :2.040 Mean :13880 Mean :0.5133
## 3rd Qu.:1986 3rd Qu.:2.067 3rd Qu.:14368 3rd Qu.:0.5299
## Max. :1988 Max. :2.089 Max. :14894 Max. :0.5324
## miles_year unempl_year sum_fatality
## Min. :7.227 Min. :5.456 Min. : 94.74
## 1st Qu.:7.563 1st Qu.:6.570 1st Qu.: 96.60
## Median :7.960 Median :7.060 Median : 98.91
## Mean :7.891 Mean :7.347 Mean : 97.94
## 3rd Qu.:8.153 3rd Qu.:8.250 3rd Qu.: 99.23
## Max. :8.616 Max. :9.271 Max. :100.28
To find the year with the maximum mean fatality rate, summary fatality rate, income, beer tax and unemployment rate, the code below is used
fatality_year[which.max(fatality_year$fatal_year),]
## Year fatal_year income_year beertax_year miles_year unempl_year sum_fatality
## 1 1982 2.089106 12998.26 0.5302734 7.227225 9.266667 100.2771
fatality_year[which.max(fatality_year$income_year),]
## Year fatal_year income_year beertax_year miles_year unempl_year sum_fatality
## 2 1988 2.069594 14893.53 0.4798154 8.61583 5.45625 99.34052
fatality_year[which.max(fatality_year$beertax_year),]
## Year fatal_year income_year beertax_year miles_year unempl_year sum_fatality
## 6 1983 2.007846 13108.08 0.532393 7.384729 9.270833 96.37663
fatality_year[which.max(fatality_year$unempl_year),]
## Year fatal_year income_year beertax_year miles_year unempl_year sum_fatality
## 6 1983 2.007846 13108.08 0.532393 7.384729 9.270833 96.37663
To find the year with the minimum mean fatality rate, summary fatality rate, income, beer tax and unemployment rate, the code below is used
fatality_year[which.min(fatality_year$fatal_year),]
## Year fatal_year income_year beertax_year miles_year unempl_year sum_fatality
## 7 1985 1.973671 13842.82 0.5169272 7.740698 7.060417 94.7362
fatality_year[which.min(fatality_year$income_year),]
## Year fatal_year income_year beertax_year miles_year unempl_year sum_fatality
## 1 1982 2.089106 12998.26 0.5302734 7.227225 9.266667 100.2771
fatality_year[which.min(fatality_year$beertax_year),]
## Year fatal_year income_year beertax_year miles_year unempl_year sum_fatality
## 2 1988 2.069594 14893.53 0.4798154 8.61583 5.45625 99.34052
fatality_year[which.min(fatality_year$unempl_year),]
## Year fatal_year income_year beertax_year miles_year unempl_year sum_fatality
## 2 1988 2.069594 14893.53 0.4798154 8.61583 5.45625 99.34052
Years in order with the highest summary fatality rate at the top
fatality_year[,]
## Year fatal_year income_year beertax_year miles_year unempl_year sum_fatality
## 1 1982 2.089106 12998.26 0.5302734 7.227225 9.266667 100.27708
## 2 1988 2.069594 14893.53 0.4798154 8.615830 5.456250 99.34052
## 3 1986 2.065071 14186.30 0.5086639 8.016382 6.918750 99.12341
## 4 1987 2.060696 14549.79 0.4951288 8.290278 6.220833 98.91339
## 5 1984 2.017122 13582.51 0.5295902 7.960133 7.233333 96.82188
## 6 1983 2.007846 13108.08 0.5323930 7.384729 9.270833 96.37663
## 7 1985 1.973671 13842.82 0.5169272 7.740698 7.060417 94.73620
The scatter plots were built to see if there is a dependence between the fatality rate and other columns in the data set.
The first plot shows the dependency between the mean income per family and the mean fatality rate. If the income increases, then the fatality rate goes down.
In the second plot, there is no linear dependence between the unemployment rate and fatality rate, but the general dependence is observed, the fatality rate increases if the unemployment rate goes up.
In the third plot, we see that the beer tax doesn’t affect the fatality rate. The beer tax stays below 1 for most fatality cases.
In the fourth plot, we observe logarithmic dependence. The more miles per driver, the higher the fatality rate following logarithmic law.
ggplot(fatality, aes(x=Fatality_rate, y=Income)) + geom_point(color="cornflowerblue", size = 2, alpha=.8) + scale_x_continuous("Fatality rate") + scale_y_continuous("Income") + theme_minimal()
ggplot(fatality, aes(x=Fatality_rate, y=Unemployment_rate)) + geom_point(color="red", size = 3, alpha=.7) + scale_x_continuous("Fatality rate") + scale_y_continuous("Unemployment rate")
ggplot(fatality, aes(x=Fatality_rate, y=Beer_tax)) + geom_point(color="green", size = 2, alpha=.8) +scale_x_continuous("Fatality rate") + scale_y_continuous("Beer tax") + theme_minimal()
ggplot(fatality, aes(x=Fatality_rate, y=Miles_per_driver)) + geom_point(color="blue", size = 1, alpha=.8) + scale_x_continuous("Fatality rate") + scale_y_continuous("Miles per driver")
By building the pie chart, we see that most of the states didn’t have any jail sentence or community service for drunk drive between 1982 and 1988, so people could drive drunk without real consequences until they get in a crash with a fatal end.
plotdata <- fatality %>%
count(Jail_sentence) %>%
arrange(desc(Jail_sentence)) %>%
mutate(prop = round(n*100/sum(n), 1),
lab.ypos = cumsum(prop) - 0.5*prop)
plotdata$label <- paste0(plotdata$Jail_sentence, "\n",
round(plotdata$prop), "%")
ggplot(plotdata,
aes(x = "",
y = prop,
fill = Jail_sentence)) +
geom_bar(width = 1,
stat = "identity",
color = "black") +
geom_text(aes(y = lab.ypos, label = label),
color = "black") +
coord_polar("y",
start = 0,
direction = -1) +
theme_void() +
theme(legend.position = "FALSE") +
labs(title = "Jail sentence?")
plotdata <- fatality %>%
count(Community_service) %>%
arrange(desc(Community_service)) %>%
mutate(prop = round(n*100/sum(n), 1),
lab.ypos = cumsum(prop) - 0.5*prop)
plotdata$label <- paste0(plotdata$Community_service, "\n",
round(plotdata$prop), "%")
ggplot(plotdata,
aes(x = "",
y = prop,
fill = Community_service)) +
geom_bar(width = 1,
stat = "identity",
color = "black") +
geom_text(aes(y = lab.ypos, label = label),
color = "black") +
coord_polar("y",
start = 0,
direction = -1) +
theme_void() +
theme(legend.position = "FALSE") +
labs(title = "Community service?")
At the box plot, we see that the median of the fatality rate doesn’t change much from year to year.
boxplot(Fatality_rate~Year,data=fatality, main="Fatality per year Data",
xlab="Year", ylab="Fatality per year")
The goal is to check how the data changed from 1982 to 1988.
The first graph shows that the total fatality rate in 48 states stays at almost the same level from year to year.
The second graph shows some increase in income per family for 7 years and the beer tax goes down on the third graph. Similarly, the unemployment rate goes down in the fourth graph. The positive change was due to the improvement in economics.
ggplot(fatality_year, aes(factor(Year, labels = c("1982",
"1983",
"1984",
"1985",
"1986",
"1987",
"1988")), sum_fatality)) + geom_bar(stat = "identity", fill = "cornflowerblue") + scale_x_discrete("Year") + scale_y_continuous("Fatality per year") + coord_flip() + theme_minimal()
ggplot(fatality_year, aes(factor(Year, labels = c("1982",
"1983",
"1984",
"1985",
"1986",
"1987",
"1988")), income_year)) + geom_bar(stat = "identity", fill = "pink") + scale_x_discrete("Year") + scale_y_continuous("Income") + theme_minimal()
ggplot(fatality_year, aes(factor(Year, labels = c("1982",
"1983",
"1984",
"1985",
"1986",
"1987",
"1988")), beertax_year)) + geom_bar(stat = "identity", fill = "lightgreen") + scale_x_discrete("Year") + scale_y_continuous("Beer Tax")
ggplot(fatality_year, aes(factor(Year, labels = c("1982",
"1983",
"1984",
"1985",
"1986",
"1987",
"1988")), unempl_year)) + geom_bar(stat = "identity", fill = "blue") +scale_x_discrete("Year") + scale_y_continuous("Unemployment rate")
As it was discussed in 4.2, we see 10 states with the highest fatality rate (mean for 7 years): NM, WY, MT, SC, MS, NV, AZ, ID, FL, AR at the first graph.
We can check if the states with the highest fatality rate have the lowest income in graph 2. The lowest mean income per family: MS, WV, AR, UT, AL, SC, KY, ID, NM, SD. We can see that states with an income below average have a higher fatality rate.
In the third graph, we can see how the beer tax changed from state to state. The states with the lowest beer tax: WY, NJ, CA, NY, WI, DE, RI, IL, CO, KY. People can buy alcohol cheaper.
In the fourth graph, the unemployment rate from state to state is shown. The states with the highest unemployment rate: WY, LA, MI, MS, AL, KY, OH, IL, WA, AR. We can notice that these are almost the same states mentioned above for the highest fatality rate and lowest income.
ggplot(fatality_state, aes(State, fatal_state)) + geom_bar(stat = "identity", fill = "cornflowerblue") + scale_x_discrete("State") + scale_y_continuous("Fatality per state") + theme(axis.text.x = element_text(angle = 90))
ggplot(fatality_state, aes(State, income_state)) + geom_bar(stat = "identity", fill = "darkblue") + scale_x_discrete("State") + scale_y_continuous("Income") + theme(axis.text.x = element_text(angle = 90))
ggplot(fatality_state, aes(State, beertax_state)) + geom_bar(stat = "identity", fill = "lightgreen") + scale_x_discrete("State") + scale_y_continuous("Beer Tax") + theme(axis.text.x = element_text(angle = 90))
ggplot(fatality_state, aes(State, unempl_state)) + geom_bar(stat = "identity", fill = "lightblue") + scale_x_discrete("State") + scale_y_continuous("Unemployment rate") + theme(axis.text.x = element_text(angle = 90))
We can check how the fatality rate depends on the values on columns.
At the first graph, we see that that if income decrease, the fatality rate goes up.
At the second graph, the higher unemployment rate, the higher the fatality rate.
At the third graph, there is slight dependence between fatality rate and beer tax. Fatality rate increases enormously while beer tax stays between 0 and 1. But most of the fatality cases happened at states with low beer tax.
Also, at the fourth graph we see that the higher miles per drive, the higher fatality rate.
ggplot(fatality_state, aes(x=fatal_state, y=income_state)) + geom_point(color="cornflowerblue", size = 2, alpha=.8) + scale_x_continuous("Fatality rate per state") + scale_y_continuous("Income") + theme_minimal()
ggplot(fatality_state, aes(x=fatal_state, y=unempl_state)) + geom_point() + geom_point(color="cornflowerblue", size = 2, alpha=.8) + scale_x_continuous("Fatality rate per state") + scale_y_continuous("Unemployment rate") + theme_minimal()
ggplot(fatality_state, aes(x=fatal_state, y=beertax_state)) + geom_point() + geom_point(color="cornflowerblue", size = 2, alpha=.8) + scale_x_continuous("Fatality rate per state") + scale_y_continuous("Beer tax") + theme_minimal()
ggplot(fatality_state, aes(x=fatal_state, y=miles_state)) + geom_point() + geom_point(color="cornflowerblue", size = 2, alpha=.8) + scale_x_continuous("Fatality rate per state") + scale_y_continuous("Miles per driver") + theme_minimal()
After analyzing the columns of the data set, we noticed that fatality depends on the unemployment rate, income, the beer tax, miles per drive. All columns in the data set to affect the fatality rate. Columns such as unemployment rate, income shows the well-being of the state. In the states with the worse economic situation, more people died from drunk driving. Also, the fact that in most states there was almost no punishment - kept people forgetting about any boundaries. Even though the data used for analysis is more than 30 years old, the results can still be applied to the current times with minor adjustments.