This is an R Markdown document containing Sean Amato’s work for the week 2 bridge homework.
Question 1: Use the summary function to gain an overview of the data set (USSeatBelts.csv). Display the mean and median for 2 columns (fatalities and age).
Summary:
us_seatbelts <-
read.csv('https://raw.githubusercontent.com/samato0624/R/main/USSeatBelts.csv')
summary(us_seatbelts)
## X state year miles
## Min. : 1 Length:765 Min. :1983 Min. : 3099
## 1st Qu.:192 Class :character 1st Qu.:1986 1st Qu.: 11401
## Median :383 Mode :character Median :1990 Median : 30319
## Mean :383 Mean :1990 Mean : 41448
## 3rd Qu.:574 3rd Qu.:1994 3rd Qu.: 52312
## Max. :765 Max. :1997 Max. :285612
##
## fatalities seatbelt speed65 speed70
## Min. :0.008327 Min. :0.0600 Length:765 Length:765
## 1st Qu.:0.017341 1st Qu.:0.4200 Class :character Class :character
## Median :0.021199 Median :0.5500 Mode :character Mode :character
## Mean :0.021490 Mean :0.5289
## 3rd Qu.:0.024774 3rd Qu.:0.6500
## Max. :0.045470 Max. :0.8700
## NA's :209
## drinkage alcohol income age
## Length:765 Length:765 Min. : 8372 Min. :28.23
## Class :character Class :character 1st Qu.:14266 1st Qu.:34.39
## Mode :character Mode :character Median :17624 Median :35.39
## Mean :17993 Mean :35.14
## 3rd Qu.:21080 3rd Qu.:36.13
## Max. :35863 Max. :39.17
##
## enforce
## Length:765
## Class :character
## Mode :character
##
##
##
##
Mean and Median for Fatalities and Age:
mean(us_seatbelts$fatalities)
## [1] 0.02148951
median(us_seatbelts$fatalities)
## [1] 0.02119896
mean(us_seatbelts$age)
## [1] 35.13719
median(us_seatbelts$age)
## [1] 35.39177
Question 2: Create a new data frame with a subset of columns (state, fatalities, age, income) and rows (income >= 17624).
library(dplyr)
filtered_columns <- data.frame(us_seatbelts$state, us_seatbelts$fatalities,
us_seatbelts$age, us_seatbelts$income)
filtered_df <- filtered_columns %>%
filter(us_seatbelts.income >= 21080)
Question 3: Create new names for each column.
colnames(filtered_df) <- c("STATE", "FATALITIES", "AGE", "INCOME")
Question 4: Repeat question 1 with the new new data frame (filtered_df). Checking for difference in fatalities versus income, this is filtered to the top 25th percentile based on income.
Summary:
summary(filtered_df)
## STATE FATALITIES AGE INCOME
## Length:192 Min. :0.008327 Min. :29.83 Min. :21080
## Class :character 1st Qu.:0.013494 1st Qu.:35.41 1st Qu.:22287
## Mode :character Median :0.015744 Median :36.11 Median :23634
## Mean :0.016023 Mean :35.95 Mean :24456
## 3rd Qu.:0.018054 3rd Qu.:36.76 3rd Qu.:25632
## Max. :0.030117 Max. :39.17 Max. :35863
Mean and Median for Fatalities and Age:
mean(filtered_df$FATALITIES)
## [1] 0.01602266
median(filtered_df$FATALITIES)
## [1] 0.01574439
mean(filtered_df$AGE)
## [1] 35.95432
median(filtered_df$AGE)
## [1] 36.10976
Conclusions: Comparing the national fatality rate, which is 2.2%, to the fatality rate of the top 25% of of people by income, which is 1.6%, there is approximately a 26% decrease. I can speculate that people with more money could afford vehicles with more safety features (i.e. seatbelts) and more easily afford the maintenance required to keep their vehicle safe. However, it’s tough to say what’s causing the difference without investigating further. Regarding the mean age, there was only a 2.2% increase in the mean age (35.1 vs 35.9) so I can’t say how much age contributes to the fatality rate, but I can say that typically the older you are the more money you make and that could contribute to the increase.
Question 5: Rename three distinct values in a column (State Column: AK -> Arkansas, Az -> Arizona, CA -> California)
df <- filtered_df %>%
mutate(STATE = ifelse(STATE == 'AK', 'Arkansas',
ifelse(STATE == 'AZ', 'Arizona',
ifelse(STATE == 'CA', 'California', STATE))))
Question 6: Display enough rows to show example of all the changes made to the data set in steps 1-5.
The data set loaded in question 1 had 13 rows including an index; in
question 2 I reduced the columns to 4 and made the income values >=
to the 3rd quartile ($21080) from the original data set;
in question 3 I capitalized the column names;
in question 4 there was no change to the data;
in question 5 I changed 3 of the state abbreviations to their actual
names.
in question 6 I only show 20 rows to illustrate that CO was not
renamed.
top_20_rows <- head(df, n = 20)
print(top_20_rows)
## STATE FATALITIES AGE INCOME
## 1 Arkansas 0.02511813 29.82771 21496
## 2 Arkansas 0.02811768 30.21070 22073
## 3 Arkansas 0.03011741 30.46439 22711
## 4 Arkansas 0.02048193 30.75657 23417
## 5 Arkansas 0.02110114 31.17860 23971
## 6 Arkansas 0.01968408 31.44535 24310
## 7 Arkansas 0.01755186 31.60147 24969
## 8 Arizona 0.02186659 35.70044 21998
## 9 California 0.02005206 33.83672 21363
## 10 California 0.01817223 33.79849 21491
## 11 California 0.01596660 33.89958 22191
## 12 California 0.01563016 33.98206 22430
## 13 California 0.01556208 34.05400 22953
## 14 California 0.01516802 34.17737 23983
## 15 California 0.01434670 34.29433 25142
## 16 California 0.01291262 34.48209 26314
## 17 CO 0.01708540 34.55241 22117
## 18 CO 0.01738614 34.78416 23019
## 19 CO 0.01839808 34.98942 24304
## 20 CO 0.01707202 35.19139 25627
Question 7: Have this program read the CSV file from github.
#Used in question 1.
us_seatbelts <-
read.csv('https://raw.githubusercontent.com/samato0624/R/main/USSeatBelts.csv')