Question 1: Use the summary function to gain an overview of the data set. Then display the mean and median for at least two attributes of your data.
I will be using the nassCDS data frame in the DAAG library. The description of the data is: “Airbag and other influences on accident fatalities.” Below, I am displaying
library(DAAG)
data(nassCDS, package = 'DAAG')
#Display 4 rows of the nassCDS data to view the data.
head(nassCDS, n = 20)
NA
Below, I have used the summary function in the base R package to gain an overview of the data
# Generate an overview of the data
summary(nassCDS)
dvcat weight dead airbag seatbelt frontal sex
1-9km/h: 686 Min. : 0.00 alive:25037 none :11798 none : 7644 Min. :0.0000 f:12248
10-24 :12848 1st Qu.: 32.47 dead : 1180 airbag:14419 belted:18573 1st Qu.:0.0000 m:13969
25-39 : 8214 Median : 86.99 Median :1.0000
40-54 : 2977 Mean : 462.81 Mean :0.6433
55+ : 1492 3rd Qu.: 364.72 3rd Qu.:1.0000
Max. :57871.59 Max. :1.0000
ageOFocc yearacc yearVeh abcat occRole deploy injSeverity
Min. :16.00 Min. :1997 Min. :1953 Length:26217 Length:26217 Min. :0.000 Min. :0.000
1st Qu.:22.00 1st Qu.:1998 1st Qu.:1989 Class :character Class :character 1st Qu.:0.000 1st Qu.:1.000
Median :33.00 Median :2000 Median :1994 Mode :character Mode :character Median :0.000 Median :2.000
Mean :37.21 Mean :2000 Mean :1993 Mean :0.337 Mean :1.716
3rd Qu.:48.00 3rd Qu.:2001 3rd Qu.:1997 3rd Qu.:1.000 3rd Qu.:3.000
Max. :97.00 Max. :2002 Max. :2003 Max. :1.000 Max. :6.000
NA's :1 NA's :153
caseid
Length:26217
Class :character
Mode :character
Below, I have calculated the mean and median age of the occupant and injury severity. The summariZe function is used
summarize(nassCDS, 'Average Age' = mean(ageOFocc), 'Median Age' = median(ageOFocc), 'Average Severity' = mean(injSeverity, na.rm = TRUE), 'Median Severity' = median(injSeverity, na.rm = TRUE))
I have generated the same mean and median above using the pipe semantics
# Using the pipe semantics
nassCDS %>% summarize('Average Age' = mean(ageOFocc), 'Median Age' = median(ageOFocc), 'Average Severity' = mean(injSeverity, na.rm = TRUE), 'Median Severity' = median(injSeverity, na.rm = TRUE))
Below are the mean and median grouped by impact speed. The data shows that the average injury severity increases with impact speed.
#Mean and median grouped by sex (f = female, m = male). the pipe semantics is used
nassCDS %>% group_by(dvcat) %>% summarize('Average Age' = mean(ageOFocc), 'Median Age' = median(ageOFocc), 'Average Severity' = mean(injSeverity, na.rm = TRUE), 'Median Severity' = median(injSeverity, na.rm = TRUE) )
NA
Question 2: Create a new data frame with a subset of the columns AND rows. There are several ways to do this so feel free to try a couple if you want. Make sure to rename the new data set so it simply just doesn’t write it over.
The new data frame below is a subset of the data containing all occupants above the mean of the age of occupant (ageOFocc). I will be selecting Impact Speed (dvcat), Age of occupant (ageOFocc), and injury severity (injSeverity).
First, I will demonstrate the subset method.
# Create a new data frame from a subset of the data using the subset function. Columns not to be selected are preceded with a minu (-) sign
nassCDS_Older_Adults_DF = nassCDS %>% subset(ageOFocc >= mean(ageOFocc), c(dvcat, ageOFocc, injSeverity))
head(nassCDS_Older_Adults_DF, n = 2500) %>% arrange(ageOFocc)
NA
Second, I will Select the same dataset above but this time as a data.table instead of a data.frame. Also, the select function is used.
library(dplyr)
library(magrittr)
library(data.table)
nassCDS_Older_Adults_DT <- nassCDS %>% data.table %>% select(dvcat, ageOFocc, injSeverity) %>% filter(ageOFocc >= mean(ageOFocc))
head(nassCDS_Older_Adults_DT, n = 2500) %>% arrange(ageOFocc)
NA
Third, I will select using the R parametric notation data.frame$ to select the columns .
library(dplyr)
# Select desired columns and all rows
nassCDS_Older_Adults_DF2 <- data.frame(nassCDS$dvcat, nassCDS$ageOFocc, nassCDS$injSeverity)
# Eliminate rows below mean age of occupant.
nassCDS_Older_Adults_DF2 <- nassCDS_Older_Adults_DF2 %>% filter(nassCDS$ageOFocc >= mean(nassCDS$ageOFocc))
# Display 2500 rows
head(nassCDS_Older_Adults_DF2, n = 2500)
Question 3: Create new column names for each column in the new data frame created in step 2.
#create a vector of the column names
colNames <- c('Impact Speed', 'Age', 'Injury Severity')
# Rename columns
names(nassCDS_Older_Adults_DF) <- colNames
# Display 5 rows
head(nassCDS_Older_Adults_DF, n = 5)
Use summary function to gain an overview of our new data frame
summary(nassCDS_Older_Adults_DF)
Impact Speed Age Injury Severity
1-9km/h: 276 Min. :38.00 Min. :0.000
10-24 :5540 1st Qu.:43.00 1st Qu.:1.000
25-39 :3382 Median :51.00 Median :2.000
40-54 :1155 Mean :54.88 Mean :1.817
55+ : 487 3rd Qu.:65.00 3rd Qu.:3.000
Max. :97.00 Max. :6.000
NA's :46
Question 4: Use the summary function to create an overview of your new data frame created in step 2. The print the mean and median for the same two attributes. Please compare (i.e. tell me how the values changed and why).
Below is the mean and median of age of occupant and injury severity. As expected, the average and median age increased since all occupants below the average age are eliminated. The average severity increased for older adults but the median remained the same. The increase in average severity could imply that older people suffer more injury in accidents. This could be due to preexisting illness.The unchanged median means that half of the values in both data sets remain the same on both sides of the value 2. This may also mean a preponderance of 2 in the middle when sorted.
# Using the pipe semantics
nassCDS_Older_Adults_DF %>% summarize('Average Age' = mean(Age), 'Median Age' = median(Age), 'Average Severity' = mean(`Injury Severity`, na.rm = TRUE), 'Median Severity' = median(`Injury Severity`, na.rm = TRUE))
Question 5: For at least 3 different/distinct values in a column please rename so that every value in that column is renamed. For example, change the letter “e” to “excellent”, the letter “a” to “average’ and the word “bad” to “terrible”. Here I am replacing the injury severity with -> 0:none, 1:possible injury, 2:no incapacity, 3:incapacity, 4:killed; 5:unknown, 6:prior death
library(dplyr)
nassCDS_Older_Adults_DF$`Injury Severity`[nassCDS_Older_Adults_DF$`Injury Severity` == 0] <- "none"
nassCDS_Older_Adults_DF$`Injury Severity`[nassCDS_Older_Adults_DF$`Injury Severity` == 1] <- "possible injury"
nassCDS_Older_Adults_DF$`Injury Severity`[nassCDS_Older_Adults_DF$`Injury Severity` == 2] <- "no incapacity"
nassCDS_Older_Adults_DF$`Injury Severity`[nassCDS_Older_Adults_DF$`Injury Severity` == 3] <- "incapacity"
nassCDS_Older_Adults_DF$`Injury Severity`[nassCDS_Older_Adults_DF$`Injury Severity` == 4] <- "killed"
nassCDS_Older_Adults_DF$`Injury Severity`[nassCDS_Older_Adults_DF$`Injury Severity` == 5] <- "unknown"
nassCDS_Older_Adults_DF$`Injury Severity`[nassCDS_Older_Adults_DF$`Injury Severity` == 6] <- "prior death"
head(nassCDS_Older_Adults_DF, n = 2500)
Question 7: BONUS – place the original .csv in a github file and have R read from the link. This should be your own github – not the file source. This will be a very useful skill as you progress in your data science education and career.
gitURL = 'https://raw.githubusercontent.com/hawa1983/R_Bridge/main/nassCDS.csv'
accident.data <- read.csv(file = gitURL, header = TRUE, sep = ",")
head(accident.data, n = 10)