#1 Description of my Data Set The data set I chose was Railroad Accidents in the U.S. in 2010.The data was collected via safetydata.fra.dot.gov. The categorical explanatory variables in this data set include: Accident ID #, Railroad Label, month, day, state, county, track type (yard, siding, industry, not reported), track maintenance, type of accident (derailment, collision, other), and railroad equipment. The numerical explanatory variables are equipment damage, track damage, number of people killed, number of people injured, speed, number of locomotives derailed, and number of cars derailed. This data set has a handful of response variables, so I’m going to see which explanatory variables correlate best with number of people killed, number of people injured, number of locomotives derailed, and number of cars derailed. My guess is that speed, equipment damage, and number equipment damage will correlate the strongest with these response variables.We can also have a look and see if month correlates to number of deaths, injured, etc. Before even digging into the data, it looks like the set does have enough information to answer these questions.
#Setup
#1 Pairwise Scatterplot
#Pairwise scatterplot of all variables
library(psych)
## Warning: package 'psych' was built under R version 4.4.3
windows(width = 12, height = 12)
pairs.panels(myData[, -c(1,5,6)], pch=21)
pairs.panels(myData[, -c(2,8,11)], pch=21, main="Pair-wise Scatter Plot of r numerical variables") #to show color grouping
#2 Least Square Regression
Cars_derailed <- myData$CarsDer
Equipment_Damage <- myData$EqpDamg
plot(Cars_derailed, Equipment_Damage, pch = 21, col ="red",
main = "Relationship between Cars Derailed and Equipment Damage")