All the documents and code used in this project can be found at [this] (https://github.com/gredoy/RDATA_P-2) ) Github repository. If the link does not work, please copy and paste the following url into your browser: https://github.com/gredoy/RDATA_P-2
In this analysis project, we explored the aftermath of severe weather events in the US over the past 60 years (1950-2011). We particularly consider the events that have caused the most harmful effects on human health (in terms of injury and fatatlity) and the greatest economic impact (interms of property and crop damage). After acquiring the data of about one million records, we pre-processed it by considering 28 well-defined major weather events and classifying the data accordingly. Any event that our search and match algorithm failed to identify was labelled as “other”. Next, we counted the occurence of each event, the total injuries and fatalities it caused, and the resultant total property and crop damage. We summed up the injuries and fatalities for a quantitative estimate of the effect on human lives and also determined the total economic cost of the event my adding up the proerty and crop damage costs. Our analysis revealed that tornado was the most the devastating event accounting for about 13,049 injuries and human lives. As for economy, flood had the most effect amounting to about 4,768 Trillion Dollars. In addition, we also determined that the event with the highest frequency in the United States was storm with over 96,000 counts.
Please note that, unless stated otherwise, all R codes in the Data Processing section of this report are not actually evaluated when generating the Rmd file to save time from the lengthy processing that is required
Starting at this point, the following R code chunks are evaluated in knitr when knitting the document
Read in the cleaned up data from the cleanData.csv file into R
newData <- read.csv("cleanData.csv")
The columns and data in the newData dataframe
names(newData)
## [1] "X" "weather" "incidentTotal" "fatalitiesTotal"
## [5] "injuriesTotal" "propDmgTotal" "cropDmgTotal" "damageTotal"
head(newData)
## X weather incidentTotal fatalitiesTotal injuriesTotal propDmgTotal
## 1 1 rain 5269 16 47 347421930
## 2 2 storm 96513 238 3707 11372508690
## 3 3 sun 19 33 129 144062000
## 4 4 cloud 2382 0 0 65100
## 5 5 hot 0 0 0 0
## 6 6 cold 1333 155 23 121883050
## cropDmgTotal damageTotal
## 1 126948800 2.187006e+12
## 2 6170401300 1.300294e+15
## 3 20000 1.727477e+09
## 4 0 6.023260e+07
## 5 0 0.000000e+00
## 6 9855050 1.643400e+11
Create a new data frame newData1 with an additional column that contains the total human injuries and fatalities due to each event
injuriesFatalitiesTotal <- newData$injuriesTotal + newData$fatalitiesTotal
newData1 <- cbind(newData[,2:3], newData$injuriesTotal,
newData$fatalitiesTotal, injuriesFatalitiesTotal,
newData[,6:8])
Re-name the columns in newData1
names(newData1) <- c("weather", "incidentCount", "injuries",
"fatalities", "injuryFatalityTotal",
"propertyDamage", "cropDamage", "damageTotal")
The columns and data in the newData1 dataframe
names(newData1)
## [1] "weather" "incidentCount" "injuries"
## [4] "fatalities" "injuryFatalityTotal" "propertyDamage"
## [7] "cropDamage" "damageTotal"
head(newData1)
## weather incidentCount injuries fatalities injuryFatalityTotal
## 1 rain 5269 47 16 63
## 2 storm 96513 3707 238 3945
## 3 sun 19 129 33 162
## 4 cloud 2382 0 0 0
## 5 hot 0 0 0 0
## 6 cold 1333 23 155 178
## propertyDamage cropDamage damageTotal
## 1 347421930 126948800 2.187006e+12
## 2 11372508690 6170401300 1.300294e+15
## 3 144062000 20000 1.727477e+09
## 4 65100 0 6.023260e+07
## 5 0 0 0.000000e+00
## 6 121883050 9855050 1.643400e+11
In determining events that are the most harmful with respect to human health, we would consider the injury and fatality data.
Pie chart of the top 5 events corresponding to the effect on injuries and fatalities combined
Initialize the margin and plotting parameters
par(mar=c(2,2,3,1))
par(mfrow=c(1,1))
Install the 3D pie chart package “plotrix” (if necessary) and then load it
#install.packages("plotrix")
library(plotrix)
## Warning: package 'plotrix' was built under R version 3.1.2
Plot the 3D pie chart of the top 5 events that caused the most damage to humans and their health (in terms of injuries and fatalties) in the past
X <- newData1[order(newData1$injuryFatalityTotal, decreasing=TRUE),][1:5,]
labels <- paste(X$weather, "\n", X$injuryFatalityTotal)
pie3D(X$injuryFatalityTotal, labels = labels, main="Consequences to Human Health and Lives\n due to the Top 5 Weather Events", labelcex=1.0, labelrad=1.75)
Based on the numbers above, the most devastating/harmful events for human health are tornadoes, followed by floods, others, storms, and lastly, hurricanes.
Please note that “other” includes a variety of weather events that were not categorized into specific events due to the data analyst’s lack of skills in doing so.
In determining the events with the greatest economic consequences, we would consider the property and crop damage data.
Pie chart of the top 5 events with the greatest economic consequences
Plot the 3D pie chart of the top 5 events that had the most effect on the economy in the past
Y <- newData1[order(newData1$damageTotal, decreasing=TRUE),][1:5,]
damageInTrillions <- round(Y$damageTotal/1e+12)
labels <- paste(Y$weather, "\n", damageInTrillions)
pie3D(damageInTrillions, labels = labels, main="Consequences of the Top 5\n Weather Events on the Economy (in Trillions of Dollars)", labelcex=1.0, labelrad=2.0)
Based on the numbers above, the most costly/economically consequential events are floods, followed by storms, hails, others, and lastly, tornadoes.
Please note that “other” includes a variety of weather events that were not categorized into specific events due to the data analyst’s lack of skills in doing so.
Pie chart of the top 5 events with the highest frequency
Plot the 3D pie chart of the top 5 events that occured the most in the past
Z <- newData1[order(newData1$incidentCount, decreasing=TRUE),][1:5,]
labels <- paste(Z$weather, "\n", Z$incidentCount)
pie3D(Z$incidentCount, labels = labels, main="5 Highest Occuring Weather Events in the\n United States between 1950 and 2011", labelcex=1.25, labelrad=1.5)
Based on the numbers above, the events that happened the most in the past are storms, followed by hails, floods, others, and lastly, tornadoes.
Please note that “other” includes a variety of weather events that were not categorized into specific events due to the data analyst’s lack of skills in doing so.
Read in the database file into R
fileName <- "repdata-data-StormData.csv.bz2"
data <- read.csv(fileName)
Create a new dataframe with the columns EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMGEXP and CROPDMG from data
data1 <- data.frame(data$EVTYPE, data$FATALITIES, data$INJURIES,
data$PROPDMG, data$PROPDMGEXP, data$CROPDMG, data$CROPDMGEXP)
Re-name the columns in the data1 data frame
names(data1) <- tolower(c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG","PROPDMGEXP", "CROPDMG", "CROPDMGEXP"))
Turn the elements in the evtype, propdmgexp and cropdmgexp columns from factor into character
data1[,1] <- as.character(data1$evtype)
data1[,5] <- as.character(data1$propdmgexp)
data1[,7] <- as.character(data1$cropdmgexp)
Please note that we will ignore the multipliers for property damage (PROPDMGEXP) and cropdamage (CROPDMGEXP) unless they are one of h/H (hundred), k/K (thousand), m/M (million) or b/B (billion).
There are two reasons for doing so. 1. We could not determime the meaning of the other values (such as, “0”, “1”, “2” etc.) in the PROPDMGEXP and CROPDMGEXP columns. 2. Including multipliers other than h, k, m or b in our analysis drastically increases our processing time which we are unable to afford with our computer’s limited processing power and the project deadline.
Therefore, please be advised that our analysis and result interpretation that follows may not be precise or entirely correct
28 weather types considered (please not that the following weather condition listing is not exhaustive). Other unrecognized weather conditions are considered under “other”
weather <- c("rain", "storm", "sun", "cloud", "hot", "cold",
"dry", "wet", "windy", "hurricane", "typhoon",
"sand storms", "snow storms", "tornado", "humid",
"fog", "snow", "thundersnow", "hail", "sleet",
"drought", "wildfire", "blizzard", "avalanche",
"mist", "freez", "dust", "flood", "other")
Create a new data frame - dataSummary - to record the aggregate results of weather incident counts, fatalities, injuries, property damage, crop damage and total damage for each of the above weather conditions.
incidentTotal <- vector(mode="numeric", length=length(weather))
fatalitiesTotal <-vector(mode="numeric", length=length(weather))
injuriesTotal <-vector(mode="numeric", length=length(weather))
propDmgTotal <-vector(mode="numeric", length=length(weather))
cropDmgTotal <-vector(mode="numeric", length=length(weather))
damageTotal <-vector(mode="numeric", length=length(weather))
dataSummary <- data.frame(weather, incidentTotal,
fatalitiesTotal, injuriesTotal,
propDmgTotal, cropDmgTotal, damageTotal)
Function to determine the multiplier’s numerical value from the propdmgexp and cropdmgexp columns in the data1 dataframe
multiplier <- function(letter) {
#Return the multiplier letter's numerical value
#If the multiplier is not one of h/H, k/K, m/M, and b/B,
#the function returns 1
multiplier <- 1
if (grepl('h', tolower(letter))) multiplier <- 100
else if (grepl('k', tolower(letter))) multiplier <- 1000
else if (grepl('m', tolower(letter))) multiplier <- 1000000
else if (grepl('b', tolower(letter))) multiplier <- 1000000000
multiplier
}
Populate the dataSummary data frame with the aggregate results for each of the categories - fatalities, injuries, property damage, crop damage and total damage - for each of the 28 weather conditions.
The major simplifying assumption in our analysis is to ignore the rows in data1 where: * propdmgexp is not one of h/H, k/K, m/M, and b/B * cropdmgexp is not one of h/H, k/K, m/M, and b/B
for (row in seq(data1$evtype)) {
instance <- tolower(data1$evtype[row])
propMultiplier <- multiplier(data1$propdmgexp[row])
cropMultiplier <- multiplier(data1$cropdmgexp[row])
#print(row)
if (propMultiplier== 1 || cropMultiplier==1) {
next
}
iteration <- 0
for (index in seq(dataSummary$weather)) {
condition <- dataSummary$weather[index]
iteration <- iteration + 1
found = grepl(condition, instance)
if (found) {
data1$evtype[row] = condition
dataSummary$incidentTotal[index] = dataSummary$incidentTotal[index] + 1
dataSummary$fatalitiesTotal[index] = dataSummary$fatalitiesTotal[index] + data1$fatalities[row]
dataSummary$injuriesTotal[index] = dataSummary$injuriesTotal[index] + data1$injuries[row]
dataSummary$propDmgTotal[index] = dataSummary$propDmgTotal[index] + (data1$propdmg[row] * propMultiplier)
dataSummary$cropDmgTotal[index] = dataSummary$cropDmgTotal[index] + (data1$cropdmg[row] * cropMultiplier)
dataSummary$damageTotal[index] = dataSummary$damageTotal[index] + (dataSummary$propDmgTotal[index] + dataSummary$cropDmgTotal[index])
#print(condition)
break
}
else if (!found & iteration==29) {
data1$evtype[row] = condition
dataSummary$incidentTotal[index] = dataSummary$incidentTotal[index] + 1
dataSummary$fatalitiesTotal[index] = dataSummary$fatalitiesTotal[index] + data1$fatalities[row]
dataSummary$injuriesTotal[index] = dataSummary$injuriesTotal[index] + data1$injuries[row]
dataSummary$propDmgTotal[index] = dataSummary$propDmgTotal[index] + (data1$propdmg[row] * propMultiplier)
dataSummary$cropDmgTotal[index] = dataSummary$cropDmgTotal[index] + (data1$cropdmg[row] * cropMultiplier)
dataSummary$damageTotal[index] = dataSummary$damageTotal[index] + (dataSummary$propDmgTotal[index] + dataSummary$cropDmgTotal[index])
#print(condition)
}
}
}
Please be advised that the above chunk of code requires a large amount of time to execute and produce an output. Therefore, it was not evaluated when generating the Rmd document.
Save the dataSummary containing the aggregate data in a .csv file in case we need the processed data again and if we do not want to wait for the time consuming processing to repeat itself. This will especially come in handy when generating the Rmd document.
write.csv(file='cleanData.csv', x=dataSummary)