Following analysis takes data from U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and tries to answer:
In the analysis a raw csv file carrying data related to storm events was loaded in R. Since the aim of the analysis was to answer questions regarding health and economic loss for events, only columns corresponding to them were retained. Event type was identified and stored in a new column. Economic losses (property and crop) were calculated using corresponding value and exponent columns. Event wise health and economic losses were summed and stored in new variables. Plots corresponding to event-wise health and economic losses were plotted. As per the data, tornado caused highest health loss and floods caused highest economic loss in US.
Data was available repdata-data-StormData.csv.bz2 and has 902297 observations and 37 columns
The code begins by reading data. Function read.csv was used to load data in the variable data. repdata-data-StormData.csv.bz2 is compressed, but read.csv reads data perfectly. Data was then subset and columns corresponding to event type, health losses (fatalities and injuries), economic losses (property/crop damages and their exponents) were retained. Since loading data took sufficient amount of time, cache=TRUE was used.
data <- read.csv("repdata-data-StormData.csv.bz2")
data <- data[,c(8,23:28)]
Following are the functions writter in order to process the data. These functions helped in keeping the code simple and readable.
Since there were various kinds of events in the data, with different kind of names stored in EVTYPE column of the data, a function was written to identify which category of event an observation belonged to. The functions takes a string as an input and returns a string with type of event stored. For the analysis following category of events were identified:
The function uses grepl to match the EVTYPE with each of the above strings, in the order specified above. Since both uppercase and lower case letters were present, string passed to grepl was appropriately written. there might be a few observations which had multiple string patterns matching, however for those the first one to be matched was stored.
eventType <- function(a){
if(grepl("[Tt][Hh][Uu][Nn][Dd][Ee][Rr][Ss][Tt][Oo][Rr][Mm]", a)){
b <- "THUNDERSTORM"
} else if(grepl("[Ff][Ll][Oo][Oo][Dd]", a)){
b <- "FLOOD"
} else if(grepl("[Rr][Aa][Ii][Nn]", a)){
b <- "RAIN"
} else if(grepl("[Tt][Oo][Rr][Nn][Aa][Dd][Oo]", a)){
b <- "TORNADO"
} else if(grepl("[Tt][Ii][Dd][Ee]", a)){
b <- "TIDE"
} else if(grepl("[Aa][Vv][Aa][Ll][Aa][Nn][Cc][Hh][Ee]", a)){
b <- "AVALANCHE"
} else if(grepl("[Bb][Ll][Ii][Zz][Zz][Aa][Rr][Dd]", a)){
b <- "BLIZZARD"
} else if(grepl("[Dd][Rr][Oo][Uu][Gg][Hh][Tt]", a)){
b <- "DROUGHT"
} else if(grepl("[Ww][Ii][Nn][Dd]", a)){
b <- "WIND"
} else if(grepl("[Hh][Aa][Ii][Ll]", a)){
b <- "HAIL"
} else if(grepl("[Ss][Nn][Oo][Ww]", a)){
b <- "SNOW"
} else if(grepl("[Ss][Tt][Oo][Rr][Mm]", a)){
b <- "STORM"
} else if(grepl("[Ll][Ii][Gg][Hh][Tt][Ee][Nn][Ii][Nn][Gg]", a)){
b <- "LIGHTENING"
} else if(grepl("[Ww][Ii][Nn][Tt][Ee][Rr]", a)){
b <- "WINTER"
} else{
b <- "OTHERS"
}
b
}
For economic loss, there were two kind of columns. One with value and one with exponent. Exponent column had values such as “H”, “k” etc which referred to thousand, hundred etc. The function takes string as input and returns a number. Following categories of exponents were identified:
The function returns 10 raised to power above exponent.
expAsNumeric <- function(a){
if(a == "K" | a == "k"){
b <- 3
} else if(a == "M" | a == "m"){
b <- 6
} else if(a == "B" | a == "b"){
b <- 9
} else if(a == "H" | a == "h"){
b <- 2
} else if(a == "1" | a == "2" | a == "3" | a == "4" | a == "5" | a == "6" | a == "7" | a == "8" | a == "9" ){
b <- as.numeric(a)
} else {
b <- 0
}
10^b
}
A loop was run for all the observations. The functiones defined above were used. In the loop following tasks are performed:
Since processing data took sufficient amount of time, cache=TRUE was used.
for(i in 1:dim(data)[1])
{
data$EVENTTYPE[i] <- eventType(data$EVTYPE[i])
data$PROPDAMAGE[i] <- expAsNumeric(data$PROPDMGEXP[i])*data$PROPDMG[i]
data$CROPDAMAGE[i] <- expAsNumeric(data$CROPDMGEXP[i])*data$CROPDMG[i]
}
Now we have identified event of each observation and economic losses for each observation have been calculted, the data was summarised. data.frame health stores number of fatalities and injuries, for this function aggregate was used to calculate sum for each event. Similarly eco stores property damages and crop damages.
health <- merge(aggregate(FATALITIES~EVENTTYPE,data,FUN=sum),aggregate(INJURIES~EVENTTYPE,data,FUN=sum),by="EVENTTYPE")
eco <- merge(aggregate(PROPDAMAGE~EVENTTYPE,data,FUN=sum),aggregate(CROPDAMAGE~EVENTTYPE,data,FUN=sum),by="EVENTTYPE")
After sum of health losses and economic losses was calculated, we were ready to look at the summary of data and answer the questions we had in the begining. For it graphs for health and economic losses were plotted using ggplot2. Library ggplot2 for plots and xtable to print table, were loaded.
library(ggplot2)
library(xtable)
Following is the graph which would help us determine the event which caused maximum damage to population health. For each event, number of fatalities and injuries were added and plotted.
g <- ggplot(health,aes(EVENTTYPE))
g + geom_point(aes(y = INJURIES+FATALITIES, col = "Healh Loss"),size=5,pch=17)+labs(title = "Harm to population health",x="Events",y="Number of fatalities and injuries")+theme(axis.text.x = element_text(angle = 90))
The above graph has nummber of fatalities and injuries for different events.
Following is the table of health losses caused by various events. And determining which event caused maximum health loss.
maxHealthLoss <- max(health$FATALITIES+health$INJURIES)
maxHealthEvent <- health[which(health$FATALITIES+health$INJURIES==maxHealthLoss),1]
health <- health[with(health,order(-FATALITIES-INJURIES)),]
rownames(health) <- 1:14
healthTable <- xtable(health)
maxHealthLoss <- round(maxHealthLoss/(10^3),2)
print(healthTable,type="html")
| EVENTTYPE | FATALITIES | INJURIES | |
|---|---|---|---|
| 1 | TORNADO | 5661.00 | 91407.00 |
| 2 | OTHERS | 5433.00 | 20454.00 |
| 3 | WIND | 1216.00 | 9059.00 |
| 4 | FLOOD | 1525.00 | 8604.00 |
| 5 | STORM | 410.00 | 4191.00 |
| 6 | THUNDERSTORM | 210.00 | 2479.00 |
| 7 | HAIL | 15.00 | 1371.00 |
| 8 | SNOW | 159.00 | 1120.00 |
| 9 | BLIZZARD | 101.00 | 805.00 |
| 10 | WINTER | 61.00 | 538.00 |
| 11 | RAIN | 113.00 | 305.00 |
| 12 | AVALANCHE | 224.00 | 171.00 |
| 13 | DROUGHT | 6.00 | 19.00 |
| 14 | TIDE | 11.00 | 5.00 |
We can see that TORNADO caused maximum damage, i.e of 97.07 thousand fatalities and injuries to polulation health.
Following is the graph which would help us determine the event which caused maximum damage to economy. For each event, total property damages and crop damages were added and plotted.
g <- ggplot(eco,aes(EVENTTYPE))
g + geom_point(aes(y = PROPDAMAGE+CROPDAMAGE, col = "Economic Loss"),size=5,pch=17)+labs(title = "Economic damages",x="Events",y="Damages (Property and Crop)")+theme(axis.text.x = element_text(angle = 90))
The above graph has total property and crop damage for different events.
Following is the table of economic losses caused by various events. And determining which event caused maximum economic loss.
maxEcoLoss <- max(eco$PROPDAMAGE+eco$CROPDAMAGE)
maxEcoEvent <- eco[which(eco$PROPDAMAGE+eco$CROPDAMAGE==maxEcoLoss),1]
eco <- eco[with(eco,order(-PROPDAMAGE-CROPDAMAGE)),]
rownames(eco) <- 1:14
ecoTable <- xtable(eco)
maxEcoLoss <- round(maxEcoLoss/(10^12),2)
print(ecoTable,type="html")
| EVENTTYPE | PROPDAMAGE | CROPDAMAGE | |
|---|---|---|---|
| 1 | FLOOD | 68415029215834.91 | 12380079100.00 |
| 2 | THUNDERSTORM | 20869552594005.10 | 653005388.00 |
| 3 | TORNADO | 1080593097926.50 | 417461520.00 |
| 4 | HAIL | 315974043512.70 | 3046837473.00 |
| 5 | OTHERS | 267547951169.50 | 10408668370.00 |
| 6 | STORM | 61677950661.00 | 5747558500.00 |
| 7 | SNOW | 18011019750.00 | 134663100.00 |
| 8 | DROUGHT | 1046306000.00 | 13972621780.00 |
| 9 | WIND | 10904848618.00 | 1409224150.00 |
| 10 | TIDE | 4650933150.00 | 850000.00 |
| 11 | RAIN | 3254758190.00 | 806162800.00 |
| 12 | BLIZZARD | 659913950.00 | 112060000.00 |
| 13 | WINTER | 27298000.00 | 15000000.00 |
| 14 | AVALANCHE | 8721800.00 | 0.00 |
We can see that FLOOD caused maximum damage, i.e of 68.43 trillion USD in form of property and crop damages.