The U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The events in the database start in the year 1950 and end in November 2011. The top 10 events with the highest total number of fatalities, injuries and damage is separately calculated. The total damage comprises of both property and crop damage. Pareto charts are then plotted to address the questions of which types of events are most harmful with respect to population health and have the greatest economic consequences across the United States. The events with the highest fatalities are “TORNADO”, “EXCESSIVE HEAT” and “FLASH FLOOD”. The events with the highest injuries are “TORNADO”, “TSTM WIND” and “FLOOD”. The events with the highest damage are “FLOOD”, “HURRICANE/TYPHOON” and “TORNADO”.
dat <- read.csv(bzfile("repdata-data-StormData.csv.bz2"), stringsAsFactors = FALSE)
The raw CSV file containing the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database is located in repdata-data-StormData.csv.bz2. The raw CSV file is loaded into R using read.csv() with stringsAsFactors = FALSE.
total_fatalities <- as.data.frame.table(sort(tapply(dat$FATALITIES, dat$EVTYPE, sum, na.rm = TRUE), decreasing = TRUE)[1:10])
names(total_fatalities) <- c("evtype", "fatalities")
total_injuries <- as.data.frame.table(sort(tapply(dat$INJURIES, dat$EVTYPE, sum, na.rm = TRUE), decreasing = TRUE)[1:10])
names(total_injuries) <- c("evtype", "injuries")
The total number of fatalities and injuries for each event category in the entire dataset is calculated, ignoring the missing values in the dataset. The data is then sorted by decreasing order of fatalities and injuries. The top 10 events with the highest total number of fatalities and injuries is then stored in data frames.
dat$ACTUALPROPDMG <- dat$PROPDMG
dat$ACTUALPROPDMG[dat$PROPDMGEXP %in% c("K", "k")] <- dat$ACTUALPROPDMG[dat$PROPDMGEXP %in% c("K", "k")] * 1000
dat$ACTUALPROPDMG[dat$PROPDMGEXP %in% c("M", "m")] <- dat$ACTUALPROPDMG[dat$PROPDMGEXP %in% c("M", "m")] * 1000000
dat$ACTUALPROPDMG[dat$PROPDMGEXP %in% c("B", "b")] <- dat$ACTUALPROPDMG[dat$PROPDMGEXP %in% c("B", "b")] * 1000000000
dat$ACTUALPROPDMG[dat$PROPDMGEXP %in% 1:9] <- dat$ACTUALPROPDMG[dat$PROPDMGEXP %in% 1:9] * 10 ^ as.integer(dat$PROPDMGEXP[dat$PROPDMGEXP %in% 1:9])
dat$ACTUALCROPDMG <- dat$CROPDMG
dat$ACTUALCROPDMG[dat$CROPDMGEXP %in% c("K", "k")] <- dat$ACTUALCROPDMG[dat$CROPDMGEXP %in% c("K", "k")] * 1000
dat$ACTUALCROPDMG[dat$CROPDMGEXP %in% c("M", "m")] <- dat$ACTUALCROPDMG[dat$CROPDMGEXP %in% c("M", "m")] * 1000000
dat$ACTUALCROPDMG[dat$CROPDMGEXP %in% c("B", "b")] <- dat$ACTUALCROPDMG[dat$CROPDMGEXP %in% c("B", "b")] * 1000000000
dat$ACTUALCROPDMG[dat$CROPDMGEXP %in% 1:9] <- dat$ACTUALCROPDMG[dat$CROPDMGEXP %in% 1:9] * 10 ^ as.integer(dat$CROPDMGEXP[dat$CROPDMGEXP %in% 1:9])
dat$TOTALDMG <- dat$ACTUALPROPDMG + dat$ACTUALCROPDMG
total_damage <- as.data.frame.table(sort(tapply(dat$TOTALDMG, dat$EVTYPE, sum, na.rm = TRUE), decreasing = TRUE)[1:10])
names(total_damage) <- c("evtype", "damage")
The actual property and crop damage caused by each event is calculated using the base and the exponent. The total damage caused by each event is then calculated by summing the actual property and crop damage. The total damage for each event category in the entire dataset is then calculated, ignoring the missing values in the dataset. The data is then sorted by decreasing order of damage. The top 10 events with the highest total damage is then stored in data frames.
library(lattice)
xyplot(fatalities ~ evtype, total_fatalities, scales=list(x=list(rot=45)), main = "Top 10 events causing fatalities across the United States", xlab = "Types of events", ylab = "Total number of fatalities")
The pareto chart above shows the top 10 events with the highest total number of fatalities, ranked from left to right. All the events in the database starting in the year 1950 and ending in November 2011 are considered. The events with the highest fatalities are “TORNADO”, “EXCESSIVE HEAT” and “FLASH FLOOD”. This addresses the question of which types of events are most harmful with respect to population health (fatalities) across the United States.
xyplot(injuries ~ evtype, total_injuries, scales=list(x=list(rot=45)), main = "Top 10 events causing injuries across the United States", xlab = "Types of events", ylab = "Total number of injuries")
The pareto chart above shows the top 10 events with the highest total number of injuries, ranked from left to right. All the events in the database starting in the year 1950 and ending in November 2011 are considered. The events with the highest injuries are “TORNADO”, “TSTM WIND” and “FLOOD”. This addresses the question of which types of events are most harmful with respect to population health (injuries) across the United States.
xyplot(damage ~ evtype, total_damage, scales=list(x=list(rot=45)), main = "Top 10 events causing damage across the United States", xlab = "Types of events", ylab = "Total damage ($)")
The pareto chart above shows the top 10 events with the highest total damage, ranked from left to right. The total damage comprises of both property and crop damage. All the events in the database starting in the year 1950 and ending in November 2011 are considered. The events with the highest damage are “FLOOD”, “HURRICANE/TYPHOON” and “TORNADO”. This addresses the question of which types of events have the greatest economic consequences across the United States.