This report analyzes data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which documents severe storm and weather events in the United States occurring in the time period between year 1950 and November of 2011. The data consists of characteristics of these events, including but not limited to location, date and duration, fatality, injuries, and damages. This report has 2 objectives:
We start by requiring the packages we will need later for descriptive analyses.
library(ggplot2)
library(knitr)
library(plyr)
Now we download the raw data, in CSV format. This file was obtained from the link here, as provided from the Coursera course site ‘Reproducible Research’. We import the file of size 46.9 MB into a data frame in the local working directory.
StormURL <-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(StormURL, dest = "StormData.csv.bz2")
Storm <- read.csv("StormData.csv.bz2", header=TRUE)
We check the dimension of the data imported.
dim(Storm)
## [1] 902297 37
The data contains 37 columns and 902297 rows.
We then check the first 2 rows and the column names.
head(Storm,2)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14 100 3 0 0
## 2 NA 0 2 150 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
names(Storm)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
To find the event type that causes the greatest harm, we focus on the characteristics INJURIES and FATALITIES separately.
Let us first take a look at the total number of injuries, by event type. As we previously called the package ‘plyr’, we make use of the function ‘ddply’ here, to sum the total number of injuries by event type, and place the information in a new data frame which we name ‘Storm.Inj’
Storm.Inj <- ddply(Storm,.(EVTYPE),summarize,Total.Injuries = sum(INJURIES,na.rm=TRUE))
We then order the number of injuries from highest to lowest, and display the 5 event types with the highest number of injuries.
Storm.Inj <- Storm.Inj[order(Storm.Inj$Total.Injuries,decreasing = TRUE), ]
head(Storm.Inj,5)
## EVTYPE Total.Injuries
## 834 TORNADO 91346
## 856 TSTM WIND 6957
## 170 FLOOD 6789
## 130 EXCESSIVE HEAT 6525
## 464 LIGHTNING 5230
We see that Tornado is the event type with the highest number of injuries: 91346. Followed by that, in decreasing number of injuries, are: TSTM Wind, Flood, Excessive Heat, and Lightning.
We visualize this ordered number of injuries by event type with a horizontal bar plot, diplaying the top 5 event types. Note that ‘geom_bar’ function has the argument “stat = ‘identity’” which is suitable when one variable (EVTYPE) is categorical, whereas the other (Total.Injuries) is an integer count.
PlotInj <- ggplot(Storm.Inj[1:5,],aes(EVTYPE,Total.Injuries,fill=EVTYPE))
PlotInj + geom_bar(stat='identity') + xlab('Type of Event') + ylab ('Total Number of Injuries')+
ggtitle('Highest Number of Injuries by Event Type') + coord_flip()
The plot reaffirms our finding that Tornado causes the highest number of injuries, out of all event types.
Next, we look at total number of fatalities, in a similar fashion as the number of injuries.
Storm.Fatal <- ddply(Storm,.(EVTYPE),summarize,Total.Fatalities = sum(FATALITIES,na.rm = TRUE))
Storm.Fatal <- Storm.Fatal[order(Storm.Fatal$Total.Fatalities,decreasing = TRUE), ]
head(Storm.Fatal,5)
## EVTYPE Total.Fatalities
## 834 TORNADO 5633
## 130 EXCESSIVE HEAT 1903
## 153 FLASH FLOOD 978
## 275 HEAT 937
## 464 LIGHTNING 816
We see that Tornado not only causes the highest number of injuries, but also fatalities. We again visualize this in a horizontal bar plot:
PlotFatal <- ggplot(Storm.Fatal[1:5,],aes(EVTYPE,Total.Fatalities,fill=EVTYPE))
PlotFatal + geom_bar(stat='identity') + xlab('Type of Event') + ylab ('Total Fatalities')+
ggtitle('Highest Number of Fatalities by Event Type') + coord_flip()
We first identify the event type that causes the most damage to CROPS, followed by that to PROPERTY. Then, we look at the TOTAL damage to crop and property, by event type.
Start by examining the crop damage. Since the column ‘CROPDMG’ contains the numerical value in dollars, while the column ‘CROPDMGEXP’ specifies the units, we can create a new column that displays the exact monetary value by multiplying the two columns together. To do this, let us start by viewing the available units in ‘CROPDMGEXP’.
unique(Storm$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
Two of the levels, ‘m’ and ‘k’, are lowercased, but represent the exact same as their uppercase counterparts, ‘M’ and ‘K’. We clean up the column to ensure uniform documentation. We first convert the column from class ‘factor’ to ‘character’, so that next we can convert all levels to upper case using the ‘toupper’ function,
Storm$CROPDMGEXP <- as.character(Storm$CROPDMGEXP)
Storm$CROPDMGEXP <- toupper(Storm$CROPDMGEXP)
This is followed by replacing non-numerical-character levels (‘?’ and ‘’) to the character ’0’.
Storm$CROPDMGEXP[Storm$CROPDMGEXP %in% c('','?')] <- '0'
unique(Storm$CROPDMGEXP) # check column has been cleaned to format desired
## [1] "0" "M" "K" "B" "2"
Now we substitute the character levels ‘M’(for millions) and ‘K’ (for thousands) and ‘B’ (for billions) with their actual powers of base ten; 6,3,9, respectively. Note that the character ‘2’ remains unchanged, since it has already been converted from ‘H’ (for hundreds) in the raw data. Then, we convert the column to class ‘numeric’ raised to the power of 10, to clearly depict the actual numeric value.
Storm$CROPDMGEXP[Storm$CROPDMGEXP %in% c('M')] <- '6'
Storm$CROPDMGEXP[Storm$CROPDMGEXP %in% c('K')] <- '3'
Storm$CROPDMGEXP[Storm$CROPDMGEXP %in% c('B')] <- '9'
Storm$CROPDMGEXP <- 10^(as.numeric(Storm$CROPDMGEXP))
unique(Storm$CROPDMGEXP) # check
## [1] 1e+00 1e+06 1e+03 1e+09 1e+02
Before multiplying the 2 columns together, we check to ensure the class of ‘CROPDMG’ is also numeric.
class(Storm$CROPDMG)
## [1] "numeric"
We create this new column ‘Crop.Damage’, formed by multiplication of the 2 columns, that clearly depicts the crop damage. We merge the column with the rest of the ‘Storm’ data imported.
Crop.Damage <- Storm$CROPDMG * Storm$CROPDMGEXP # new column
Storm <- cbind(Storm,Crop.Damage) # merge column with Storm data
Next, by event type, we sum the total crop damage. This information is extracted into a new data frame which we call ‘Event.Crop.Dmg’. The top 5 event types with the highest damage are shown.
Event.Crop.Dmg <- ddply(Storm, .(EVTYPE), summarize, Total.Crop.Dmg = sum(Crop.Damage, na.rm = TRUE))
Event.Crop.Dmg <- Event.Crop.Dmg[order(Event.Crop.Dmg$Total.Crop.Dmg, decreasing = T), ]
head(Event.Crop.Dmg,5)
## EVTYPE Total.Crop.Dmg
## 95 DROUGHT 13972566000
## 170 FLOOD 5661968450
## 590 RIVER FLOOD 5029459000
## 427 ICE STORM 5022113500
## 244 HAIL 3025954473
We see that Drought results in the greatest amount of financial damage to crops.
We then look at property damages in a similar way.
unique(Storm$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
Storm$PROPDMGEXP <-as.character(Storm$PROPDMGEXP)
This is followed by converting all levels to upper case counterparts,and replacing non-numerical-character levels (‘+’ and ‘-’ and ‘?’ and ‘’) with the character ’0’.
Storm$PROPDMGEXP <- toupper(Storm$PROPDMGEXP)
Storm$PROPDMGEXP[Storm$PROPDMGEXP %in% c('+','-','?','')] <- '0'
unique(Storm$PROPDMGEXP) # check column has been cleaned to format desired
## [1] "K" "M" "0" "B" "5" "6" "4" "2" "3" "H" "7" "1" "8"
Storm$PROPDMGEXP[Storm$PROPDMGEXP %in% c('M')] <- '6'
Storm$PROPDMGEXP[Storm$PROPDMGEXP %in% c('K')] <- '3'
Storm$PROPDMGEXP[Storm$PROPDMGEXP %in% c('B')] <- '9'
Storm$PROPDMGEXP[Storm$PROPDMGEXP %in% c('H')] <- '2'
Storm$PROPDMGEXP <- 10^(as.numeric(Storm$PROPDMGEXP))
unique(Storm$PROPDMGEXP) # check
## [1] 1e+03 1e+06 1e+00 1e+09 1e+05 1e+04 1e+02 1e+07 1e+01 1e+08
Prop.Damage <- Storm$PROPDMG * Storm$PROPDMGEXP # new column
Storm <- cbind(Storm,Prop.Damage) # merge column with Storm data
Event.Prop.Dmg <- ddply(Storm, .(EVTYPE), summarize, Total.Prop.Dmg = sum(Prop.Damage, na.rm = TRUE))
Event.Prop.Dmg <- Event.Prop.Dmg[order(Event.Prop.Dmg$Total.Prop.Dmg, decreasing = T), ]
head(Event.Prop.Dmg,5)
## EVTYPE Total.Prop.Dmg
## 170 FLOOD 144657709807
## 411 HURRICANE/TYPHOON 69305840000
## 834 TORNADO 56947380676
## 670 STORM SURGE 43323536000
## 153 FLASH FLOOD 16822673978
We see that Flood results in the highest damage to properties.
Now, we look at the total damages (both crop and property). We create a column of total damages named ‘Total.Dmg’, then merge this column into the Storm data frame.
Total.Dmg <- Storm$Crop.Damage + Storm$Prop.Damage
head(Total.Dmg)
## [1] 25000 2500 25000 2500 2500 2500
Storm <- cbind(Storm,Total.Dmg)
We then rank the top 5 event types with the highest total damages.
Total.Damage <- ddply(Storm,.(EVTYPE),summarize,TotalDamage = sum(Total.Dmg,na.rm=TRUE))
Total.Damage <- Total.Damage[order(Total.Damage$TotalDamage, decreasing = TRUE), ]
head(Total.Damage)
## EVTYPE TotalDamage
## 170 FLOOD 150319678257
## 411 HURRICANE/TYPHOON 71913712800
## 834 TORNADO 57362333946
## 670 STORM SURGE 43323541000
## 244 HAIL 18761221986
## 153 FLASH FLOOD 18243991078
Let us visualize the total damage by event type in a horizontal bar plot:
PlotDmg <- ggplot(Total.Damage[1:5,],aes(EVTYPE,TotalDamage,fill=EVTYPE))
PlotDmg + geom_bar(stat='identity') + xlab('Type of Event') + ylab ('Total Damages')+
ggtitle('Top 5 Events Ranked by Total Damages') + coord_flip()
We see that FLOOD causes the greatest economic damage overall.
-Addressing Objective 1, TORNADO causes the greatest harm to population health, both in terms of injuries and fatalities. EXCESSIVE HEAT comes second overall, since it is 4th highest in total injuries, and 2nd highest in total fatalities. Majority of the resources for rescuing efforts should be allocated to TORNADO events.
-Addressing Objective 2, FLOOD causes the greatest economic loss, since it is the event type that causes the greatest damage to property and 2nd greatest damage to crops.