Weather events have caused a lot of damage economically and resulted in many deaths and injuries. Overall, the most harmful weather events have been tornados. Tornados have caused, by far, the most property damage, crop damage, and human casualties (injuries + deaths). Tornados caused more than 10 times the human casualties as their nearest competitor, excessive heat. As for crop damage, tornados have caused roughly twice the monetary loss as their nearest competitor, flash flooding. We transform the Storm Data dataset into several summarised datasets, then use figures to visualize the data.
Load necessary packages, and load Storm Data from working directory.
library(ggplot2)
library(dplyr)
library(grid)
First we read in the data (repdata-data-StormData.csv.bz2 is saved in the working directory).
stormData <- read.csv("repdata-data-StormData.csv.bz2")
stormData <- tbl_df(stormData)
Look at the dimensions and first few rows of data.
dim(stormData)
## [1] 902297 37
print(stormData)
## Source: local data frame [902,297 x 37]
##
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## (dbl) (fctr) (fctr) (fctr) (dbl) (fctr) (fctr)
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## 7 1 11/16/1951 0:00:00 0100 CST 9 BLOUNT AL
## 8 1 1/22/1952 0:00:00 0900 CST 123 TALLAPOOSA AL
## 9 1 2/13/1952 0:00:00 2000 CST 125 TUSCALOOSA AL
## 10 1 2/13/1952 0:00:00 2000 CST 57 FAYETTE AL
## .. ... ... ... ... ... ... ...
## Variables not shown: EVTYPE (fctr), BGN_RANGE (dbl), BGN_AZI (fctr),
## BGN_LOCATI (fctr), END_DATE (fctr), END_TIME (fctr), COUNTY_END (dbl),
## COUNTYENDN (lgl), END_RANGE (dbl), END_AZI (fctr), END_LOCATI (fctr),
## LENGTH (dbl), WIDTH (dbl), F (int), MAG (dbl), FATALITIES (dbl),
## INJURIES (dbl), PROPDMG (dbl), PROPDMGEXP (fctr), CROPDMG (dbl),
## CROPDMGEXP (fctr), WFO (fctr), STATEOFFIC (fctr), ZONENAMES (fctr),
## LATITUDE (dbl), LONGITUDE (dbl), LATITUDE_E (dbl), LONGITUDE_ (dbl),
## REMARKS (fctr), REFNUM (dbl)
The transformations and their justifications will accompany the results in the following section.
We create a new variable CASUALTIES = FATALITIES + INJURIES, then consider the cumulative CASUALTIES for each EVTYPE.
stormData1 <- stormData %>%
mutate(CASUALTIES = FATALITIES + INJURIES) %>%
group_by(EVTYPE) %>%
summarise(TOTAL_CASUALTIES = sum(CASUALTIES, na.rm = T)) %>%
arrange(desc(TOTAL_CASUALTIES)) %>%
slice(1:20) %>%
print()
## Source: local data frame [20 x 2]
##
## EVTYPE TOTAL_CASUALTIES
## (fctr) (dbl)
## 1 TORNADO 96979
## 2 EXCESSIVE HEAT 8428
## 3 TSTM WIND 7461
## 4 FLOOD 7259
## 5 LIGHTNING 6046
## 6 HEAT 3037
## 7 FLASH FLOOD 2755
## 8 ICE STORM 2064
## 9 THUNDERSTORM WIND 1621
## 10 WINTER STORM 1527
## 11 HIGH WIND 1385
## 12 HAIL 1376
## 13 HURRICANE/TYPHOON 1339
## 14 HEAVY SNOW 1148
## 15 WILDFIRE 986
## 16 THUNDERSTORM WINDS 972
## 17 BLIZZARD 906
## 18 FOG 796
## 19 RIP CURRENT 600
## 20 WILD/FOREST FIRE 557
We visualize the top 20 events most hazardous to human population health in the following plot.
p <- ggplot(stormData1, aes(x = EVTYPE, y = TOTAL_CASUALTIES, fill = TOTAL_CASUALTIES)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ymin <- min(stormData1$TOTAL_CASUALTIES)
ymax <- max(stormData1$TOTAL_CASUALTIES)
p <- p + coord_cartesian(ylim=c(ymin, 0.1*ymax)) +
scale_fill_gradientn(colours=rainbow(5)) +
labs(xlab(""), ylab("Total Casualties")) +
ylab("Total Casualties") +
ggtitle("Top 20 Weather Events Harmful to Human Health")
print(p)
The figure indicates that Tornados, by far, cause the highest number of casualties. In fact, the numer of casualties inflicted by tornados is about 100,000, which is more than 10 times that of any of the other top 20.
Note that the factor levels of PROPDMGEXP and CROPDMGEXP are unintelligible:
levels(stormData$PROPDMGEXP)
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
levels(stormData$CROPDMGEXP)
## [1] "" "?" "0" "2" "B" "k" "K" "m" "M"
We rename the factor levels according to convention adopted at https://rstudio-pubs-static.s3.amazonaws.com/58957_37b6723ee52b455990e149edde45e5b6.html, then transform PROPDMGEXP and CROPDMGEXP to type numeric in stormData. We form new variables representing the total property and crop damage, named totalPropDmg and totalCropDmg, respectively.
levels(stormData$PROPDMGEXP) <- c(0, 0, 0, 1, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10^9, 10^2, 10^2, 10^3, 10^6, 10^6)
levels(stormData$CROPDMGEXP) <- c(0, 0, 10, 10, 10^9, 10^3, 10^3, 10^6, 10^6)
stormData <- stormData %>%
mutate(PROPDMGEXP = as.numeric(PROPDMGEXP),
CROPDMGEXP = as.numeric(CROPDMGEXP),
totalPropDmg = PROPDMG*PROPDMGEXP,
totalCropDmg = CROPDMG*CROPDMGEXP,
totalDamage = totalPropDmg + totalCropDmg)
The following summarised dataset will come in handy later for graphing purposes.
stormDataSummarised <- stormData %>%
group_by(EVTYPE) %>%
summarise(propDamageSum = sum(totalPropDmg), cropDamageSum = sum(totalCropDmg)) %>%
print()
## Source: local data frame [985 x 3]
##
## EVTYPE propDamageSum cropDamageSum
## (fctr) (dbl) (dbl)
## 1 HIGH SURF ADVISORY 1200 0
## 2 COASTAL FLOOD 0 0
## 3 FLASH FLOOD 300 0
## 4 LIGHTNING 0 0
## 5 TSTM WIND 656 0
## 6 TSTM WIND (G45) 48 0
## 7 WATERSPOUT 0 0
## 8 WIND 0 0
## 9 ? 30 0
## 10 ABNORMAL WARMTH 0 0
## .. ... ... ...
Let us process stormData by summarising the total economic damages inflicted by each type of event. Here we create a summary data frame for each type of damage, then merge the two.
stormDataProp <- stormData %>%
group_by(EVTYPE) %>%
summarise(propDamageSum = sum(totalPropDmg)) %>%
arrange(desc(propDamageSum)) %>%
slice(1:20)
stormDataCrop <- stormData %>%
group_by(EVTYPE) %>%
summarise(cropDamageSum = sum(totalCropDmg)) %>%
arrange(desc(cropDamageSum)) %>%
slice(1:20)
stormDataCropProp <- merge(x = stormDataProp, y = stormDataCrop, all = T)
print(stormDataCropProp)
## EVTYPE propDamageSum cropDamageSum
## 1 DROUGHT NA 148044.10
## 2 EXTREME COLD NA 25772.70
## 3 FLASH FLOOD 8532395.2 718045.20
## 4 FLASH FLOODING 171242.5 20514.20
## 5 FLOOD 5420630.1 677650.95
## 6 FROST/FREEZE NA 29224.70
## 7 HAIL 4144326.7 2320784.95
## 8 HEAVY RAIN 305696.9 45214.20
## 9 HEAVY SNOW 734318.0 NA
## 10 HIGH WIND 1951880.8 69754.75
## 11 HIGH WINDS 333988.3 NA
## 12 HURRICANE NA 24096.55
## 13 HURRICANE/TYPHOON NA 20286.58
## 14 ICE STORM 399736.9 NA
## 15 LIGHTNING 3619930.4 14330.96
## 16 STRONG WIND 378075.2 NA
## 17 THUNDERSTORM WIND 5263238.0 267514.20
## 18 THUNDERSTORM WINDS 2660174.6 74727.95
## 19 TORNADO 19321049.9 400069.49
## 20 TROPICAL STORM 293039.8 24269.60
## 21 TSTM WIND 8018780.8 437255.65
## 22 TSTM WIND/HAIL NA 17487.00
## 23 URBAN/SML STREAM FLD 156343.9 NA
## 24 WILD/FOREST FIRE 237530.6 16860.87
## 25 WILDFIRE 510398.3 17748.20
## 26 WINTER STORM 797867.9 NA
Before visualizing the data, we fill in the missing values. Since we chose the top 20 events by property and crop damage individually, there are now 26 event types in EVTYPE. We replace the missing values in the second and third columns with those found above in stormDataSummarised.
events <- stormDataCropProp$EVTYPE
newLogical <- stormDataSummarised$EVTYPE %in% events
newProp <- stormDataSummarised$propDamageSum[newLogical]
newCrop <- stormDataSummarised$cropDamageSum[newLogical]
stormDataCropProp$propDamageSum <- newProp
stormDataCropProp$cropDamageSum <- newCrop
print(stormDataCropProp)
## EVTYPE propDamageSum cropDamageSum
## 1 DROUGHT 25637.35 148044.10
## 2 EXTREME COLD 46005.38 25772.70
## 3 FLASH FLOOD 8532395.22 718045.20
## 4 FLASH FLOODING 171242.45 20514.20
## 5 FLOOD 5420630.06 677650.95
## 6 FROST/FREEZE 5819.64 29224.70
## 7 HAIL 4144326.74 2320784.95
## 8 HEAVY RAIN 305696.89 45214.20
## 9 HEAVY SNOW 734317.99 8795.50
## 10 HIGH WIND 1951880.76 69754.75
## 11 HIGH WINDS 333988.35 7077.40
## 12 HURRICANE 99229.65 24096.55
## 13 HURRICANE/TYPHOON 38709.09 20286.58
## 14 ICE STORM 399736.88 6771.25
## 15 LIGHTNING 3619930.38 14330.96
## 16 STRONG WIND 378075.22 6531.00
## 17 THUNDERSTORM WIND 5263238.05 267514.20
## 18 THUNDERSTORM WINDS 2660174.61 74727.95
## 19 TORNADO 19321049.94 400069.49
## 20 TROPICAL STORM 293039.76 24269.60
## 21 TSTM WIND 8018780.83 437255.65
## 22 TSTM WIND/HAIL 50265.00 17487.00
## 23 URBAN/SML STREAM FLD 156343.93 11180.90
## 24 WILD/FOREST FIRE 237530.65 16860.87
## 25 WILDFIRE 510398.26 17748.20
## 26 WINTER STORM 797867.88 7940.95
Now we visualize the events that caused the most property and crop damage, along with those that caused the most total damage (sum of property and crop damages).
p4 <- ggplot(stormDataCropProp, aes(x = EVTYPE, y = propDamageSum/(10^6))) +
geom_point(aes(color = "Property Damage", shape = "Property Damage")) +
geom_point(aes(x = EVTYPE, y = cropDamageSum/(10^6), color = "Crop Damage", shape = "Crop Damage")) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("") +
ylab("Economic Loss ($millions)") +
ggtitle("Economic Consequences of Weaher Events")
print(p4)
The figure above shows the economic losses caused by weather events. The events included registered as top 20 for either crop damage or property damage; there are 26 total. It is evident that for property damage, tornados are the biggest culprit, while hail is the biggest culprit for crop damage.
Let’s look at the 20 weather events that have caused the most total damage (property + crop).
top20 <- stormData %>%
select(EVTYPE, totalDamage) %>%
group_by(EVTYPE) %>%
summarise(totalDamageSum = sum(totalDamage)) %>%
arrange(desc(totalDamageSum)) %>%
slice(1:20)
print(top20)
## Source: local data frame [20 x 2]
##
## EVTYPE totalDamageSum
## (fctr) (dbl)
## 1 TORNADO 19721119.4
## 2 FLASH FLOOD 9250440.4
## 3 TSTM WIND 8456036.5
## 4 HAIL 6465111.7
## 5 FLOOD 6098281.0
## 6 THUNDERSTORM WIND 5530752.2
## 7 LIGHTNING 3634261.3
## 8 THUNDERSTORM WINDS 2734902.6
## 9 HIGH WIND 2021635.5
## 10 WINTER STORM 805808.8
## 11 HEAVY SNOW 743113.5
## 12 WILDFIRE 528146.5
## 13 ICE STORM 406508.1
## 14 STRONG WIND 384606.2
## 15 HEAVY RAIN 350911.1
## 16 HIGH WINDS 341065.8
## 17 TROPICAL STORM 317309.4
## 18 WILD/FOREST FIRE 254391.5
## 19 FLASH FLOODING 191756.6
## 20 DROUGHT 173681.5
p3 <- ggplot(top20, aes(x = EVTYPE, y = totalDamageSum/(10^6))) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p3 <- p3 +
ylab("Total Damage ($millions)") +
xlab("") +
ggtitle("Weather Events and Damage")
print(p3)
The figure above shows the 20 events that caused the hightest amount of total economic damage (property + crop). Unsurprisingly, property damage accounts for the bulk of the total damage, which is why the pattern in this figure looks so similar to that of the blue triangles in the previous figure.