The analysis to be conducted with this report ahall answer following two questions from the available NOAA data:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
Following libraries are loaded to analyse the data set:
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
##
## The following objects are masked from 'package:dplyr':
##
## combine, src, summarize
##
## The following objects are masked from 'package:plyr':
##
## is.discrete, summarize
##
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
The data is available at https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2 and read in and stored as data table “storm_data”.
storm_data <- read.csv("/Users/markjack/data/repdata-data-StormData.csv")
The data set is quickly reviewed for its column names and a general summary of its content.
colnames(storm_data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
str(storm_data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
We select as key indicators (columns) to address the above two questions following columns from the “storm_data” data set: STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, and CROPDMG.
storm_damage <- select(storm_data, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG)
head(storm_damage)
## STATE EVTYPE FATALITIES INJURIES PROPDMG CROPDMG
## 1 AL TORNADO 0 15 25.0 0
## 2 AL TORNADO 0 0 2.5 0
## 3 AL TORNADO 0 2 25.0 0
## 4 AL TORNADO 0 2 2.5 0
## 5 AL TORNADO 0 2 2.5 0
## 6 AL TORNADO 0 6 2.5 0
Across United States, health impacts by weather: The two columns FATALITIES and INJURIES are individually summed with respect to different event types listed in the column EVTYPE. The summation is done across all the states listed in column STATE to obtain the total fatalities and injuries for each weather event in the Uinted States. Different tools from the R packages dplyr, tidyr and plyr are being used to first melt the existing data, then sum across the new row values “FATALITIES” and “INJURIES” in a new column named “variable”. The new variable “sum” is then cast back into two columns FATALITIES and INJURIES to separate the summed results for those two categories. The data is then sorted with the command ‘arrange’ to first show the total number of fatalities for each weather type in descending order which is then repeated with the total number of injuries for each weather event in descending order. The largest 10 values are printed in each case.
storm_damage <- group_by(storm_damage, STATE, EVTYPE)
storm_damage_health <- melt(storm_damage, id.vars = c("EVTYPE"), measure.vars = c("FATALITIES", "INJURIES"), na.rm = TRUE)
storm_damage_order_health <- group_by(storm_damage_health, EVTYPE, variable)
storm_damage_sum_health <- ddply(storm_damage_order_health, c("EVTYPE", "variable"), summarise, sum = sum(value, na.rm = TRUE))
storm_damage_sum_health <- arrange(storm_damage_sum_health, desc(sum))
storm_damage_sum_health2 <- dcast(storm_damage_sum_health, EVTYPE ~ variable)
## Using sum as value column: use value.var to override.
storm_damage_sum_health3 <- arrange(storm_damage_sum_health2, desc(FATALITIES), desc(INJURIES))
head(storm_damage_sum_health3, 10)
## EVTYPE FATALITIES INJURIES
## 1 TORNADO 5633 91346
## 2 EXCESSIVE HEAT 1903 6525
## 3 FLASH FLOOD 978 1777
## 4 HEAT 937 2100
## 5 LIGHTNING 816 5230
## 6 TSTM WIND 504 6957
## 7 FLOOD 470 6789
## 8 RIP CURRENT 368 232
## 9 HIGH WIND 248 1137
## 10 AVALANCHE 224 170
storm_damage_sum_health4 <- arrange(storm_damage_sum_health2, desc(INJURIES), desc(FATALITIES))
head(storm_damage_sum_health4, 10)
## EVTYPE FATALITIES INJURIES
## 1 TORNADO 5633 91346
## 2 TSTM WIND 504 6957
## 3 FLOOD 470 6789
## 4 EXCESSIVE HEAT 1903 6525
## 5 LIGHTNING 816 5230
## 6 HEAT 937 2100
## 7 ICE STORM 89 1975
## 8 FLASH FLOOD 978 1777
## 9 THUNDERSTORM WIND 133 1488
## 10 HAIL 15 1361
The largest number of fatalities and injuries in the United States can be clearly attributed to tornados as seen from the table. Tornados as a cause of bodily harm to people by far leads in the number of total fatalities listed with excessive heat following as the second most important cause in case of fatalities. For injuries, TSTM winds and floods and lightning strikes are equally important as second leading causes.
This is further visualized with these two bar plots of the total number of fatalities and injuries for each weather event type in descending order:
par(mfrow = c(1,2))
barplot(storm_damage_sum_health3$FATALITIES[1:10], xlab="EVTYPE", col="darkblue")
barplot(storm_damage_sum_health4$INJURIES[1:10], xlab="EVTYPE", col="red")
Across the United States, economic impacts by weather: Analogously to the analysis shown above on the health impacts of weather related incidents, the above procedure is repeated exactly but with columns FATALITIES and INJURIES from the original data set storm_data replaced by the columns PROPDMG and CROPDMG. The two columns summarize any property damage and crop damage inflicted by a weather-related event. Following the exact same procedure as above we also obtain a listing of the 10 most significant weather causes of property damage and crop damage in the United States.
storm_damage <- group_by(storm_damage, STATE, EVTYPE)
storm_damage_econ <- melt(storm_damage, id.vars = c("EVTYPE"), measure.vars = c("PROPDMG", "CROPDMG"), na.rm = TRUE)
storm_damage_order_econ <- group_by(storm_damage_econ, EVTYPE, variable)
storm_damage_sum_econ <- ddply(storm_damage_order_econ, c("EVTYPE", "variable"), summarise, sum = sum(value, na.rm = TRUE))
storm_damage_sum_econ <- arrange(storm_damage_sum_econ, desc(sum))
storm_damage_sum_econ2 <- dcast(storm_damage_sum_econ, EVTYPE ~ variable)
## Using sum as value column: use value.var to override.
storm_damage_sum_econ3 <- arrange(storm_damage_sum_econ2, desc(PROPDMG), desc(CROPDMG))
head(storm_damage_sum_econ3, 10)
## EVTYPE PROPDMG CROPDMG
## 1 TORNADO 3212258.2 100018.52
## 2 FLASH FLOOD 1420124.6 179200.46
## 3 TSTM WIND 1335965.6 109202.60
## 4 FLOOD 899938.5 168037.88
## 5 THUNDERSTORM WIND 876844.2 66791.45
## 6 HAIL 688693.4 579596.28
## 7 LIGHTNING 603351.8 3580.61
## 8 THUNDERSTORM WINDS 446293.2 18684.93
## 9 HIGH WIND 324731.6 17283.21
## 10 WINTER STORM 132720.6 1978.99
storm_damage_sum_econ4 <- arrange(storm_damage_sum_econ2, desc(CROPDMG), desc(PROPDMG))
head(storm_damage_sum_econ4, 10)
## EVTYPE PROPDMG CROPDMG
## 1 HAIL 688693.38 579596.28
## 2 FLASH FLOOD 1420124.59 179200.46
## 3 FLOOD 899938.48 168037.88
## 4 TSTM WIND 1335965.61 109202.60
## 5 TORNADO 3212258.16 100018.52
## 6 THUNDERSTORM WIND 876844.17 66791.45
## 7 DROUGHT 4099.05 33898.62
## 8 THUNDERSTORM WINDS 446293.18 18684.93
## 9 HIGH WIND 324731.56 17283.21
## 10 HEAVY RAIN 50842.14 11122.80
The results show again tornados as leading cause for weather-inflicted property damage in the US at more than 3 million in damages, while hail is the most important contributor to crop damage at more than 500,000 in damages. the next 10 major weather reasons for property and crop damage are then listed for each table. We have further added bar plots of the total amount in property and crop damages for each weather event type with values listed in descending order in each case:
par(mfrow = c(1,2))
barplot(storm_damage_sum_econ3$PROPDMG[1:10], xlab="EVTYPE", col="darkblue")
barplot(storm_damage_sum_econ4$CROPDMG[1:10], xlab="EVTYPE", col="red")
Health impact and economic impacts by weather for each state: Below we simply repeated the above analysis but for each state individually and combined the analysis for property damage and crop damage while fatalities and injuries were investigated deparately.
storm_damage <- group_by(storm_damage, STATE, EVTYPE)
storm_damage_melt <- melt(storm_damage, id.vars = c("STATE", "EVTYPE"), measure.vars = c("PROPDMG", "CROPDMG"), na.rm = TRUE)
storm_damage_order <- group_by(storm_damage_melt, STATE, EVTYPE, variable)
storm_damage_sum <- ddply(storm_damage_order, c("STATE", "EVTYPE", "variable"), summarise, sum = sum(value, na.rm = TRUE))
storm_damage_sum <- arrange(storm_damage_sum, desc(sum))
head(storm_damage_sum, 10)
## STATE EVTYPE variable sum
## 1 TX TORNADO PROPDMG 283097.2
## 2 NE HAIL CROPDMG 201031.1
## 3 MS TORNADO PROPDMG 187840.9
## 4 AL TORNADO PROPDMG 167816.2
## 5 OK TORNADO PROPDMG 165167.9
## 6 FL TORNADO PROPDMG 159752.6
## 7 IA TORNADO PROPDMG 152142.8
## 8 GA TORNADO PROPDMG 151349.5
## 9 TX TSTM WIND PROPDMG 144959.0
## 10 KS TORNADO PROPDMG 143209.9
storm_fat_melt <- melt(storm_damage, id.vars = c("STATE", "EVTYPE"), measure.vars = c("FATALITIES"), na.rm = TRUE)
storm_fat_order <- group_by(storm_fat_melt, STATE, EVTYPE, variable)
storm_fat_sum <- ddply(storm_fat_order, c("STATE", "EVTYPE", "variable"), summarise, sum = sum(value, na.rm = TRUE))
storm_fat_sum <- arrange(storm_fat_sum, desc(sum))
head(storm_fat_sum, 10)
## STATE EVTYPE variable sum
## 1 IL HEAT FATALITIES 653
## 2 AL TORNADO FATALITIES 617
## 3 TX TORNADO FATALITIES 538
## 4 MS TORNADO FATALITIES 450
## 5 MO TORNADO FATALITIES 388
## 6 AR TORNADO FATALITIES 379
## 7 TN TORNADO FATALITIES 368
## 8 PA EXCESSIVE HEAT FATALITIES 359
## 9 IL EXCESSIVE HEAT FATALITIES 330
## 10 OK TORNADO FATALITIES 296
storm_inj_melt <- melt(storm_damage, id.vars = c("STATE", "EVTYPE"), measure.vars = c("INJURIES"), na.rm = TRUE)
storm_inj_order <- group_by(storm_inj_melt, STATE, EVTYPE, variable)
storm_inj_sum <- ddply(storm_inj_order, c("STATE", "EVTYPE", "variable"), summarise, sum = sum(value, na.rm = TRUE))
storm_inj_sum <- arrange(storm_inj_sum, desc(sum))
head(storm_inj_sum, 10)
## STATE EVTYPE variable sum
## 1 TX TORNADO INJURIES 8207
## 2 AL TORNADO INJURIES 7929
## 3 TX FLOOD INJURIES 6338
## 4 MS TORNADO INJURIES 6244
## 5 AR TORNADO INJURIES 5116
## 6 OK TORNADO INJURIES 4829
## 7 TN TORNADO INJURIES 4748
## 8 OH TORNADO INJURIES 4438
## 9 MO TORNADO INJURIES 4330
## 10 IN TORNADO INJURIES 4224