The data for this analysis was obtained from the USA National Weather Service. Documentation on the dataset can be found on the Storm Data Documentation website
In this report, we study data on storm-related events in the USA between 1950 and November 2011. Since labeling of the events contains typo’s and multiple values for the same type of events, we perform a quick and dirty clean-up. We then show the top 10 most harmful storm-related events in three categories, summed over the history of this dataset: 1. Number of fatal casualties 2. Number of casualties with an injury 3. Total economic damage (crops + property)
sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 14393)
##
## locale:
## [1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252
## [3] LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C
## [5] LC_TIME=Dutch_Netherlands.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] backports_1.0.5 magrittr_1.5 rprojroot_1.2 tools_3.3.2
## [5] htmltools_0.3.5 Rcpp_0.12.9 stringi_1.1.2 rmarkdown_1.3
## [9] knitr_1.15.1 stringr_1.1.0 digest_0.6.12 evaluate_0.10
if(!file.exists("weather.csv.bz2")){
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "weather.csv.bz2")
}
weather_raw <- read.csv("weather.csv.bz2", header = TRUE, sep = ",", stringsAsFactors = FALSE)
After downloading the dataset, we load it into memory as weather and inspect the dataframe:
# Inspect dataset and check for missing values
str(weather_raw)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
colSums(is.na(weather_raw))
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME
## 0 0 0 0 0 0
## STATE EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## 0 0 0 0 0 0
## END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI
## 0 0 902297 0 0 0
## LENGTH WIDTH F MAG FATALITIES INJURIES
## 0 0 843563 0 0 0
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC
## 0 0 0 0 0 0
## ZONENAMES LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## 0 47 0 40 0 0
## REFNUM
## 0
Missing values occur solily in the columns COUNTYENDN, F, LATTITUDE and LATTITUDE_E. None of these variables are used in the analysis below.
head(grep("[wW][iI][nN][dD]", unique(weather_raw$EVTYPE), value=TRUE),6)
## [1] "TSTM WIND" "HURRICANE OPAL/HIGH WINDS"
## [3] "THUNDERSTORM WINDS" "THUNDERSTORM WIND"
## [5] "HIGH WINDS" "THUNDERSTORM WINDS LIGHTNING"
The result of the regular expression grep operation shows a problem with the EVTYPE values of the dataset. Searching on wind (insensitive for capital letters) we see that “TSTM WIND”, “THUNDERSTORM WINDS” and “THUNDERSTORM WIND” correspond to the same event, but are spelled differently. This phenomenon troubles the categorizing of events, and hence we will first try to clean it up.
In order to make grouping of the events easier, we perform the following steps. This clean-up is far from ideal, and for a more precise ananysis, more time should be spend on this.
# Make all EVTYPE names lowercase, to remove redundant factors from the use of capital letters.
weather <- weather_raw
weather$EVTYPE <- tolower(weather$EVTYPE)
weather$EVTYPE <- gsub("tstm", "thunderstorm", weather$EVTYPE)
weather$EVTYPE <- gsub("winds", "wind", weather$EVTYPE)
weather$EVTYPE <- gsub("rain|raining", "rain", weather$EVTYPE)
weather$EVTYPE <- gsub("floods|flooding", "flood", weather$EVTYPE)
weather$EVTYPE <- gsub("thu.*wind", "thunderstorm wind", weather$EVTYPE)
weather$EVTYPE <- gsub("record[ ]*", "", weather$EVTYPE)
weather$EVTYPE <- gsub("excessive[ ]*", "", weather$EVTYPE)
weather$EVTYPE <- gsub("heavy[ ]*", "", weather$EVTYPE)