This analysis takes data from U.S. National Oceanic and Atmospheric Administration’s (NOAA) and explore the severe weather types and how they affect the US from 1950 to 2011.
The main purpose of this analysis was to address two questions, which has been raised in Coursera Reproducible Research - Course Project Assignment: (1) Which types of weather events are most harmful with respect to the population health? (2) Which types of weather events have the greatest economic consequences?
Hereby, we have tried to response on proposed task graphically, where health issues are presented on two graphs, while economic issues have been presented on one single graph.
First we’ll read only first 1000 lines of the document, just to get brief insight of its structure and to decide what columns we do actually need for further analysis. (cache=TRUE)
df <- read.csv("./data/repdata_data_StormData.csv.bz2", nrows = 1000)
str(df)
## 'data.frame': 1000 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 449 levels "1/10/1972 0:00:00",..: 182 182 86 344 35 35 36 9 81 81 ...
## $ BGN_TIME : int 130 145 1600 900 1500 2000 100 900 2000 2000 ...
## $ TIME_ZONE : Factor w/ 2 levels "CST","EST": 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 68 levels "","AUTAUGA","BALDWIN",..: 50 3 30 46 23 40 6 63 64 30 ...
## $ STATE : Factor w/ 1 level "AL": 1 1 1 1 1 1 1 1 1 1 ...
## $ EVTYPE : Factor w/ 3 levels "HAIL","TORNADO",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : logi NA NA NA NA NA NA ...
## $ BGN_LOCATI: logi NA NA NA NA NA NA ...
## $ END_DATE : logi NA NA NA NA NA NA ...
## $ END_TIME : logi NA NA NA NA NA NA ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : logi NA NA NA NA NA NA ...
## $ END_LOCATI: logi NA NA NA NA NA NA ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 3 levels "","K","M": 2 2 2 2 2 2 2 2 2 2 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: logi NA NA NA NA NA NA ...
## $ WFO : logi NA NA NA NA NA NA ...
## $ STATEOFFIC: logi NA NA NA NA NA NA ...
## $ ZONENAMES : logi NA NA NA NA NA NA ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : logi NA NA NA NA NA NA ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
In order to answer the questions raised in the Assignment we only need few columns to work on. Hence, we’ll import only columns that we need for further analysis:
EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP (cache=TRUE)
cls <- c(rep("NULL", 7), "factor", rep("NULL", 14),
"numeric", "numeric", "numeric", "factor",
"numeric", "factor", rep("NULL", 9))
df_storm <- read.csv("./data/repdata_data_StormData.csv.bz2",
sep = ",", check.names = FALSE,
colClasses = cls)
str(df_storm)
## 'data.frame': 902297 obs. of 7 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
From the above we can see that levels of the factors are some kind of messy and that they not match with the official Event types and multipliers.
For example, EVTYPE has 985 levels, while it should have only 47 (official events). Also, multiplier columns contain some punctuation characters and numbers, while only acceptable levels are B, M, K and H.
First we will work on Event names through matching existing names in EVTYPE column and creation of ‘events’ column which contains only official event types which are soundly matched pairs of EVTYPE.
# Create vector of official event names taken from the Storm Data preparation document
events_off <- c("ASTRONOMICAL LOW TIDE","AVALANCHE","BLIZZARD",
"COASTAL FLOOD","COLD/WIND CHILL","DEBRIS FLOW",
"DENSE FOG","DENSE SMOKE","DROUGHT","DUST DEVIL",
"DUST STORM","EXCESSIVE HEAT","EXTREME COLD/WIND CHILL",
"FLASH FLOOD","FLOOD","FREEZING FOG","FROST/FREEZE","FUNNEL CLOUD",
"HAIL","HEAT","HEAVY RAIN","HEAVY SNOW","HIGH SURF","HIGH WIND",
"HURRICANE/TYPHOON","ICE STORM","LAKESHORE FLOOD","LAKE-EFFECT SNOW",
"LIGHTNING","MARINE HAIL","MARINE HIGH WIND","MARINE STRONG WIND",
"MARINE THUNDERSTORM WIND","RIP CURRENT","SEICHE","SLEET",
"STORM TIDE","STRONG WIND","THUNDERSTORM WIND","TORNADO",
"TROPICAL DEPRESSION","TROPICAL STORM","TSUNAMI","VOLCANIC ASH",
"WATERSPOUT","WILDFIRE","WINTER STORM","WINTER WEATHER")
# Change the class of EVTYPE variable
df_storm$EVTYPE <- as.character(df_storm$EVTYPE)
# Create new variable 'events' soundly mathcing EVTYPE names with official event names, where similar sounding strings should get similar or equal codes
library(dplyr)
library(stringdist)
df_storm <- df_storm %>% mutate(events =
events_off[amatch(df_storm$EVTYPE,
events_off ,method="soundex", maxDist=2)])
levels(df_storm$PROPDMGEXP)
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
levels(df_storm$CROPDMGEXP)
## [1] "" "?" "0" "2" "B" "k" "K" "m" "M"
We will remove numbers and punctuation characters and set to uppercase
levels(df_storm$PROPDMGEXP) <- gsub(pattern = "\\d|[[:punct:]]",
replacement = "", x = levels(df_storm$PROPDMGEXP))
levels(df_storm$PROPDMGEXP) <- toupper(levels(df_storm$PROPDMGEXP))
levels(df_storm$CROPDMGEXP) <- gsub(pattern = "\\d|[[:punct:]]",
replacement = "", x = levels(df_storm$CROPDMGEXP))
levels(df_storm$CROPDMGEXP) <- toupper(levels(df_storm$CROPDMGEXP))
Observe how levels of multipliers look after cleaning
levels(df_storm$PROPDMGEXP)
## [1] "" "B" "H" "K" "M"
levels(df_storm$CROPDMGEXP)
## [1] "" "B" "K" "M"
Replace multiplier symbol with power number and calculate total property damage
df_storm <- mutate(df_storm,
prop_exp = ifelse(PROPDMGEXP == 'B', 9,
ifelse(PROPDMGEXP == 'M', 6,
ifelse(PROPDMGEXP == 'K', 3,
ifelse(PROPDMGEXP == 'H', 2, 0) ) ) ),
crop_exp = ifelse(CROPDMGEXP == 'B', 9,
ifelse(CROPDMGEXP == 'M', 6,
ifelse(CROPDMGEXP == 'K', 3,
ifelse(CROPDMGEXP == 'H', 2, 0) ) ) ),
total = PROPDMG * (10^prop_exp) + CROPDMG * (10^crop_exp)
)
fatal <- df_storm %>% group_by(events) %>%
summarise(sum_fatal = sum(FATALITIES)) %>%
arrange(desc(sum_fatal))
injur <- df_storm %>% group_by(events) %>%
summarise(sum_inj = sum(INJURIES)) %>%
arrange(desc(sum_inj))
After showing the results for the question about population harm, we can see that the event type that causes the most population harm is TORNADO.
total_tbl <- df_storm %>% group_by(events) %>%
summarise(sum_total = sum(total)) %>%
arrange(desc(sum_total))
head(total_tbl)
## Source: local data frame [6 x 2]
##
## events sum_total
## (chr) (dbl)
## 1 FLOOD 150319678257
## 2 HURRICANE/TYPHOON 90271472810
## 3 TORNADO 58959393949
## 4 STORM TIDE 48213406740
## 5 ASTRONOMICAL LOW TIDE 19738579082
## 6 HAIL 18758864016
After showing the results for the question economic harm, we can see that the event type that causes the most economic harm is FLOOD.