Synopsis

In this report the author aims to describe the most harmful weather events with regards to health and economic consequence in the united states between the years 1996 and 2011. To identify harmful weather events the author obtained and explored the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. Though a longer time series data was available, it was not complete and the research had to be limited to the time frame stated. From this data it was found out that tornado is the weather event that is most harmful for overall health, while Excessive Heat is the cause of most fatalities. On the other hand flooding is the weather event type with the greatest economic consequence.

Data

Before proceeding with any of the analysis it is vital that all needed packages and their dependencies should be loaded. The following code silently loads the necessary packages.

NeededPackages <- c("lubridate", "dplyr", "ggplot2", "pdftools", "tesseract", "stringr", "ggpubr")
lapply(NeededPackages, require, character.only = TRUE)

The data for this assignment comes in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. It can be download from the course web site. The code below downloads and reads in the data to an object called Data.

urlData <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
#download.file(urlData, destfile = "Data.csv.bz2")
Data <- read.csv(file = "Data.csv.bz2", header = T, sep = ",")

Documentation

There is also some documentation of the database available. Here it can be found how some of the variables are constructed/defined.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

National Weather Service Storm Data Documentation

urldoc <- "https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf"
#download.file(urldoc, "NOAA_Data_Documentation.pdf", method = "curl")

National Climatic Data Center Storm Events FAQ

urlFAQ <- "https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf"
#download.file(urlFAQ, "NOAA_Data_FAQ.pdf", method = "curl")

Research Questions

The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. The database above is used to answer the questions below and the code for the entire analysis is provided for reproducibility.

The data analysis address the following questions:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

Data Processing

To answer the research questions set forth above, the analysis has to

Determine the different types of event.
Identify measures of harm with respect to health
Identify measures of economic harm.

1. Determine the different types of events.

The need for an important data processing to determine the different types of events, steams from the disparities between the 48 official types of events and the 985 unique events types in the data set. Essentially, the disparities come from typing error in the data set. In this part of the data processing, the full official event type list is first extracted from the documentation file. The code below processes the documentation and reports the full type of events.

# Read the whole PDF
EVTYPE_names <- pdf_text(pdf = "NOAA_Data_Documentation.pdf")

# Extract the data for page 6
EVTYPE_names <- EVTYPE_names[6]

# Split the page data by row
text2 <- strsplit(EVTYPE_names, "\r")

col1 <- NULL
col3 <- NULL
col4 <- NULL
for(i in 9:32){
  text <- text2[[1]][i]
  col1[i] <- substring(word(text, start = 1L, end =1L, sep = fixed("                ")), 2)
  col3[i] <- word(text, start = 3L, end =3L, sep = fixed("              "))
  col4[i] <- word(text, start = 4L, end =4L, sep = fixed("              "))

}

col1 <- col1[c(9:32)]
col3 <- col3[!str_detect(col3, "  ", negate = FALSE) & !is.na(col4)&  nchar(col3) > 3]
col4 <- col4[!str_detect(col4, "  ", negate = FALSE) & !is.na(col4)& nchar(col4) > 3]

fulltype <- c(col1, col3, col4)

rm(EVTYPE_names,text, text2,col1, col3, col4, i)

fulltype

##  [1] "Astronomical Low Tide"    "Avalanche"               
##  [3] "Blizzard"                 "Coastal Flood"           
##  [5] "Cold/Wind Chill"          "Debris Flow"             
##  [7] "Dense Fog"                "Dense Smoke"             
##  [9] "Drought"                  "Dust Devil"              
## [11] "Dust Storm"               "Excessive Heat"          
## [13] "Extreme Cold/Wind Chill"  "Flash Flood"             
## [15] "Flood"                    "Frost/Freeze"            
## [17] "Funnel Cloud"             "Freezing Fog"            
## [19] "Hail"                     "Heat"                    
## [21] "Heavy Rain"               "Heavy Snow"              
## [23] "High Surf"                "High Wind"               
## [25] "Hurricane (Typhoon)"      "Lakeshore Flood"         
## [27] "Lightning"                "Sleet"                   
## [29] "Storm Surge/Tide"         "Tornado"                 
## [31] "Tropical Depression"      "Tropical Storm"          
## [33] "Ice Storm"                "Lake-Effect Snow"        
## [35] "Marine Hail"              "Marine High Wind"        
## [37] "Marine Strong Wind"       "Marine Thunderstorm Wind"
## [39] "Rip Current"              "Seiche"                  
## [41] "Strong Wind"              "Thunderstorm Wind"       
## [43] "Tsunami"                  "Volcanic Ash"            
## [45] "Waterspout"               "Wildfire"                
## [47] "Winter Storm"             "Winter Weather"

Once a vector of the full official event types has been created, it can used to partial match with the event types in the data set. However, given the size of the data set with 902297 observations, it would not be efficient to work directly with the data set. Furthermore, according to NOAA the data recording start from January 1950, however, only from January 1996 do they record all events type. Since the objective of this analysis is comparing the effects of different weather events, only the data points starting January 1996 will be evaluated. Additionally, from the documentation and the additional information provided at the end of this document, the variables that are needed to address the research question are a subset of the original Data.

The code below, first creates a smaller data set, Data1, which only has the date, EVTYPE, FATALITIES, INJURIES, CROPDMG, CROPDMGEXP, PROPDMG, and PROPDMGEXP variables. Then it subsets the observation starting 1996 and extracts the unique values of EVTYPE from the data set to be partial matched with the full type vector created above. Additionally, a copy of the EVTYPE variable (EVTPE1) is made, to be gradually replaced with the right event type from the official list. This way the end data frame has both the new event classification and the old version for comparison.

Data1 <- Data %>% 
  select(BGN_DATE, EVTYPE, FATALITIES, INJURIES, CROPDMG, CROPDMGEXP, PROPDMG, PROPDMGEXP)

Data1$BGN_DATE <- lubridate::mdy_hms(Data1$BGN_DATE)

Data1 <- Data1 %>%
         filter(BGN_DATE >= as.Date("1996-01-01"))

Data1$EVTYPE1 <- as.character(Data1$EVTYPE)

testNames <- as.character(levels(as.factor(Data1$EVTYPE)))

head(testNames)

## [1] "   HIGH SURF ADVISORY" " COASTAL FLOOD"        " FLASH FLOOD"         
## [4] " LIGHTNING"            " TSTM WIND"            " TSTM WIND (G45)"

Partial matching is now almost possible between the unique elements of EVTYPE in the testNames vector and official event types in fulltype vector. However, an initial inspection of the testNames reveals that few EVTYPEs could be removed at this point. The code below removes this from the testNames vector.

testNames <- testNames[testNames != "NONE"]
testNames <- testNames[testNames != "OTHER" & testNames != "Other"]

In the data processing that follows a partial match for the unique elements of EVTYPE is searched for in the official full event types. The maximum distance between the matches is initially set at 1 to allow as close a match as possible. In subsequent steps this is then relaxed to reprocess EVTYPEs for which no match was found.

matches <- NULL
for (i in 1:length(testNames)){
  matches[i] <- list(agrep(testNames[i], fulltype, ignore.case = T, max.distance = 1))

}

The partial matching results in one of three scenarios.

1. Where there is one match for the unique EVTYPE in the official full event types
1. Where there are multiple matches for the unique EVTYPE in the official full event types
1. Where there are no matches for the unique EVTYPE in the official full event types

# subset the list that have 1 match, no match, and multiple match


done <- NULL
index <- NULL
for (i in 1:length(matches)){
  index[i] <- i
  done[i] <- length(matches[[i]])
}

check1 <- as.data.frame(cbind(index, done))

perfect1 <- check1$index[check1$done == 1]
nomatch <- check1$index[check1$done == 0]
multiplematch <- check1$index[check1$done > 1]

As stated above the matching resulted in 90 matches with one event type from the official list, 43 matches with multiple event types from the official list and 849 matches that had no match in the official event type list.

The data processing in this steps addresses this different scenarios in the following manner. Where there is just one match, the matching is scrutinized and reported to confirm the matching is appropriate, then observations in Data 1 are updated accordingly.Where there are multiple matches, the matching is scrutinized and a decision as to the best match is made by the author. The final matching is then reported and observations in Data 1 are appropriately updated. EVTYPEs from the data set where there was no match from the official list are subsetted and passed on as a matching vector for the next level of matching with increased maximum distance.

One match

x <- list()

for (i in 1:length(testNames[perfect1])){
  Ps <- fulltype[matches[[perfect1[i]]]]
  x[[testNames[perfect1][i]]] <- Ps
  
}

head(x)

## $` COASTAL FLOOD`
## [1] "Coastal Flood"
## 
## $` FLASH FLOOD`
## [1] "Flash Flood"
## 
## $` LIGHTNING`
## [1] "Lightning"
## 
## $` WATERSPOUT`
## [1] "Waterspout"
## 
## $`ASTRONOMICAL LOW TIDE`
## [1] "Astronomical Low Tide"
## 
## $AVALANCE
## [1] "Avalanche"

While the code above reports the matching, the next lines of code update the EVTYPE1 in data 1.

xname <- NULL
xfulltype <- NULL

for (i in 1:length(x)){
  xname[i] <- names(x[i])
  xfulltype[i] <- x[[i]]
}


for (i in 1:length(xname)){
  Data1$EVTYPE1[Data1$EVTYPE1 == xname[i]] <- xfulltype[i]
}

Multiple matches

The next line of code extracts the EVTYPEs that had more than one match from the official event type list.

x <- list()

for (i in 1:length(testNames[multiplematch])){
    Ps <- fulltype[matches[[multiplematch[i]]]]
    x[[paste0(testNames[multiplematch][i])]] <- Ps
    
}

head(x)

## $` WIND`
## [1] "Cold/Wind Chill"          "Extreme Cold/Wind Chill" 
## [3] "High Wind"                "Marine High Wind"        
## [5] "Marine Strong Wind"       "Marine Thunderstorm Wind"
## [7] "Strong Wind"              "Thunderstorm Wind"       
## 
## $`?`
##  [1] "Astronomical Low Tide"    "Avalanche"               
##  [3] "Blizzard"                 "Coastal Flood"           
##  [5] "Cold/Wind Chill"          "Debris Flow"             
##  [7] "Dense Fog"                "Dense Smoke"             
##  [9] "Drought"                  "Dust Devil"              
## [11] "Dust Storm"               "Excessive Heat"          
## [13] "Extreme Cold/Wind Chill"  "Flash Flood"             
## [15] "Flood"                    "Frost/Freeze"            
## [17] "Funnel Cloud"             "Freezing Fog"            
## [19] "Hail"                     "Heat"                    
## [21] "Heavy Rain"               "Heavy Snow"              
## [23] "High Surf"                "High Wind"               
## [25] "Hurricane (Typhoon)"      "Lakeshore Flood"         
## [27] "Lightning"                "Sleet"                   
## [29] "Storm Surge/Tide"         "Tornado"                 
## [31] "Tropical Depression"      "Tropical Storm"          
## [33] "Ice Storm"                "Lake-Effect Snow"        
## [35] "Marine Hail"              "Marine High Wind"        
## [37] "Marine Strong Wind"       "Marine Thunderstorm Wind"
## [39] "Rip Current"              "Seiche"                  
## [41] "Strong Wind"              "Thunderstorm Wind"       
## [43] "Tsunami"                  "Volcanic Ash"            
## [45] "Waterspout"               "Wildfire"                
## [47] "Winter Storm"             "Winter Weather"          
## 
## $Cold
## [1] "Cold/Wind Chill"         "Extreme Cold/Wind Chill"
## 
## $COLD
## [1] "Cold/Wind Chill"         "Extreme Cold/Wind Chill"
## 
## $`COLD/WIND CHILL`
## [1] "Cold/Wind Chill"         "Extreme Cold/Wind Chill"
## 
## $`COLD/WINDS`
## [1] "Cold/Wind Chill"         "Extreme Cold/Wind Chill"

This was then carefully assessed to make the decision as to which event type was appropriate match. The next line of code makes the selection and updates EVTYPE1 from data 1.

# multiple matches - index of actual elements that have a match 
  
mmelementindex <- seq(1:43)[-c(1,2,10,15,38,39,40,42,43)]

x <- x[mmelementindex] 


# Selection index for multiple matches list

mmselectionindex <- c(1,1,1,1,3,3,3,1,1,3,2,1,1,1,7,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1)

# the multiple matches can now be resolved and the mmfinal list is used to update EVTYPE1 from data 1 in much the same way as before.

mmfinal <- list()
for (i in 1:length(x)){
  Ps <- x[[i]][mmselectionindex[i]]
  mmfinal[[names(x)[i]]] <- Ps
}

mmfinalname <- NULL
mmfinalfulltype <- NULL

for (i in 1:length(mmfinal)){
  mmfinalname[i] <- names(mmfinal[i])
  mmfinalfulltype[i] <- mmfinal[[i]]
}

for (i in 1:length(mmfinalname)){
  Data1$EVTYPE1[Data1$EVTYPE1 == mmfinalname[i]] <- mmfinalfulltype[i]
}

No matches

As reported above the EVTYPEs that had no match were passed to a level 2 analysis where the whole data processing from above was performed again with a relaxed matching criteria (maximum distance of 2). EVTYPEs with one and multiple matches were processed in the same way and the no matches where again passed to 3rd level analysis that repeated the whole step with maximum distance of 3. The next code performs the level 2 and level 3 analysis in one go for space reasons.

####### Level 2 Analysis
remain <- testNames[nomatch]

matches1 <- NULL
for (i in 1:length(remain)){
  matches1[i] <- list(agrep(remain[i], fulltype, ignore.case = T, max.distance = 2))
  
}

done <- NULL
index <- NULL
for (i in 1:length(matches1)){
  index[i] <- i
  done[i] <- length(matches1[[i]])
}

check1 <- as.data.frame(cbind(index, done))

perfect1 <- check1$index[check1$done == 1]
nomatch <- check1$index[check1$done == 0]
multiplematch <- check1$index[check1$done > 1]


# level 2 one match -look at the lists that have a perfect match just to see if it is done. 

x <- list()
for (i in 1:length(remain[perfect1])){
  Ps <- fulltype[matches1[[perfect1[i]]]]
  x[[paste0(remain[perfect1][i])]] <- Ps
  
}

# From a closer look at the perfect match list the following elements were 
# approved to be a close approximation of their match in the full type. 

x <- x[c(5,8,9,12 )]

# Now prepare the final matches for replacement.
xname <- NULL
xfulltype <- NULL
for (i in 1:length(x)){
  xname[i] <- names(x[i])
  xfulltype[i] <- x[[i]]
}

# changing the main thing
for (i in 1:length(xname)){
  Data1$EVTYPE1[Data1$EVTYPE1 == xname[i]] <- xfulltype[i]
}

######  multiple matches level 2
x <- list()
for (i in 1:length(remain[multiplematch])){
  Ps <- fulltype[matches1[[multiplematch[i]]]]
  x[[paste0(remain[multiplematch][i])]] <- Ps
}


# Based up a closer look at the multiple matches the following were judged to have a fairly close match.

x <- x[seq(1:24)[-c(1,2,5)]]

# Based upon a closer inspection of this multiple match list, the following selection were 
# made. 
mmselectionindex <- c(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 14, 2, 2)

# the multiple matches can now be resolved and the mmfinal list reports the final matching
mmfinal <- list()
for (i in 1:length(x)){
  Ps <- x[[i]][mmselectionindex[i]]
  mmfinal[[names(x)[i]]] <- Ps
}

# The final matching from the multiple matches can now be used to extend the fixed labels
mmfinalname <- NULL
mmfinalfulltype <- NULL
for (i in 1:length(mmfinal)){
  mmfinalname[i] <- names(mmfinal[i])
  mmfinalfulltype[i] <- mmfinal[[i]]
}
# Changing the main thing
for (i in 1:length(mmfinalname)){
  Data1$EVTYPE1[Data1$EVTYPE1 == mmfinalname[i]] <- mmfinalfulltype[i]
}


## level 3
remain1 <- remain[nomatch]
matches2 <- NULL
for (i in 1:length(remain1)){
  matches2[i] <- list(agrep(remain1[i], fulltype, ignore.case = T, max.distance = 3))
  
}
done <- NULL
index <- NULL
for (i in 1:length(matches2)){
  index[i] <- i
  done[i] <- length(matches2[[i]])
}

check1 <- as.data.frame(cbind(index, done))

perfect1 <- check1$index[check1$done == 1]
nomatch <- check1$index[check1$done == 0]
multiplematch <- check1$index[check1$done > 1]


# level 3 one match - look at the lists that have a perfect match just to see if it is correct. 

x <- list()
for (i in 1:length(remain1[perfect1])){
  Ps <- fulltype[matches2[[perfect1[i]]]]
  x[[paste0(remain1[perfect1][i])]] <- Ps
}

# From a closer look at the perfect match list the following elements were 
# approved to be a close approximation of their match in the full type. 

x <- x[c(2,3,4,7,19,20,21, 22)]

# Now prepare the final matches for replacement.

xname <- NULL
xfulltype <- NULL
for (i in 1:length(x)){
  xname[i] <- names(x[i])
  xfulltype[i] <- x[[i]]
}

# changing the main thing

for (i in 1:length(xname)){
  Data1$EVTYPE1[Data1$EVTYPE1 == xname[i]] <- xfulltype[i]
}

# level 3 - multiple matches

x <- list()
for (i in 1:length(remain1[multiplematch])){
  Ps <- fulltype[matches2[[multiplematch[i]]]]
  x[[paste0(remain1[multiplematch][i])]] <- Ps
}

# Based up a closer look at the multiple matches the following were judged to have a fairly close match.
x <- x[c(1,3,4,5,6,11,12,13,21,22,23,24,25,26,27,28,29,30,31,32,33)]


# Based upon a closer inspection of this multiple match list, the following selection were made.

mmselectionindex <- c(2,3,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,16)

# the multiple matches can now be resolved

mmfinal <- list()
for (i in 1:length(x)){
  Ps <- x[[i]][mmselectionindex[i]]
  mmfinal[[names(x)[i]]] <- Ps
}


# The final matching from the multiple matches can now be used to extend the fixed labels
mmfinalname <- NULL
mmfinalfulltype <- NULL
for (i in 1:length(mmfinal)){
  mmfinalname[i] <- names(mmfinal[i])
  mmfinalfulltype[i] <- mmfinal[[i]]
}

#Changing the main thing
for (i in 1:length(mmfinalname)){
  Data1$EVTYPE1[Data1$EVTYPE1 == mmfinalname[i]] <- mmfinalfulltype[i]
}

After three levels of matching a major chunk of the data has been cleaned. As a last step the EVTYPEs that were not matched in any of the three level matching were subsetted and aggregated to more manually match what can be further matched. The next line of code performs this analysis.

Datanot <- Data1 %>% 
            filter(!EVTYPE1 %in% fulltype) %>% 
            mutate(count = 1)

Datanot1 <- Datanot %>% 
            group_by(EVTYPE1) %>% 
            summarise(SC = sum(count)) %>% 
            arrange(desc(SC))

lastList <- Datanot1$EVTYPE1[1:20]

# Subet the last List to what can be replaced

lastList <- lastList[c(1,2, 3,4,9,10, 17)]


LLSindex <- c(38, 14, 46, 48, 23, 13, 14)

x <- list()

for (i in 1:length(lastList)){
    Ps <- fulltype[LLSindex[i]]
    x[[paste0(lastList[i])]] <- Ps
    
}
x

## $`MARINE TSTM WIND`
## [1] "Marine Thunderstorm Wind"
## 
## $`URBAN/SML STREAM FLD`
## [1] "Flash Flood"
## 
## $`WILD/FOREST FIRE`
## [1] "Wildfire"
## 
## $`WINTER WEATHER/MIX`
## [1] "Winter Weather"
## 
## $`HEAVY SURF/HIGH SURF`
## [1] "High Surf"
## 
## $`EXTREME WINDCHILL`
## [1] "Extreme Cold/Wind Chill"
## 
## $`RIVER FLOOD`
## [1] "Flash Flood"

Based upon the matching reported above, EVTYPE1 from data 1 is updated and the performance of the data processing is measured.

xname <- NULL
xfulltype <- NULL

for (i in 1:length(x)){
  xname[i] <- names(x[i])
  xfulltype[i] <- x[[i]]
}

DataTypeFinal <- Data1 %>%
          filter(EVTYPE1 %in% fulltype)

Cleaned <- scales::percent(1-(length(Data1$BGN_DATE)-length(DataTypeFinal$BGN_DATE))/length(Data1$BGN_DATE))

rm(list=setdiff(ls(), c("Data", "DataTypeFinal", "Cleaned")))

While all the EVTYPEs could not be matched 97.3% of the data has been cleaned. The cleaned data can now be passed to the next step so the measures of harm can be processed..

2. Identify measures of harm with respect to health

Based upon the documentations of the data that are provided below, the concept of most harmful with respect to population health will be quantified by looking at fatalities and injuries associated with different event types. Both this variables have been subsetted at the start of the data processing and do not need any further processing.

3. Identify measures of economic harm

The measures of economic harm are represented by the following four variables, CROPDMG, CROPDMGEXP, PROPDMG, and PROPDMGEXP.While the first two depict the extent of crop damage caused by different weather types the last two variables show the extent of property damaged caused by different weather events.

According to the extra assignment information provided below, the ‘CROPDMGEXP’ is the exponent values for ‘CROPDMG’ (crop damage). In the same way, ‘PROPDMGEXP’ is the exponent values for ‘PROPDMG’ (property damage). The variables should be used in combination to get the total values for crops and property damage. (B or b = Billion, M or m = Million, K or k = Thousand, H or h = Hundred, 0 is a 10 multiplier). There is more information about other values of the exponent values but this are incidents that do not occur in the subsetted data and would not be further processed for this work. Instead the next line of code creates a CropDamage and PropertyDamage measures as described above, rearranges the columns for better readability and saves the new complete data in a data frame called TidyData.

TidyData <- DataTypeFinal %>% 
                 mutate(CropDamage = ifelse(CROPDMGEXP == "B", CROPDMG*1000000000, 
                                    ifelse(CROPDMGEXP == "M", CROPDMG*1000000,
                                           ifelse(CROPDMGEXP == "K",CROPDMG*1000,CROPDMG)))) %>%
                 mutate(PropertyDamage = ifelse(PROPDMGEXP == "B", PROPDMG*1000000000, 
                                    ifelse(PROPDMGEXP == "M", PROPDMG*1000000, 
                                           ifelse(PROPDMGEXP == "K",PROPDMG*1000,
                                                  ifelse(PROPDMGEXP == "0",PROPDMG*10,PROPDMG))))) %>% 
                select(BGN_DATE, EVTYPE, EVTYPE1, FATALITIES, INJURIES, CROPDMG, CROPDMGEXP, CropDamage,
                       PROPDMG, PROPDMGEXP, PropertyDamage)

#rm(list=setdiff(ls(), "TidyData"))

head(TidyData)

##     BGN_DATE       EVTYPE           EVTYPE1 FATALITIES INJURIES CROPDMG
## 1 1996-01-06 WINTER STORM      Winter Storm          0        0      38
## 2 1996-01-11      TORNADO           Tornado          0        0       0
## 3 1996-01-11    TSTM WIND Thunderstorm Wind          0        0       0
## 4 1996-01-11    TSTM WIND Thunderstorm Wind          0        0       0
## 5 1996-01-11    TSTM WIND Thunderstorm Wind          0        0       0
## 6 1996-01-18         HAIL              Hail          0        0       0
##   CROPDMGEXP CropDamage PROPDMG PROPDMGEXP PropertyDamage
## 1          K      38000     380          K         380000
## 2                     0     100          K         100000
## 3                     0       3          K           3000
## 4                     0       5          K           5000
## 5                     0       2          K           2000
## 6                     0       0                         0

Results

Most harmful weather events with respect to population health

The processed data can now be used to answer the research questions stated above. With regards, to most harmful events with respect to population health three types of analysis are performed. That is to look at most harmful events with respect to fatalities, most harmful events with respect to injuries and finally most harmful events with respect to both.

# Total Fatalities Data set and Plot                    
DataTF <- TidyData %>% 
              select( EVTYPE1, FATALITIES) %>% 
              group_by(EVTYPE1) %>% 
              summarise(TotalFatalities = sum(FATALITIES)) %>% 
              arrange(desc(TotalFatalities)) %>% 
              top_n(10, TotalFatalities)

TF <- ggplot(data = DataTF, aes(x =reorder(EVTYPE1, -TotalFatalities), y = TotalFatalities)) +
  geom_bar(stat = 'identity', position = 'dodge')+
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title.x=element_blank())+
  labs(title = "Total Fatalities By Weather Event Type", 
       caption = "Total fatalities that were caused by the top 10 most
       harmful weather event types in the united states")

# Total Injuries Data set and Plot
DataTI <- TidyData %>% 
  select( EVTYPE1, INJURIES) %>% 
  group_by(EVTYPE1) %>% 
  summarise(TotalInjuries = sum(INJURIES)) %>% 
  arrange(desc(TotalInjuries)) %>% 
  top_n(10, TotalInjuries)

TI <- ggplot(data = DataTI, aes(x =reorder(EVTYPE1, -TotalInjuries), y = TotalInjuries)) +
  geom_bar(stat = 'identity', position = 'dodge')+
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title.x=element_blank())+
  labs(title = "Total Injuries By Weather Event Type", 
       caption = "Total injuries that were caused by the top 10 most
       harmful weather event types in the united states")


# Total Health harm Data set and Plot
DataHH <- TidyData %>% 
  select( EVTYPE1, INJURIES, FATALITIES) %>% 
  mutate(HealthHarm = INJURIES + FATALITIES) %>% 
  group_by(EVTYPE1) %>% 
  summarise(TotalHealthHarm = sum(HealthHarm)) %>% 
  arrange(desc(TotalHealthHarm)) %>% 
  top_n(10, TotalHealthHarm)

HH <- ggplot(data = DataHH, aes(x =reorder(EVTYPE1, -TotalHealthHarm), y = TotalHealthHarm)) +
  geom_bar(stat = 'identity', position = 'dodge')+
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title.x=element_blank())+
  labs(title = "Tota Health Harm By Weather Event Type", 
       caption = "Total health harm (fatalities + injuries) that was caused by 
       the top 10 most harmful weather event types in the united states")

# Most harmful events with regards to health over time

HarmfulEvents <- DataHH$EVTYPE1

DataHHTime <- TidyData %>% 
  filter(EVTYPE1 %in% HarmfulEvents) %>% 
  mutate(Year = lubridate::year(BGN_DATE)) %>% 
  select(Year ,EVTYPE1, INJURIES, FATALITIES) %>% 
  mutate(HealthHarm = INJURIES + FATALITIES) %>% 
  group_by(Year, EVTYPE1) %>% 
  summarise(TotalHealthHarm = sum(HealthHarm)) %>% 
  arrange(desc(TotalHealthHarm))

HHTime <- ggplot(data = DataHHTime, aes(x =Year, y = TotalHealthHarm, color = EVTYPE1)) +
  geom_point()+
  geom_line()+
  labs(title = "Top 10 Harmful Events Over Time (Health)", 
       caption = "The total health harm (fatalities + injuries) that was caused in the united states
       by the top 10 most harmful weather event types over time.")+
  theme(legend.title = element_text(size = 10),
        legend.text = element_text(size = 8), 
        legend.position= c(0.5, 0.6))
# Final plot
ggarrange(TF, TI, HH, HHTime, ncol = 2, nrow = 2)

The time series visualization of the top 10 most harmful events shows that the impact of tornado has been consistently high. However, in the late 1990s flooding had a high total health harm, this was then considerably reduced in the following years. Extrapolating on this observation and noting the extremely high total health harm of tornado at the end of 2011, it can be expected that the total harm of tornado will subside in the coming years. This can of course be linked to policy response to high heath harm from a weather event. It can be expected that regulations and response strategies would be more suited to future tornado events so that they do not cause as high health harm as they did at 2011.

Most harmful weather events with the greatest economic consequences

Similar to the results above the analysis With regards to the greatest economic consequences, three types of analysis are performed. That is to look at most harmful events with respect to crop damage, most harmful events with respect to property damage and finally most harmful events with respect to both.

# Total Crop Damage Data set and Plot                    
DataCD <- TidyData %>% 
              select( EVTYPE1, CropDamage) %>% 
              group_by(EVTYPE1) %>% 
              summarise(TotalCropDamage = sum(CropDamage)) %>% 
              arrange(desc(TotalCropDamage)) %>% 
              top_n(10, TotalCropDamage)



TCD <- ggplot(data = DataCD, aes(x =reorder(EVTYPE1, -TotalCropDamage), y = TotalCropDamage)) +
  geom_bar(stat = 'identity', position = 'dodge')+
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title.x=element_blank())+
  labs(title = "Tota Crop Damage By Weather Event Type", 
       caption = "Total crop damage that was caused by the top 10 most
       harmful weather event types in the united states")

# Total Property Damage Data set and Plot
DataPD <- TidyData %>% 
  select( EVTYPE1, PropertyDamage) %>% 
  group_by(EVTYPE1) %>% 
  summarise(TotalPropertyDamage = sum(PropertyDamage)) %>% 
  arrange(desc(TotalPropertyDamage)) %>% 
  top_n(10, TotalPropertyDamage)



TPD <- ggplot(data = DataPD, aes(x =reorder(EVTYPE1, -TotalPropertyDamage), y = TotalPropertyDamage)) +
  geom_bar(stat = 'identity', position = 'dodge')+
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title.x=element_blank())+
  labs(title = "Total Property Damage By Weather Event Type", 
       caption = "Total property damage that was caused by the top 10 most
       harmful weather event types in the united states")


# Total Economic harm Data set and Plot
DataEH <- TidyData %>% 
  select( EVTYPE1, CropDamage, PropertyDamage) %>% 
  mutate(EconomicHarm = CropDamage + PropertyDamage) %>% 
  group_by(EVTYPE1) %>% 
  summarise(TotalEconomicHarm = sum(EconomicHarm)) %>% 
  arrange(desc(TotalEconomicHarm)) %>% 
  top_n(10, TotalEconomicHarm)



EH <- ggplot(data = DataEH, aes(x =reorder(EVTYPE1, -TotalEconomicHarm), y = TotalEconomicHarm)) +
  geom_bar(stat = 'identity', position = 'dodge')+
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title.x=element_blank())+
  labs(title = "Total Economic Harm By Weather Event Type", 
       caption = "Total economic harm (crop damage + property damage) that was caused by 
       the top 10 most harmful weather event types in the united states")


# Most harmful events with regards to economic consequence over time

HarmfulEvents <- DataEH$EVTYPE1


DataEHTime <- TidyData %>% 
  filter(EVTYPE1 %in% HarmfulEvents) %>% 
  mutate(Year = lubridate::year(BGN_DATE)) %>% 
  select(Year ,EVTYPE1, CropDamage, PropertyDamage) %>% 
  mutate(EconomicHarm = CropDamage + PropertyDamage) %>% 
  group_by(Year, EVTYPE1) %>% 
  summarise(TotalEconomicHarm = sum(EconomicHarm)) %>% 
  arrange(desc(TotalEconomicHarm))


HETime <-
  ggplot(data = DataEHTime, aes(x =Year, y = TotalEconomicHarm, color = EVTYPE1)) +
  geom_point()+
  geom_line()+
  labs(title = "Top 10 Harmful Events Over Time (Economic)", 
       caption = "The total economic harm (crop damage + property damage) that was caused in the
       united states by the top 10 most harmful weather event types over time.")+
  theme(legend.title = element_text(size = 8),
        legend.text = element_text(size = 7), 
        legend.position= c(0.2, 0.6))



ggarrange(TCD, TPD, EH, HETime, ncol = 2, nrow =2)

Apendix

Additional information

Most variable names are self-explanatory but few are vague and difficult to find in the Strom Data Documentation or in FAQ, I’ll give more details for those ones.
The ‘CROPDMGEXP’ is the exponent values for ‘CROPDMG’ (crop damage). In the same way, ‘PROPDMGEXP’ is the exponent values for ‘PROPDMG’ (property damage). You should use both to get the total values for crops and property damage. (B or b = Billion, M or m = Million, K or k = Thousand, H or h = Hundred). The number from one to ten represent the power of ten (10^The number). The symbols “-”, “+” and “?” refers to less than, greater than and low certainty. You have the option to ignore these three symbols altogether. Also, there is a nice work entitled: How To Handle Exponent Value of PROPDMGEXP and CROPDMGEXP. It discusses that issue in more depth.
WFO = Weather Forecast Office, F = Fujita tornado intensity scale (F-Scale), MAG = Magnitude, or Strength, of the event. It is required by NOAA for Wind and Hail events if it is known. Wind Events are in KNOTS. Hail is in INCHES and TENTHS without the decimal (one and one-half are150).
STATE__ = State FIPS number, you can download a full list of FIPS in excel sheet from here. LENGTH = Path length of a tornado (in miles and tenths of miles). WIDTH = Path width of a tornado, in yards. According to NOAA documentation the length and width used for all tornadoes, including each member of families of tornadoes, or for all segments of multi-segmented tornadoes.
According to NOAA the data recording start from Jan. 1950. At that time they recorded one event type, tornado. They add more events gradually and only from Jan. 1996 they start recording all events type. Since our objective is comparing the effects of different weather events, do we need to include all years, even it has only single event type?
The official events type are 48. However, if you use ‘unique’ function on ‘EVTYPE’ column you will get near one thousand events! All that is just typo. The regular expression (‘regex’, grepl, regexpr, gregexpr) and ‘tolower’, and ‘toupper’ functions can be a great help here. But cleaning up all that mess can take several days. However, reducing the size of data first will make a great difference here. Subset your data first, think of years, total value, frequency … anything else that can be used to reduce the number of rows. Also, there are 37 variables, you do not need all of that for your analysis, remove as much as you can to make the analysis faster.
Speaking about data subset, can you rely on ‘quantile’ to get the top (some) percentage? or can we just get a random sample from the data for our analysis?
Back to ‘EVTYPE’ column, one way to fix the typo is loading the official list of storm event types and manually mapping all (or most) list of unique recorded events with a typo. This way is the most painful one. Another way is using ‘match’ (from ‘base’ package) or even better you may use ‘amatch’ function from ‘stringdist’ package for Approximate String Matching. In that case, you will need to experiments with ‘maxDist’ option to get good accuracy.

Harmful Weather Events in the United States

1996 - 2011

Fikir Worku Edossa

2/14/2020