Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Loading Necessay Libraries

library(ggplot2)
library(knitr)
library(printr)
library(plyr)
library(dplyr)
library(lubridate)
library(gridExtra)
library(reshape2)
options(scipen = 1)  # Turn off scientific notations for numbers

Data Processing

The data for this project come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. Data can be downloaded from the following link :

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

We have downloaded and unzipped the data in the folder named “Data” in the project working directory. We now load the data to R using the read.csv method.

if (!"data" %in% ls()) {
        data <- read.csv("Data/repdata_data_StormData.csv", sep=",")
        }

In order to get a better sense of the data, we first look at the structure of the dataset :

str(data)
## 'data.frame':    902297 obs. of  39 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ date      : POSIXct, format: "1950-04-18" "1950-04-18" ...
##  $ year      : num  1950 1950 1951 1951 1951 ...

Therefore, there are 902297 row and 39 in the dataset. The class of teh variables as well as the first few records were also present the output.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

In order to verify this fact, we create a histogram of year of events.

data$date <- mdy_hms(data$BGN_DATE)     # change begin date to a date format 

data$year<- year(data$date)        # extract year from the begin date

ggplot (data, aes(year))+
        geom_histogram(binwidth=3, alpha=.5, position="identity", fill="blue", col="blue")+
        ggtitle ("Histogram of the start year of events")+
        xlab("Year")+
        ylab("Frequency")

Histogram clearly indicates that the number of events in the dataset increase after 1990.

Result

In this section, we separately explore the impact of the storms and severe weather events on population health and economy. Specifically we are interested to know which type of events are most harmful with respect to the population health,and economy.

Analysis of Impact On Population Health

In this section we explore that across the United States, which types of events are most harmful with respect to population health.

To answer this question we focus on the total number of FATALITIES and INJURIES caused by the storm and events. Therefore we create a frequency table of event type and the total number of fatalities and injuries.

Due to the large number of event types we focus only on the top 15 most harmful event in relation to fatality and injury.

by_event <- group_by(data, EVTYPE)
summary_data <- summarise (by_event,     
                           Num_event = n(),
                           Total_injury = sum(INJURIES),
                           Total_fatality = sum(FATALITIES),
                           Total_injury_fatality = sum(INJURIES+FATALITIES)
                           )
arranged_data <- arrange(summary_data, desc(Total_injury, Total_fatality, Total_injury_fatality))
subset25 <- slice(arranged_data, 1:15)
subset25
EVTYPE Num_event Total_injury Total_fatality Total_injury_fatality
TORNADO 60652 91346 5633 96979
TSTM WIND 219940 6957 504 7461
FLOOD 25326 6789 470 7259
EXCESSIVE HEAT 1678 6525 1903 8428
LIGHTNING 15754 5230 816 6046
HEAT 767 2100 937 3037
ICE STORM 2006 1975 89 2064
FLASH FLOOD 54277 1777 978 2755
THUNDERSTORM WIND 82563 1488 133 1621
HAIL 288661 1361 15 1376
WINTER STORM 11433 1321 206 1527
HURRICANE/TYPHOON 88 1275 64 1339
HIGH WIND 20212 1137 248 1385
HEAVY SNOW 15708 1021 127 1148
WILDFIRE 2761 911 75 986

\(\;\)

The table is arranged in descending order of total injury, total fatality and total injury and fatality combined. The table clearly shows that TORNADO is the most harmful event. After that TSTM WIND and FLOOD are in the second and third position.

Bar plot below is also created to visualize the frequency of the total injury and fatality by event type for the top 15 events. Again we can observe that TORNADO is clearly indicated as the most hamful event.

subset25 <- transform(subset25, EVTYPE = reorder(EVTYPE, Total_injury_fatality))

ggplot(data=subset25, aes(x=EVTYPE, y=Total_injury, fill=Total_injury_fatality)) +
        geom_bar(stat="identity", position='dodge', size=0.1, alpha=.85)+
        theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+
        geom_text(aes(label=Total_injury_fatality), hjust=-0.05 ,position=position_dodge(.9), size=3)+
        theme_bw( base_family = "Avenir", base_size = 12)+ 
        labs(title = "Top 15 most harmful events to public health") + 
        labs(x="Event type", y="Total injury and fatality" )+
        coord_flip()+
        theme(legend.position="none")

Analysis of Impact On Economy

In this section of the report we explore that across the United States, which types of events have the greatest economic consequences.

To answer this question we focus on the number of property damage (PROPDMG) and crop damage (CROPDMG) in the dataset. However there is a multiplier associated with each observation of property and crop damage. Specifically CROPDMGEXP and PROPDMGEXP record the value of multipliers where we have hundred (H), Thousand (K), Million (M), Billion (B) and some Na values (+,?,…). Therefore, in order to find a meaningful value for the total damage, we need to first code the multipliers to a numerical value.

First we subset the appropriate variables and take a look at the levels of PROPDMGEXP and CROPDMGEXP variables.

subData <- select(data, EVTYPE, PROPDMG,CROPDMG,CROPDMGEXP,PROPDMGEXP)   # selecting a subset of variables

levels <- union (levels(subData$CROPDMGEXP), levels(subData$PROPDMGEXP)) %>%
        sort(decreasing = TRUE)
levels
##  [1] "M" "m" "K" "k" "H" "h" "B" "8" "7" "6" "5" "4" "3" "2" "1" "0" "+"
## [18] "?" "-" ""

The following function is written to to code the factor values to numeric values. We decided to code NA values as 1.

recodeFunction <- function(x){
        
        result <- toupper(x)
        result <- mapvalues(result, 
                            from = c("B", "M", "K", "H" , "1", "2", "3", "4", "5", "6", "7", "8", "0", "+" ,"?", "-" ,""), 
                            to = c(1000000000, 1000000, 1000, 100, 10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1, 1, 1, 1, 1))
        
        result <- as.numeric(result)
        
        #the value of NAs will be coded as zero
        #result[is.na(result)] <- 0
        
        result
}

Now we apply the function to create the total property damage and total crop damage by event type. We create a table to list the top 15 most harmful events with respect to the total cost of crop damage, total cost of property damage and the total cost of crop and property damage combined.

subData <- mutate (subData, PROPIndex= recodeFunction(PROPDMGEXP), CROPIndex=recodeFunction(CROPDMGEXP) )
## The following `from` values were not present in `x`: H, 1, 3, 4, 5, 6, 7, 8, +, -
subData <- mutate (subData, Prop_damage =PROPIndex*PROPDMG , Crop_damage=CROPIndex*CROPDMG)

summary_data <- group_by(subData,EVTYPE )%>%
                summarise (Num_event = n(),
                           Total_property_damage = sum(Prop_damage),
                           Total_crop_damage = sum(Crop_damage),
                           Total_crop_property_damage = sum(Prop_damage+Prop_damage)
                           )%>%
                arrange(desc(Total_property_damage,Total_crop_damage,Total_crop_property_damage )) %>%
                slice(1:15)

# top_n(20, Total_crop_property_damage)

kable(summary_data, align = "c")
EVTYPE Num_event Total_property_damage Total_crop_damage Total_crop_property_damage
FLOOD 25326 144657709807 5661968450 289315419614
HURRICANE/TYPHOON 88 69305840000 2607872800 138611680000
TORNADO 60652 56947380676 414953270 113894761353
STORM SURGE 261 43323536000 5000 86647072000
FLASH FLOOD 54277 16822673978 1421317100 33645347957
HAIL 288661 15735267513 3025954473 31470535025
HURRICANE 174 11868319010 2741910000 23736638020
TROPICAL STORM 690 7703890550 678346000 15407781100
WINTER STORM 11433 6688497251 26944000 13376994502
HIGH WIND 20212 5270046295 638571300 10540092590
RIVER FLOOD 173 5118945500 5029459000 10237891000
WILDFIRE 2761 4765114000 295472800 9530228000
STORM SURGE/TIDE 148 4641188000 850000 9282376000
TSTM WIND 219940 4484928495 554007350 8969856990
ICE STORM 2006 3944927860 5022113500 7889855720

\(\;\)

Below we create a bar plot to better visualize the top 15 most harmful events with respect to the cost of crop damage and property damage separately.

barplot_data <- select(summary_data, EVTYPE, Total_property_damage, Total_crop_damage)

# fisrt we need to reshape the data
barplot_data <- melt(barplot_data, id="EVTYPE", variable.name="Damage", value.name="Total_cost")

ggplot(barplot_data, aes(x=reorder(EVTYPE, Total_cost), y=Total_cost, fill=Damage)) +
        geom_bar(stat="identity", position='dodge', size=0.1, alpha=.85)+
        theme_bw( base_family = "Avenir", base_size = 12)+ 
        scale_fill_manual(values=c("#9ECAE1", "#3282BD")) +
        theme(axis.text.x=element_text(angle=45,hjust=1,vjust=1))+
        labs(title = "Top 15 harmful events with higher economic impact") + 
        labs(x="Event type", y="Cost of damage " )

\(\;\)

Therefore, as we can clearly see in the bar plot that the FLOOD is the most harmful event with respect to the property damage. With respect to the crop damage, the FLOOD has still the highest cost, however the RIVER FLOOD and ICE STORM are in the next positions with very close numbers.

We also create a bar plot to visualize the top 15 most harmful events with respect to the total crop and property cost combined.

ggplot(summary_data, aes(x=reorder(EVTYPE, Total_crop_property_damage ), y=Total_crop_property_damage , fill=Total_crop_property_damage )) +
        geom_bar(stat="identity", position='dodge', size=0.1, alpha=.85)+
        theme_bw( base_family = "Avenir", base_size = 12)+ 
        theme(axis.text.x=element_text(angle=90,hjust=1,vjust=1))+
        geom_text(aes(label=Total_crop_property_damage), hjust=-0.05 ,position=position_dodge(.9), size=3)+
        labs(title = "Top 15 harmful events with higher economic impact ") + 
        labs(x="Event type", y="Total cost of crop and property damage " )+
        coord_flip()+
        theme(legend.position="none")

\(\;\)

As we can see in the bar plot of the total crop and property damage, FLOOD is the weather event with highest economic impact on crop and property costing a total of 289315419614 US Dollars .