SYNOPSIS

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many events can result in fatalities, injuries and property damage, and preventing such outcomes is a key concern for governments. The aim of this analysis is to explore the NOAA Storm Database in order to understand the impact of weather events in US public health and economy. To achieve it we have analysed, on one hand the number of fatalities and injuries for each event, and on the other the overall economic impact in terms of property and crop damage. The NOAA dataset is publicly available and covers data of weather events in the United States for the time period between 1950 and 2011. To perform the whole analysis we have used the R programming language.


DATA PROCESSING

Load libraries

The following R libraries are required in order to perform the analysis.

library(R.utils) #to extract the NOAA archive.
library(dplyr)   #to manipulate data.
library(ggplot2)   #to plot data.


Setting local directory & downloading files for the analysis

The NOAA dataset is publcly available here. It is possible to download it manually from the provided url or execute the R code below and download it in the correspondent working directory.

#Set working directory for the analysis
setwd("C:/Users/Marco/Dropbox/Coursera/Data Science Specialization - JHU/Reproducible Research/week4")

#Download dataset, which is compressed via the bzip2 algorithm. Once downloaded we will use the bunzip2 function, from the R.utils library, to convert it to .csv
if(!file.exists("storm_data.csv")) {  
  download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","storm_data.csv.bz2")
bunzip2("storm_data.csv.bz2", "storm_data.csv",overwrite=TRUE, remove=FALSE)
}

#Download related documentation for the data
if(!file.exists("nsw_storm_sata_docuentation.pdf")) {  
  download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf","nsw_storm_sata_docuentation.pdf")
}

if(!file.exists("storm_events_faq.pdf")) {  
  download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf","storm_events_faq.pdf")
}

#Check that both data and documentation is available in local folder
dir()
## [1] "Ass2_Storm.Rmd"                  "nsw_storm_sata_docuentation.pdf"
## [3] "proj2_storm.html"                "proj2_storm.Rmd"                
## [5] "storm_data.csv"                  "storm_data.csv.bz2"             
## [7] "storm_events_faq.pdf"


Read the dataset and explore it

With the code below we will import the dataset in the R environment. The dataset consists 902297 rows and 37 columns. Some of the variables are not ncessary for the goal of our analysis, hece we will drop them in next step. We can see that data has been collected for 985 distinct events.

if(!exists("storm_data")) {  
storm_data<-read.csv("storm_data.csv")
}

dim(storm_data)
## [1] 902297     37
names(storm_data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"
head(storm_data)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6
#There is a total of 985 distinct events in the dataset. 
length(unique(storm_data$EVTYPE))
## [1] 985


Subset only necessary variables

After reading documentation, we reckon that - for the purpose of this analysis - we can reduce the size of the dataset by keeping just the necessary variables for our analysis.

  • To evaluate impact on public health we will be using the variables “FATALITIES” and “INJURIES”.
  • To evaluate economic consecuences we will analyse Property and Crop damage using the following variables: “PROPDMG”, “PROPDMGEXP”, “CROPDMG”, “CROPDMGEXP”.
  • To analyse climate event we will use the variable “EVTYPE”.

Here below we subset our variables of interest.

sub_storm_data<- storm_data %>%
      select(EVTYPE,FATALITIES,INJURIES,PROPDMG:CROPDMGEXP)
#Let also put column names in lower case.
names(sub_storm_data)<-tolower(names(sub_storm_data))
names(sub_storm_data)
## [1] "evtype"     "fatalities" "injuries"   "propdmg"    "propdmgexp"
## [6] "cropdmg"    "cropdmgexp"
summary(sub_storm_data)
##                evtype         fatalities          injuries        
##  HAIL             :288661   Min.   :  0.0000   Min.   :   0.0000  
##  TSTM WIND        :219940   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  THUNDERSTORM WIND: 82563   Median :  0.0000   Median :   0.0000  
##  TORNADO          : 60652   Mean   :  0.0168   Mean   :   0.1557  
##  FLASH FLOOD      : 54277   3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##  FLOOD            : 25326   Max.   :583.0000   Max.   :1700.0000  
##  (Other)          :170878                                         
##     propdmg          propdmgexp        cropdmg          cropdmgexp    
##  Min.   :   0.00          :465934   Min.   :  0.000          :618413  
##  1st Qu.:   0.00   K      :424665   1st Qu.:  0.000   K      :281832  
##  Median :   0.00   M      : 11330   Median :  0.000   M      :  1994  
##  Mean   :  12.06   0      :   216   Mean   :  1.527   k      :    21  
##  3rd Qu.:   0.50   B      :    40   3rd Qu.:  0.000   0      :    19  
##  Max.   :5000.00   5      :    28   Max.   :990.000   B      :     9  
##                    (Other):    84                     (Other):     9
sum(is.na(sub_storm_data))
## [1] 0
#The dataset does not present missing values, hece we can use the whole observations for our analysis.


Clean property and crop variables

As stated in documentation (page 12), variables that property and crop damages need to be transformed using the exponential variables “propdmgext” and “cropdmgexp”, as follows: “Alphabetical characters used to signify magnitude include”K" for thousands, “M” for millions, and “B” for billions.

First we check out possible values for the “exponential” variables and clean where necessary.

unique(sub_storm_data$propdmgexp)
##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(sub_storm_data$cropdmgexp)
## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M
#We can see that there are also other values apart from the ones stated in documentation. Since we have no information about how to encode the unknown values, we will filter them out and keep just the ones explicitly reported on official documentation. That are "H","K","M","B".
fil_sub_storm_data <- filter(sub_storm_data,sub_storm_data$propdmgexp %in% c("H","K","M","B") | sub_storm_data$cropdmgexp %in% c("H","K","M","B"))

#Let now create two new variables to measure cost of property and crop, based on exponential values.
exp<- c('H' = 100, 'K' = 1000, 'M' = 1000000, 'B' = 1000000000)
fil_sub_storm_data$CropCost <- ifelse(toupper(fil_sub_storm_data$cropdmgexp) %in% names(exp), fil_sub_storm_data$cropdmg * exp[toupper(fil_sub_storm_data$cropdmgexp)], 0)

fil_sub_storm_data$PropCost <- ifelse(toupper(fil_sub_storm_data$propdmgexp) %in% names(exp), fil_sub_storm_data$propdmg * exp[toupper(fil_sub_storm_data$propdmgexp)], 0)

#Finally we sum up property nd crop cost to calculat the total economic impact. 
fil_sub_storm_data <- fil_sub_storm_data %>% 
      mutate(economic_impact= PropCost + CropCost)

To be consistent throughout the rest of the analysis, we will use the new dataset named “fil_sub_storm_data” to answer both questions about public health and economic impact of climate events.


Summarize the data by our variables of interest

Below we sumarize fatalities, injuries and economic impact according to the type of event. We will use this summarized dataset to show results in the next section.

summarized_df<-fil_sub_storm_data %>%
      group_by(evtype) %>%
            summarize(tot_fatalities=sum(fatalities),tot_injuries=sum(injuries),
                      tot_economic_impact=sum(economic_impact))


RESULTS

In the plots that follows, we show the top 15 most severe events in the US for:

Plot Fatalities

plot_fatalities <- arrange(summarized_df,desc(tot_fatalities)) %>%
      head(15) %>%
            ggplot(aes(x=reorder(evtype,tot_fatalities), y=tot_fatalities)) +
            geom_bar(fill="red",stat="identity")  + 
            coord_flip() + 
            ylab("Total number of fatalities") + xlab("Event") +
            ggtitle("Most severe weather events for public health - fatalities") 

plot_fatalities

Plot injuries

plot_injuries <- arrange(summarized_df,desc(tot_injuries)) %>%
      head(15) %>%
            ggplot(aes(x=reorder(evtype,tot_injuries), y=tot_injuries)) +
            geom_bar(fill="orange",stat="identity")  + 
            coord_flip() + 
            ylab("Total number of injuries") + xlab("Event") +
            ggtitle("Most severe weather events for public health - injuries") 
plot_injuries 

Plot economic impact

plot_economic_impact <- arrange(summarized_df,desc(tot_economic_impact)) %>%
      head(15) %>%
            ggplot(aes(x=reorder(evtype,tot_economic_impact), y=tot_economic_impact)) +
            geom_bar(fill="black",stat="identity")  + 
            coord_flip() + 
            ylab("Total economic impact $") + xlab("Event") +
            ggtitle("Most severe weather events for the economy") 
plot_economic_impact 


Tornado is by far the most harmful weather event for population health both in terms of fatalities and injuries caused.

When we looked at the economic damage, results show that flood has the greatest economic consecuences on the US economy, followed by hurricane/typhoon and tornado.