Based on data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database

Synopsis:

The data analysis looks briefly at weather events impact to the U.S. population. It is based on data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database from 1950 to 2011. Two main areas are investigated. Firstly, which types of weather events are the most harmful to the United States population and secondly which types of events have the greatest economic impact.

Data Processing

Firstly the raw data was loaded in from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. Some supporting documentation was also downloaded to understand the most relevant variables. There are fewer events recorded for the earlier years (most likely due to a lack of good records) and as a result more recent years records should be considered more complete.

        library(dplyr)
        library(data.table)
        library(ggplot2)
        library(grDevices)
        library(RColorBrewer)
        library(plotly)

        ## Setup the working directory where the data is located
        setwd("C:/Users/paddy/Documents/Coursera/Assignments/Reproducable Research/Week 4/")
        #setwd("C:/Users/paddy/Documents/Coursera/Data Science/Course Material/5. Reproducable Research/Assignments/Week 2/Reproducible Research Course Project 1")
        ## Download the raw archive file from the web: 
        fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
        download.file(fileUrl, destfile = "./repdata%2Fdata%2FStormData.csv.bz2")
        file2Url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf"
        download.file(file2Url, destfile = "./repdata%2Fpeer2_doc%2Fpd01016005curr.pdf",mode = "wb")
        file3Url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf"
        download.file(file3Url, destfile = "./repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf",mode = "wb")
        dateDownloaded <- date()
        dateDownloaded
       # unzip(zipfile = "./repdata%2Fdata%2FStormData.csv.bz2")

As read.csv function can handle the .bz2 format it was read in directly.

        ## Load the test data
        data <- read.csv("./repdata%2Fdata%2FStormData.csv.bz2")
        ## Take a quick look at the data
        head(data)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6
  1. Some small post processing of the data was carried out to facilitate later analysis.
        ## Load data into dplyr
        data1 <- tbl_df(data)
        rm(data)

To investigate the most harmful weather events with respect to population health the FATALITIES variable was investigated and summarised per event type (EVTYPE).
This analysis was carried out using the dplyr package to determine the most impactful weather events.

        ## summarise FATALITIES by EVTYPE in dplyr
        pop_health<- 
               data1%>%
                select(BGN_DATE,BGN_TIME,EVTYPE,FATALITIES,INJURIES)%>%        
                filter(FATALITIES>0) %>%
                group_by(EVTYPE) %>%
                summarise(total_fatalities=sum(FATALITIES),total_injuries=sum(INJURIES))%>%
                arrange(desc(total_fatalities))

                TopN_fatalities<-head(pop_health,20)

These results were then plotted using ggplot2 package and can be seen in the results section.

      p1<-ggplot(TopN_fatalities,
               aes(x=reorder(EVTYPE,-total_fatalities), y=total_fatalities)) +
                geom_col(colour="steelblue2",fill="lightsteelblue1",width = 0.75) +
                xlab("Weather Event Type") +
                ylab(expression("Total Number of Fatalities")) +
                ggtitle("U.S. Top 20 Weather Events for Total Fatalities 1950-2011") +
                geom_text(aes(label = round(total_fatalities,0)),colour = "indianred3",nudge_y = 200,size=3)+
                theme_minimal()+
                theme(axis.text.y = element_text(size = 8,angle=0,color = "indianred3"))+
                theme(axis.text.x = element_text(size = 8,angle=90,color = "indianred3",vjust=0,hjust = 1)) # changes axis labels

Following on from this, in order to continue looking at the weather events most harmful to population health, the INJURIES variable was also investigated and again summarised per event type (EVTYPE). This analysis was again carried out using the dplyr package and the results were plotted using ggplot2 package.

        ## Load data into dplyr
        pop_health_i<- 
               data1%>%
                select(BGN_DATE,BGN_TIME,EVTYPE,FATALITIES,INJURIES)%>%        
                filter(INJURIES>0) %>%
                group_by(EVTYPE) %>%
                summarise(total_fatalities=sum(FATALITIES),total_injuries=sum(INJURIES)) %>%
                arrange(desc(total_injuries))                
                
                TopN_injuries<-head(pop_health_i,20)
#        pop_health$INJURIES<-as.numeric(pop_health$INJURIES)

The resulting Top 20 most impactful events for injury can be seen in the results section later on in the report.

        p2<- ggplot(TopN_injuries,
               aes(x=reorder(EVTYPE,-total_injuries), y=total_injuries)) +
                geom_col(colour="darkslateblue",fill="steelblue",width = 0.75) +
                xlab("Weather Event Type") +
                ylab(expression("Total Number of Injuries")) +
                ggtitle("U.S. Top 20 Weather Events for Total Injuries 1950-2011") +
                geom_text(aes(label = round(total_fatalities,0)),colour = "indianred3",nudge_y = 2500,size=3)+
                theme_minimal()+
                theme(axis.text.y = element_text(size = 8,angle=0,color = "indianred3"))+
                theme(axis.text.x = element_text(size = 8,angle=90,color = "indianred3",vjust=0,hjust = 1)) # changes axis labels

A function found on this page was used to combine 2x ggplots onto a single figure.

## Multiple plot function

        ## ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
        ## - cols:   Number of columns in layout
        ## - layout: A matrix specifying the layout. If present, 'cols' is ignored.

        ## If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
        ## then plot 1 will go in the upper left, 2 will go in the upper right, and
        ## 3 will go all the way across the bottom.

        multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
                        require(grid)
                          # Make a list from the ... arguments and plotlist
                        plots <- c(list(...), plotlist)
                        numPlots = length(plots)
                        # If layout is NULL, then use 'cols' to determine layout
                        if (is.null(layout)) {
                                # Make the panel
                                # ncol: Number of columns of plots
                                # nrow: Number of rows needed, calculated from # of cols
                        layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                                ncol = cols, nrow = ceiling(numPlots/cols))
                        }

                        if (numPlots==1) {
                                print(plots[[1]])

                        } 
                        else {
                                # Set up the page
                                grid.newpage()
                        pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

                        # Make each plot, in the correct location
                        for (i in 1:numPlots) {
                        # Get the i,j matrix positions of the regions that contain this subplot
                        matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

                        print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
                        }
                        }
        }

The second part of the analysis attmpted to understand which weather events had the the greatest economic consequences across the United States between 1950 and 2011.
For damage there are 2x relevant variables (PROPDMG,PROPDMG) corresponding to property damage and crop damage respectively. According to the support documentation from NOAA the damage estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions. These characters were recorded in the associated (PROPDMGEXP,CROPDMGEXP) columns.

First the economic extent of damage inflicted on property was analysed. Entries with zero value in the property damage (PROPDMG) variable were removed as they would not contribute to the total cost.
The entries with incorrect values for PROPDMGEXP were checked to see their contribution. As they only represent 0.13% of the total samples they were ignored as they were not deemed to have a major influence on the end result.

        ##remove the zero value rows from data1 and summarise property damage by EVTYPE
        prop<- 
               data1%>%
                select(BGN_DATE,BGN_TIME,EVTYPE,PROPDMG,PROPDMGEXP)%>%        
                filter(PROPDMG>0)
                
                #as.character(prop$PROPDMGEXP)
                prop$PROPDMGEXP <- toupper(prop$PROPDMGEXP)
                unique(prop$PROPDMGEXP)
##  [1] "K" "M" "B" "+" "0" ""  "5" "6" "4" "H" "2" "7" "3" "-"
        ##review the non-styandard entries of PROPDMGEXP:
        prop_typo<-prop%>%
                select(BGN_DATE,BGN_TIME,EVTYPE,PROPDMG,PROPDMGEXP)%>%        
                filter(PROPDMGEXP=="+"|PROPDMGEXP=="0"|PROPDMGEXP==""|PROPDMGEXP=="5"|PROPDMGEXP=="6"|PROPDMGEXP=="4"|PROPDMGEXP=="2"|PROPDMGEXP=="7"|PROPDMGEXP=="-"|PROPDMGEXP=="3")
        
        100*dim(prop_typo)/dim(prop) ##to understand % of entries impactped by typos
## [1]   0.1337938 100.0000000
        prop_edit<-prop%>%
                select(BGN_DATE,BGN_TIME,EVTYPE,PROPDMG,PROPDMGEXP)%>%        
                filter(!(PROPDMGEXP=="+"|PROPDMGEXP=="0"|PROPDMGEXP==""|PROPDMGEXP=="5"|PROPDMGEXP=="6"|PROPDMGEXP=="4"|PROPDMGEXP=="2"|PROPDMGEXP=="7"|PROPDMGEXP=="-"|PROPDMGEXP=="3"))
        dim(prop_edit)
## [1] 238854      5
        unique(prop_edit$PROPDMGEXP)
## [1] "K" "M" "B" "H"
        ##check the distribution of correct entries to understand the makeup
        PROPDMGEXP_count<- 
               prop_edit%>%
                select(BGN_DATE,BGN_TIME,EVTYPE,PROPDMGEXP)%>%        
                group_by(PROPDMGEXP)%>%
                summarise(count_PROPDMGEXP=length(PROPDMGEXP))
           
        PROPDMGEXP_count
## # A tibble: 4 x 2
##   PROPDMGEXP count_PROPDMGEXP
##   <chr>                 <int>
## 1 B                        40
## 2 H                         7
## 3 K                    227481
## 4 M                     11326
        #class(prop_edit$PROPDMGEXP)
        #as.character(prop_edit$PROPDMGEXP)
        prop_edit$PROPDMGEXP <- toupper(prop_edit$PROPDMGEXP)
        unique(prop_edit$PROPDMGEXP)   
## [1] "K" "M" "B" "H"
        ##replace the PROPDMGEXP variable with a numeric equvalent:
        prop_edit$PROPDMGEXP[prop_edit$PROPDMGEXP == "B"] <- as.numeric("1000000000")*1
        prop_edit$PROPDMGEXP[prop_edit$PROPDMGEXP == "M"] <- as.numeric("1000000")*1
        prop_edit$PROPDMGEXP[prop_edit$PROPDMGEXP == "K"] <- as.numeric("1000")*1
        prop_edit$PROPDMGEXP[prop_edit$PROPDMGEXP == "H"] <- as.numeric("100")*1
       
        
        prop_edit$PROPDMGEXP<-as.numeric(prop_edit$PROPDMGEXP)
        class(prop_edit$PROPDMGEXP)
## [1] "numeric"
        unique(prop_edit$PROPDMGEXP)
## [1] 1e+03 1e+06 1e+09 1e+02
         prop_dmg<-prop_edit%>%
                select(BGN_DATE,BGN_TIME,EVTYPE,PROPDMG,PROPDMGEXP)%>% 
                group_by(EVTYPE)%>%
                summarise(Tot_PROPDMG=sum(PROPDMG*PROPDMGEXP))%>%
                arrange(desc(Tot_PROPDMG)) 

This was then followed by similar analysis of the ecomonic impact to crop damage. Entries with zero value in the crop damage (CROPDMG) variable were removed as again they would not contribute to the total cost. The entries with incorrect entries for CROPDMGEXP were also checked and found to represent 0.016% of the overall samples so were ignored and not deemed to have significant impact in the end result.

 ##remove the zero value rows from data1 and summarise crop damage by EVTYPE
        crop<- 
               data1%>%
                select(BGN_DATE,BGN_TIME,EVTYPE,CROPDMG,CROPDMGEXP)%>%         
                filter(CROPDMG>0)
                
                crop$CROPDMGEXP <- toupper(crop$CROPDMGEXP)
                unique(crop$CROPDMGEXP)
## [1] "M" "K" "B" "0" ""
        ##review the non-styandard entries of PROPDMGEXP:
        crop_typo<-crop%>%
                select(BGN_DATE,BGN_TIME,EVTYPE,CROPDMG,CROPDMGEXP)%>%        
                filter(CROPDMGEXP==""|CROPDMGEXP=="0")
        
        100*dim(crop_typo)/dim(crop) ##to understand % of entries impactped by typos
## [1]   0.06787637 100.00000000
        crop_edit<-crop%>%
                select(BGN_DATE,BGN_TIME,EVTYPE,CROPDMG,CROPDMGEXP)%>%        
                filter(!(CROPDMGEXP==""|CROPDMGEXP=="0"))
        
                dim(crop_edit)
## [1] 22084     5
                unique(crop_edit$CROPDMGEXP)
## [1] "M" "K" "B"
        ##check the distribution of correct entries to understand the makeup
        CROPDMGEXP_count<- 
               crop_edit%>%
                select(BGN_DATE,BGN_TIME,EVTYPE,CROPDMGEXP)%>%        
                group_by(CROPDMGEXP)%>%
                summarise(count_CROPDMGEXP=length(CROPDMGEXP))
           
  
                class(crop_edit$CROPDMGEXP)
## [1] "character"
        crop_edit$CROPDMGEXP <- toupper(crop_edit$CROPDMGEXP)
        unique(crop_edit$CROPDMGEXP)   
## [1] "M" "K" "B"
        ##replace the PROPDMGEXP variable with a numeric equvalent:
        crop_edit$CROPDMGEXP[crop_edit$CROPDMGEXP == "B"] <- "1000000000"
        crop_edit$CROPDMGEXP[crop_edit$CROPDMGEXP == "M"] <- "1000000"
        crop_edit$CROPDMGEXP[crop_edit$CROPDMGEXP == "K"] <- "1000"
       
        
        crop_edit$CROPDMGEXP<-as.numeric(crop_edit$CROPDMGEXP)
        class(crop_edit$CROPDMGEXP)
## [1] "numeric"
         crop_dmg<-crop_edit%>%
                select(BGN_DATE,BGN_TIME,EVTYPE,CROPDMG,CROPDMGEXP)%>% 
                group_by(EVTYPE)%>%
                summarise(Tot_CROPDMG=sum(CROPDMG*CROPDMGEXP))%>%
                arrange(desc(Tot_CROPDMG)) 

The total economic cost was based on the sum of both the crop damage and property damage.

        Tot_dmg <- merge(crop_dmg, prop_dmg, by.x = "EVTYPE", by.y = "EVTYPE")
        Total_dmg<-
                Tot_dmg%>%
                select(EVTYPE,Tot_CROPDMG,Tot_PROPDMG)%>% 
                group_by(EVTYPE)%>%
                summarise(Tot_DMG=sum(Tot_CROPDMG,Tot_PROPDMG))%>%
                arrange(desc(Tot_DMG))
        

        Total_dmg_summary<-
                Total_dmg%>%
                select(EVTYPE,Tot_DMG)%>% 
                group_by(EVTYPE)%>%
                arrange(desc(Tot_DMG))%>%
                mutate("Cost (Billion$)"=format(paste("$",round(Tot_DMG / 1e9, 1), "B"), trim = TRUE))%>%
                select(EVTYPE,"Cost (Billion$)")

A pie chart was then created to show which weather events impacted most economically. This chart itself can be seen in the results section below.

dfD <- data.frame(CategoryD = Total_dmg$EVTYPE, TotD=Total_dmg$Tot_DMG)

       ##dfD$TotD<- format(paste("$",round(dfD$TotD / 1e9, 1), "B"), trim = TRUE)

colors <- c('rgb(211,94,96)', 'rgb(128,133,133)', 'rgb(144,103,167)', 'rgb(171,104,87)', 'rgb(114,147,203)')

pie <- plot_ly(dfD, labels = ~CategoryD, values = ~TotD, type = 'pie',
        textposition = 'inside',
        textinfo = 'label+percent',
        insidetextfont = list(color = '#FFFFFF'),
        hoverinfo = 'text',
        text = format(paste("$",round(dfD$TotD / 1e9, 1), "B"), trim = TRUE),
        marker = list(colors = colors,
                      line = list(color = '#FFFFFF', width = 1)),
                      #The 'pull' attribute can also be used to create space between the sectors
        showlegend = FALSE) %>%
  layout(title = 'United States Economic Impact per Weather Event',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

Results:

The results of the analysis can be seen in the plots below.
In summary from a population health impact, tornadoes have been found to have the most significant impact. From the economic impact, floods have the most impact.

Population Health Impact Summary:

The tables below show to most impactful weather events on population health. These are shown by fatalities and injuries. In both instances tornadoes come out on top.

        TopN_fatalities
## # A tibble: 20 x 3
##    EVTYPE                  total_fatalities total_injuries
##    <fct>                              <dbl>          <dbl>
##  1 TORNADO                             5633          60187
##  2 EXCESSIVE HEAT                      1903           4791
##  3 FLASH FLOOD                          978            641
##  4 HEAT                                 937           1420
##  5 LIGHTNING                            816            649
##  6 TSTM WIND                            504            646
##  7 FLOOD                                470           2679
##  8 RIP CURRENT                          368             87
##  9 HIGH WIND                            248            308
## 10 AVALANCHE                            224             79
## 11 WINTER STORM                         206            599
## 12 RIP CURRENTS                         204             67
## 13 HEAT WAVE                            172            269
## 14 EXTREME COLD                         160            199
## 15 THUNDERSTORM WIND                    133            130
## 16 HEAVY SNOW                           127            225
## 17 EXTREME COLD/WIND CHILL              125             15
## 18 STRONG WIND                          103             40
## 19 BLIZZARD                             101            718
## 20 HIGH SURF                            101             39
        TopN_injuries
## # A tibble: 20 x 3
##    EVTYPE             total_fatalities total_injuries
##    <fct>                         <dbl>          <dbl>
##  1 TORNADO                        5227          91346
##  2 TSTM WIND                       199           6957
##  3 FLOOD                           104           6789
##  4 EXCESSIVE HEAT                  402           6525
##  5 LIGHTNING                       283           5230
##  6 HEAT                             73           2100
##  7 ICE STORM                        35           1975
##  8 FLASH FLOOD                     171           1777
##  9 THUNDERSTORM WIND                54           1488
## 10 HAIL                              3           1361
## 11 WINTER STORM                     85           1321
## 12 HURRICANE/TYPHOON                32           1275
## 13 HIGH WIND                       102           1137
## 14 HEAVY SNOW                       51           1021
## 15 WILDFIRE                         55            911
## 16 THUNDERSTORM WINDS               25            908
## 17 BLIZZARD                         48            805
## 18 FOG                              38            734
## 19 WILD/FOREST FIRE                  7            545
## 20 DUST STORM                       19            440
        ##plot histograms in a single figure
        multiplot(p1, p2, cols=1)
## Loading required package: grid

Economic Impact Summary:

Combining both the calcuations from both property and crop damage variables it was possible to determine the total cost of damage per weather event. The table below shows the impact of the top 20 events. Taking this data we can get an understanding of the overall impact of the main weather events contributing to the ecomonomic impact in the United States during the measurement period. The top 5 events account for 72% of the ecomonic damage.

        head(Total_dmg_summary,20)
## # A tibble: 20 x 2
## # Groups:   EVTYPE [20]
##    EVTYPE             `Cost (Billion$)`
##    <fct>              <chr>            
##  1 FLOOD              $ 150.3 B        
##  2 HURRICANE/TYPHOON  $ 71.9 B         
##  3 TORNADO            $ 57.4 B         
##  4 STORM SURGE        $ 43.3 B         
##  5 HAIL               $ 18.8 B         
##  6 FLASH FLOOD        $ 17.6 B         
##  7 DROUGHT            $ 15 B           
##  8 HURRICANE          $ 14.6 B         
##  9 RIVER FLOOD        $ 10.1 B         
## 10 ICE STORM          $ 9 B            
## 11 TROPICAL STORM     $ 8.4 B          
## 12 WINTER STORM       $ 6.7 B          
## 13 HIGH WIND          $ 5.9 B          
## 14 WILDFIRE           $ 5.1 B          
## 15 TSTM WIND          $ 5 B            
## 16 STORM SURGE/TIDE   $ 4.6 B          
## 17 THUNDERSTORM WIND  $ 3.9 B          
## 18 HURRICANE OPAL     $ 3.2 B          
## 19 WILD/FOREST FIRE   $ 3.1 B          
## 20 THUNDERSTORM WINDS $ 1.9 B
        pie