Synopsis

The data analysis presented in this document is for Course Project 2 in the Reproducible Reasearch course, provided by John Hopkins University on Coursera platform.The database used for this analysis was provided by U.S. National Oceanic and Atmospheric Administration (NOAA) and it tracks major weather events, including where and when they occur, as well as estimation on fatalities, injuries, and economic figures.The main goals of this project were to evaluate what type wether events would have a higher impact on population health as well as economic consequences.To evaluate the impact on population health, injury and fatality values were used. Property and Crop damage costs were the variables used to evaluate the economic consequences. It was found that Tornados and Excessive Heat are the most impactful events on population health, while Drought and Hurricane/Typhoons were the events causing larger economic consequences.

Data Processing

As mentioned, the data used for this project was obtained from NOAA’s database. The funcion read.csv was used to read the data into R Studio, and then stored in a variable called StormData.

        # Open csv file with raw data
                StormData <- read.csv("repdata_data_StormData.csv.bz2") 

StormData has a total of 902297 rows and 37 columns. The time frame of the data stored in StormData variable ranges from 1950 until 2011. Therefore, in order to help reducing the size of data to analyse, only data after 1990 was considered and stored in a new variable called SampleStormData. Further pre-processing included change date format of values in column BGN_DATE, and keep columns that contain relevant information for the data analysis.

        # Change date format of column BGN_DATE to Date and Save it in new column BGN_DATE_YMD
                StormData$BGN_DATE <- as.Date(format(strptime(as.character(StormData$BGN_DATE), 
                                                              "%m/%d/%Y"), 
                                                     "%Y-%m-%d"))

        # Filter StormData for year > 1990
                StormData$YEAR <- year(StormData$BGN_DATE)
                SampleStormData <- StormData[StormData$YEAR > "1990", ]
        
        # Columns to keep and filter dataset SampleStormData to keep just the columns selected
                keepVar <- c("BGN_DATE","YEAR","STATE","EVTYPE","FATALITIES","INJURIES",
                             "PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP","LATITUDE",
                             "LONGITUDE","REFNUM")
                SampleStormData <- SampleStormData[,keepVar]
        
        # Set all Events to Capital letters
                SampleStormData$EVTYPE <- factor(toupper(SampleStormData$EVTYPE))

Results

Type of Events most harmful to population health

In order to know what type of events were most harmful to the population health, between 1991 and 2011, we will group the variable EVTYPE, and then count the total of Injuries and Fatalities per event. Addionaly, those values will then be transformed into percentages.

  1. Totaly Injuries
        # Total Injuries per Event Type
                totalInjperEvType <- aggregate.data.frame(SampleStormData$INJURIES,
                                                          by = list(SampleStormData$EVTYPE),
                                                          FUN = sum)
                colnames(totalInjperEvType) <- c("EVTYPE","Total_Injuries")
                
        # Order Total Injuries in descend order
                totalInjperEvType <- totalInjperEvType[order(totalInjperEvType$Total_Injuries, 
                                                             decreasing = TRUE),]  
        
        # Show total Injuries in Percentage
                totalInjperEvType$PercTotalInj <- totalInjperEvType$Total_Injuries/sum(totalInjperEvType$Total_Injuries)*100
                totalInjperEvType$PercTotalInj <- round(totalInjperEvType$PercTotalInj, 2)
  1. Total Fatalities
        # Total Fatalities per Event Type
                totalFatalperEvType <- aggregate.data.frame(SampleStormData$FATALITIES, 
                                                            by = list(SampleStormData$EVTYPE), 
                                                            FUN = sum)
                colnames(totalFatalperEvType) <- c("EVTYPE","Total_Fatalities")
                
        # Order Total Injuries in descend order
                totalFatalperEvType <- totalFatalperEvType[order(totalFatalperEvType$Total_Fatalities, 
                                                                 decreasing = TRUE),]  
        
        # Show total Injuries in Percentage
                totalFatalperEvType$PercTotalInj <- totalFatalperEvType$Total_Fatalities/sum(totalFatalperEvType$Total_Fatalities)*100
                totalFatalperEvType$PercTotalInj <- round(totalFatalperEvType$PercTotalInj, 2)
  1. Visualization
        # SUMMARY RESULTS IN A FORM OF BAR PLOT IN PANEL
                plot1 <- ggplot(data = totalInjperEvType[totalInjperEvType$PercTotalInj > 5,],
                                aes(EVTYPE, PercTotalInj))+
                        labs(x = NULL, y = "Total Injuries (%)", 
                             title = "Most harmful (>5%) events in terms of injury, 1991-2011")+
                        theme(plot.title = element_text(size=9),
                              axis.title = element_text(size = 9))+
                        geom_bar(stat = "Identity")+
                        geom_text(aes(label=PercTotalInj), vjust=-0.3, size=2)
        
                plot2 <- ggplot(data = totalFatalperEvType[totalFatalperEvType$PercTotalInj > 5,], 
                                aes(EVTYPE, PercTotalInj))+
                        labs(x = "Event Type ", y = "Total Fatalities (%)", 
                             title = "Most harmful (>5%) events in terms of fatality, 1991-2011")+
                        theme(plot.title = element_text(size=9),
                              axis.title = element_text(size = 9))+
                        geom_bar(stat = "Identity")+
                        geom_text(aes(label=PercTotalInj), vjust=-0.3, size=2)
                
                multiplot(plot1,plot2, cols=1)

Fig. 1 - Most harmful events (rate > %5) for Population Heath, 1991-2011.

Results presented in Fig.1 show that, generally, TORNADO is the weather event that causes more Injuries in the population, in a total of 35.5% of all injuries, while Excessive Heat is repsonsible for 17.3% of total fatalities, between 1991 and 2011.

Further data analysis was done intending to address the following questions:

  1. Which Tornado caused the largest amounth of injuries, per state and year?
        # TORNADO
                subsetTornado_INJ <- aggregate.data.frame(SampleStormData$INJURIES,
                                                          by = list(SampleStormData$EVTYPE == "TORNADO",
                                                                    SampleStormData$STATE,
                                                                    SampleStormData$BGN_DATE),
                                                          FUN = sum)
                subsetTornado_INJ <- subsetTornado_INJ[subsetTornado_INJ$Group.1 =="TRUE",]
                subsetTornado_INJ$Group.1 <- NULL
                colnames(subsetTornado_INJ) <- c("STATE", "DATE","Total_Injuries")
                
                subsetTornado_INJ <- subsetTornado_INJ[order(subsetTornado_INJ$Total_Injuries,
                                        decreasing = TRUE),]
                
                subsetTornado_INJ[1:3,]  
##        STATE       DATE Total_Injuries
## 104123    AL 2011-04-27           2092
## 104678    MO 2011-05-22           1152
## 104162    TN 2011-04-27            857

Above are shown the top 3 tornados that caused most injuries. this data suggests that in 2011 a tornado caused large amouth of injuries (almost 3000) throughout the states of Alabama and Tennessie (AL and TN, respectively). Official data has been found supporting this cause, and a tornado outbreak in the Deep South has been reported.

  1. Which Excessive Heat event caused the largest amounth of fatalities, per state and year?

To answer this question, data was filtered based on event type, and then groped by state and year. A sum function was applied to the variable FATALITIES.

        # EXCESSIVE HEAT
                subsetEXCHEAT <- aggregate.data.frame(SampleStormData$FATALITIES,
                                                      by = list(SampleStormData$EVTYPE == "EXCESSIVE HEAT",
                                                                SampleStormData$STATE,
                                                                SampleStormData$YEAR),
                                                      FUN = sum)
                subsetEXCHEAT <- subsetEXCHEAT[subsetEXCHEAT$Group.1 =="TRUE",]
                subsetEXCHEAT$Group.1 <- NULL
                colnames(subsetEXCHEAT) <- c("STATE", "YEAR", "Total_Fatalities")
                
                subsetEXCHEAT <- subsetEXCHEAT[order(subsetEXCHEAT$Total_Fatalities,
                                                     decreasing = TRUE),]
                
                subsetEXCHEAT[1:3,]
##     STATE YEAR Total_Fatalities
## 531    IL 1999              138
## 571    PA 1999               88
## 550    MO 1999               77

The results presented above show that large majoritie of fatalities due to Extreme Heat were in the states of Illinois, Pennsylvania and Montana, which is consistent with official records of a Summer heat wave during 1999.

Type of Events with greater Economic consequences

To understand what weather events had the highest economic impact, we have to analyse the crop damage (CROPDMG) and property damaga (PROPDMG). The values in these columns had to be multiplied by the corresponding CROPDMGEXP and PROPDMGEXP to produce the actual monetary value. After doing so and check the data, one event in particular stood out as a possible outlier (REFNUM = 605943). It was then decided to eliminate it from the dataset.

        # Replace 'K', 'M' and 'B' with their numeric equivalents. After that, replace NA with 1. Applies to PROPDMGEXP and CROPDMGEXP
                SampleStormData$PROPDMGEXP <- recode(SampleStormData$PROPDMGEXP, 'K' = 1000, 'M'=1000000, 'B'=1000000000)
                SampleStormData$CROPDMGEXP <- recode(SampleStormData$CROPDMGEXP, 'K' = 1000, 'M'=1000000, 'B'=1000000000)
                SampleStormData$PROPDMGEXP[is.na(SampleStormData$PROPDMGEXP)] <- 1
                SampleStormData$CROPDMGEXP[is.na(SampleStormData$CROPDMGEXP)] <- 1
        # Calculate damage amounts for PROPDMG,CROPDMG and Total Damage Amount
                SampleStormData$PROPDMGAMT <- SampleStormData$PROPDMG*SampleStormData$PROPDMGEXP
                SampleStormData$CROPDMGAMT <- SampleStormData$CROPDMG*SampleStormData$CROPDMGEXP
                SampleStormData$TOTDMGAMT <- SampleStormData$CROPDMGAMT + SampleStormData$PROPDMGAMT 

No Official Weather Events of type FLOOD to be reported in CA on 2016-01-01. Most likely this entry is wrong. We delete it.

        # Remove possible outlier
                SampleStormData<- SampleStormData[!(SampleStormData$REFNUM == "605943"),]
  1. Crop Damage data assessment.

To assess which weather event caused more crop damage, we sum the CROPDMGAMT by EVTYPE, and then calculate the percentage of each relative to the total crop costs.

        # Crop Damaga data processing
                TotalCropDmgPerEVTYPE <- aggregate.data.frame(SampleStormData$CROPDMGAMT,
                                                      by = list(SampleStormData$EVTYPE),
                                                      FUN = sum)
                colnames(TotalCropDmgPerEVTYPE) <- c("EVTYPE","TOTALCROPDMGAMT")
                
        # Sort data in descending order
                TotalCropDmgPerEVTYPE <- TotalCropDmgPerEVTYPE[order(TotalCropDmgPerEVTYPE$TOTALCROPDMGAMT, 
                                                                     decreasing = TRUE),]
        # Show Crop Damages in Percentage
                TotalCropDmgPerEVTYPE$PercTotal <- TotalCropDmgPerEVTYPE$TOTALCROPDMGAMT/sum(TotalCropDmgPerEVTYPE$TOTALCROPDMGAMT)*100
                TotalCropDmgPerEVTYPE$PercTotal <- round(TotalCropDmgPerEVTYPE$PercTotal, 2)
  1. Total Property Damage data assessment.

For total property damage costs we do the same data analysis as for crop damage.

        # Total Property Damages per Event Type
                TotalPropDmgPerEVTYPE <- aggregate.data.frame(SampleStormData$PROPDMGAMT,
                                                              by = list(SampleStormData$EVTYPE),
                                                              FUN = sum)
                colnames(TotalPropDmgPerEVTYPE) <- c("EVTYPE","TOTALPROPDMGAMT")
        
        # Sort data in descending order
                TotalPropDmgPerEVTYPE <- TotalPropDmgPerEVTYPE[order(TotalPropDmgPerEVTYPE$TOTALPROPDMGAMT,
                                                                     decreasing = TRUE),]
                        
        # Show total Injuries in Percentage
                TotalPropDmgPerEVTYPE$PercTotal <- TotalPropDmgPerEVTYPE$TOTALPROPDMGAMT/
                        sum(TotalPropDmgPerEVTYPE$TOTALPROPDMGAMT)*100
                TotalPropDmgPerEVTYPE$PercTotal <- round(TotalPropDmgPerEVTYPE$PercTotal, 2)
  1. Results visualization

We plot the results of total crop damage amount (TotalCropDmgPerEVTYPE) and total property damage amount (TotalPropDmgPerEVTYPE) for values above 5%. This is just for the purpose of better visualizationof those results who are relevant for this analysis.

        # SUMMARY RESULTS IN A FORM OF BAR PLOT IN PANEL
                plot3 <- ggplot(data = TotalCropDmgPerEVTYPE[totalInjperEvType$PercTotal > 5,],
                                aes(EVTYPE, PercTotal))+
                        labs(x = NULL, y = "Total Crop Costs (%)", 
                             title = "Events with higher (> 5%) Total Crop Costs, 1991-2011")+
                        theme(plot.title = element_text(size=9),
                              axis.title = element_text(size = 9))+
                        geom_bar(stat = "Identity")+
                        geom_text(aes(label=PercTotal), vjust=-0.3, size=2)
                
                plot4 <- ggplot(data = TotalPropDmgPerEVTYPE[TotalPropDmgPerEVTYPE$PercTotal > 5,],
                                aes(EVTYPE, PercTotal))+
                        labs(x = "Event Type ", y = "Total Prop. Costs (%)", 
                             title = "Events with higher (> 5%) Total Property Costs , 1991-2011")+
                        theme(plot.title = element_text(size=9),
                              axis.title = element_text(size = 9))+
                        geom_bar(stat = "Identity",
                                 )+
                        geom_text(aes(label=PercTotal), vjust=-0.3, size=2)
                
                multiplot(plot3,plot4, cols=1)

Fig. 2 - Weather events with higher (rate > %5) Economic consequences, 1991-2011.

Highest crop damage costs were caused by DROUGHT about 28.5% of the total costs reported between 1991 and 2011, while HURRICAN/TYPHOON events were responsible for 24.4% of the total property damage costs, as reported on FIG.2.

Further data analysis was done intending to address the following questions:

  1. Which DROUGHT event caused the largest amounth of crop damages, per state and year?
        # Further Investication for Total Crop Damage
                subsetDROUGHT <- SampleStormData[which(SampleStormData$EVTYPE == "DROUGHT"),
                                                 names(SampleStormData) %in% 
                                                         c("BGN_DATE_YMD","YEAR","STATE","CROPDMGAMT")
                                                 ]
                                                   
                subsetDROUGHT <- subsetDROUGHT[order(subsetDROUGHT$CROPDMGAMT,
                                                     decreasing = TRUE),]
                subsetDROUGHT$CROPDMGAMT <- format(subsetDROUGHT$CROPDMGAMT, scientific = TRUE)
                subsetDROUGHT[1:3,]
##        YEAR STATE CROPDMGAMT
## 639347 2006    TX 1.0000e+09
## 422676 2001    IA 5.7885e+08
## 410175 2000    TX 5.1500e+08

In 2006, in Texas, it was recorded the DROUGHT event with highest amounth of crop damages, in a total of $1B dollars. According to official recordings, Wildfires have been reported during this time in Texas.

  1. Which HURRICANE/TYPHOON event caused the largest amount of property damages, per state and year?
        # Further Investication for Total Property Damage
                subsetHURTYPH <- SampleStormData[which(SampleStormData$EVTYPE == "HURRICANE/TYPHOON"),
                                                 names(SampleStormData) %in% 
                                                         c("BGN_DATE_YMD","YEAR","STATE","PROPDMGAMT")
                                                 ]
                
                subsetHURTYPH <- subsetHURTYPH[order(subsetHURTYPH$PROPDMGAMT,
                                                     decreasing = TRUE),]
                
                subsetHURTYPH[1:3,]
##        YEAR STATE PROPDMGAMT
## 577675 2005    LA  1.693e+10
## 569308 2005    FL  1.000e+10
## 581533 2005    MS  7.350e+09

2005 was a devastating year regarding Hurricanes; Katrina, for example. Total Property costs reflect that exacly, as we can above. In the state of Luisianna it has been reported approx. $17B dollars of property costs. Both Louisianna and Mississippi were sevearly hit by Katrina. Results also show that in the state of Florida, total property damage costs have been reported to be $10B dollars, these were due to Hurricane Wilma. These results show how devastating these weather events can be, not just in terms of lives but also in property costs.

Final Conclusions

With this project, it was possible to concluded which weather events have a higher impact on population health, as well as economic consequences. Data analysed was restricted to events reported between 1991 and 2011. In this time frame, Tornados were responsible for the majority of injuries, while excessive heat caused the highest number of fatalities. In terms of economic consequences, it was found that drought events are responsible for larger crop damage costs, while hurricanes/typhoon events cause the highest property damage costs.