Synopsis

This document analyses the impact of natural disasters to property and public health in the USA. The data used for the analysis comes from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which has been collected between 1950 and 2011. Due to the significant data storage constraints the data has been processed and aggregated with the focus on property damage and public health in mind. Further, as the data had many input fallacies an expert manual approach has been taken to match the original 48 event types with 985 inputs. The conclusion of this analysis is that the thunderstorm winds have by far the highest impact on property and public health. However, this conclusion is very dependent on the way the inputs are matched to the original event types categories.

Data Processing

The first step to the analysis was to download the CSV file from the URL of the NOAA , the code below shows the exact URL. The approach taken for the download allows anyone to download the file, without having to update the code with the location of the file.

temp <- tempfile()
download.file( "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",temp)
data <- read.csv( temp )

The next step was to constrain the data that will be used further in the analysis, as the original data set has significant number of inputs, while only a handful is relevant to the analysis. Thus, we have created a new data set that only has variables: EVTYPE (event type), FATALITIES (human fatailities), INJURIES (human injuries), PROPDMG (property damage) and CROPDMG (crop damage).

data_new <- data.frame( Property = data$PROPDMG, EvType = data$EVTYPE,
                        Fatal = data$FATALITIES, Inj = data$INJURIES, Crop = data$CROPDMG )

Additional cleaning of the data set is performed, like retaining only complete cases and removing NA values.

data_new <- data_new[ complete.cases( data_new ), ]
data_new <- data_new[ !is.na( data_new$EvType ), ]

Finally, new variables Health and Damage were added where Health represents the sum of injuries and fatailities, i.e. the number of people impacted by the event, and Damage which representes the sum of crop and property damages, i.e. the economic damage caused by the event.

data_new$Health <- data_new$Fatal + data_new$Inj
data_new$Damage <- data_new$Crop + data_new$Property

As was mentioned in the synopsis, the main part of the data processing is mapping the inputs that were actually obtained to the original set of event provided by NOAA. After careful look through of the list of different inputs, we concluded that the best approach is a qualitative one. The mapping we prepared is available in the underlying code, however we will not show it here as the list has 985 inputs. If you need further details on the decision making process please contact the authors.

We would like to emphasize that this is a crucial segment of data processing and analysis, as it impacts the aggregation which consequentially impacts the conclusions.

Finally, the last part of data processing is aggregating data across different regions and dates, so we can compare different events with respect only to their impact to public health and economic wellbeing, disregarding the year it occurred or the region. Again, having in mind that we are only interested in the events that have had fatalities, injuries, crop or property damages, after aggregation we exclude the events that have 0 values for variable heatlth and damage (x in the code ).

data_dmg    <- data_new[ which(data_new$Damage != 0 ), ]
data_health <- data_new[ which(data_new$Health != 0 ), ]

data_dmg    <- aggregate( data_dmg$Damage, by = list(Category = data_dmg$Event), FUN = sum )
data_health <- aggregate( data_new$Health, by = list(Category = data_new$Event), FUN = sum )

data_dmg    <- data_dmg[ which(data_dmg$x != 0 ), ]
data_health <- data_health[ which(data_health$x != 0 ), ]

Results

In order to find the event that has the biggest impact to the public health and properties, we have first performed exploratory analysis. The first plot shows one of the exploratory analysis performed, the property damage across events.

g <- ggplot(data_dmg, aes(Category, x) ) + ggtitle( "Property Damages across Events") + xlab("Event") + ylab( "Property Damage ($)")
g + geom_bar( stat = "identity" )

As we have 47 events, it is very difficult to read the x-axis but it is also visible that only a few have values above 1 million. Thus we have decided that in the next step we limit the plot to only those few.

data_dmg_mm   <- data_dmg
data_dmg_mm$x <- data_dmg$x/1000000
data_dmg_mm   <- data_dmg_mm[ which(data_dmg_mm$x > 1 ), ]

The next plot shows the main four events that caused the most crop or/and property damages between 1950 and 2011 in the USA. Thunderstorm Winds have been by far the most impactful natural event to the crops and/or properties in the USA during this period of time.

g <- ggplot(data_dmg_mm, aes(Category, x), scale = "area" ) + ggtitle( "Property Damage in $millions for 4 Main Natural Disasters")+ xlab("Event") + ylab( "Property Damage ($mm)")
g + geom_bar( stat="identity" )

Similar exploratory analysis was performed to the Health variable, but in this case exploratory analysis showed that we should focus on events that caused more than 1 thousand fatalities/injuries.

The next plot shows the main four events that caused the fatalities/injuries between 1950 and 2011 in the USA. Again, Thunderstorm Winds have been by far the most impactful natural event to public health in the USA during this period of time.

data_health_kk   <- data_health
data_health_kk$x <- data_health$x/1000
data_health_kk   <- data_health_kk[ which(data_health_kk$x > 7 ), ]
g <- ggplot(data_health_kk, aes(Category, x ) ) + ggtitle( "Fatalities in thousands for 4 Main Natural Disasters")+ xlab("Event") + ylab( "Property Damage ($mm)")
g + geom_bar( stat="identity" )