Reproducible Research Project Report

Synopsis

In this report, the Storm Data set obtained from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database will be explored to identify the effects of different weather events on public health and economic welfare. The events in the database start in 1950 and span dates till November 2011.

Data Processing

The report makes use of the dplyr package for data processing and this is loaded in.

library(dplyr)

Reading the data

rawdata <- read.csv("repdata_data_Stormdata.csv/repdata_data_Stormdata.csv", 
                    header = TRUE, colClasses = "character")
copyrawdata <-rawdata

Here, the data is loaded in from the working directory. This analysis will be broken down into two sections. They will cover: * 1) The effect of natural disaster events on public and population health across the United States. * 2) The economic consequences of natural disaster events across the United States. To streamline the process of analysing the data for both questions individually, a copy of the raw data was saved into a new variable ‘copyrawdata’.

Processing for Task 1

rawdata$EVTYPE <- factor(rawdata$EVTYPE)
rawdata$FATALITIES <- as.numeric(rawdata$FATALITIES)
rawdata$INJURIES <- as.numeric(rawdata$INJURIES)
FatalitiesGroups <- rawdata %>% group_by(EVTYPE)
InjuriesGroups <- rawdata %>% group_by(EVTYPE)
FatalitiesSummary <- FatalitiesGroups %>% summarise(Total.Fatalities = sum(FATALITIES))
InjuriesSummary <- InjuriesGroups %>% summarise(Total.Injuries = sum(INJURIES))
FatalitiesSummary <- FatalitiesSummary[FatalitiesSummary$Total.Fatalities != 0, ]
InjuriesSummary <- InjuriesSummary[InjuriesSummary$Total.Injuries != 0, ]
InjuriesSummary <- InjuriesSummary %>% arrange(desc(Total.Injuries))
FatalitiesSummary <- FatalitiesSummary %>% arrange(desc(Total.Fatalities))
mergedframe <- merge.data.frame(FatalitiesSummary, InjuriesSummary, by = "EVTYPE")
finalmergedframe <- mutate(mergedframe,Injuries.and.Fatalities = Total.Injuries + Total.Fatalities )
finalmergedframe <- finalmergedframe %>% arrange(desc(Injuries.and.Fatalities))
totalframe <- finalmergedframe %>% select(EVTYPE, Injuries.and.Fatalities)
mergedframe$EVTYPE <- factor(mergedframe$EVTYPE)

In this data processing section the columns are converted to the correct data type and then groups of each factor natural disaster are made for two data frames considering the Injuries and Fatalities. Once this is done, the sum of all the fatalities and injuries for each type of natural disaster are summed and then any values = 0 are removed. The data frames are organised in descending order of injuries and fatalities and then a new data frame containing only total values and the type of natural disaster is created. These will allow for an ordered list to be generated.

Processing for Task 2

DataofInterest <- copyrawdata %>% select("EVTYPE","PROPDMG","PROPDMGEXP","CROPDMG", "CROPDMGEXP")

lettertonum <- function(columnname) {
  
columnname <- as.numeric(gsub( "K", "1000", 
                        gsub("M", "1000000", gsub("B", "1000000000", columnname))))

}

DataofInterest$PROPDMGEXP <- lettertonum(DataofInterest$PROPDMGEXP)
## Warning in lettertonum(DataofInterest$PROPDMGEXP): NAs introduced by coercion
DataofInterest$CROPDMGEXP <- lettertonum(DataofInterest$CROPDMGEXP)
## Warning in lettertonum(DataofInterest$CROPDMGEXP): NAs introduced by coercion
DataofInterest <- DataofInterest[!is.na(DataofInterest$PROPDMGEXP),]
DataofInterest <- DataofInterest[!is.na(DataofInterest$CROPDMGEXP),]
DataofInterest$PROPDMG <- as.numeric(DataofInterest$PROPDMG)
DataofInterest$CROPDMG <- as.numeric(DataofInterest$CROPDMG)
DataofInterest$COMBINEDPROPDMG <- DataofInterest$PROPDMG * DataofInterest$PROPDMGEXP
DataofInterest$COMBINEDCROPDMG <- DataofInterest$CROPDMG * DataofInterest$CROPDMGEXP
DataofInterest <- DataofInterest[!(DataofInterest$COMBINEDPROPDMG == "0" | DataofInterest$COMBINEDCROPDMG == "0"),]
DataofInterest$TotalDamage <- DataofInterest$COMBINEDPROPDMG + DataofInterest$COMBINEDCROPDMG
DataofInterest$EVTYPE <- factor(DataofInterest$EVTYPE)
groups <- DataofInterest %>% group_by(EVTYPE)
summary <- groups %>% summarise(EventTotal = sum(TotalDamage))
summary <- summary %>% arrange(desc(EventTotal))

First a new data frame called ‘Data of Interest’ containing only EVTYPE and expense columns is created and saved.

A new function that converts the letters to their corresponding number was also created called ‘lettertonum’. Through the data documentation, the values of each letter was obtained and using this function, the columns where changed. Rows that had a value of 0 created NA values which are then removed in the following process. Once this is done, the DMG value is multiplied by the EXP value to give the actual value. A value of 0 was removed.

After this, the values were grouped together by event type and a new column called EventTotal was created to sum up all the expense for the different types of events into a data frame called summary. This was then ordered in descending order as well.

Results

In this section the results obtained from the data processing will be used to address the effects of the natural disaster events on population health and economic welfare.

Results and analysis for Task 1

totalframe[1:5, ]
##           EVTYPE Injuries.and.Fatalities
## 1        TORNADO                   96979
## 2 EXCESSIVE HEAT                    8428
## 3      TSTM WIND                    7461
## 4          FLOOD                    7259
## 5      LIGHTNING                    6046
FatalitiesSummary[1:5,]
## # A tibble: 5 × 2
##   EVTYPE         Total.Fatalities
##   <fct>                     <dbl>
## 1 TORNADO                    5633
## 2 EXCESSIVE HEAT             1903
## 3 FLASH FLOOD                 978
## 4 HEAT                        937
## 5 LIGHTNING                   816
InjuriesSummary[1:5,]
## # A tibble: 5 × 2
##   EVTYPE         Total.Injuries
##   <fct>                   <dbl>
## 1 TORNADO                 91346
## 2 TSTM WIND                6957
## 3 FLOOD                    6789
## 4 EXCESSIVE HEAT           6525
## 5 LIGHTNING                5230

These show the top 5 natural disaster event types that are responsible for i) the total of fatalities and injuries, ii) fatalities alone, and iii) injuries alone.

model <- lm(log10(mergedframe$Total.Injuries) ~ log10(mergedframe$Total.Fatalities))
summary(model)
## 
## Call:
## lm(formula = log10(mergedframe$Total.Injuries) ~ log10(mergedframe$Total.Fatalities))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.41451 -0.47632 -0.06813  0.55550  1.45787 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          0.47632    0.10042   4.743 6.71e-06 ***
## log10(mergedframe$Total.Fatalities)  1.02004    0.07017  14.537  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6379 on 104 degrees of freedom
## Multiple R-squared:  0.6702, Adjusted R-squared:  0.667 
## F-statistic: 211.3 on 1 and 104 DF,  p-value: < 2.2e-16
plot(log10(mergedframe$Total.Fatalities), log10(mergedframe$Total.Injuries), pch = 19,   
     col = rainbow(length(levels(mergedframe$EVTYPE))), xlab = "log( Total Fatalities )",
     ylab = "log( Total Injuries )")
abline(model, col = "black", lwd = 3)

Figure 1 - Scatter plot of log(Injuries) against log(Fatalities) for each of the different types of natural disaster events from the data set.

This figure aims to identify is there is any relationship between the injuries and fatalities when analyzing the impact of each natural disaster on public health, i.e. can it justifiably be said that because an event caused more injuries that it is likely that it also caused more fatalities. A linear regression model is fitted to the scatter plot and a summary of its characteristics is also included in results section previously.

Results and analysis for Task 2

summary(DataofInterest$COMBINEDCROPDMG)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 5.000e+01 5.000e+03 1.500e+04 1.724e+06 1.000e+05 5.000e+09
summary(DataofInterest$COMBINEDPROPDMG)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 1.00e+01 5.00e+03 2.00e+04 1.11e+07 1.00e+05 1.15e+11
summary[1:10, ]
## # A tibble: 10 × 2
##    EVTYPE              EventTotal
##    <fct>                    <dbl>
##  1 FLOOD             126044533500
##  2 HURRICANE/TYPHOON  29348117800
##  3 HURRICANE          10498188000
##  4 RIVER FLOOD        10108369000
##  5 ICE STORM           5108614000
##  6 FLASH FLOOD         4309101392
##  7 HAIL                3838339690
##  8 TORNADO             2335763950
##  9 HURRICANE OPAL      2157000000
## 10 HIGH WIND           1918571300

These show the summary statistics of the Combined property and crop damage, as well as a list of the 10 event types that caused the most combined property and crop damage.

hist(summary$EventTotal,col = "blue",ylim = c(0, 100), main = "Hisogram of Total Expense for 
     Each Natural Disaster Event", xlab = "Total Expense")

Figure 2 - Histogram of the total expense (crop damage and property damage) for all the event types.