Introduction

There are several factors that affect the different aspects of the communities (including the public health and economy). Some of these factors are naturally occurring e.g storms and severe weathers. Although it might be almost impossible to prevent these natural factors or disasters, the level of damage (including loss of lives, property damages, injuries and fatalities) it might cause can be mitigated.

The purpose of this project was to process and analyze the type of environmental events (e.g rain, flooding, hurricane, etc.) that caused the most damaging effect to the population health (i.e fatalities and injuries) and the economy (i.e cost of both property and crop damages) between years 1950-2011.

Data Processing

  • First, the storm data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database was loaded.
link <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(link, destfile = "StormData.csv.bz2")
stormdata <- read.csv("./StormData.csv.bz2", stringsAsFactors = FALSE)

##To check if data was loaded properly
head(stormdata)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
## 4         0                                               0         NA
## 5         0                                               0         NA
## 6         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
## 4         0                       0.0   100 2   0          0        2     2.5
## 5         0                       0.0   150 2   0          0        2     2.5
## 6         0                       1.5   177 2   0          0        6     2.5
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
## 4          K       0                                         3458      8626
## 5          K       0                                         3412      8642
## 6          K       0                                         3450      8748
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3
## 4          0          0              4
## 5          0          0              5
## 6          0          0              6

Performed a exploratory data analysis to give a brief overview of the whole data set.As not all the columns present in the original data set is needed for this project, a new data set was created with only the columns/variables of interest.

##load needed package
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
  • Selected the Columns of interest which are:
    1. EVTYPE: This is the type of event recorded, e.g wind, storm, snow, etc.
    2. FATALITIES: Number/magnitude of fatalities caused by the recorded event
    3. INJURIES: Number of injuries caused by the recorded event
    4. PROPDMG: Total number of property damaged by an event
    5. PROPDMGEXP: Magnitude of the property damaged i.e amount to the nearest dollars (K = “thousands” and so on)
    6. CROPDMG: Total number of crops/farms damaged by an event.
    7. CROPDMGEXP: Magnitude of the crop damaged i.e. amount to the nearest dollars.
up_data <- stormdata %>%
            select("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")

##Check the new data
head(up_data)
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          K       0           
## 2 TORNADO          0        0     2.5          K       0           
## 3 TORNADO          0        2    25.0          K       0           
## 4 TORNADO          0        2     2.5          K       0           
## 5 TORNADO          0        2     2.5          K       0           
## 6 TORNADO          0        6     2.5          K       0

Checked the new data set to know the number of events recorded during the study time period.

Total_events <- summary(up_data$EVTYPE)
Unique_events <- summary(unique(up_data$EVTYPE))

From the data set, there is a total of 902297, character, character events recorded but only 985, character, character of these events are unique. Some of the events recorded in the data set can be categorized under a single event. For example, the events hurricane and typhoon etc, can be put together under a single event, Hurricane.

up_data$EVTYPE[grepl("tornado", up_data$EVTYPE, ignore.case = TRUE)] <- "Tornado"
up_data$EVTYPE[grepl("FLOOD",up_data$EVTYPE, ignore.case = TRUE)] <- "Flooding"
up_data$EVTYPE[grepl("hurricane|typhoon",up_data$EVTYPE, ignore.case = TRUE)] <- "Hurricane"
up_data$EVTYPE<-factor(up_data$EVTYPE)

##Check data
head(up_data)
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 Tornado          0       15    25.0          K       0           
## 2 Tornado          0        0     2.5          K       0           
## 3 Tornado          0        2    25.0          K       0           
## 4 Tornado          0        2     2.5          K       0           
## 5 Tornado          0        2     2.5          K       0           
## 6 Tornado          0        6     2.5          K       0
  • To check the top events with the most harm to population, a subset of the data set was created containing only the three columns needed (Event, Fatalities and Injuries). Majority of the events are recorded more than once over the course of the data collection. To account for that, all events with the same names are grouped as once before calculating the total health harm caused.
##Select the three columns needed to answer the question above
most_harmful <- up_data %>%
  group_by(EVTYPE) %>%
  summarise(Total_fatalities = sum(FATALITIES), Total_injuries = sum(INJURIES)) %>%
  arrange(desc(Total_fatalities + Total_injuries))

##Check the data
head(most_harmful)
## # A tibble: 6 x 3
##   EVTYPE         Total_fatalities Total_injuries
##   <fct>                     <dbl>          <dbl>
## 1 Tornado                    5661          91407
## 2 Flooding                   1525           8604
## 3 EXCESSIVE HEAT             1903           6525
## 4 TSTM WIND                   504           6957
## 5 LIGHTNING                   816           5230
## 6 HEAT                        937           2100
  • From the sub-data above, the top 5 event with the most casualties (i.e summation of the fatalities and injuries recorded), were picked and saved in a variable for later use.
##create a variable with just the top 5 events with the most harmful effect on the population health
Total_data <- with(most_harmful, aggregate(Total_fatalities + Total_injuries ~ EVTYPE, data = most_harmful, FUN = "sum"))

## Rename the second column of the total data
names(Total_data)[2] <- "Causalties"
## order the total harm column in descending order to get the top events
Total_data <- Total_data[order(-Total_data$Causalties), ]

top5 <- Total_data[1:5, ]
print(top5)
##             EVTYPE Causalties
## 728        Tornado      97068
## 137       Flooding      10129
## 113 EXCESSIVE HEAT       8428
## 741      TSTM WIND       7461
## 387      LIGHTNING       6046

Another information collected in the data set is the economic magnitude, recorded for both property and crop damages in the form of PROPDMGEXP and CROPDMGEXP respectively. Although no specific amount was given for most damages recorded, a range identifier was provided. The identifier include:

  1. K: Amount of damage in the thousands range
  2. M: Amount of damage in the millions range
  3. B: Amount of damage in the billions range.

Converted the identifier to numeric value in order to successfully find the events with the greatest economic consequences.
- Also calculated the total amount of damage caused by each event by multiplying the number of damages with the magnitude of the damage.

library(dplyr)
library(tidyr)
##Replace the total amount identifier of with nearest 10s 

up_data$PROPDMGEXP<-dplyr::recode(up_data$PROPDMGEXP,'K'=1000,'M'=1000000,'B'=1000000000,.default=1)
up_data$CROPDMGEXP<-dplyr::recode(up_data$CROPDMGEXP,'K'=1000,'M'=1000000,'B'=1000000000,.default=1)

##calculate the total amount of damage
up_data$PROPVAL <- up_data$PROPDMG * up_data$PROPDMGEXP
up_data$CROPVAL <- up_data$CROPDMG * up_data$CROPDMGEXP
  • A sub-data with the total damages in all account (i.e property damage + crop damage) was created. The data was also grouped by EVTYPE to put same events recorded together as one.
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(magrittr)
cost_data <- up_data %>%
  group_by(EVTYPE) %>%
  summarise(tot_prop = sum(PROPVAL), tot_crop = sum(CROPVAL)) %>%
  arrange(desc(tot_prop + tot_crop))

head(cost_data)
## # A tibble: 6 x 3
##   EVTYPE           tot_prop    tot_crop
##   <fct>               <dbl>       <dbl>
## 1 Flooding    167529740932. 12380099110
## 2 Hurricane    85336410030   5506117810
## 3 Tornado      58581598040.   417461520
## 4 STORM SURGE  43323536000         5000
## 5 HAIL         15727367053.  3025537890
## 6 DROUGHT       1046106000  13972566000
  • From the sub-data above, the top 5 event with the most economic consequences (i.e summation of the total property and crop damages), were picked and saved in a variable for later use.
##create a variable with just the top 5 events that were most harmful to the economy
most_cost <- with(cost_data, aggregate(tot_prop + tot_crop ~ EVTYPE, data = cost_data, FUN = "sum"))

## Rename the second column of the total data
names(most_cost)[2] <- "TOTDMGEXP"
## order the total harm column in descending order to get the top events
most_cost <- most_cost[order(-most_cost$TOTDMGEXP), ]

top5a <- most_cost[1:5, ]
print(top5a)
##          EVTYPE    TOTDMGEXP
## 137    Flooding 179909840042
## 340   Hurricane  90842527840
## 728     Tornado  58999059560
## 574 STORM SURGE  43323541000
## 194        HAIL  18752904943

Results

library(ggplot2)

1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

- Created a visual representation (using ggplot) of the type of events that are most harmful to population health (i.e have the highest number of total recorded fatalities and injuries). Plotted the graphs of the top 5 events to make the figure less cumbersome and more easier to interpret. Couple of steps were take to create a final plot. 
par(tcl = 0.5, mgp = c(4, 0, 0), las = 1,
    mar = c(6.1, 6.1, 5.1, 2.1), 
    family = 'serif')               
##Plot graph
barplot(top5$Causalties, col = "Coral", xlab = "Type of Event Recorded", ylab = "Total number of Causalties", main = "Top 5 Events with the Most Harmful effects on the Population Health", sub = "(Between 1950-2011)", names.arg = top5$EVTYPE, las = 1)

Figure 1: Plot of the top 5 events with the most harmful effect on the population health (fatalities and injuries), as recorded between the years 1950-2011
From the plot above, tornado had the most harmful effect on the population health i.e causing the most fatalities and injuries

2. Across the United States, which types of events have the greatest economic consequences?

par(tcl = 0.5, mgp = c(4, 0, 0), las = 1,
    mar = c(6.1, 6.1, 5.1, 2.1), 
    family = 'serif')

##Plot graph
barplot(top5a$TOTDMGEXP, col = "Coral", xlab = "Type of Event Recorded", ylab = "Total Amount of Damages", main = "Top 5 Events with the Most Harmful effects on the Economy", sub = "(Between 1950-2011)", names.arg = top5a$EVTYPE, las = 1)

Figure 2: Plot of the top 5 events with the most negative effects on the economy (with respect to the properties and crops damaged), as recorded between the years 1950-2011
From the plot above, flooding had the most negative effect on the economy. The most properties and crops were damaged when flooding occurred.