Synopsis

In this report we aim to answer two research questions:

1- which types of events are most harmful with respect to population health across the U.S?

2- which types of events have the greatest economic consequences across the U.S.?

To answer these two questions we performed simple analysis on the data obtained from Coursera website. From these data we found that the most harmful type of event with respect to population health across the U.s. is the TORNADO event type. We also found that the event that has the most greates economic consequence is the TROPICAL STORM GORDON.

Getting and Processing Data

I downloaded the data from Reproducible Research Course Website andd read it into storm_df data frame. I made sure to make the process of downloading and reading the data reproducible, by the code below.

# Loading some packages for data manipulation and visualization
library(ggplot2); library(dplyr); library(tidyr); library(readr)

data_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

## Download the file containing the data and extract it in the same 
## working directory, if the .csv.bz2 file already exists then skip downloading

#this checks if this R script was run before
#'storm_df' is the data frame where I stored the "StormData.csv" file
if(!exists("storm_df")) {
        
        if(!file.exists("repdata-data-StormData.csv.bz2")){
                #create a temp file
                temp <- tempfile()
                
                #Download the file containing the data and store it in temp
                download.file(data_url, temp)
                
                # Read it into R environment
                storm_df <- tbl_df(read_csv(temp))
                
                # get rid of temp
                unlink(temp)
                
        }
        
        #if "repdata-data-StormData.csv.bz2" already exists in my working directory 
        # then just read it in strom_df 
        storm_df <- tbl_df(read_csv("repdata-data-StormData.csv.bz2"))
}
## 
|================================================================================| 100%  535 MB

Data Preparation for Analysis and Plotting

First, let’s have a look at storm_df to get a sense of the data

storm_df
## Source: local data frame [902,297 x 37]
## 
##    STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
##      (dbl)              (chr)    (int)     (chr)  (dbl)      (chr) (chr)
## 1        1  4/18/1950 0:00:00      130       CST     97     MOBILE    AL
## 2        1  4/18/1950 0:00:00      145       CST      3    BALDWIN    AL
## 3        1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4        1   6/8/1951 0:00:00      900       CST     89    MADISON    AL
## 5        1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6        1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
## 7        1 11/16/1951 0:00:00      100       CST      9     BLOUNT    AL
## 8        1  1/22/1952 0:00:00      900       CST    123 TALLAPOOSA    AL
## 9        1  2/13/1952 0:00:00     2000       CST    125 TUSCALOOSA    AL
## 10       1  2/13/1952 0:00:00     2000       CST     57    FAYETTE    AL
## ..     ...                ...      ...       ...    ...        ...   ...
## Variables not shown: EVTYPE (chr), BGN_RANGE (dbl), BGN_AZI (lgl),
##   BGN_LOCATI (lgl), END_DATE (lgl), END_TIME (lgl), COUNTY_END (dbl),
##   COUNTYENDN (lgl), END_RANGE (dbl), END_AZI (lgl), END_LOCATI (lgl),
##   LENGTH (dbl), WIDTH (dbl), F (int), MAG (dbl), FATALITIES (dbl),
##   INJURIES (dbl), PROPDMG (dbl), PROPDMGEXP (chr), CROPDMG (dbl),
##   CROPDMGEXP (lgl), WFO (lgl), STATEOFFIC (lgl), ZONENAMES (lgl), LATITUDE
##   (dbl), LONGITUDE (dbl), LATITUDE_E (dbl), LONGITUDE_ (dbl), REMARKS
##   (lgl), REFNUM (dbl)

Now, we need to anlayze the data to answer the following questions:

1- Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

2- Across the United States, which types of events have the greatest economic consequences?

To achieve that, first we need to prepare the data in a way that makes answering those questions as easy as possible.

In the next code chunk, I’ll use dplyr package to preprocess the data.

# Processing PROPDMGEXP by mapping each character with its corresponing numeric value
# This will take 'H', 'k', 'M' ... etc from PROPDMGEXP variable and map them to 
# 100, 1000, 1000000, etc. respectively

PROPDMGEX_FILTERED <- 
        storm_df$PROPDMGEXP[storm_df$PROPDMGEXP %in% c('', 'h', 'H', 'K', 'm', 'M', 'B')]

PROPDMGEXP_VALUES <- 
        as.numeric(plyr::mapvalues(PROPDMGEX_FILTERED,
                             from=c('', 'h', 'H', 'K', 'm', 'M', 'B'),
                             to=c(1, 10^2, 10^2, 10^3, 10^6, 10^6, 10^9)))


# As for CROPDMG, no need to do the same as the CROPDMGEXP variable is just NAs. 
# check using !anyNA(storm_df$CROPDMGEXP)

# Now I will use dplyr to create a new df , storm_ready, to make the analysis easier.

# 1- Select from storm_df the variables of interest:
#    EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG
# 2- Exclude rows where PROPDMGEXP contains values other than:
#    '', 'h', 'H', 'K', 'm', 'M', 'B'. 
#    (Values other than these have no specific meaning)
# 3- Group by EVTYPE
# 4- Make 2 summaries:
#    TOTDMG: that's the sum of (PROPDMG * PROPDMGEXP_VALUES + CROPDMG)
#    HARMED_PEOPLE, that's the sum of (INJURIES + FATALITIES)
# 5- Arrange according to TOTDMG and HARMED_PEOPLE
# 6- Omit rows where there's no health harm nor property damage 
#    (these are not relevant to our analysis) 
# 7- Add a new variable, EVENT_INDEX, that's just recording the row index of the resulting df

storm_ready <- storm_df %>%
        select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG) %>%
        filter(PROPDMGEXP %in% c('', 'h', 'H', 'K', 'm', 'M', 'B')) %>%
        group_by(EVTYPE) %>%
        summarise(TOTDMG = sum(PROPDMG * PROPDMGEXP_VALUES + CROPDMG),
                  HARMED_PEOPLE = sum(INJURIES + FATALITIES)) %>%
        arrange(desc(HARMED_PEOPLE), desc(TOTDMG))%>%
        filter(TOTDMG > 0 & HARMED_PEOPLE > 0) %>%
        mutate(EVENT_INDEX = 1:n())

Now, the data storm_ready is clean and ready for analysis

Take a look at storm_ready to get a sense of it

storm_ready
## Source: local data frame [163 x 4]
## 
##               EVTYPE       TOTDMG HARMED_PEOPLE EVENT_INDEX
##                (chr)        (dbl)         (dbl)       (int)
## 1            TORNADO 2.056802e+12         96951           1
## 2     EXCESSIVE HEAT 3.423919e+10          8428           2
## 3          TSTM WIND 1.084023e+11          7461           3
## 4              FLOOD 1.123072e+12          7259           4
## 5          LIGHTNING 2.068139e+12          6046           5
## 6               HEAT 4.412652e+09          3037           6
## 7        FLASH FLOOD 5.143075e+11          2755           7
## 8          ICE STORM 1.678672e+12          2064           8
## 9  THUNDERSTORM WIND 7.685333e+11          1621           9
## 10      WINTER STORM 1.260491e+12          1527          10
## ..               ...          ...           ...         ...

Results

Addressing The First Question:

To find the most harmful event with respect to human health: We will pick the first element of the variable EVTYPE in storm_ready data frame. This first element contains the maximum number of harmed people (injured + dead), because we already arranged storm_ready in a descending order.

most_harmful <- storm_ready$EVTYPE[1]

So, it turned out that TORNADO is the most harmful event.

We can see how many people were effected by TORNADO:

harmed_by_tornado <- storm_ready$HARMED_PEOPLE[1]

harmed_by_tornado
## [1] 96951

The plot below shows the EVENT_INDEX on the x-axis and the number of people harmed, scaled on \(log_{10}\) basis, on the y-axis.

I chose \(log_{10}\) basis to compensate for the huge dispersion in the data and make the plot more visible.

g <- ggplot(data=storm_ready, aes(EVENT_INDEX, log10(HARMED_PEOPLE)))
g <- g + 
        geom_bar(stat = "identity", fill = 'salmon', color = 'black') + 
        labs(list(title = "Harmful Effect of Different Events across The U.S.",
                  x = "Event Index",
                  y = "No. of Affected People (scaled to log10)"))
g

Addressing The Second Question:

To find the event with highest economic consequences:

most_harmful_economic <- subset(storm_ready, TOTDMG == max(TOTDMG))$EVTYPE

most_harmful_economic
## [1] "TROPICAL STORM GORDON"

The plot below shows the EVENT_INDEX on the x-axis and the total economic damage, on the y-axis

g <- ggplot(data=storm_ready, aes(EVENT_INDEX, TOTDMG))
g <- g + 
        geom_bar(stat= "identity", fill = 'salmon') + 
        labs(list(title = "Total Economic Damage of Different Events across The U.S.",
                  x = "Event Index",
                  y = "Economic Damage"))
g