Reproducible Research Week 4 Course Project 2

Synopsis

-This is an exploration of the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database.

-This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, which type of event, as well as the estimates of relevant fatalities, injuries, and various forms of damage.

-The dataset used in this project is provided by the U.S. National Oceanic and Atmospheric Administration (NOAA).

-The work is done on the subset of data where only the (EVTYPE, FATALITIES, CROPDMG, PROPDMG, and newly transformed columns from CROPDMGEXP and PROPDMGEXP which were CROPEXP and PROPEXP respectively were used for the complete analysis)

-There were exponential powers (Columns PROPDMGEXP and CROPDMGEXP) linked with the CROPDMG and PROPDMG values and so the exponential notations were converted and then multiplied with DMG columns to get the correct DMG values.

-Graphs were plotted using the ggplot2 package and data was formatted using dplyr package.

-This analysis discovered that tornado(s) are responsible for a maximum number of fatalities and injuries.

-This analysis also discovered that floods are responsible for maximum property damage, while Droughts cause maximum crop damage.

-Objective: Explore the NOAA Storm Database to help answer important questions about severe weather events.

DATA Processing

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:

Storm Data

Loading the required libraries and loading the raw data using the read.csv Getting the overview of data and summary of how the data looks

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
raw_data <- read.csv("repdata-data-StormData.csv")

summary(raw_data)
##     STATE__       BGN_DATE           BGN_TIME          TIME_ZONE        
##  Min.   : 1.0   Length:902297      Length:902297      Length:902297     
##  1st Qu.:19.0   Class :character   Class :character   Class :character  
##  Median :30.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :31.2                                                           
##  3rd Qu.:45.0                                                           
##  Max.   :95.0                                                           
##                                                                         
##      COUNTY       COUNTYNAME           STATE              EVTYPE         
##  Min.   :  0.0   Length:902297      Length:902297      Length:902297     
##  1st Qu.: 31.0   Class :character   Class :character   Class :character  
##  Median : 75.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :100.6                                                           
##  3rd Qu.:131.0                                                           
##  Max.   :873.0                                                           
##                                                                          
##    BGN_RANGE          BGN_AZI           BGN_LOCATI          END_DATE        
##  Min.   :   0.000   Length:902297      Length:902297      Length:902297     
##  1st Qu.:   0.000   Class :character   Class :character   Class :character  
##  Median :   0.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :   1.484                                                           
##  3rd Qu.:   1.000                                                           
##  Max.   :3749.000                                                           
##                                                                             
##    END_TIME           COUNTY_END COUNTYENDN       END_RANGE       
##  Length:902297      Min.   :0    Mode:logical   Min.   :  0.0000  
##  Class :character   1st Qu.:0    NA's:902297    1st Qu.:  0.0000  
##  Mode  :character   Median :0                   Median :  0.0000  
##                     Mean   :0                   Mean   :  0.9862  
##                     3rd Qu.:0                   3rd Qu.:  0.0000  
##                     Max.   :0                   Max.   :925.0000  
##                                                                   
##    END_AZI           END_LOCATI            LENGTH              WIDTH         
##  Length:902297      Length:902297      Min.   :   0.0000   Min.   :   0.000  
##  Class :character   Class :character   1st Qu.:   0.0000   1st Qu.:   0.000  
##  Mode  :character   Mode  :character   Median :   0.0000   Median :   0.000  
##                                        Mean   :   0.2301   Mean   :   7.503  
##                                        3rd Qu.:   0.0000   3rd Qu.:   0.000  
##                                        Max.   :2315.0000   Max.   :4400.000  
##                                                                              
##        F               MAG            FATALITIES          INJURIES        
##  Min.   :0.0      Min.   :    0.0   Min.   :  0.0000   Min.   :   0.0000  
##  1st Qu.:0.0      1st Qu.:    0.0   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  Median :1.0      Median :   50.0   Median :  0.0000   Median :   0.0000  
##  Mean   :0.9      Mean   :   46.9   Mean   :  0.0168   Mean   :   0.1557  
##  3rd Qu.:1.0      3rd Qu.:   75.0   3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##  Max.   :5.0      Max.   :22000.0   Max.   :583.0000   Max.   :1700.0000  
##  NA's   :843563                                                           
##     PROPDMG         PROPDMGEXP           CROPDMG         CROPDMGEXP       
##  Min.   :   0.00   Length:902297      Min.   :  0.000   Length:902297     
##  1st Qu.:   0.00   Class :character   1st Qu.:  0.000   Class :character  
##  Median :   0.00   Mode  :character   Median :  0.000   Mode  :character  
##  Mean   :  12.06                      Mean   :  1.527                     
##  3rd Qu.:   0.50                      3rd Qu.:  0.000                     
##  Max.   :5000.00                      Max.   :990.000                     
##                                                                           
##      WFO             STATEOFFIC         ZONENAMES            LATITUDE   
##  Length:902297      Length:902297      Length:902297      Min.   :   0  
##  Class :character   Class :character   Class :character   1st Qu.:2802  
##  Mode  :character   Mode  :character   Mode  :character   Median :3540  
##                                                           Mean   :2875  
##                                                           3rd Qu.:4019  
##                                                           Max.   :9706  
##                                                           NA's   :47    
##    LONGITUDE        LATITUDE_E     LONGITUDE_       REMARKS         
##  Min.   :-14451   Min.   :   0   Min.   :-14455   Length:902297     
##  1st Qu.:  7247   1st Qu.:   0   1st Qu.:     0   Class :character  
##  Median :  8707   Median :   0   Median :     0   Mode  :character  
##  Mean   :  6940   Mean   :1452   Mean   :  3509                     
##  3rd Qu.:  9605   3rd Qu.:3549   3rd Qu.:  8735                     
##  Max.   : 17124   Max.   :9706   Max.   :106220                     
##                   NA's   :40                                        
##      REFNUM      
##  Min.   :     1  
##  1st Qu.:225575  
##  Median :451149  
##  Mean   :451149  
##  3rd Qu.:676723  
##  Max.   :902297  
## 

The Variables useful to use in this analysis are: - EVTYPE (event type or calamity type) - FATALITIES - INJURIES - CROPDMG - CROPDMGEXP - PROPDMG - PROPDMGEXP

Results

Question 1

Across the United States, which types of events are most harmful with respect to population health?

Answer

So as we saw previously the variables FATALITIES and INJURIES grouped according to the eventype will provide us the answer for the question, that is which event type (EVTYPE Column) is responsible for the most harmful effect on the population health.

Selecting The FATALITIES and EVTYPE columns from the raw data and processing the subset to obtain top 10 harmful events based on the fatalities count.

#library(dplyr)

fatalities <- raw_data %>%
              select(EVTYPE, FATALITIES) %>%
              group_by(EVTYPE) %>%
              summarise(FATALITIES = sum(FATALITIES))
## `summarise()` ungrouping output (override with `.groups` argument)
fatalities_top_10 <- fatalities[order(-fatalities$FATALITIES), ][1:10, ]

fatalities_top_10
## # A tibble: 10 x 2
##    EVTYPE         FATALITIES
##    <chr>               <dbl>
##  1 TORNADO              5633
##  2 EXCESSIVE HEAT       1903
##  3 FLASH FLOOD           978
##  4 HEAT                  937
##  5 LIGHTNING             816
##  6 TSTM WIND             504
##  7 FLOOD                 470
##  8 RIP CURRENT           368
##  9 HIGH WIND             248
## 10 AVALANCHE             224

Now Selecting the INJURIES and EVTYPE columns from the raw data and processing the subset to obtain top 10 harmful events based on the injuries count.

#library(dplyr)

injuries <- raw_data %>%
            select(EVTYPE, INJURIES) %>%
            group_by(EVTYPE) %>%
            summarise(INJURIES = sum(INJURIES))
## `summarise()` ungrouping output (override with `.groups` argument)
injuries_top_10 <- injuries[order(-injuries$INJURIES), ][1:10, ]

injuries_top_10
## # A tibble: 10 x 2
##    EVTYPE            INJURIES
##    <chr>                <dbl>
##  1 TORNADO              91346
##  2 TSTM WIND             6957
##  3 FLOOD                 6789
##  4 EXCESSIVE HEAT        6525
##  5 LIGHTNING             5230
##  6 HEAT                  2100
##  7 ICE STORM             1975
##  8 FLASH FLOOD           1777
##  9 THUNDERSTORM WIND     1488
## 10 HAIL                  1361

To get a more clear picture we must plot the data side by side and for this we will use the ggplot2 library

#library(ggplot2)
#library(gridExtra) #to plot side by side

fatalities_plot <- ggplot(fatalities_top_10, aes(reorder(EVTYPE, FATALITIES), FATALITIES)) +
     geom_bar(stat = "identity") + 
     theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
     xlab("Event Type") + ylab("Fatalities")

injuries_plot <- ggplot(injuries_top_10, aes(reorder(EVTYPE, INJURIES), INJURIES)) +
  geom_bar(stat = "identity") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) + 
  xlab("Event Type") + ylab("Injuries")

grid.arrange(fatalities_plot, injuries_plot, ncol = 2)

From this graph we get a clear picture that TORNADO is the most harmful event and causes the most injuries and fatalities in US.

Question 2

Across the United States, which types of events have the greatest economic consequences?

Answer

From looking at the data we get a bit sense that to know about the economical destruction caused by the calamitic events, we can access 2 columns that are CROPDMG and PROPDMG which are the damage numbers to crops and properties respectively, finding out which specific event caused the most crop and property destruction we can figure out the answer to this question.

If you look at the data closely you can point out there is a bit problem in the data which is , there are two more columns linked with the DMG columns which are CROPDMGEXP and PROPDMGEXP which the exponential powers of the damage values and if not considered in the analysis it can cause misinterpretation of the data and can even result into faulty analyis.

So first we need to sort out this problem by converting the exponential notations into real powers and then multiplying the powers to the DMG values we will get our actual damage values and then we can perform further analysis.

Converting the exponential notations in the CROPDMG Column:

raw_data$CROPEXP[raw_data$CROPDMGEXP == "M"] <- 1e+06
raw_data$CROPEXP[raw_data$CROPDMGEXP == "K"] <- 1000
raw_data$CROPEXP[raw_data$CROPDMGEXP == "m"] <- 1e+06
raw_data$CROPEXP[raw_data$CROPDMGEXP == "B"] <- 1e+09
raw_data$CROPEXP[raw_data$CROPDMGEXP == "0"] <- 1
raw_data$CROPEXP[raw_data$CROPDMGEXP == "k"] <- 1000
raw_data$CROPEXP[raw_data$CROPDMGEXP == "2"] <- 100
raw_data$CROPEXP[raw_data$CROPDMGEXP == ""] <- 1


raw_data$CROPEXP[raw_data$CROPDMGEXP == "?"] <- 0

raw_data$CROPDMGVAL <- raw_data$CROPDMG * raw_data$CROPEXP

Converting the exponential notations in the PROPDMG Column:

raw_data$PROPEXP[raw_data$PROPDMGEXP == "K"] <- 1000
raw_data$PROPEXP[raw_data$PROPDMGEXP == "M"] <- 1e+06
raw_data$PROPEXP[raw_data$PROPDMGEXP == ""] <- 1
raw_data$PROPEXP[raw_data$PROPDMGEXP == "B"] <- 1e+09
raw_data$PROPEXP[raw_data$PROPDMGEXP == "m"] <- 1e+06
raw_data$PROPEXP[raw_data$PROPDMGEXP == "0"] <- 1
raw_data$PROPEXP[raw_data$PROPDMGEXP == "5"] <- 1e+05
raw_data$PROPEXP[raw_data$PROPDMGEXP == "6"] <- 1e+06
raw_data$PROPEXP[raw_data$PROPDMGEXP == "4"] <- 10000
raw_data$PROPEXP[raw_data$PROPDMGEXP == "2"] <- 100
raw_data$PROPEXP[raw_data$PROPDMGEXP == "3"] <- 1000
raw_data$PROPEXP[raw_data$PROPDMGEXP == "h"] <- 100
raw_data$PROPEXP[raw_data$PROPDMGEXP == "7"] <- 1e+07
raw_data$PROPEXP[raw_data$PROPDMGEXP == "H"] <- 100
raw_data$PROPEXP[raw_data$PROPDMGEXP == "1"] <- 10
raw_data$PROPEXP[raw_data$PROPDMGEXP == "8"] <- 1e+08
raw_data$PROPEXP[raw_data$PROPDMGEXP == "+"] <- 0
raw_data$PROPEXP[raw_data$PROPDMGEXP == "-"] <- 0
raw_data$PROPEXP[raw_data$PROPDMGEXP == "?"] <- 0

raw_data$PROPDMGVAL <- raw_data$PROPDMG * raw_data$PROPEXP

Now if we look at the newly constructed columns we can see we have the converted exponential values and the final DMG Colums(PROPDMGVAL and CROPDMGVAL):

head(raw_data[, 38:41])
##   CROPEXP CROPDMGVAL PROPEXP PROPDMGVAL
## 1       1          0    1000      25000
## 2       1          0    1000       2500
## 3       1          0    1000      25000
## 4       1          0    1000       2500
## 5       1          0    1000       2500
## 6       1          0    1000       2500

Now we can process the the subset of the data with the newly created columns and the EVTYPE columns to provide answer to our question

Now we will add the CROPDMGVAL and PROPDMGVAL to get the total damage:

raw_data$TOTALDMG <- raw_data$CROPDMGVAL + raw_data$PROPDMGVAL

Selecting the EVTYPE and the TOTALDMG (created by calculation above) and processing the subset to obtain the top 10 events which caused the most crop destruction :

#library(dplyr)
total_dmg <- raw_data %>%
            select(EVTYPE, TOTALDMG) %>%
            group_by(EVTYPE) %>%
            summarise(TOTALDMG = sum(TOTALDMG))
## `summarise()` ungrouping output (override with `.groups` argument)
total_dmg_top10 <- total_dmg[order(-total_dmg$TOTALDMG), ][1:10, ]

total_dmg_top10
## # A tibble: 10 x 2
##    EVTYPE                 TOTALDMG
##    <chr>                     <dbl>
##  1 FLOOD             150319678257 
##  2 HURRICANE/TYPHOON  71913712800 
##  3 TORNADO            57362333886.
##  4 STORM SURGE        43323541000 
##  5 HAIL               18761221986.
##  6 FLASH FLOOD        18243991078.
##  7 DROUGHT            15018672000 
##  8 HURRICANE          14610229010 
##  9 RIVER FLOOD        10148404500 
## 10 ICE STORM           8967041360

Plotting the above data :

economic_loss_plot <- ggplot(total_dmg_top10, aes(reorder(EVTYPE, TOTALDMG), TOTALDMG)) +
  geom_bar(stat = "identity") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) + 
  xlab("Event Type") + ylab("TOTALDMG")

economic_loss_plot

Final Observations :

By looking at plots we can observe that the most harmful event on the population health is TORNADO as it is responsible for highest number of fatalities and Injuries.

By looking at the last plot we can see the total damage done to crops and properties is mostly by floods and hence we can conclude that the most destructive event economically is FLOOD