Reproducible Research. Course Project 2

Analysis of the characteristics of major storms and weather events in the United States.

Synopis

Storm Data is an official publication of the National Oceanic and Atmospheric Administration (NOAA), which documents the occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce includes weather events information, like when and where they occur, estimates of any fatalities, injuries and property damage. This analysis is focused on visualizing the qualitative impact, various type of sever weather conditions had on two major aspects of people’s life in affected areas. The analysis is split into two parts. One measures an impact of various types of weather related events on amount of fatalities and injuries where the second one gives a view on economic impact (expressed in USD) such events had in the period covered by a data set.

Questions defined for this analysis to answer are:

  1. Across the United States, which types of events are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

The Storm Data file which is subject of the analysis can be found here

The Storm Data documentation is available here

Data processing

Downloading the Storm Data base and reading it into R

if(!file.exists("./2FStormData.csv.bz2")){
  url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
  download.file(url, destfile = "./2FStormData.csv.bz2", mode = wb, method = "curl")
}
df<- read.csv("./2FStormData.csv.bz2",stringsAsFactors = FALSE)

Loading required packages

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(knitr)
library(gridExtra)

## Warning: package 'gridExtra' was built under R version 3.2.4

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

Data set dimensions

dim(df)

## [1] 902297     37

Summary of data set

summary(df)

##     STATE__       BGN_DATE           BGN_TIME          TIME_ZONE        
##  Min.   : 1.0   Length:902297      Length:902297      Length:902297     
##  1st Qu.:19.0   Class :character   Class :character   Class :character  
##  Median :30.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :31.2                                                           
##  3rd Qu.:45.0                                                           
##  Max.   :95.0                                                           
##                                                                         
##      COUNTY       COUNTYNAME           STATE              EVTYPE         
##  Min.   :  0.0   Length:902297      Length:902297      Length:902297     
##  1st Qu.: 31.0   Class :character   Class :character   Class :character  
##  Median : 75.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :100.6                                                           
##  3rd Qu.:131.0                                                           
##  Max.   :873.0                                                           
##                                                                          
##    BGN_RANGE          BGN_AZI           BGN_LOCATI       
##  Min.   :   0.000   Length:902297      Length:902297     
##  1st Qu.:   0.000   Class :character   Class :character  
##  Median :   0.000   Mode  :character   Mode  :character  
##  Mean   :   1.484                                        
##  3rd Qu.:   1.000                                        
##  Max.   :3749.000                                        
##                                                          
##    END_DATE           END_TIME           COUNTY_END COUNTYENDN    
##  Length:902297      Length:902297      Min.   :0    Mode:logical  
##  Class :character   Class :character   1st Qu.:0    NA's:902297   
##  Mode  :character   Mode  :character   Median :0                  
##                                        Mean   :0                  
##                                        3rd Qu.:0                  
##                                        Max.   :0                  
##                                                                   
##    END_RANGE          END_AZI           END_LOCATI       
##  Min.   :  0.0000   Length:902297      Length:902297     
##  1st Qu.:  0.0000   Class :character   Class :character  
##  Median :  0.0000   Mode  :character   Mode  :character  
##  Mean   :  0.9862                                        
##  3rd Qu.:  0.0000                                        
##  Max.   :925.0000                                        
##                                                          
##      LENGTH              WIDTH                F               MAG         
##  Min.   :   0.0000   Min.   :   0.000   Min.   :0.0      Min.   :    0.0  
##  1st Qu.:   0.0000   1st Qu.:   0.000   1st Qu.:0.0      1st Qu.:    0.0  
##  Median :   0.0000   Median :   0.000   Median :1.0      Median :   50.0  
##  Mean   :   0.2301   Mean   :   7.503   Mean   :0.9      Mean   :   46.9  
##  3rd Qu.:   0.0000   3rd Qu.:   0.000   3rd Qu.:1.0      3rd Qu.:   75.0  
##  Max.   :2315.0000   Max.   :4400.000   Max.   :5.0      Max.   :22000.0  
##                                         NA's   :843563                    
##    FATALITIES          INJURIES            PROPDMG       
##  Min.   :  0.0000   Min.   :   0.0000   Min.   :   0.00  
##  1st Qu.:  0.0000   1st Qu.:   0.0000   1st Qu.:   0.00  
##  Median :  0.0000   Median :   0.0000   Median :   0.00  
##  Mean   :  0.0168   Mean   :   0.1557   Mean   :  12.06  
##  3rd Qu.:  0.0000   3rd Qu.:   0.0000   3rd Qu.:   0.50  
##  Max.   :583.0000   Max.   :1700.0000   Max.   :5000.00  
##                                                          
##   PROPDMGEXP           CROPDMG         CROPDMGEXP       
##  Length:902297      Min.   :  0.000   Length:902297     
##  Class :character   1st Qu.:  0.000   Class :character  
##  Mode  :character   Median :  0.000   Mode  :character  
##                     Mean   :  1.527                     
##                     3rd Qu.:  0.000                     
##                     Max.   :990.000                     
##                                                         
##      WFO             STATEOFFIC         ZONENAMES            LATITUDE   
##  Length:902297      Length:902297      Length:902297      Min.   :   0  
##  Class :character   Class :character   Class :character   1st Qu.:2802  
##  Mode  :character   Mode  :character   Mode  :character   Median :3540  
##                                                           Mean   :2875  
##                                                           3rd Qu.:4019  
##                                                           Max.   :9706  
##                                                           NA's   :47    
##    LONGITUDE        LATITUDE_E     LONGITUDE_       REMARKS         
##  Min.   :-14451   Min.   :   0   Min.   :-14455   Length:902297     
##  1st Qu.:  7247   1st Qu.:   0   1st Qu.:     0   Class :character  
##  Median :  8707   Median :   0   Median :     0   Mode  :character  
##  Mean   :  6940   Mean   :1452   Mean   :  3509                     
##  3rd Qu.:  9605   3rd Qu.:3549   3rd Qu.:  8735                     
##  Max.   : 17124   Max.   :9706   Max.   :106220                     
##                   NA's   :40                                        
##      REFNUM      
##  Min.   :     1  
##  1st Qu.:225575  
##  Median :451149  
##  Mean   :451149  
##  3rd Qu.:676723  
##  Max.   :902297  
## 

List of variables

names(df)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Part I of data processing which is aimed to produce quantitative data to answer the following question:

Across the United States, which types of events are most harmful with respect to population health?

After an analysis of the entire spectrum of variables available in the data set, two of them : FATALITIES and INJURIES have been classified as having the most harmful impact on population health.

From the original data set, we select three variables (EVTYPE, FATALITIES, INJURIES). Then we add up the FATALITIES and INJURIES numbers and put them in the new Fatalities_and_Injuries column which is next arranged in descending order. We store the results in df_x data frame.

df_x <- df %>%
  select(EVTYPE, FATALITIES, INJURIES) %>% 
  mutate(Fatalities_and_Injuries = FATALITIES + INJURIES) %>%
  arrange(desc(Fatalities_and_Injuries))

In the next step, we aggregate Fatalities_and_Injuries variable from df_x by list of EVTYPE using sum function. We arrange the results in descending order and take the top 15 obervations and store the results in df_y data frame.

df_y <- aggregate(x=df_x$Fatalities_and_Injuries, by = list(df_x$EVTYPE ), FUN = sum) %>% 
  select(EVTYPE = Group.1, Fatalities_and_Injuries = x) %>% 
  arrange( desc(Fatalities_and_Injuries)) %>% 
  head(15)

Data frame df_y represents the final data processing step for the Part I, responsible for identifying which types of events are most harmful with respect to population health.

Part II of data processing which is aimed to produce quantitative data to answer the following question:

Across the United States, which types of events have the greatest economic consequences?

After an analysis of the entire spectrum of variable available in the data set, two of them : PROPDMG and CROPDMG have been classified as having the greatest economic consequences.

Results

Across the United States, which types of events are most harmful with respect to population health?

Answer:

g <- ggplot(df_y, aes(EVTYPE, Fatalities_and_Injuries, color = EVTYPE, fill = EVTYPE))
g + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1))

According the to the analysis results visualised above the top three events that are most harmful with respect to population health are: 1.TORNADO 2.EXCESSIVE HEAT 3.TSTM WIND

Across the United States, which types of events have the greatest economic consequences?

Answer:

combined_dmg_plot <- ggplot(combined_dmg, aes(EVTYPE, DAMAGEUSD, color = EVTYPE, fill = EVTYPE)) +
  geom_bar(stat = "identity") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("Combined property and crop related damages in USD")
combined_dmg_plot

According the to the analysis results visualised above the top three events that have the greatest economic consequences are: 1.Flood 2.HURRICANE/TYPHOON 3.HAIL