Severe weather events and their public consequences across the United State from year 1950 until November 2011

Synopsis

Very often public resources are limited, so is deemed very important to have a prioritization of the weather events regarding the threat they consist for the health of our citizens and their property. In this report we aim to explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and help government or municipal manager to encounter severe weather events.For this reason we will try to answer the following questions

Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

Data loading and Processing

Data loading and initial reading

At the beginning we will install the necessary packages and some useful urls.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lattice)
fileurl<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
fileurl1<-"https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf"
fileurl2<-"https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf"

The U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database can be downloaded from the https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2. The database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. It start in the year 1950 and end in November 2011.

if(!file.exists("stormData.zip")){
   download.file(fileurl,destfile = "stormData.zip",method="curl")}
stormdata <- read.csv("stormData.zip")

Now we have the data we will call str() to see some information for the variables.

str(stormdata)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Data filtering

During our analysis we want to focus on the events that cause fatalities,injuries or crop and properties damages. For this reason we will filter the initial data and we create two separate data sets, one for public health concerns and one for the economic consequences.

For public health we will filter the data keeping the lines where variables FATALITIE and INJURIES are greater than zero.

phstormdata<-filter(stormdata,FATALITIES>0|INJURIES>0)

For economic consequences we will filter the data keeping the lines where variables CROPDMG and PROPDMG are greater than zero.

ecstormdata<-filter(stormdata,CROPDMG>0|PROPDMG>0)

Results

1 Question: Across the United States, which types of events are most harmful with respect to population health?

For the first question we will take the phstormdata created above and we will group the data by the EVTYPE.Then we will sum the total fatalities for each EVTYPE.

fatalities<-group_by(phstormdata,EVTYPE) %>%
    summarize(Total_fatalities=sum(FATALITIES))

Now we will arrange the data beginning with the EVTYPE with the most fatalities and subset the 20 first EVTYPE

fatalities<-arrange(fatalities,desc(Total_fatalities))
fatalities2<-fatalities[1:20,]

The question that arouse is if the data set created above has a satisfying percentage of the total fatalities occurred from severe weather events for the whole span of the period. To answer that we will sum the Total_fatalities from the two dataset(one with all the EVTYPE and one with the 20 firsts) and then we will divide them.

percen<-sum(fatalities2$Total_fatalities)/sum(fatalities$Total_fatalities)
percentageF<-paste(round(percen*100,2),"%",sep="")

The percentage is 89.22%. Almost at 90% we are confident that by isolating this events we spot the most deadly.

No we will repeat the procedure and for variable INJURES

injuries<-group_by(phstormdata,EVTYPE)%>%
    summarize(Total_Injuries=sum(INJURIES))
injuries<-arrange(injuries,desc(Total_Injuries))
injuries2<-injuries[1:20,]
percen<-sum(injuries2$Total_Injuries)/sum(injuries$Total_Injuries)
percentageI<-paste(round(percen*100,2),"%",sep="")

For injuries the percentage is 95.81% and give as even bigger confidence, than fatalities case, that by isolating this events we spot the most harmful with respect the population health. Now we will merge the two data set, fatalities and injuries, to create a dataset contain the most harmful event

populationhealth<-merge(fatalities2,injuries2,by="EVTYPE",all=TRUE)
EVTYPE1<-populationhealth$EVTYPE

Some of the events have only one value (Injuries or fatalities) and during the merge process we obtain some NAs values. We will substitute NAs with zeros and then add injuries and fatalities to have total number of incidents in a new variable called Total_FAT_INJ. Also at this time we notice,by looking the events names

AVALANCHE, BLIZZARD, DUST STORM, EXCESSIVE HEAT, EXTREME COLD, EXTREME COLD/WIND CHILL, FLASH FLOOD, FLOOD, FOG, HAIL, HEAT, HEAT WAVE, HEAVY SNOW, HIGH SURF, HIGH WIND, HURRICANE/TYPHOON, ICE STORM, LIGHTNING, RIP CURRENT, RIP CURRENTS, STRONG WIND, THUNDERSTORM WIND, THUNDERSTORM WINDS, TORNADO, TSTM WIND, WILD/FOREST FIRE, WILDFIRE, WINTER STORM

that some events is duplicated, the reason is some misspelling and the usage of abbreviations. The events are RIP CURRENT and RIP CURRENTS,THUNDERSTORM WIND,THUNDERSTORM WINDS and TSTM WINDS,WILDFIRE and WILD/FOREST FIRE. In order to determine the correct names we use information from the National Weather Service Storm Data Documentation located in https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf. In the chunk that follows we will add the content of the rows and use the correct names for the event.

populationhealth$Total_fatalities[is.na(populationhealth$Total_fatalities)]<-0
populationhealth$Total_Injuries[is.na(populationhealth$Total_Injuries)]<-0
populationhealth[19:20,]=data.frame("RIP CURRENT",
    populationhealth$Total_fatalities[19]+populationhealth$Total_fatalities[20],
    populationhealth$Total_Injuries[19]+populationhealth$Total_Injuries[20])
populationhealth[c(22,23,25),]<-data.frame("THUNDERSTORM WIND",
    populationhealth$Total_fatalities[22]+
    populationhealth$Total_fatalities[23]+populationhealth$Total_fatalities[25]
    ,populationhealth$Total_Injuries[22]+populationhealth$Total_Injuries[23]+
    populationhealth$Total_Injuries[25])
populationhealth[26:27,]=data.frame("WILDFIRE",
    populationhealth$Total_fatalities[26]+populationhealth$Total_fatalities[27],
    populationhealth$Total_Injuries[26]+populationhealth$Total_Injuries[27])
populationhealth<-unique(populationhealth)
populationhealth<-mutate(populationhealth,
    Total_FAT_INJ=Total_fatalities+Total_Injuries)

We will arrange the data set populationhelth created above by the total fatalities .

populationhealth<-arrange(populationhealth,desc(Total_fatalities),desc(Total_FAT_INJ))

The we will reorder the levels in the factor variable EVTYPE action which will help us to plot the results.

populationhealth$EVTYPE<-factor(populationhealth$EVTYPE,levels =populationhealth$EVTYPE[order(populationhealth$Total_fatalities,
decreasing = TRUE)])

We will plot the results using lattice plot system.

barchart(Total_Injuries+Total_fatalities~EVTYPE,data=populationhealth,stack=FALSE,origin=0,
    main=list(label="Fatalities & injuries caused by weather evets from 1950 until Nov 2011 for the U.S ",cex=1.8),
    ylab=list("Incidents of Fatalities and injuries",cex=1.5),xlab=list("Severe weather events",cex=1.5),
    auto.key = list(space='top',cex=1,text=c('Total_Injuries','Total_fatalities')),
    scales=list(x=list(rot=90,cex=1)))

As final conclusion we notice that the event tornado is by far the most dangerous for the public health.

2 Question:Across the United States, which types of events have the greatest economic consequences?

For the second question we will take the ecstormdata data set created at section “Data loading and Processing”. Except the obvious CROPDMG and PROPDMG variables in order to answer the question we have to use and the variables CROPDMGEXP nd PROPDMGEXP. The last two variables contain alphabetical characters used to signify magnitude of the CROPDMG and PROPDMG variables and include “K” for thousands, “M” for millions, and “B” for billions according to the National Weather Service Storm Data Documentation located in https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf. In the dataset we observe and other variables which are ignored because we are not sure about the value they are represent. In the end of the following chunk we will calculate the damage each event cause,in two new variables, by multiplying DMG variables with DMGEXP.

ecstormdata$CROPDMGEXP<-as.character(ecstormdata$CROPDMGEXP)
ecstormdata$CROPDMGEXP[grepl("[Kk]",ecstormdata$CROPDMGEXP)] <-10^-3
ecstormdata$CROPDMGEXP[grepl("[Mm]",ecstormdata$CROPDMGEXP)] <-1
ecstormdata$CROPDMGEXP[grepl("[Bb]",ecstormdata$CROPDMGEXP)] <-10^3
ecstormdata$PROPDMGEXP<-as.character(ecstormdata$PROPDMGEXP)
ecstormdata$PROPDMGEXP[grepl("[Kk]",ecstormdata$PROPDMGEXP)] <-10^-3
ecstormdata$PROPDMGEXP[grepl("[Mm]",ecstormdata$PROPDMGEXP)] <-1
ecstormdata$PROPDMGEXP[grepl("[Bb]",ecstormdata$PROPDMGEXP)] <-10^3
ecstormdata$CROPDMGEXP<-as.numeric(ecstormdata$CROPDMGEXP)

## Warning: NAs introduced by coercion

ecstormdata$PROPDMGEXP<-as.numeric(ecstormdata$PROPDMGEXP)

## Warning: NAs introduced by coercion

ecstormdata$CROPDMGEXP[is.na(ecstormdata$CROPDMGEXP)]<-0
ecstormdata$PROPDMGEXP[is.na(ecstormdata$PROPDMGEXP)]<-0
ecstormdata<-mutate(ecstormdata,CROPDMG_IN_M=CROPDMGEXP*CROPDMG)
ecstormdata<-mutate(ecstormdata,PROPDMG_IN_M=PROPDMGEXP*PROPDMG)

We will sum Crop and property damage separately, then we will arrange the two new datasets by the EVTYPE with the biggest economic damage and subset the 20 first events for each type of damage.

ecstormdata<-mutate(ecstormdata,CROPDMG_IN_M=CROPDMGEXP*CROPDMG)
ecstormdata<-mutate(ecstormdata,PROPDMG_IN_M=PROPDMGEXP*PROPDMG)
EconDmgP<-group_by(ecstormdata,EVTYPE) %>%
    summarize(Prop_Economic_Damage_in_M=sum(PROPDMG_IN_M))
EconDmgP<-arrange(EconDmgP,desc(Prop_Economic_Damage_in_M))
EconDmgC<-group_by(ecstormdata,EVTYPE) %>%
    summarize(Crop_Economic_Damage_in_M=sum(CROPDMG_IN_M))
EconDmgC<-arrange(EconDmgC,desc(Crop_Economic_Damage_in_M))
EconDmgC2<-EconDmgC[1:20,]
EconDmgP2<-EconDmgP[1:20,]

Again we will calculate the percentage the 20 firsts EVTYPE give as.

percenP<-sum(EconDmgP2$Prop_Economic_Damage_in_M)/sum(EconDmgP$Prop_Economic_Damage_in_M)
percenC<-sum(EconDmgC2$Crop_Economic_Damage_in_M)/sum(EconDmgC$Crop_Economic_Damage_in_M)
percentageP<-paste(round(percenP*100,2),"%",sep="")
percentageC<-paste(round(percenC*100,2),"%",sep="")

For Corp damage we calculated 95.59% and for the property damage 96.98%. With both to be above 95% we are confident that the EVTYPE we subset are causing the biggest economics problems across the U.S.

We will continue with the merging of the two dataset “EconDmgC2” and “EconDmgP2”.

EconDmgT<-merge(EconDmgC2,EconDmgP2,by="EVTYPE",all = TRUE)
EconDmgT$Crop_Economic_Damage_in_M[is.na(EconDmgT$Crop_Economic_Damage_in_M)]<-0
EconDmgT$Prop_Economic_Damage_in_M[is.na(EconDmgT$Prop_Economic_Damage_in_M)]<-0
EVTYPE2<-EconDmgT$EVTYPE

Now we will check the EVTYPES names to see if we have duplicated events caused by misspelling and the usage of abbreviations like in the first question.

DROUGHT, EXCESSIVE HEAT, EXTREME COLD, FLASH FLOOD, FLOOD, FREEZE, FROST/FREEZE, HAIL, HEAT, HEAVY RAIN, HEAVY RAIN/SEVERE WEATHER, HIGH WIND, HURRICANE, HURRICANE OPAL, HURRICANE/TYPHOON, ICE STORM, RIVER FLOOD, STORM SURGE, STORM SURGE/TIDE, THUNDERSTORM WIND, THUNDERSTORM WINDS, TORNADO, TROPICAL STORM, TSTM WIND, WILD/FOREST FIRE, WILDFIRE, WINTER STORM

The duplicated events are RIP CURRENT and RIP CURRENTS,THUNDERSTORM WIND,THUNDERSTORM WINDS and TSTM WINDS,WILDFIRE and WILD/FOREST FIRE. In order to determine the correct names we use information from the National Weather Service Storm Data Documentation located in https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf. In the chunk that follows we will add the content of the rows and use the correct names for the event.

EconDmgT[c(20,21,24),]<-data.frame("THUNDERSTORM WIND",EconDmgT$Crop_Economic_Damage_in_M[20]+
    EconDmgT$Crop_Economic_Damage_in_M[21]+
    EconDmgT$Crop_Economic_Damage_in_M[24],EconDmgT$Prop_Economic_Damage_in_M[20]+
    EconDmgT$Prop_Economic_Damage_in_M[21]+
    EconDmgT$Prop_Economic_Damage_in_M[24])
EconDmgT[25:26,]=data.frame("WILDFIRE",
    EconDmgT$Crop_Economic_Damage_in_M[25]+EconDmgT$Crop_Economic_Damage_in_M[26],
    EconDmgT$Prop_Economic_Damage_in_M[25]+EconDmgT$Prop_Economic_Damage_in_M[26])
EconDmgT<-unique(EconDmgT)

We will continue with the creation of a new variable for the EconDmgT containing the sum of the crop and property damages and then we will arrange the dataset by this new variable.In this chunk we will reorder the levels in the factor variable EVTYPE action which will help us to plot the results.

EconDmgT<-mutate(EconDmgT,Total_Economic_Damage_in_M=
    Crop_Economic_Damage_in_M+Prop_Economic_Damage_in_M)
EconDmgT<-arrange(EconDmgT,desc(Total_Economic_Damage_in_M))
EconDmgT$EVTYPE<-factor(EconDmgT$EVTYPE,levels =
    EconDmgT$EVTYPE[order(EconDmgT$Total_Economic_Damage_in_M,decreasing = TRUE)])

Finally we will plot the results using lattice system.

barchart(Crop_Economic_Damage_in_M+Prop_Economic_Damage_in_M~EVTYPE,data=EconDmgT,stack=FALSE,origin=0,
    main=list(label="Crop & Ploperty loses caused by weather evets from 1950 until Nov 2011 for the U.S",cex=1.8),
    ylab=list("Economic damage in Millions of U.S Dollars",cex=1.5),xlab=list("Severe weather events",cex=1.5),
    auto.key = list(space='top',cex=1,text=c('Corp_losses','Property losses')),
    scales=list(x=list(rot=90,cex=1)))

The final conclusions for the question 2 are the followings:

Economic consequences from property damages are much bigger than Crop damages.
The events which cause the biggest damage are different for each category.Flood for property damages and drought for crop damages.