Title:

NOAA Storm Database Severe Weather Analysis

Synopsis:

The NOAA has a database on severe weather. This database goes back to the 1950’s however was standarized in 1996. The data covers major weather events and their impact on human lives, economic impact along with location data. By analyizing the data one can gain an understanding of the impact the different weather events pose. This data can then be used by government and private individuals to prepare for future events as well as understand what has the greatest impacts. This analysis plans to understand what type of event have the greatest impact on human population health and economic impact.

Data Processing

Data processing section will walk through the steps taken to process & create the plots used in the data analysis.

Start with loading the librarys

library(plyr)
## Warning: package 'plyr' was built under R version 3.2.2

Now we need to read in the data.

StormData <- read.csv("repdata-data-StormData.csv.bz2")
head(StormData)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

After analyizing the data structure and reading the information from http://www.ncdc.noaa.gov/stormevents/details.jsp as found on the course forums, one can say we should only investigate incidents after 1996. Otherwise tornados will have an unusally high amount of injuries & damage. This still gives us 20 years of data to work with, time permitting we can look back at the rest of the pre-1996 data and see if its relatively inline with the post 1996 data.

So we need to extract the data after 1996. So first convert the dates and check that everything has converted properly.

StormData$BGN_DATE <- as.Date(StormData$BGN_DATE, format = "%m/%d/%Y %H:%M:%S")
class(StormData$BGN_DATE)
## [1] "Date"
sum(is.na(StormData$BGN_DATE))
## [1] 0

Now we want to subset the data to a working range of data that we want to deal with.

WorkingData <- subset(StormData, BGN_DATE >= "1996-01-01")

First we want to investigate how harmful to population health, i.e fatalities & injuries.

HumanFatal <- aggregate(FATALITIES ~ EVTYPE,WorkingData,sum)
HumanInjury <- aggregate(INJURIES ~ EVTYPE,WorkingData,sum)
HumanImpact <- merge(HumanInjury, HumanFatal, by = "EVTYPE")

We now have a subset of data of the event types & number of injuries & fatalities that each of the event types have caused. Taking a look at the data view(HumanImpact) within RStudio we can see that a lot of the values have 0 for either fatalities or injuries. So lets go ahead and subset the data set down some more.

HumanImpact <- subset(HumanImpact, (INJURIES > 0 & FATALITIES > 0))

We now want to get a combined total of injuries & fatalities for sorting on

HumanImpact$Combined <- HumanImpact$INJURIES + HumanImpact$FATALITIES
HumanImpact <- HumanImpact[order(-HumanImpact$Combined),]

Taking a look at the data we can see clearly that Tornados have quite the impact!

head(HumanImpact)
##             EVTYPE INJURIES FATALITIES Combined
## 426        TORNADO    20667       1511    22178
## 81  EXCESSIVE HEAT     6391       1797     8188
## 102          FLOOD     6758        414     7172
## 224      LIGHTNING     4141        651     4792
## 434      TSTM WIND     3629        241     3870
## 98     FLASH FLOOD     1674        887     2561

Lets plot out the data in a stacked bar chart, which is surprisingly hard! First we grab the top ten events and factor the event names.

plotdata <- HumanImpact[1:10, 1:4]
plotdata$EVTYPE <- factor(plotdata$EVTYPE)

We then need to manipulate the data so that we can graph out as a bar chart, this needs to be a matrix, so first create a new data frame.

Injury <- c(plotdata$INJURIES)
Death <- c(plotdata$FATALITIES)
HumanDF <- data.frame(Injury, Death)

Then transpose the dataframe with t() and apply the event names across so that the x-axis is properly labeled. The HumanDF dataframe will be used to plot out the data in the results section.

HumanDF <- t(HumanDF)
colnames(HumanDF) <- plotdata$EVTYPE

Now we move onto the second question, regarding economic consequences. First we need to take a look at the data:

head(WorkingData,1)
##        STATE__   BGN_DATE    BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 248768       1 1996-01-06 08:00:00 PM       CST      1 ALZ001>038    AL
##              EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI         END_DATE
## 248768 WINTER STORM         0                    1/7/1996 0:00:00
##           END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI
## 248768 03:00:00 PM          0         NA         0                   
##        LENGTH WIDTH  F MAG FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 248768      0     0 NA   0          0        0     380          K      38
##        CROPDMGEXP WFO       STATEOFFIC
## 248768          K BMX ALABAMA, Central
##                                                                                                                                                                                                                                                                                                                                                                                                     ZONENAMES
## 248768 LAUDERDALE - LAUDERDALE - COLBERT - FRANKLIN - LAWRENCE - LIMESTONE - MADISON - MORGAN - MARSHALL - JACKSON - DEKALB - MARION - LAMAR - FAYETTE - WINSTON - WALKER - CULLMAN - BLOUNT - ETOWAH - CALHOUN - CHEROKEE - CLEBURNE - PICKENS - TUSCALOOSA - JEFFERSON - SHELBY - ST. CLAIR - TALLADEGA - CLAY - RANDOLPH - SUMTER - GREENE - HALE - PERRY - BIBB - CHILTON - COOSA - TALLAPOOSA - CHAMBERS
##        LATITUDE LONGITUDE LATITUDE_E LONGITUDE_
## 248768        0         0          0          0
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            REMARKS
## 248768 A winter storm brought a mixture of freezing rain, sleet, and snow to the northern two-thirds of Alabama.  Precipitation began as freezing rain and sleet but quickly changed to snow.  The precipitation coated roads and caused serious travel problems across the northern sections of thestate that lasted into Monday morning (the 8th).  Some higher elevations of the northeast corner of Alabama had travel problems into Tuesday.  Amounts were generally light with the highest snowfall reported at Huntsville International Airport with 2 inches.  Most other locations across North Alabama reported one-quarter of an inch to an inch and a half.  On Sunday the 7th, one fatality occurred in an automobile/train collision in Calhoun County that was attributed to icy roads.  The teenage driver of the car was not wearing a seat belt and was thrown from the vehicle.
##        REFNUM
## 248768 248768

From this we can identify four columns of data that we need to utilize, PROPDMG, PROPDMGEXP,CROPDMG & CROPDMGEXP in addtion to EVTYPE. So lets subset out this data into our new working variable, lets call it EconData

EconData = data.frame(factor(WorkingData$EVTYPE),WorkingData$PROPDMG,WorkingData$PROPDMGEXP,WorkingData$CROPDMG,WorkingData$CROPDMGEXP)
colnames(EconData) <- c("EVTYPE","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")
head(EconData)
##         EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 WINTER STORM     380          K      38          K
## 2      TORNADO     100          K       0           
## 3    TSTM WIND       3          K       0           
## 4    TSTM WIND       5          K       0           
## 5    TSTM WIND       2          K       0           
## 6         HAIL       0                  0

We can see lots of zero values, so lets filter these out.

EconData <- subset(EconData, (PROPDMG > 0 & CROPDMG > 0))

Now according to the Storm Data Export document the EXP columns will be a multiplyer (if any) to apply to the damage amount, so lets multiply these across using the amazing mapvalues function within plyr. Also convert them to numeric.

EconData$PROPDMGEXP <- as.character(EconData$PROPDMGEXP)
EconData$CROPDMGEXP <- as.character(EconData$CROPDMGEXP)

EconData$PROPDMGEXP <- mapvalues(EconData$PROPDMGEXP,c("K","M","B"),c(1000,1000000,1000000000))
EconData$CROPDMGEXP <- mapvalues(EconData$CROPDMGEXP,c("K","M","B"),c(1000,1000000,1000000000))

EconData$PROPDMGEXP <- sapply(EconData$PROPDMGEXP,as.numeric)
EconData$CROPDMGEXP <- sapply(EconData$CROPDMGEXP,as.numeric)

Now find the total value of damage by multiplying the mulitplyer PROPDMGEXP & CROPDMGEXP by the damage values.

EconData$PROPDMG <- EconData$PROPDMG * EconData$PROPDMGEXP
EconData$CROPDMG <- EconData$CROPDMG * EconData$CROPDMGEXP
head(EconData)
##          EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1  WINTER STORM  380000       1000   38000       1000
## 58         HAIL    2000       1000    1000       1000
## 59         HAIL    2000       1000    2000       1000
## 60         HAIL   15000       1000   10000       1000
## 62         HAIL    5000       1000    2000       1000
## 64  FLASH FLOOD  200000       1000   25000       1000

We will now repeate the same process as preping the injury data.

  1. Convert EVTYPE to a factor to remove the levels

  2. Aggregate the damages by EVTYPE

  3. Add up these damages to sort by largest

  4. Sort code

  5. Grab the top 10 items for plotting

EconData$EVTYPE <- factor(EconData$EVTYPE)
EconPlot <- aggregate(cbind(PROPDMG,CROPDMG) ~ EVTYPE, EconData, sum)
EconPlot$TotalDMG <- EconPlot$PROPDMG + EconPlot$CROPDMG
EconPlot <- EconPlot[order(-EconPlot$TotalDMG),]
EconPlot <- EconPlot[1:10, ]

Then we make the dataframe consisting of the PROPDMG & CROPDMG, transpose the dataframe and the add the names to the columns so they appear on the bar chart.

EconDF <- data.frame(EconPlot$PROPDMG,EconPlot$CROPDMG)
EconDF <- t(EconDF)
colnames(EconDF) <- EconPlot$EVTYPE

Results

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Plotting out the data of the event types we can see tornados are the biggest threat in respect to population health.

par(mar=c(7,5,1,1))
barplot(as.matrix(HumanDF),cex.names = .6,col=c("black","red"),las = 2,ylab = "Count of Injuries/Fatalities", legend = c("Injuries","Fatalities"),mgp=c(3,.5,0))

Taking a look at the data it is easy to see that tornados on their own are the biggest threat. However this in turn is simply a bad storm which coul consist of lightning, flash flooding, thunderstorm winds which also comprise the top 10 items. So in general storms appear to be the biggest threat to life. These are often difficult to prepare for as they are difficult to predict & planning for a tornado is minimal. Most regions in the “tornado alley” are well prepared and have warning systems in place for tornados.

As far as planning excessive heat is one area which could actually be prevented and human impact cut back on. Limits to activity outdoors and required breaks for workers outdoors could lower this number. More analysis into the data specifics is required.

  1. Across the United States, which types of events have the greatest economic consequences?
par(mar=c(7,7,1,1))
barplot(as.matrix(EconDF),cex.names = .6,col=c("black","red"),las = 2,ylab = "Nominal Sum of Damages USD$", legend = c("Property Damage","Crop Damage"),mgp=c(5,.5,0))

For economic consequences flooding is the clear “winner” here. Flooding far exceeds the next two items which in turn are often a cause of flooding themselves! As far as preparing for flooding it really boils down to investigating flood zones and preventing/placing restricions on building in those areas. Due to human nature living in a flood zone will be a way of life in general, however specific precautions can be taken to prevent damage where possible.

Conclusions

The above report could be broken down some more. The specific areas that could be addressed are:

  1. Aggregating the EVTYPE even further, there are instances of “HURRICANE/TYPHOON” & “HURRICANE” which could/should be combined. This is debateable as I filtered the dataset to post 1996 data which should have corrected this, taking a look at the unique values we can see that HURRICANE is standalone, which could be corrected and then combined. Others could be combined as well (WINTER STORM & HEAVY SNOW), however the decision to filter to post 1996 data was to prevent this issue and gain a greater understanding of what is causing the greatest harm.
unique(EconData$EVTYPE)
##  [1] WINTER STORM         HAIL                 FLASH FLOOD         
##  [4] TORNADO              TSTM WIND            TSTM WIND/HAIL      
##  [7] HIGH WIND            URBAN/SML STREAM FLD WILD/FOREST FIRE    
## [10] Heavy Rain/High Surf FLOOD                HEAVY RAIN          
## [13] HURRICANE            STORM SURGE          River Flooding      
## [16] DROUGHT              HEAVY SNOW           LIGHTNING           
## [19] ICE STORM            RIVER FLOOD          Frost/Freeze        
## [22] BLIZZARD             EXTREME COLD         TYPHOON             
## [25] TROPICAL STORM       GUSTY WINDS          DUST STORM          
## [28] FREEZE               DRY MICROBURST       SMALL HAIL          
## [31] WILDFIRE             HURRICANE/TYPHOON    STRONG WIND         
## [34] LANDSLIDE            EXCESSIVE HEAT       THUNDERSTORM WIND   
## [37] FROST/FREEZE         STORM SURGE/TIDE     TSUNAMI             
## 39 Levels: BLIZZARD DROUGHT DRY MICROBURST DUST STORM ... WINTER STORM
  1. Investigation into the REMARKS field could yield greater information regarding specific events, such as the EXCESSIVE HEAT item. During my inital data exploration there appears to be a wealth of information reagarding the hows & whys as to the injuries & deaths. This data itself could be explored as to the actual cause of death per EVTYPE and that would give a better idea as to where to focus efforts regarding planning for these events.