Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, property damage, and crop damage and preventing such outcomes to the extent possible is a key concern.

So, the goal of this report is to show which types of storms and other severe weather events are most harmful to population health and have greatest economic consequences across the United Sates in the year 2010. The U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database was used for this report and specifically obtained data for the year 2010 (the most recent complete year available).

Data showed that among storms and other weather events in the U.S., most fatalities were linked to Flash Flood followed by Rip Current while most injuries were caused by Tornado. Furthermore, results indicated that the combined estimated amount of Property Damage and Crop Damage in 2010 was due to Flood and Hail.

Loading and Processing the Raw Data

Load the file from the course web site

setwd("C:/Users/jggomez/Desktop/Data Science/Course 5")
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
    "Week4/StormData.csv")
setwd("C:/Users/jggomez/Desktop/Data Science/Course 5/Week4")

Read the raw file, which is a CSV file, included in the zip archive. Blank lines and all comments are ignored.

StormData = read.csv("StormData.csv", na.strings = "",
      blank.lines.skip = TRUE, comment.char = "")
str(StormData) ##Check the internal structure of StormData dataset
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29600 levels "5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13512 1872 4597 10591 4371 10093 1972 23872 24417 4597 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 34 levels "  N"," NW","E",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ BGN_LOCATI: Factor w/ 54428 levels "- 1 N Albion",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_DATE  : Factor w/ 6662 levels "1/1/1993 0:00:00",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_TIME  : Factor w/ 3646 levels " 0900CST"," 200CST",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 23 levels "E","ENE","ESE",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_LOCATI: Factor w/ 34505 levels "- .5 NNW","- 11 ESE Jay",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 18 levels "-","?","+","0",..: 16 16 16 16 16 16 16 16 16 16 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 8 levels "?","0","2","B",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ WFO       : Factor w/ 541 levels " CI","$AC","$AG",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ STATEOFFIC: Factor w/ 249 levels "ALABAMA, Central",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ ZONENAMES : Factor w/ 25111 levels "                                                                                                                               "| __truncated__,..: NA NA NA NA NA NA NA NA NA NA ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436780 levels "-2 at Deer Park\n",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

It was noted that events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

With that, events from 1 Jan 2010 to 31 Dec 2010 were used for this report to capture the most recent complete year available. The BGN_DATE variable was used as date of event.

StormData$BGN_DATE <- as.POSIXct(strptime(x = as.character(StormData$BGN_DATE),
       format = "%m/%d/%Y")) ##transform BGN_DATE to date class
class(StormData$BGN_DATE)
## [1] "POSIXct" "POSIXt"
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
Storm2010 <- filter(StormData, BGN_DATE > "2009-12-31" & 
    BGN_DATE < "2011-01-01") ## filter 2010 events

Results

Show which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health. Estimates of fatalities and injuries were used (indicated in the FATALITIES AND INJURIES variables, respectively).

Top 10 storms and weather events with most cases of fatalilties

library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
Fatalities=ddply(Storm2010,~EVTYPE,summarise,
    SumFatalities=sum(FATALITIES,na.rm=TRUE)) 
        ##summarize fatalities by type of events
HighFatalities=arrange(Fatalities,desc(SumFatalities))
HighFatalities$Percentage <- round(HighFatalities$SumFatalities / 
    sum(HighFatalities$SumFatalities) * 100,2)
HighFatalities = head(HighFatalities,10)
HighFatalities
##             EVTYPE SumFatalities Percentage
## 1      FLASH FLOOD            67      15.76
## 2      RIP CURRENT            49      11.53
## 3          TORNADO            45      10.59
## 4             HEAT            44      10.35
## 5            FLOOD            41       9.65
## 6   EXCESSIVE HEAT            39       9.18
## 7        LIGHTNING            29       6.82
## 8  COLD/WIND CHILL            26       6.12
## 9        AVALANCHE            22       5.18
## 10       HIGH SURF            17       4.00

Show the top 10 storms and weather events with most number of fatalities in bar plot.

a=HighFatalities$EVTYPE
b=HighFatalities$SumFatalities
c = matrix (b, nrow=1, byrow=TRUE)
colnames(c)=a
y1=round(max(b),-1) ## to be set as y-axis limit

par(mar = c(7, 4, 2, 2) + 0.2)
end_point = 0.5 + ncol(c) + ncol(c)-1
barplot(b,main = "Storms and Weather Events with High Fatalities in 2010",
    xlab = "Type of Event", ylab="Fatalities",col = c("red"), 
    ylim = c(0,y1), cex.axis=0.7,cex.names=0.7,width=1.75)

text(seq(1.5,end_point,by=2), par("usr")[3]-0.25, 
     srt = 60, adj= 1, xpd = TRUE,
     labels = paste(colnames(c)), cex=0.5)  ##60 deg x-axis label

FLASH FLOOD showed the highest fatalities with 67 cases or 15.76% of total fatalities in 2010.

This is followed by RIP CURRENT with 49 cases or 11.53% of total fatalities.

Top 10 storms and weather events with most cases of injuries

Injuries=ddply(Storm2010,~EVTYPE,summarise,
    SumInjuries=sum(INJURIES,na.rm=TRUE)) 
        ##summarize injuries by type of events
HighInjuries=arrange(Injuries,desc(SumInjuries))
HighInjuries$Percentage <- round(HighInjuries$SumInjuries / 
        sum(HighInjuries$SumInjuries) * 100,2)
HighInjuries = head(HighInjuries,10)
HighInjuries
##               EVTYPE SumInjuries Percentage
## 1            TORNADO         699      37.68
## 2  THUNDERSTORM WIND         323      17.41
## 3        FLASH FLOOD         188      10.13
## 4          LIGHTNING         182       9.81
## 5              FLOOD         127       6.85
## 6     EXCESSIVE HEAT          63       3.40
## 7        RIP CURRENT          45       2.43
## 8               HAIL          42       2.26
## 9               HEAT          39       2.10
## 10       STRONG WIND          28       1.51

Show Top 10 storms and weather events with most cases of injuries in bar plot

d=HighInjuries$EVTYPE
e=HighInjuries$SumInjuries
f = matrix (e, nrow=1, byrow=TRUE)
colnames(f)=d
y2=round(max(f),-2) ## to be set as y-axis limit

par(mar = c(7, 4, 2, 2) + 0.2)
end_point = 0.5 + ncol(f) + ncol(f)-1
barplot(e,main = "Storms and Weather Events with High Injuries in 2010",
    xlab = "Type of Event", ylab="Injuries",col = c("orange"), 
    ylim = c(0,y2), cex.axis=0.7,cex.names=0.7,width=1.65)

text(seq(1.5,end_point,by=2), par("usr")[3]-0.25, 
     srt = 60, adj= 1, xpd = TRUE,
     labels = paste(colnames(f)), cex=0.5)##60 deg x-axis label

TORNADO showed the highest injuries with 699 cases or 37.68% of total injuries in 2010.

This was followed by THUNDERSTORM WIND with 323 cases or 17.41% of total Injuries.

Top 10 storms and weather events with most cases of combined fatalities and injuries

FatalitiesInjuries <- merge(Fatalities,Injuries,by="EVTYPE")
FatalitiesInjuries$Combined <- (FatalitiesInjuries$SumFatalities + 
    FatalitiesInjuries$SumInjuries)
HighFatalitiesInjuries <- arrange(FatalitiesInjuries,desc(Combined))
HighFatalitiesInjuries$Percentage <- round(HighFatalitiesInjuries$Combined / 
        sum(HighFatalitiesInjuries$Combined) * 100,2)
HighFatalitiesInjuries=head(HighFatalitiesInjuries,10)
HighFatalitiesInjuries
##               EVTYPE SumFatalities SumInjuries Combined Percentage
## 1            TORNADO            45         699      744      32.63
## 2  THUNDERSTORM WIND            15         323      338      14.82
## 3        FLASH FLOOD            67         188      255      11.18
## 4          LIGHTNING            29         182      211       9.25
## 5              FLOOD            41         127      168       7.37
## 6     EXCESSIVE HEAT            39          63      102       4.47
## 7        RIP CURRENT            49          45       94       4.12
## 8               HEAT            44          39       83       3.64
## 9          AVALANCHE            22          20       42       1.84
## 10              HAIL             0          42       42       1.84

TORNADO showed the highest combined fatalities and injuries with 744 cases or 32.63% of total combined fatalities and injuries in 2010.

Show which types of events (as indicated in the EVTYPE variable) have the greatest economic consequences across the United States.

According to the National Weather Service Storm Data Documentation, the estimated property damage amount in dollar should be entered as actual dollar amount and estimates should be rounded off to three significant digits (PROPDMG) followed by an alphabetical character signifying the magnitude of the number “K” for thousands, “M” for Millions, “B” for billions (PROPDMGEXP).

Now, check if all data entries in PROPDMG are in three significant digits as stated in the documentation.

g=filter(Storm2010, PROPDMG > 999.99)
g$PROPDMG
## [1] 5000 5000 1000 3200 1000 1000 1000 1000
g$PROPDMGEXP
## [1] K K K K K K K K
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
nrow(g) / nrow(Storm2010) ## % of invalid entries
## [1] 0.0001661095

Remove invalid PROPDMG entries since total frequency is insignificant. Then, estimate the amount of property damage

PropertyDamage = filter(Storm2010, PROPDMG < 1000)
h = data.frame(table(PropertyDamage$PROPDMGEXP))
sum(h$Freq) ## check if tallies with the nrows of PropertyDamage
## [1] 48153
nrow(PropertyDamage)
## [1] 48153
h$PROPDMGEXP=c(0,0,0,0,0,0,0,0,0,0,0,0,1000000000,0,0,1000,1000000,1000000)
colnames(h)=c("PROPDMGEXP","f","PropDmgValue")
i <- (merge(PropertyDamage, h, by = 'PROPDMGEXP')) ##to add column to PropertyDamage that indicates the corresponding values of PROPDMGEXP
summary(PropertyDamage$PROPDMGEXP)
##     -     ?     +     0     1     2     3     4     5     6     7     8 
##     0     0     0     0     0     0     0     0     0     0     0     0 
##     B     h     H     K     m     M 
##     2     0     0 47735     0   416
data.frame(table(i$PropDmgValue)) ##check if results tally
##    Var1  Freq
## 1  1000 47735
## 2 1e+06   416
## 3 1e+09     2
nrow(PropertyDamage)
## [1] 48153
nrow(i)
## [1] 48153
i$PropDmg = i$PROPDMG*i$PropDmgValue/1000000 ## to show amount in million dollars
summary(i$PropDmg)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0000    0.0000    0.0000    0.1916    0.0020 1800.0000

Process CROPDMG and CROPDMGEXP to estimate amount of Crop Damage in the same manner in which PROPDMG and PROPDMGEXP were processed

j=filter(Storm2010, CROPDMG > 999.99)
j$CROPDMG
## numeric(0)
CropDamage=Storm2010 ##since there is no invalid CROPDMG entry amount
k = data.frame(table(CropDamage$CROPDMGEXP))
sum(k$Freq) ## check if tallies with the nrows of CropDamage
## [1] 48161
nrow(CropDamage)
## [1] 48161
k$CROPDMGEXP=c(0,0,0,1000000000,1000,1000,1000000,1000000)
colnames(k)=c("CROPDMGEXP","f","CropDmgValue")
l <- (merge(CropDamage, k, by = 'CROPDMGEXP')) 
summary(CropDamage$CROPDMGEXP)
##     ?     0     2     B     k     K     m     M 
##     0     0     0     0     0 48021     0   140
data.frame(table(l$CropDmgValue)) ##check of result tallies
##    Var1  Freq
## 1  1000 48021
## 2 1e+06   140
nrow(CropDamage)
## [1] 48161
nrow(l)
## [1] 48161
l$CropDmg = l$CROPDMG*l$CropDmgValue/1000000 ## to show amount in million dollars
summary(l$CropDmg)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##   0.00000   0.00000   0.00000   0.03707   0.00000 286.00000

Results

First 10 storms and weather events with highest amount (in million dollars) of property damage

library(plyr)
PropertyDamage=ddply(i,~EVTYPE,summarise,
    SumPropDmg=sum(PropDmg,na.rm=TRUE)) 
        ##summarize Property Damage by type of events
HighPropertyDamage=arrange(PropertyDamage,desc(SumPropDmg))
HighPropertyDamage$Percentage <- round(HighPropertyDamage$SumPropDmg / 
    sum(HighPropertyDamage$SumPropDmg) * 100,2)
HighPropertyDamage = head(HighPropertyDamage,10)
HighPropertyDamage
##               EVTYPE SumPropDmg Percentage
## 1               HAIL 3404.10995      36.89
## 2              FLOOD 3094.02854      33.53
## 3            TORNADO 1135.92045      12.31
## 4        FLASH FLOOD  826.04090       8.95
## 5           WILDFIRE  244.88867       2.65
## 6  THUNDERSTORM WIND  213.20100       2.31
## 7          LIGHTNING   70.14870       0.76
## 8       WINTER STORM   61.94800       0.67
## 9          HIGH SURF   49.11150       0.53
## 10         HIGH WIND   27.65992       0.30

HAIL showed the highest amount of Property Damage of $3404.10995M or 36.89% of total Property Damage in 2010.

First 10 storms and weather events with highest amount (in million dollars) of crop damage

CropDamage=ddply(l,~EVTYPE,summarise,
    SumCropDmg=sum(CropDmg,na.rm=TRUE)) 
        ##summarize Crop Damage by type of events
HighCropDamage=arrange(CropDamage,desc(SumCropDmg))
HighCropDamage$Percentage <- round(HighCropDamage$SumCropDmg / 
    sum(HighCropDamage$SumCropDmg) * 100,2)
HighCropDamage = head(HighCropDamage,10)
HighCropDamage
##               EVTYPE SumCropDmg Percentage
## 1              FLOOD   1104.709      61.88
## 2       FROST/FREEZE    354.240      19.84
## 3        FLASH FLOOD    152.680       8.55
## 4               HAIL    101.980       5.71
## 5            TORNADO     27.645       1.55
## 6     WINTER WEATHER     15.000       0.84
## 7  THUNDERSTORM WIND     10.812       0.61
## 8            DROUGHT      5.807       0.33
## 9     TROPICAL STORM      5.010       0.28
## 10        HEAVY RAIN      4.101       0.23

FLOOD showed the highest amount of Crop Damage of $1104.709M or 61.88% of total Crop Damage in 2010.

First 10 storms and weather events with highest amount (in million dollars) of combined Property Damage and Crop Damage

PropertyCropDamage <- merge(PropertyDamage,CropDamage,by="EVTYPE")
PropertyCropDamage$Combined <- (PropertyCropDamage$SumPropDmg + 
    PropertyCropDamage$SumCropDmg)
HighPropertyCropDamage <- arrange(PropertyCropDamage,desc(Combined))
HighPropertyCropDamage$Percentage <- round(HighPropertyCropDamage$Combined / 
        sum(HighPropertyCropDamage$Combined) * 100,2)
HighPropertyCropDamage=head(HighPropertyCropDamage,10)
HighPropertyCropDamage
##               EVTYPE SumPropDmg SumCropDmg  Combined Percentage
## 1              FLOOD  3094.0285   1104.709 4198.7375      38.12
## 2               HAIL  3404.1100    101.980 3506.0900      31.83
## 3            TORNADO  1135.9205     27.645 1163.5655      10.56
## 4        FLASH FLOOD   826.0409    152.680  978.7209       8.89
## 5       FROST/FREEZE     3.1750    354.240  357.4150       3.25
## 6           WILDFIRE   244.8887      2.050  246.9387       2.24
## 7  THUNDERSTORM WIND   213.2010     10.812  224.0130       2.03
## 8          LIGHTNING    70.1487      0.452   70.6007       0.64
## 9       WINTER STORM    61.9480      0.000   61.9480       0.56
## 10         HIGH SURF    49.1115      0.000   49.1115       0.45

FLOOD showed the highest amount of combined Property Damage and Crop Damage of $4198.73754M or 38.12% of total combined Property Damage and Crop Damage in 2010.

Show first 10 storms and weather events with highest amount (in million dollars) of combined Property Damage and Crop Damage in stacked bar plot

m=HighPropertyCropDamage$EVTYPE
PropertyDmg=HighPropertyCropDamage$SumPropDmg
CropDmg=HighPropertyCropDamage$SumCropDmg
n=rbind(PropertyDmg,CropDmg)
colnames(n)=m
y3=max(HighPropertyCropDamage$Combined)+1000 ## to be set as y-axis limit
par(mar = c(7, 6, 2, 2) + 0.2)
barplot(n, main = "Storms and Whether Events with Highest Economic Impact in 2010",
    xlab = "Type  of Event", ylab="Combined Property Damage and Crops Damage 
    in MM", col = c("red","orange"), ylim=c(0,y3),cex.axis=0.5,
    cex.names=0.6, cex.main = 0.8, width=1.70, xaxt='n', ann=FALSE)

text(seq(1.5,end_point,by=2), par("usr")[3]-0.25, srt = 60, adj= 1, 
    xpd = TRUE, labels = paste(colnames(n)), cex=0.5)

legend("topright", c("Property Damage","Crop Damage"), fill = c("red","orange"),
    cex = 0.6)