Project1 - Data Analysis on Population and Property Loss in US due to Various Calamities

Synopsis

In this report we will look at the infliction of various calamities on population and economic(property+crop) damage over the years from 1950 to 2011. Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

The question we are interested in answering are as follows:

Across the United States, which types of events(EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

Loading and Preprocessing the Raw Data

The dataset was taken from National Weather Service Storm Data. There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

Loading libraries required for our analysis.

library(plyr)
library(dplyr)
library(ggplot2)

Reading and Viewing its Properties

data<-read.csv("repdata_data_StormData.csv.bz2")
str(data)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

So we see there many unwanted variables that are redundant and we don’t need them at least not for our analysis. ### Data transformation

#variables we need to solve this analysis
dframe<-data[,c("STATE","EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
dframe$CROPDMGEXP<-as.character(dframe$CROPDMGEXP)
dframe$PROPDMGEXP<-as.character(dframe$PROPDMGEXP)
head(dframe)

##   STATE  EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1    AL TORNADO          0       15    25.0          K       0           
## 2    AL TORNADO          0        0     2.5          K       0           
## 3    AL TORNADO          0        2    25.0          K       0           
## 4    AL TORNADO          0        2     2.5          K       0           
## 5    AL TORNADO          0        2     2.5          K       0           
## 6    AL TORNADO          0        6     2.5          K       0

Now we add two more variables ,one would be total population damage variable showing the sum of injuries and fatalities caused by each calamity which will show total population damage and second variable would be total economic damage which was inflicted upon by the calamity.

dframe$POPU.DAMAGE<-dframe[,"FATALITIES"]+dframe[,"INJURIES"]    #first variable
dframe$TOTAL.PROP.DAMAGE<-rep(0,902297)                          #second variable initialized with zero in all rows

Now we will subset from the dframe data frame.The rows in PROPDMGEXP and CROPDMGEXP have various factors of which we will only include denomination marked as K(for Thousand),M(for Million),B(for Billion) and rest levels would be droped.

deno<-c("K","M","B")
df<-filter(dframe,PROPDMGEXP %in% deno & CROPDMGEXP %in% deno)

We have our subset in df dataframe and now we can fill our TOTAL.PROP.DAMAGE column by adding columns CROPDMG and PROPDMG according to the denominations as indicated by CROPDMGEXP and PROPDMGEXP respectively.

for(i in 1:nrow(df)){

     if(df$PROPDMGEXP[i] == "K"&&df$CROPDMGEXP[i] == "K")
        df$TOTAL.PROP.DAMAGE[i]<- df$PROPDMG[i]*1000 + df$CROPDMG[i]*1000
  
     else if(df$PROPDMGEXP[i]=="K"&&df$CROPDMGEXP[i] == "M")
        df$TOTAL.PROP.DAMAGE[i]<- df$PROPDMG[i]*1000 + df$CROPDMG[i]*1000000
  
     else if(df$PROPDMGEXP[i]=="K"&&df$CROPDMGEXP[i] == "B")
       df$TOTAL.PROP.DAMAGE[i]<- df$PROPDMG[i]*1000 + df$CROPDMG[i]*1000000000
     
     else if(df$PROPDMGEXP[i] == "M"&&df$CROPDMGEXP[i] == "K")
       df$TOTAL.PROP.DAMAGE[i]<- df$PROPDMG[i]*1000000 + df$CROPDMG[i]*1000
     
     else if(df$PROPDMGEXP[i]=="M"&&df$CROPDMGEXP[i] == "M")
       df$TOTAL.PROP.DAMAGE[i]<-  df$PROPDMG[i]*1000000 + df$CROPDMG[i]*1000000
     
     else if(df$PROPDMGEXP[i]=="M"&& df$CROPDMGEXP[i] == "B")
       df$TOTAL.PROP.DAMAGE[i]<-  df$PROPDMG[i]*1000000 + df$CROPDMG[i]*1000000000
     
     else if(df$PROPDMGEXP[i] == "B"&&df$CROPDMGEXP[i] == "K")
       df$TOTAL.PROP.DAMAGE[i]<- df$PROPDMG[i]*1000000000 + df$CROPDMG[i]*1000
     
     else if(df$PROPDMGEXP[i]=="B"&&df$CROPDMGEXP[i] == "M")
       df$TOTAL.PROP.DAMAGE[i]<-  df$PROPDMG[i]*1000000000 + df$CROPDMG[i]*1000000
     
     else if(df$PROPDMGEXP[i]=="B"&& df$CROPDMGEXP[i] == "B")
       df$TOTAL.PROP.DAMAGE[i]<-  df$PROPDMG[i]*1000000000 + df$CROPDMG[i]*1000000000
     
}

Results

Now we have prepared our dataframe df for further analysis.

head(df)

##   STATE                    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 1    AL HURRICANE OPAL/HIGH WINDS          2        0     0.1          B
## 2    AL        THUNDERSTORM WINDS          0        0     5.0          M
## 3    AL            HURRICANE ERIN          0        0    25.0          M
## 4    AL            HURRICANE OPAL          0        0    48.0          M
## 5    AL        THUNDERSTORM WINDS          0        0    50.0          K
## 6    AL        THUNDERSTORM WINDS          0        2   500.0          K
##   CROPDMG CROPDMGEXP POPU.DAMAGE TOTAL.PROP.DAMAGE
## 1      10          M           2           1.1e+08
## 2     500          K           0           5.5e+06
## 3       1          M           0           2.6e+07
## 4       4          M           0           5.2e+07
## 5      50          K           0           1.0e+05
## 6      50          K           2           5.5e+05

Filter df on the basis of EVTYPE and reassign it to df.

df1<-df%>%group_by(EVTYPE)%>%summarise(TOT.POPU.DAMAGE.PER.EVTYPE=sum(POPU.DAMAGE,na.rm = T),TOT.ECONOMIC.DAMAGE.PER.EVTYPE=sum(TOTAL.PROP.DAMAGE,na.rm = T))
summary(df1)

##                     EVTYPE    TOT.POPU.DAMAGE.PER.EVTYPE
##  ASTRONOMICAL HIGH TIDE:  1   Min.   :    0.0           
##  ASTRONOMICAL LOW TIDE :  1   1st Qu.:    0.0           
##  AVALANCHE             :  1   Median :    1.0           
##  BLIZZARD              :  1   Mean   :  272.9           
##  COASTAL FLOOD         :  1   3rd Qu.:   39.5           
##  COASTAL FLOODING      :  1   Max.   :13024.0           
##  (Other)               :117                             
##  TOT.ECONOMIC.DAMAGE.PER.EVTYPE
##  Min.   :0.000e+00             
##  1st Qu.:2.776e+05             
##  Median :4.264e+06             
##  Mean   :2.125e+09             
##  3rd Qu.:1.555e+08             
##  Max.   :1.380e+11             
##

After seeing summary we see that max fatalities in US is 13000+ and still mean is just 200+ so we will divide this dataset into 2 parts one part with total fatalities >1000 and second set with fatalaties<1000 to get a clear picture. Since we are interested in knowing which event was most harmful to population so we will only look fatalities >1000

df2<-filter(df1,TOT.POPU.DAMAGE.PER.EVTYPE>=1000)
df2

## # A tibble: 8 x 3
##   EVTYPE            TOT.POPU.DAMAGE.PER.EVTYPE TOT.ECONOMIC.DAMAGE.PER.EVTYPE
##   <fct>                                  <dbl>                          <dbl>
## 1 EXCESSIVE HEAT                          1070                      493803200
## 2 FLASH FLOOD                             1054                     8715295130
## 3 FLOOD                                   6756                   138007444500
## 4 HEAT                                    1376                        2390000
## 5 ICE STORM                               1629                     5925147300
## 6 LIGHTNING                               1183                      320786130
## 7 THUNDERSTORM WIND                       1542                     3813647990
## 8 TORNADO                                13024                    16520148150

Plot 1 for population damage event wise

g1<-ggplot(df2,aes(EVTYPE,TOT.POPU.DAMAGE.PER.EVTYPE,fill=EVTYPE))
g1+geom_bar(stat = "identity",show.legend = F)+theme_bw() +geom_text(aes(label=TOT.POPU.DAMAGE.PER.EVTYPE),vjust=-0.3)+ggtitle("POPULATION DAMAGE(>1000)")+ylab("POPULATION DAMAGE PER EVTYPE")+ theme(axis.text.x = element_text(angle = 60, hjust = 1))

Here we can see that max damage is caused by tornado and is followed by flood and then ice storm. Therefore population wise ranking of top 5 event causing max damage is:

TORNADO(13024)
FLOOD(6756)
ICE STORM(1629)
THUNDERSTORM WIND(1542)
HEAT(1376)

Conclusion-1

Data analysis on this dataset shows that Tornado is by far population damage wise the most devastating calamity in US when compared to other calamities. When we sum up the economic loss of all calamities other than Tornado, this figure is no where close to loss of lives inflicted by Tornado.

So government should be prepared for Tornado and should invest in proper technolohgy and infrastructure that could inform people beforehand about such calamities (esp. Tornado) and reduce the Tornado’s impact on human lives.

Now in the TOT.ECONOMIC.DAMAGE.PER.EVTYPE column we see alot variation and since we are interested in answering which types of events have the greatest economic consequences so well filter out the rows showing damage greater than 2B$.

df3<-filter(df1,TOT.ECONOMIC.DAMAGE.PER.EVTYPE>2000000000)
df3

## # A tibble: 13 x 3
##    EVTYPE            TOT.POPU.DAMAGE.PER.EVTYPE TOT.ECONOMIC.DAMAGE.PER.EVTYPE
##    <fct>                                  <dbl>                          <dbl>
##  1 FLASH FLOOD                             1054                     8715295130
##  2 FLOOD                                   6756                   138007444500
##  3 HAIL                                     305                    10019978590
##  4 HIGH WIND                                418                     3057666640
##  5 HURRICANE                                 64                    12405268000
##  6 HURRICANE OPAL                             1                     2157000000
##  7 HURRICANE/TYPHOON                        949                    29348167800
##  8 ICE STORM                               1629                     5925147300
##  9 RIVER FLOOD                                0                    10108369000
## 10 STORM SURGE/TIDE                          16                     4641493000
## 11 THUNDERSTORM WIND                       1542                     3813647990
## 12 TORNADO                                13024                    16520148150
## 13 WILDFIRE                                 614                     3684468370

Plot 2 for total property loss(in Billion dollars) event wise

g2<-ggplot(df3,aes(EVTYPE,TOT.ECONOMIC.DAMAGE.PER.EVTYPE/1000000000,fill=EVTYPE))
g2+geom_bar(stat = "identity",show.legend = F)+theme_bw() +geom_text(aes(label=round(TOT.ECONOMIC.DAMAGE.PER.EVTYPE/1000000000,2)),vjust=-0.3)+ggtitle("ECONOMIC DAMAGE(>2B$)")+ylab("ECONOMIC DAMAGE PER EVTYPE(in Billion Dollars)")+ theme(axis.text.x = element_text(angle = 60, hjust = 1))

Max economic destruction caused by flood with 138.01B dollars followed by hurricane/typhoon with 29.34B dollars loss 3rd place is occupied by tornado with 16.52B$ loss of property. Therefore property wise ranking of top 5 event causing max damage in dollar are:

FLOOD(138.01B)
HURRICANE/TYPHOON(29.35B)
TORNADO(16.52B)
HURRICANE(12.41B)
RIVER FLOOD(10.11B)

Conclusion-2

Data analysis on this dataset shows that flood is by far economically most devastating calamity in US when compared to other calamities. When we sum up the economic loss of all calamities other than flood , this figure is no where close to economic loss inflicted by flood.

So government should be prepared for flood more than anything and should invest in proper infrastructure that could reduce the flood’s economic impact.