Population and Economic Consequences of Weather Events in the United States (1950-2011)

Synopsis

The objective of this report is to study the (population and econmic) impact of weather events in the United States between 1950 and 2011. More specifically, the goal of the analysis is to understand the set of events that caused the most impact, on average. Towards this end, we classify the events in the dataset obtained from US Ocenanic and Atomosperic Administration (NOAA) into different categories. Then, we analyse the effects of the events in each category as follows. First, we compute the impact on the population in terms of fatalities and injuries. Then, we compute the impact on the economy in terms of property and crop damages (in USD). We summarise the effects by calculating the total, mean, standard deviation, median, and maximum of each of the parameters. Finally, we plot the top 2 categories that caused the most impact (on average) that provides a confidence interval of the damages for these categories.

Loading and Processing the Raw Data

First, we load the storm and weather events database obtained from US Ocenanic Atomosperic Administration (NOAA) into R. The dataset includes events in the United States between the years 1950 and 2011.

# the results from this chunk are hidden
datafile.bz2 <- "data/repdata-data-StormData.csv.bz2"
datafile <- "data/repdata-data-StormData.csv"
if (file.exists(datafile.bz2) == TRUE) {
    library(R.utils)
    bunzip2(datafile.bz2, destname = datafile, overwrite = TRUE, remove = FALSE)
} else {
    stop(paste(dst, "missing"))
}

## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.6.1 (2014-01-04) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.18.0 (2014-02-22) successfully loaded. See ?R.oo for help.
## 
## Attaching package: 'R.oo'
## 
## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods
## 
## The following objects are masked from 'package:base':
## 
##     attach, detach, gc, load, save
## 
## R.utils v1.32.4 (2014-05-14) successfully loaded. See ?R.utils for help.
## 
## Attaching package: 'R.utils'
## 
## The following object is masked from 'package:utils':
## 
##     timestamp
## 
## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, parse, warnings

After extracting the data file, we load and process the data.

stormData <- read.csv(datafile, header = TRUE)
names(stormData)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

We check the first few observations/rows of the dataset. (There are 902297 observerations in the dataset in total.) We show the observerations for population impact and economic impact separately.

dim(stormData)

## [1] 902297     37

head(stormData[, c("BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES")])

##             BGN_DATE  EVTYPE FATALITIES INJURIES
## 1  4/18/1950 0:00:00 TORNADO          0       15
## 2  4/18/1950 0:00:00 TORNADO          0        0
## 3  2/20/1951 0:00:00 TORNADO          0        2
## 4   6/8/1951 0:00:00 TORNADO          0        2
## 5 11/15/1951 0:00:00 TORNADO          0        2
## 6 11/15/1951 0:00:00 TORNADO          0        6

head(stormData[, c("BGN_DATE", "EVTYPE", "PROPDMG", "PROPDMGEXP", "CROPDMG", 
    "CROPDMGEXP")])

##             BGN_DATE  EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1  4/18/1950 0:00:00 TORNADO    25.0          K       0           
## 2  4/18/1950 0:00:00 TORNADO     2.5          K       0           
## 3  2/20/1951 0:00:00 TORNADO    25.0          K       0           
## 4   6/8/1951 0:00:00 TORNADO     2.5          K       0           
## 5 11/15/1951 0:00:00 TORNADO     2.5          K       0           
## 6 11/15/1951 0:00:00 TORNADO     2.5          K       0

The events identified in EVTYPE column does not conform to NOAA documentation guidelines. NOAA documentation indicates that there are 48 weather events in the dataset. However, there are 985 events in the dataset. We clean the dataset to map these events into the events identified in NOAA documentation. Then, we group these 48 events into different categories. The events in NOAA documentation are as follows.

eventTypes <- toupper(c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", 
    "Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke", "Drought", 
    "Dust Devil", "Dust Storm", "Excessive Heat", "Extreme Cold / Wind Chill", 
    "Flash Flood", "Flood", "Frost/Freeze", "Funnel Cloud", "Freezing Fog", 
    "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf", "High Wind", "Hurricane (Typhoon)", 
    "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood", "Lightning", "Marine Hail", 
    "Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind", "Rip Current", 
    "Seiche", "Sleet", "Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind", 
    "Tornado", "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", 
    "Waterspout", "Wildfire", "Winter Storm", "Winter Weather", "Removed"))
eventCategories <- c("T", "W", "W", "F", "W", "F", "S", "S", "S", "S", "S", 
    "S", "W", "F", "F", "W", "H", "W", "W", "S", "R", "W", "T", "T", "H", "W", 
    "W", "F", "R", "W", "T", "T", "R", "T", "T", "W", "T", "T", "R", "H", "H", 
    "H", "O", "O", "T", "S", "W", "W", "Z")
abbreviate(eventTypes)

##     ASTRONOMICAL LOW TIDE                 AVALANCHE 
##                    "ASLT"                    "AVAL" 
##                  BLIZZARD             COASTAL FLOOD 
##                    "BLIZ"                    "COAF" 
##           COLD/WIND CHILL               DEBRIS FLOW 
##                    "COLC"                    "DEBF" 
##                 DENSE FOG               DENSE SMOKE 
##                    "DENF"                    "DENS" 
##                   DROUGHT                DUST DEVIL 
##                    "DROU"                    "DUSD" 
##                DUST STORM            EXCESSIVE HEAT 
##                    "DUSS"                    "EXCH" 
## EXTREME COLD / WIND CHILL               FLASH FLOOD 
##                   "EC/WC"                    "FLAF" 
##                     FLOOD              FROST/FREEZE 
##                    "FLOO"                    "FROS" 
##              FUNNEL CLOUD              FREEZING FOG 
##                    "FUNC"                    "FREF" 
##                      HAIL                      HEAT 
##                    "HAIL"                    "HEAT" 
##                HEAVY RAIN                HEAVY SNOW 
##                    "HEAR"                    "HEAS" 
##                 HIGH SURF                 HIGH WIND 
##                    "HIGS"                    "HIGW" 
##       HURRICANE (TYPHOON)                 ICE STORM 
##                    "HUR("                    "ICES" 
##          LAKE-EFFECT SNOW           LAKESHORE FLOOD 
##                    "LAKS"                    "LAKF" 
##                 LIGHTNING               MARINE HAIL 
##                    "LIGH"                    "MARH" 
##          MARINE HIGH WIND        MARINE STRONG WIND 
##                    "MAHW"                    "MASW" 
##  MARINE THUNDERSTORM WIND               RIP CURRENT 
##                    "MATW"                    "RIPC" 
##                    SEICHE                     SLEET 
##                    "SEIC"                    "SLEE" 
##          STORM SURGE/TIDE               STRONG WIND 
##                    "STOS"                    "STRW" 
##         THUNDERSTORM WIND                   TORNADO 
##                    "THUW"                    "TORN" 
##       TROPICAL DEPRESSION            TROPICAL STORM 
##                    "TROD"                    "TROS" 
##                   TSUNAMI              VOLCANIC ASH 
##                    "TSUN"                    "VOLA" 
##                WATERSPOUT                  WILDFIRE 
##                    "WATE"                    "WILD" 
##              WINTER STORM            WINTER WEATHER 
##                    "WINS"                    "WINW" 
##                   REMOVED 
##                    "REMO"

unique(eventCategories)

## [1] "T" "W" "F" "S" "H" "R" "O" "Z"

In our analysis, we use 8 categories of events. The events are mapped into these categories are as follows.

ecs <- unique(as.character(eventCategories))
evtypesstr = sapply(1:length(ecs), function(i) paste(abbreviate(eventTypes[eventCategories == 
    ecs[i]]), collapse = ", "))
for (i in 1:length(ecs)) {
    print(paste0(ecs[i], ": ", evtypesstr[i]))
}

## [1] "T: ASLT, HIGS, HIGW, MAHW, MASW, RIPC, SEIC, STOS, STRW, WATE"
## [1] "W: AVAL, BLIZ, COLC, EC/WC, FROS, FREF, HAIL, HEAS, ICES, LAKS, MARH, SLEE, WINS, WINW"
## [1] "F: COAF, DEBF, FLAF, FLOO, LAKF"
## [1] "S: DENF, DENS, DROU, DUSD, DUSS, EXCH, HEAT, WILD"
## [1] "H: FUNC, HUR(, TORN, TROD, TROS"
## [1] "R: HEAR, LIGH, MATW, THUW"
## [1] "O: TSUN, VOLA"
## [1] "Z: REMO"

Next, we map the events in the dataset to one of the 48 identified above and subsequently classify them into appropriate category.

events <- toupper(as.character(levels(stormData$EVTYPE)))
origEvents <- events
library(stringr)
events <- str_trim(events)
eventGroups <- rep(NA, length(events))
eventGroupCats <- rep(NA, length(events))

events <- gsub("\\s*RECORD\\s*", "", events)

rows <- grep("torn|whirl|wall", events, ignore.case = TRUE)
events[rows] <- toupper("Tornado")

rows <- grep("wintry|winter|northern", events, ignore.case = TRUE)
events[rows] <- toupper("Winter Weather")

coastal <- grep("coastal", events, ignore.case = TRUE)
rows <- grep("dam\\s+|drown|rising|flood", events, ignore.case = TRUE)
events[setdiff(rows, coastal)] <- toupper("Flood")
events[coastal] <- toupper("Coastal Flood")

rows <- grep("dust", events, ignore.case = TRUE)
events[rows] <- toupper("Dust Storm")

rows <- grep("thu|tund|TSTM|storm", events, ignore.case = TRUE)
events[rows] <- toupper("Thunderstorm Wind")

rows <- grep("fire|smoke", events, ignore.case = TRUE)
events[rows] <- toupper("Wildfire")

rows <- grep("coastal", events, ignore.case = TRUE)
events[rows] <- toupper("Coastal Flood")

rows <- grep("spout", events, ignore.case = TRUE)
events[rows] <- toupper("Waterspout")

rows <- grep("snow", events, ignore.case = TRUE)
events[rows] <- toupper("Heavy Snow")

rows <- grep("hurri|typh|floyd", events, ignore.case = TRUE)
events[rows] <- toupper("Hurricane (Typhoon)")

rows <- grep("warm|heat|hot", events, ignore.case = TRUE)
events[rows] <- toupper("Heat")

rows <- grep("freez|frost", events, ignore.case = TRUE)
events[rows] <- toupper("Frost/Freeze")

rows <- grep("chill|thermia", events, ignore.case = TRUE)
events[rows] <- toupper("Extreme Cold /Wind Chill")

rows <- grep("^wind|wnd|High Wind|HIGH  WINDS", events, ignore.case = TRUE)
events[rows] <- toupper("High Wind")

rows <- grep("funnel", events, ignore.case = TRUE)
events[rows] <- toupper("Funnel Cloud")

rows <- grep("cold|cool|low", events, ignore.case = TRUE)
events[rows] <- toupper("Cold/Wind Chill")

rows <- grep("rain|preci|wet|shower|slide", events, ignore.case = TRUE)
events[rows] <- toupper("Heavy Rain")

rows <- grep("gust|gradi|burst|turbu|Strong Wind", events, ignore.case = TRUE)
events[rows] <- toupper("Strong Wind")

rows <- grep("dry|dri", events, ignore.case = TRUE)
events[rows] <- toupper("Excessive Heat")

rows <- grep("swell|tide|wave|sea|beach|HIGH WATER", events, ignore.case = TRUE)
events[rows] <- toupper("Storm Surge/Tide")

rows <- grep("urban|flash|stream|slump", events, ignore.case = TRUE)
events[rows] <- toupper("Flash Flood")

rows <- grep("tempe", events, ignore.case = TRUE)
events[rows] <- toupper("Heat")

rows <- grep("ice|icy|glaze", events, ignore.case = TRUE)
events[rows] <- toupper("Sleet")

rows <- grep("volca", events, ignore.case = TRUE)
events[rows] <- toupper("Volcanic Ash")

rows <- grep("surf", events, ignore.case = TRUE)
events[rows] <- toupper("High Surf")

rows <- grep("fog|vog", events, ignore.case = TRUE)
events[rows] <- toupper("Dense Fog")

rows <- grep("avalan|blizz", events, ignore.case = TRUE)
events[rows] <- toupper("Blizzard")

rows <- grep("hail", events, ignore.case = TRUE)
events[rows] <- toupper("Hail")

rows <- grep("rip curr", events, ignore.case = TRUE)
events[rows] <- toupper("Rip Current")

rows <- grep("lightning|ligntning|lighting", events, ignore.case = TRUE)
events[rows] <- toupper("Lightning")

for (i in 1:length(events)) {
    if (events[i] %in% eventTypes) {
        eventGroups[i] <- eventTypes[which(events[i] == eventTypes)]
    } else {
        ## check if there is substring of events in eventTypes
        if (length(grep("?", events[i]) > 0)) {
            eventGroups[i] = "REMOVED"
        }
    }
    eventGroupCats[i] <- eventCategories[which(eventGroups[i] == eventTypes)]
}

stormData <- transform(stormData, ET = eventGroups[stormData$EVTYPE], EC = eventGroupCats[stormData$EVTYPE])
length(unique(as.character(stormData$ET)))

## [1] 32

length(unique(as.character(stormData$EC)))

## [1] 8

data.frame(eventCategory = ecs, NoOfObservations = sapply(1:length(ecs), function(i) sum(stormData$EC == 
    ecs[i])))

##   eventCategory NoOfObservations
## 1             T            32374
## 2             W           334784
## 3             F            86169
## 4             S            11744
## 5             H            68063
## 6             R           368969
## 7             O               49
## 8             Z              145

Next, we remove categories “O” (for lack of large number of observations) and “Z” (as proper categorization is not possible).

o_rows <- which(stormData$EC == "O")
z_rows <- which(stormData$EC == "Z")
stormData <- stormData[-union(z_rows, o_rows), ]

Finally, we transform the BGN_DATE column into POSIX format and filter only columns of interest.

stormData$EVDATE <- strptime(stormData$BGN_DATE, "%m/%d/%Y %H:%M:%S")
stormData <- stormData[, c("EVDATE", "EVTYPE", "EC", "FATALITIES", "INJURIES", 
    "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
names(stormData)

## [1] "EVDATE"     "EVTYPE"     "EC"         "FATALITIES" "INJURIES"  
## [6] "PROPDMG"    "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP"

dim(stormData)

## [1] 902103      9

The cleaned dataset includes 902103 observations and 9 variables.

Results

Population Health Impact

First, we filter the cleaned dataset to include only observations that had an impact on the population. In other words, we consider observations that had non-zero fatalities or non-zero injuries.

fatal_impact <- stormData[stormData$FATALITIES > 0, c("EVDATE", "EC", "FATALITIES")]
injury_impact <- stormData[stormData$INJURIES > 0, c("EVDATE", "EC", "INJURIES")]
dim(fatal_impact)

## [1] 6969    3

dim(injury_impact)

## [1] 17598     3

totfatalities <- sum(fatal_impact$FATALITIES)
totinjuries <- sum(injury_impact$INJURIES)

Total fatalities across all weather events in the United States between 1950-2011: 15104. Total injuries across all weather events in the United States between 1950-2011: 140380.

Here is a brief summary of total fatalities and total injuries across all categories.

summary(fatal_impact$FATALITIES)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     1.0     1.0     2.2     2.0   583.0

summary(injury_impact$INJURIES)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       1       2       8       4    1700

Next, we use the plyr package in R to summarise fatalities and injuries across different categories. We compute the total, mean, standard deviation, median, and maximum fatalities/injuries for different categories.

fatal_data <- transform(fatal_impact, EVDATE = as.character(fatal_impact$EVDATE), 
    Category = as.character(fatal_impact$EC), FATALITIES = fatal_impact$FATALITIES)
injury_data <- transform(injury_impact, EVDATE = as.character(injury_impact$EVDATE), 
    Category = as.character(injury_impact$EC), INJURIES = injury_impact$INJURIES)

library(plyr)
fatal_df <- ddply(fatal_data, "Category", summarise, tot.obs = length(Category), 
    tot.fatal = sum(FATALITIES), mean.fatal = mean(FATALITIES), sd.fatal = sd(FATALITIES), 
    median.fatal = median(FATALITIES), max.fatal = max(FATALITIES))
injury_df <- ddply(injury_data, "Category", summarise, tot.obs = length(Category), 
    tot.injury = sum(INJURIES), mean.injury = mean(INJURIES), sd.injury = sd(INJURIES), 
    median.injury = median(INJURIES), max.injury = max(INJURIES))
fatal_df <- fatal_df[order(fatal_df$mean.fatal, decreasing = TRUE), ]
injury_df <- injury_df[order(injury_df$mean.injury, decreasing = TRUE), ]

Here is a list of statistics on the number of fatalities and injuries: total, mean, standard deviation, median, maximum. The categories are orded according to the mean value.

fatal_df

##   Category tot.obs tot.fatal mean.fatal sd.fatal median.fatal max.fatal
## 4        S     886      3348      3.779  20.6801            1       583
## 2        H    1653      5797      3.507   7.8800            1       158
## 1        F    1014      1559      1.537   1.4170            1        20
## 6        W     899      1291      1.436   1.0921            1        14
## 3        R    1530      1899      1.241   1.1153            1        22
## 5        T     987      1210      1.226   0.7011            1        10

injury_df

##   Category tot.obs tot.injury mean.injury sd.injury median.injury
## 4        S     659      11931      18.105    46.416             3
## 1        F     586       8685      14.821    71.840             2
## 2        H    7744      92743      11.976    47.742             3
## 6        W     886       6250       7.054    20.110             2
## 5        T     999       2735       2.738     5.113             1
## 3        R    6724      18036       2.682    19.677             1
##   max.injury
## 4        519
## 1        800
## 2       1700
## 6        385
## 5         89
## 3       1568

The category that has the highest average number of fatalities in the United States between 1950 and 2011 is S. The list of events included in this category are:

top_fatal_EC = as.character(fatal_df[1, "Category"])
eventTypes[eventCategories == top_fatal_EC]

## [1] "DENSE FOG"      "DENSE SMOKE"    "DROUGHT"        "DUST DEVIL"    
## [5] "DUST STORM"     "EXCESSIVE HEAT" "HEAT"           "WILDFIRE"

The category that has the highest average number of injuries in the United States between 1950 and 2011 is S. The list of events included in this category are:

top_injury_EC = as.character(injury_df[1, "Category"])
eventTypes[eventCategories == top_injury_EC]

## [1] "DENSE FOG"      "DENSE SMOKE"    "DROUGHT"        "DUST DEVIL"    
## [5] "DUST STORM"     "EXCESSIVE HEAT" "HEAT"           "WILDFIRE"

Here are the boxplots that includes the top 2 categories for fatalities and injuries. This figure shows the mean and the confidence interval of the statistic (in log scale) across some categories.

ec <- union(as.character(fatal_df[1:2, "Category"]), as.character(injury_df[1:2, 
    "Category"]))
pop_ri <- data.frame()
for (i in 1:length(ec)) {
    top_i_fatal <- fatal_impact[fatal_impact$EC == ec[i], ]
    pop_f <- data.frame(type = rep("Fatalities", dim(top_i_fatal)[1]), Category = top_i_fatal$EC, 
        EVDATE = top_i_fatal$EVDATE, log2Impact = log2(top_i_fatal$FATALITIES))
    pop_ri <- rbind(pop_ri, pop_f)
    top_i_injury <- injury_impact[injury_impact$EC == ec[i], ]
    pop_i <- data.frame(type = rep("Injuries", dim(top_i_injury)[1]), Category = top_i_injury$EC, 
        EVDATE = top_i_injury$EVDATE, log2Impact = log2(top_i_injury$INJURIES))
    pop_ri <- rbind(pop_ri, pop_i)
}
library(ggplot2)
plotData <- ggplot(pop_ri, aes(EVDATE, log2Impact)) + facet_grid(type ~ Category) + 
    geom_boxplot(aes(fill = Category)) + labs(x = "Date", y = expression(log[2] * 
    "(population impact)")) + labs(title = "Boxplot of fatalities and injuries (includes top 2 categories)") + 
    theme_bw(10)
print(plotData)

plot of chunk pop-boxplot

Economic Impact

First, we filter the cleaned dataset to include only observations that had an impact on the economy. In other words, we consider observations that had non-zero property damages or non-zero crop damages. We use the PROPDMGEXP and CROPDMGEXP variables to compute the actual value of the damages. These variables provide a multiplier for PROPDMG and CROPDMG respectively.

prop_impact <- stormData[stormData$PROPDMG > 0, c("EVDATE", "EC", "PROPDMG", 
    "PROPDMGEXP")]
crop_impact <- stormData[stormData$CROPDMG > 0, c("EVDATE", "EC", "CROPDMG", 
    "CROPDMGEXP")]
dim(prop_impact)

## [1] 239143      4

dim(crop_impact)

## [1] 22068     4

dmg_codes <- union(unique(as.character(prop_impact$PROPDMGEXP)), unique(as.character(crop_impact$CROPDMGEXP)))
multiplier <- c(10^3, 10^6, 10^9, 10^6, 1, 1, 1, 10^5, 10^6, 10^4, 10^2, 10^2, 
    10^7, 10^3, 10^2, 1, 10^3)
data.frame(Code = dmg_codes, Multiplier = multiplier)

##    Code Multiplier
## 1     K      1e+03
## 2     M      1e+06
## 3     B      1e+09
## 4     m      1e+06
## 5     +      1e+00
## 6     0      1e+00
## 7            1e+00
## 8     5      1e+05
## 9     6      1e+06
## 10    4      1e+04
## 11    h      1e+02
## 12    2      1e+02
## 13    7      1e+07
## 14    3      1e+03
## 15    H      1e+02
## 16    -      1e+00
## 17    k      1e+03

prop_impact$PROPDMGVAL <- sapply(1:dim(prop_impact)[1], function(i) prop_impact$PROPDMG[i] * 
    multiplier[which(dmg_codes == prop_impact$PROPDMGEXP[i])])
crop_impact$CROPDMGVAL <- sapply(1:dim(crop_impact)[1], function(i) crop_impact$CROPDMG[i] * 
    multiplier[which(dmg_codes == crop_impact$CROPDMGEXP[i])])
totPropDamages <- (sum(prop_impact$PROPDMGVAL))/10^9
totCropDamages <- (sum(crop_impact$CROPDMGVAL))/10^9

Total property damages across all weather events in the United States between 1950-2011: $ 428.08 Billions. Total crop damages across all weather events in the United States between 1950-2011: $ 49.1 Billions.

Here is a brief summary of property and crop damages across all categories of events.

summary(prop_impact$PROPDMGVAL)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.00e+00 3.00e+03 1.00e+04 1.79e+06 5.00e+04 1.15e+11

summary(crop_impact$CROPDMGVAL)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 3.00e+00 5.00e+03 1.50e+04 2.23e+06 1.00e+05 5.00e+09

Next, we use the plyr package to summarise property and crop damages across different categories. We compute the total, mean, standard deviation, median, and maximum property/crop damages for different categories.

prop_data <- transform(prop_impact, EVDATE = as.character(prop_impact$EVDATE), 
    Category = as.character(prop_impact$EC), PROPDMGVAL = prop_impact$PROPDMGVAL)
crop_data <- transform(crop_impact, EVDATE = as.character(crop_impact$EVDATE), 
    Category = as.character(crop_impact$EC), CROPDMGVAL = crop_impact$CROPDMGVAL)

library(plyr)
prop_df <- ddply(prop_data, "Category", summarise, tot.obs = length(Category), 
    tot.prop = round(sum(PROPDMGVAL)/10^6, 3), mean.prop = round(mean(PROPDMGVAL)/10^6, 
        3), sd.prop = round(sd(PROPDMGVAL)/10^6, 3), median.prop = round(median(PROPDMGVAL)/10^6, 
        3), max.prop = round(max(PROPDMGVAL)/10^6, 3))
crop_df <- ddply(crop_data, "Category", summarise, tot.obs = length(Category), 
    tot.crop = round(sum(CROPDMGVAL)/10^6, 3), mean.crop = round(mean(CROPDMGVAL)/10^6, 
        3), sd.crop = round(sd(CROPDMGVAL)/10^6, 3), median.crop = round(median(CROPDMGVAL)/10^6, 
        3), max.crop = round(max(CROPDMGVAL)/10^6, 3))
prop_df <- prop_df[order(prop_df$mean.prop, decreasing = TRUE), ]
crop_df <- crop_df[order(crop_df$mean.crop, decreasing = TRUE), ]
prop_df <- transform(prop_df, tot.prop = as.character(tot.prop), mean.prop = as.character(mean.prop), 
    median.prop = as.character(median.prop), max.prop = as.character(max.prop))
crop_df <- transform(crop_df, tot.crop = as.character(tot.crop), mean.crop = as.character(mean.crop), 
    median.crop = as.character(median.crop), max.crop = as.character(max.crop))

Here is a list of statistics on property/crop damages: total, mean, standard deviation, median, maximum across different categories. All values are in Millions of USD.

prop_df

##   Category tot.obs   tot.prop mean.prop sd.prop median.prop max.prop
## 4        S    1286    9590.99     7.458   65.30         0.1     1500
## 1        F   32264 168273.424     5.216  641.38       0.025   115000
## 2        H   39328 143961.672     3.661  124.96       0.025    16930
## 6        W   27243  24382.954     0.895   33.99       0.008     5000
## 5        T    9643   6322.843     0.656   18.35        0.01     1300
## 3        R  129379  75546.999     0.584   94.67       0.005    31300

crop_df

##   Category tot.obs  tot.crop mean.crop sd.crop median.crop max.crop
## 4        S     410 15280.327    37.269 102.459         0.5     1000
## 2        H    1597  5933.579     3.715  45.463       0.015     1510
## 1        F    4253 12388.597     2.913  78.003        0.05     5000
## 5        T     337   757.342     2.247  11.734       0.025      175
## 3        R    5801   7973.41     1.374  65.888        0.01     5000
## 6        W    9670  6769.882       0.7   8.812       0.015      596

The category that caused the highest average property damages in the United States between 1950 and 2011 is S. The list of events included in this category are:

top_prop_EC = as.character(prop_df[1, "Category"])
eventTypes[eventCategories == top_prop_EC]

## [1] "DENSE FOG"      "DENSE SMOKE"    "DROUGHT"        "DUST DEVIL"    
## [5] "DUST STORM"     "EXCESSIVE HEAT" "HEAT"           "WILDFIRE"

The category that caused the highest average crop damages in the United States between 1950 and 2011 is S. The list of events included in this category are:

top_crop_EC = as.character(crop_df[1, "Category"])
eventTypes[eventCategories == top_crop_EC]

## [1] "DENSE FOG"      "DENSE SMOKE"    "DROUGHT"        "DUST DEVIL"    
## [5] "DUST STORM"     "EXCESSIVE HEAT" "HEAT"           "WILDFIRE"

Here are the boxplots that includes the top 2 categories for property and crop damages. This figure shows the mean and the confidence interval of the statistic (in log scale) across some categories.

ec <- union(as.character(prop_df[1:2, "Category"]), as.character(crop_df[1:2, 
    "Category"]))
econ_ri <- data.frame()
for (i in 1:length(ec)) {
    top_i_prop <- prop_impact[prop_impact$EC == ec[i], ]
    econ_p <- data.frame(type = rep("Property Damages", dim(top_i_prop)[1]), 
        Category = top_i_prop$EC, EVDATE = top_i_prop$EVDATE, log2Impact = log2(top_i_prop$PROPDMGVAL))
    econ_ri <- rbind(econ_ri, econ_p)
    top_i_crop <- crop_impact[crop_impact$EC == ec[i], ]
    econ_c <- data.frame(type = rep("Crop Damages", dim(top_i_crop)[1]), Category = top_i_crop$EC, 
        EVDATE = top_i_crop$EVDATE, log2Impact = log2(top_i_crop$CROPDMGVAL))
    econ_ri <- rbind(econ_ri, econ_c)
}
library(ggplot2)
plotData <- ggplot(econ_ri, aes(EVDATE, log2Impact)) + facet_grid(type ~ Category) + 
    geom_boxplot(aes(fill = Category)) + labs(x = "Date", y = expression(log[2] * 
    "(economic impact)")) + labs(title = "Boxplot of Property and Crop damages (includes top 2 categories)") + 
    theme_bw(10)
print(plotData)

plot of chunk econ-boxplot