Synopsis

In this data analysis of weather events, we begin by loading the data frame into R and viewing its structure. Then, we proceed to data cleaning and pre-processing for analysis. We create uniform variable names and subset the main data frame using only the columns we might need for analysis. Next, we address the two analysis questions by creating new, calculated variables for our question-specific data frames. We aggregate our data frames by event type and arrange our damage data in descending order. Finally, we take the top 10 results from each question-specific data frame and create bar plots to address which storm events are the most harmful in the US by either population health or economic measures.

Data Processing

First, we load and view the data in R.

res <- read.csv("repdata%2Fdata%2FStormData.csv.bz2", na.strings="")
head(res)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0    <NA>       <NA>     <NA>     <NA>          0
## 2 TORNADO         0    <NA>       <NA>     <NA>     <NA>          0
## 3 TORNADO         0    <NA>       <NA>     <NA>     <NA>          0
## 4 TORNADO         0    <NA>       <NA>     <NA>     <NA>          0
## 5 TORNADO         0    <NA>       <NA>     <NA>     <NA>          0
## 6 TORNADO         0    <NA>       <NA>     <NA>     <NA>          0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0    <NA>       <NA>   14.0   100 3   0          0
## 2         NA         0    <NA>       <NA>    2.0   150 2   0          0
## 3         NA         0    <NA>       <NA>    0.1   123 2   0          0
## 4         NA         0    <NA>       <NA>    0.0   100 2   0          0
## 5         NA         0    <NA>       <NA>    0.0   150 2   0          0
## 6         NA         0    <NA>       <NA>    1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP  WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0       <NA> <NA>       <NA>      <NA>
## 2        0     2.5          K       0       <NA> <NA>       <NA>      <NA>
## 3        2    25.0          K       0       <NA> <NA>       <NA>      <NA>
## 4        2     2.5          K       0       <NA> <NA>       <NA>      <NA>
## 5        2     2.5          K       0       <NA> <NA>       <NA>      <NA>
## 6        6     2.5          K       0       <NA> <NA>       <NA>      <NA>
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806    <NA>      1
## 2     3042      8755          0          0    <NA>      2
## 3     3340      8742          0          0    <NA>      3
## 4     3458      8626          0          0    <NA>      4
## 5     3412      8642          0          0    <NA>      5
## 6     3450      8748          0          0    <NA>      6

We are told that the events in the database start in the year 1950 and end in Nov 2011. There are fewer events recorded in the earlier years in general. More recent years are more complete.

Let’s look at the structure of res.

str(res)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29600 levels "5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13512 1872 4597 10591 4371 10093 1972 23872 24417 4597 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 34 levels "  N"," NW","E",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ BGN_LOCATI: Factor w/ 54428 levels " Christiansburg",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_DATE  : Factor w/ 6662 levels "1/1/1993 0:00:00",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_TIME  : Factor w/ 3646 levels " 0900CST"," 200CST",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 23 levels "E","ENE","ESE",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_LOCATI: Factor w/ 34505 levels " CANTON"," TULIA",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 18 levels "-","?","+","0",..: 16 16 16 16 16 16 16 16 16 16 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 8 levels "?","0","2","B",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ WFO       : Factor w/ 541 levels " CI","%SD","$AC",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ STATEOFFIC: Factor w/ 249 levels "ALABAMA, Central",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ ZONENAMES : Factor w/ 25111 levels "                                                                                                                               "| __truncated__,..: NA NA NA NA NA NA NA NA NA NA ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436780 levels "\t","\t\t","\t\t\t\t",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Upon further inspection, we see a mix of lower and uppercase letters in some of our variables’ factor levels. We’re going to clean up some of those variables.

head(table(res$EVTYPE), 12)
## 
##    HIGH SURF ADVISORY         COASTAL FLOOD           FLASH FLOOD 
##                     1                     1                     1 
##             LIGHTNING             TSTM WIND       TSTM WIND (G45) 
##                     1                     4                     1 
##            WATERSPOUT                  WIND                     ? 
##                     1                     1                     1 
##       ABNORMAL WARMTH        ABNORMALLY DRY        ABNORMALLY WET 
##                     4                     2                     1
tail(table(res$EVTYPE), 12)
## 
##  WINTER STORM/HIGH WIND WINTER STORM/HIGH WINDS           WINTER STORMS 
##                       1                       1                       3 
##          Winter Weather          WINTER WEATHER      WINTER WEATHER MIX 
##                      19                    7026                       6 
##      WINTER WEATHER/MIX             WINTERY MIX              Wintry mix 
##                    1104                       2                       3 
##              Wintry Mix              WINTRY MIX                     WND 
##                       1                      90                       1
res$EVTYPE      <- as.factor(tolower(res$EVTYPE))
table(res$PROPDMGEXP)
## 
##      -      ?      +      0      1      2      3      4      5      6 
##      1      8      5    216     25     13      4      4     28      4 
##      7      8      B      h      H      K      m      M 
##      5      1     40      1      6 424665      7  11330
res$PROPDMGEXP  <- as.factor(toupper(res$PROPDMGEXP))
table(res$CROPDMGEXP)
## 
##      ?      0      2      B      k      K      m      M 
##      7     19      1      9     21 281832      1   1994
res$CROPDMGEXP  <- as.factor(toupper(res$CROPDMGEXP))

Next, let’s subset only those variables which we might be using to answer our analysis questions.

sd <- subset(res, select=c(STATE, WFO, EVTYPE, BGN_DATE, END_DATE, FATALITIES, INJURIES, 
                             PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, REMARKS))
summary(sd)
##      STATE             WFO                       EVTYPE      
##  TX     : 83728   OUN    : 17393   hail             :288661  
##  KS     : 53440   JAN    : 13889   tstm wind        :219942  
##  OK     : 46802   LWX    : 13174   thunderstorm wind: 82564  
##  MO     : 35648   PHI    : 12551   tornado          : 60652  
##  IA     : 31069   TSA    : 12483   flash flood      : 54277  
##  NE     : 30271   (Other):690738   flood            : 25327  
##  (Other):621339   NA's   :142069   (Other)          :170874  
##               BGN_DATE                   END_DATE        FATALITIES      
##  5/25/2011 0:00:00:  1202   4/27/2011 0:00:00:  1214   Min.   :  0.0000  
##  4/27/2011 0:00:00:  1193   5/25/2011 0:00:00:  1196   1st Qu.:  0.0000  
##  6/9/2011 0:00:00 :  1030   6/9/2011 0:00:00 :  1021   Median :  0.0000  
##  5/30/2004 0:00:00:  1016   4/4/2011 0:00:00 :  1007   Mean   :  0.0168  
##  4/4/2011 0:00:00 :  1009   5/30/2004 0:00:00:   998   3rd Qu.:  0.0000  
##  4/2/2006 0:00:00 :   981   (Other)          :653450   Max.   :583.0000  
##  (Other)          :895866   NA's             :243411                     
##     INJURIES            PROPDMG          PROPDMGEXP        CROPDMG       
##  Min.   :   0.0000   Min.   :   0.00   K      :424665   Min.   :  0.000  
##  1st Qu.:   0.0000   1st Qu.:   0.00   M      : 11337   1st Qu.:  0.000  
##  Median :   0.0000   Median :   0.00   0      :   216   Median :  0.000  
##  Mean   :   0.1557   Mean   :  12.06   B      :    40   Mean   :  1.527  
##  3rd Qu.:   0.0000   3rd Qu.:   0.50   5      :    28   3rd Qu.:  0.000  
##  Max.   :1700.0000   Max.   :5000.00   (Other):    77   Max.   :990.000  
##                                        NA's   :465934                    
##  CROPDMGEXP                                              REMARKS      
##  ?   :     7                                                 : 24013  
##  0   :    19   Trees down.\n                                 :  1110  
##  2   :     1   Several trees were blown down.\n              :   568  
##  B   :     9   Trees were downed.\n                          :   446  
##  K   :281853   Large trees and power lines were blown down.\n:   432  
##  M   :  1995   (Other)                                       :588295  
##  NA's:618413   NA's                                          :287433

Question 1. Which types of events are most harmful to population health?

First, we copy the sd data frame into a new data frame called ph. Then, we create a new column called CASUALTIES that equals the sum of the number of INJURIES and FATALITIES recorded for each storm. Next, we aggregate the data frame by the total number of people historically harmed by each event type using the sum function. Finally, we arrange the data frame in descending order using the dplyr package to see which events are most harmful to human health. We store the top ten results for population health in a variable called p10.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
ph <- sd
ph$CASUALTIES <- ph$INJURIES + ph$FATALITIES
phAgg <- aggregate(CASUALTIES ~ EVTYPE, ph, FUN=sum)
phArr <- arrange(phAgg, desc(CASUALTIES))
ph10 <- head(phArr, 10)
ph10$EVTYPE <- factor(ph10$EVTYPE)

Question 2. Which types of events have the greatest economic consequences?

We’re going to use the PROPDMG/PROPDMGEXP and CROPDMG/CROPDMGEXP variables to answer this question. Let’s again take a closer look at our data frame. What are our factor levels for the PROPDMGEXP column?

table(sd$PROPDMGEXP)
## 
##      -      ?      +      0      1      2      3      4      5      6 
##      1      8      5    216     25     13      4      4     28      4 
##      7      8      B      H      K      M 
##      5      1     40      7 424665  11337

Our levels for our property damage exponent are -, ?, +, 0-8, B, H, K, M. Because the symbolic levels are unclear in the provided documentation, we will discard them by creating a new subsetted data frame, sdp. Then, we will calculate the appropriate integer relating to each of the exponent variables and store them in a new variable, PROPDMGFULL.

sdp <- subset(sd[!(sd$PROPDMGEXP=="-" | sd$PROPDMGEXP=="?" | sd$PROPDMGEXP=="+"),])

sdp$PROPDMGFULL <- with(sdp, ifelse(sdp$PROPDMGEXP == "H", 1e2,   ## H stands for hundred
                           ifelse(sdp$PROPDMGEXP == "K", 1e3,     ## K stands for thousand
                           ifelse(sdp$PROPDMGEXP == "M", 1e6,     ## M stands for million
                           ifelse(sdp$PROPDMGEXP == "B", 1e9,     ## B stands for billion
                                ifelse(sdp$PROPDMGEXP == 1, 1e1,
                                ifelse(sdp$PROPDMGEXP == 2, 1e2,
                                ifelse(sdp$PROPDMGEXP == 3, 1e3,
                                ifelse(sdp$PROPDMGEXP == 4, 1e4,
                                ifelse(sdp$PROPDMGEXP == 5, 1e5,
                                ifelse(sdp$PROPDMGEXP == 6, 1e6,
                                ifelse(sdp$PROPDMGEXP == 7, 1e7,
                                ifelse(sdp$PROPDMGEXP == 8, 1e8,
                                ifelse(sdp$PROPDMGEXP == 0, 1e0, NA))))))))))))))

Next, we repeat the above process with the crop variables.

table(sdp$CROPDMGEXP)
## 
##      ?      0      2      B      K      M 
##      5     16      0      5 277988   1552

Our levels for our crop damage exponent are ?, 0, 2, B, K, M. There are only five instances where CROPDMGEXP=="?", so we will ignore them. We will copy the sdp data frame to a new one for our next calculations and calcluate the appropriate integer from the exponent column into a new variable called CROPDMGFULL.

sdc <- sdp

sdc$CROPDMGFULL <- with(sdc, ifelse(sdc$CROPDMGEXP == "H", 1e2,
                        ifelse(sdc$CROPDMGEXP == "K", 1e3,
                        ifelse(sdc$CROPDMGEXP == "M", 1e6,
                        ifelse(sdc$CROPDMGEXP == "B", 1e9, 
                        ifelse(sdc$CROPDMGEXP == 2, 1e2,
                        ifelse(sdc$CROPDMGEXP == 0, 1e0, NA)))))))

Next, we calculate the total damage caused by each of the storms by adding together the complete cases of property damage and crop damage. To do this, we first create a new, logical variable that contains the complete cases of both PROPDMG and PROPDMGFULL. We then calculate the full amount of property damage by multiplying together the PROPDMG and PROPDMGFULL columns and storing the result in the new PROPTOTAL column. The analagous process is done for the crop variables, too. Finally, we add together the PROPTOTAL and CROPTOTAL variables and store them in our new variable, DMGTOTAL.

We begin this code chunk by copying the latest data frame into a new one called dat.

dat <- sdc
dat$CCPROP      <- complete.cases(dat$PROPDMG + dat$PROPDMGFULL)
dat$PROPTOTAL   <- with(dat, ifelse(dat$CCPROP == TRUE, dat$PROPDMG * dat$PROPDMGFULL, 0))

dat$CCCROP      <- complete.cases(dat$CROPDMG + dat$CROPDMGFULL)
dat$CROPTOTAL   <- with(dat, ifelse(dat$CCCROP == TRUE, dat$CROPDMG * dat$CROPDMGFULL, 0))

dat$DMGTOTAL    <- with(dat, ifelse(!is.na(dat$PROPTOTAL) & !is.na(dat$CROPTOTAL),
                                 dat$PROPTOTAL + dat$CROPTOTAL, NA))

Lastly, to answer the second question, we aggregate the total damage by event type using the sum function and store the resulting data frame in a variable called ec. Then, using the dplyr package, we arrange the aggregated data frame in descending order by total damage. Finally, we store the top ten results for economic damage in a variable called ec10.

ec <- aggregate(DMGTOTAL ~ EVTYPE, dat, FUN=sum)
ecArr <- arrange(ec, desc(DMGTOTAL))
ec10 <- head(ecArr, 10) 

Results

To answer question 1, we create a barplot of the top ten most historically harmful storm events to US Population Health.

print(ph10)
##               EVTYPE CASUALTIES
## 1            tornado      96979
## 2     excessive heat       8428
## 3          tstm wind       7461
## 4              flood       7259
## 5          lightning       6046
## 6               heat       3037
## 7        flash flood       2755
## 8          ice storm       2064
## 9  thunderstorm wind       1621
## 10      winter storm       1527
## 1st barplot code
par(mar = c(6.7, 4.9, 3.0, 3.5), mgp= c(3.8, 0.5, 0))
barplot(ph10$CASUALTIES, names=ph10$EVTYPE, main="Most Harmful Storm Types to US Population", ylab = "Total Casualties", las=2, col=heat.colors(10), cex.axis=0.8, cex.names=0.8)
mtext("Historical Data 1950-2011")

Fig 1. We can see that tornadoes are by far the event that caused the most casualties in the US during 1950-2011. According to the data, tornadoes were responsible for 96,979 casualties in that time frame. The next storm event, ‘excessive heat’, falls almost 90,000 casualties short of tornadoes for the same time period at 8428 casualties.

In answering the second question, we build a barplot of the top ten most economically harmful storm events to US property and crops.

print(ec10)
##               EVTYPE     DMGTOTAL
## 1              flood 149828665250
## 2  hurricane/typhoon  71913712800
## 3            tornado  57350760234
## 4        storm surge  43323541000
## 5        flash flood  18210702822
## 6               hail  17789075356
## 7          hurricane  14557229010
## 8        river flood  10147679500
## 9          ice storm   8967041360
## 10    tropical storm   8155601550
## 2nd barplot code
par(mar = c(6.7, 4.9, 3.0, 3.5), mgp= c(3.8, 0.5, 0))
barplot(ec10$DMGTOTAL, names=ec10$EVTYPE, main="Most Economically Harmful Storm Types in US", ylab = "Total Damage in USD", las=2, col=heat.colors(10), cex.axis=0.8, cex.names=0.8)
mtext("Historical Data 1950-2011")

Fig 2. We see that floods were the most economically harmful event in the US during 1950-2011, causing the most combined property and crop damage. The total damage, according to the data, is almost $150 billion. The next highest event, hurricanes/typhoons, caused almost $72 billion in damages in the same time period. Tornadoes, while historically responsible for the most storm casualties in the US, fell into third place at $57 billion in property and crop damages during this time frame.