## Synopsis

1. Using data from the National Oceanic and Atmospheric Administration’s (NOAA) database a summary dataset is constructed by aggregating up weather events over event types and decades.

2. Early in the analysis it is found that there is a non-random relationship between events recorded in the database and time and one must therefore be careful to simply aggregate up all events over time.

3. The poor classificaton of the data by event type is a major hurdle for conducting the research properly, but, a simple computerized replacement strategy leads to a considerable reduction in this problem.

4. Overall and in view to the data cleaning process it is maintained that the conclusions reached are quite valid and especially for the period where data coverage is good - namely the 1990s and 2000s.

5. Four dependent variables are then constructed from the dataset: total harm (injuries and casualties), total harm per event, total cost (to property and crop) and total cost per event.

6. The findings are that tornadoes indeed are the worst weather events in the US simply because they render the highest number of human casualties overall.

7. However, these events are followed closely by other serious events such as hurricanes which occur much less frequently, but they are very costly both in terms of human lives and their repercussions for the economy and livelihoods of the implicated victims when they happen.

8. Floods are also very costly, both in terms of total harm and cost, but this is mostly due to the fact that they happen frequently as do tornadoes.

9. More can be done about floods both at the individual and societal level than about tornadoes.

10. Further investigation could look into the geographical pattern of these events.

## Data Processing

The data comes from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The events in the database start in the year 1950 and end in November 2011. The data is more complete for recent time. Metadata is available from the following webpages and online services * National Weather Service Storm Data Documentation: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf * National Climatic Data Center Storm Events FAQ: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf

The dataset is downloaded using the url for future reproducibility (as long as the url is valid) storing the large dataset temporarily in the document (working directory) folder:

download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile="rawstormdata.csv.bz2")

Necessary libraries for making the assignment are invoked and the raw dataset is converted into R format and the basic dimensions of the data are inspected with the ‘str’ command:

storm_data <- read.csv(file="rawstormdata.csv.bz2", header=TRUE, sep=",")
str(storm_data)
## 'data.frame':    902297 obs. of  37 variables:
##  $STATE__ : num 1 1 1 1 1 1 1 1 1 1 ... ##$ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ... ##$ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $COUNTY : num 97 3 57 89 43 77 9 123 125 57 ... ##$ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ... ##$ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ... ##$ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ... ##$ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ... ##$ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $COUNTYENDN: logi NA NA NA NA NA NA ... ##$ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ... ##$ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ... ##$ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $F : int 3 2 2 2 2 2 2 1 3 3 ... ##$ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ... ##$ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ... ##$ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ... ##$ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ... ##$ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $LATITUDE : num 3040 3042 3340 3458 3412 ... ##$ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $LATITUDE_E: num 3051 0 0 0 0 ... ##$ LONGITUDE_: num  8806 0 0 0 0 ...
##  $REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ... ##$ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

From the above we can see that the dataset has 37 variables and 902,297 observations. As only some variables (10 vars) are needed for the analysis, the relevant variables are selected and stored in a new dataset while deleting the raw data file to save disk space.

The following variables are relevant for the research (full meta data is available from https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf):

1. BGN_DATE (Factor var.)- The date of the observation (the date and time that weather event took place)
2. STATE (Factor var.)- The state where the event took place
3. EVTYPE (Factor var.)- Type of event
4. FATALITIEs (Numeric var.) - Number of people that died in the event
5. INJURIES (Numeric var.) - Number of people that were injured in the event
6. PROPDMG (Numeric var.) - Gives information about the first part of the property damage expenditure (significant digits), the number of 0’s to be added to this number are given in 7.
7. PROPDMGEXP (Factor var.) - Factor variable, denominating the number of 0’s to be added to the number in 6. K=1000, M=1,000,0000, B=1,000,000,000 etc.
8. CROPDMG (Numeric var.) - As for 6, but related with crops.
9. CROPDMGEXP (Factor var.) - As for 7, but related with crops.
10. REFNUM (Numberic) - Original order of observations in the raw dataset, stored by date and within dates by states in alphanumeric order.

We select these 10 vars, store them in a new dataset called storms and remove the storm_data file from the disk:

storms <- select(storm_data, BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, REFNUM)
rm(storm_data)

A major barrier to completing this assignment succesfully is the messy state of the raw data. It will take several days to properly correct all the errors made by NOAA whilst classifying the events. On Page 6 in the metadata file we can see that there should only be a total of 48 different event types. Aggregating the data up (not shown) it is verified that there are a total of 898 different categories in the dataset at the moment. This means that 850 have to be reclassified into the original event types. I would put this data in Excel and store the old and new EVTYPE as documentation. However, this task seems a bit beyond the scope of the assignment. As a substitute I found on this recoding done by another student Robert Reynolds (https://rpubs.com/robertreynolds77/142333) and I decided to use it as a quickfix (correcting for some wrong reclassifications using page 6 in the metadata file) for my data and see how many of the 850 mistakes would be erradicated:

storms$EVTYPE[grep("AVALAN",storms$EVTYPE)]                     <- "AVALANCHE"
storms$EVTYPE[grep("BLIZZARD|WINTER",storms$EVTYPE)]            <- "BLIZZARD"
storms$EVTYPE[grep("FLOOD|FLDG|FLD|URBAN|FLOOO",storms$EVTYPE)] <- "FLOOD"
storms$EVTYPE[grep("RAIN",storms$EVTYPE)]                       <- "HEAVY RAIN"
storms$EVTYPE[grep("FREEZING RAIN",storms$EVTYPE)]              <- "EXTREME COLD/WIND CHILL"
storms$EVTYPE[grep("ICE|ICY",storms$EVTYPE)]                    <- "ICE STORM"
storms$EVTYPE[grep("SNOW",storms$EVTYPE)]                       <- "HEAVY SNOW"
storms$EVTYPE[grep("HAIL",storms$EVTYPE)]                       <- "HAIL"
storms$EVTYPE[grep("SURF",storms$EVTYPE)]                       <- "HIGH SURF"
storms$EVTYPE[grep("HIGH WIND",storms$EVTYPE)]                  <- "HIGH WIND"
storms$EVTYPE[grep("HURRICANE|TYPHOON",storms$EVTYPE)]          <- "HURRICANE"
storms$EVTYPE[grep("TROPICAL",storms$EVTYPE)]                   <- "TROPICAL STORM"
storms$EVTYPE[grep("THUNDER|TSTM",storms$EVTYPE)]               <- "THUNDERSTORM WIND"
storms$EVTYPE[grep("LIGHTN|LIGHTI|LIGNT",storms$EVTYPE)]        <- "LIGHTNING"
storms$EVTYPE[grep("WIND|WND",storms$EVTYPE)]                   <- "THUNDERSTORM WIND"
storms$EVTYPE[grep("TORN|FUNNEL",storms$EVTYPE)]                <- "TORNADO"
storms$EVTYPE[grep("FIRE",storms$EVTYPE)]                       <- "WILDFIRE"
storms$EVTYPE[grep("HEAT|HOT|HIGH TEMP|DRI",storms$EVTYPE)]         <- "DROUGHT"
storms$EVTYPE[grep("SURGE",storms$EVTYPE)]                      <- "STORM SURGE/TIDE"
storms$EVTYPE[grep("RIP",storms$EVTYPE)]                        <- "RIP CURRENT"
storms$EVTYPE[grep("SWELL|SEA|TIDE",storms$EVTYPE)]             <- "STORM SURGE/TIDE"
storms$EVTYPE[grep("SPOUT",storms$EVTYPE)]                      <- "WATERSPOUT"
storms$EVTYPE[grep("SLIDE",storms$EVTYPE)]                      <- "STORM SURGE/TIDE"
storms$EVTYPE[grep("DUST",storms$EVTYPE)]                       <- "DUST STORM"

This helped a lot as the problem of wrongly classified events now dropped to 234-48=186 - which is a major drop in wrong classifcation (from 850 before).

Some additional minor data transformations are necessary, we make the computer do all of them so that the research is 100% reproducible. The BGN_DATE is read in as date format and 5 new vars are created (TOTAL_HUMAn_HARM, PROPERTY_DAMAGE, CROP_DAMAGE, TOTAL_ECONOMIC_DAMAGE, DECADE). We will need these variables when next summing up the data for analysis.

tbl_df(storms)
## # A tibble: 902,297 × 10
##              BGN_DATE  STATE  EVTYPE FATALITIES INJURIES PROPDMG
##                <fctr> <fctr>  <fctr>      <dbl>    <dbl>   <dbl>
## 1   4/18/1950 0:00:00     AL TORNADO          0       15    25.0
## 2   4/18/1950 0:00:00     AL TORNADO          0        0     2.5
## 3   2/20/1951 0:00:00     AL TORNADO          0        2    25.0
## 4    6/8/1951 0:00:00     AL TORNADO          0        2     2.5
## 5  11/15/1951 0:00:00     AL TORNADO          0        2     2.5
## 6  11/15/1951 0:00:00     AL TORNADO          0        6     2.5
## 7  11/16/1951 0:00:00     AL TORNADO          0        1     2.5
## 8   1/22/1952 0:00:00     AL TORNADO          0        0     2.5
## 9   2/13/1952 0:00:00     AL TORNADO          1       14    25.0
## 10  2/13/1952 0:00:00     AL TORNADO          0        0    25.0
## # ... with 902,287 more rows, and 4 more variables: PROPDMGEXP <fctr>,
## #   CROPDMG <dbl>, CROPDMGEXP <fctr>, REFNUM <dbl>
storms <- mutate(storms, BGN_DATE=as.Date(BGN_DATE, "%m/%d/%Y %H:%M:%S"))
storms <- mutate(storms, YEAR=year(BGN_DATE))
storms <- mutate(storms, TOTAL_HUMAN_HARM=FATALITIES+INJURIES)
storms <- mutate(storms, PROPERTY_DAMAGE = ifelse(PROPDMGEXP =='K', PROPDMG*1000, ifelse(PROPDMGEXP =='M', PROPDMG*1000000, ifelse(PROPDMGEXP == 'B', PROPDMG*1000000000, ifelse(PROPDMGEXP == 'H', PROPDMG*100, PROPDMG)))))
storms <- mutate(storms, CROP_DAMAGE = ifelse(CROPDMGEXP =='K', CROPDMG*1000, ifelse(CROPDMGEXP =='M', CROPDMG*1000000, ifelse(CROPDMGEXP =='B', CROPDMG*1000000000, ifelse(CROPDMGEXP =='H', CROPDMG*100, CROPDMG)))))
storms <- mutate(storms, TOTAL_ECONOMIC_DAMAGE=PROPERTY_DAMAGE+CROP_DAMAGE)
storms <- mutate(storms, EVTYPE=toupper(EVTYPE), PROPDMGEXP=toupper(PROPDMGEXP), CROPDMGEXP=toupper(CROPDMGEXP))
storms <- mutate(storms, DECADE = ifelse(YEAR %in% 1950:1959, "1950", ifelse(YEAR %in% 1960:1969, "1960", ifelse(YEAR %in% 1970:1979, "1970", ifelse(YEAR %in% 1980:1989, "1980", ifelse(YEAR %in% 1990:1999, "1990", "2000"))))))
storms$DECADE <- as.numeric(storms$DECADE)

All set to go - we can now finally start to analyse the data :-)

# RESEARCH QUESTIONS

The research questions to be adressed in this study are:

• Across the US, which type of events are most harmful w.r.t population health?
• Across the US, which type of events have the greatest economic consequences?

# INITIAL EXPLORATION AND SOME DESCRIPTIVE STATISTICS

Of course we can just aggregate all the numbers up in the database by eventtype in terms of total harm and economic damages.

However, this ignores some fundamental questions about this dataset.I identified two of those that I would like to address a bit in this paper (one could also have focused on a lot of other things, such as difference between property and crop damage, injury vs. fatalities etc.):

1.Is there an issue with the time dimensionality of the data that can challenge the validity of just aggregating up this data? E.g. is it true that back in time some events got more coverage than others and that there is a tendency for more and more weather event categories to emerge over time? If so, how will it possibly affect the findings if we want to convey the longitudinal perspective?

To answer this we have to create a new dataset where we borrow from what we have learned in course project 1. Because we need to aggregate up the number of events by event type per decade (I am also calculating some other summary stats by decade and event type that I will need a bit later in the plots). First I did everything by year but there are too many observations to get an overview of the trellis plots, so I switched to decades:

tbl_df(storms)
## # A tibble: 902,297 × 16
##      BGN_DATE  STATE  EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
##        <date> <fctr>   <chr>      <dbl>    <dbl>   <dbl>      <chr>
## 1  1950-04-18     AL TORNADO          0       15    25.0          K
## 2  1950-04-18     AL TORNADO          0        0     2.5          K
## 3  1951-02-20     AL TORNADO          0        2    25.0          K
## 4  1951-06-08     AL TORNADO          0        2     2.5          K
## 5  1951-11-15     AL TORNADO          0        2     2.5          K
## 6  1951-11-15     AL TORNADO          0        6     2.5          K
## 7  1951-11-16     AL TORNADO          0        1     2.5          K
## 8  1952-01-22     AL TORNADO          0        0     2.5          K
## 9  1952-02-13     AL TORNADO          1       14    25.0          K
## 10 1952-02-13     AL TORNADO          0        0    25.0          K
## # ... with 902,287 more rows, and 9 more variables: CROPDMG <dbl>,
## #   CROPDMGEXP <chr>, REFNUM <dbl>, YEAR <dbl>, TOTAL_HUMAN_HARM <dbl>,
## #   PROPERTY_DAMAGE <dbl>, CROP_DAMAGE <dbl>, TOTAL_ECONOMIC_DAMAGE <dbl>,
## #   DECADE <dbl>
storms1 <- storms%>%
summarize(count=n(), HARM=sum(TOTAL_HUMAN_HARM), COST=sum(TOTAL_ECONOMIC_DAMAGE))

The aggregated data has 295 observations. Many of these are still due to errors in the variable EVTYPE. Ideally we should reduce the number of categories the proper way. However, a quick strategy around this problem is to accept that we loose some observations. So I create a new dataet out of storms1 that only include event types by decade with more than 1 observation and events that had some kind of human or economic cost:

storms2 <- filter(storms1, count>1, HARM>0, COST>0)
storms3 <- filter(storms2, DECADE>1980)

We are now down to 60 observations by decades (55 if only focusing on the period after 1989).

Now we can make a lattice plot to check whether there is a systematic relationship between event counts and decades or whehter this appears to be random and hence unimportant in the dataset.

library(lattice)
attach(storms2)
xyplot(count~DECADE|EVTYPE, par.strip.text=list(cex=0.5), main="Event counts by decade and event type")

detach(storms2)

The trellis plot shows that the identification of event type is not random across decades. Some events are simply better covered over time than others (especially tornadoes, or as we can also see from the header of storms1 - exactly hail, tornadoes and thunderstorm wind are the only type of events recorded until the 1990s - and both hail and thunderstorm wind is of no recorded conomic or human consequence during most of this period and therefore drops out of the storms2 dataset).

Due to the very low number of observations on this list (storms2 data) until the 1990s and the fact that the lattice plots are very difficult (near impossible!) to handle in terms of legends, we finally settle for the storms3 dataset which focuses only on data from 1990 onwards.

2.What is the correct dependent variable to judge what is most harmful? Is it the total damange by event type or the damage per event by type? There is a difference between contemplating this problem at the level of individuals and the level of society.

At the individual level we are interested in knowing what type of weather events that are more likely to jeopardize our wellbeing, then we can design strategies to avoid them such as not choosing to live in particular places.

At societal level we are maybe more interested in knowing about events that we can or should seek to prevent in the perspective that large singular but very harmful events to society need more attention.

For example, tornadoes are likely to cause many minor events and therefore in the aggregate lead to a lot of damage. But it is almost impossible to address this problem at the societal level because their occurrences are more random.

Oppositely are major floods such as Katrina more likely to happen only quite seldomly, but when they do they have enormous cost both for livelihoods and health and they may have major repercussions at the societal level for a very long time.

In conclusion the very worst events both for our wellbeing and the economy must be those that cause harm to society and indviduals at the same time. So those are the events I will be looking for in my analysis now…

But first we need to calcuate two additional variables due to the above observations, that is HARM per event and COST per event over decades in the dataset storms2:

storms3 <- mutate(storms3, HARM_per_event=HARM/count, COST_per_event=COST/count)

# STATISTICAL ANALYSIS AND RESULTS

Now the analysis can be conducted by using two simple plots. One showing the plot of total human harm and harm_per_event by event type and another showing the plot of total cost and cost_per_event by event type. We observe based on these two plots and then conclude at the end. Some concluding observations are given towards the end.

## RESULTS - HUMAN HARM

library(ggplot2)

qplot(x=HARM, y=HARM_per_event, color=EVTYPE, data=storms3, geom="point")

The plot shows that in terms of human harm, the tornadoes, floods, droughts and thunderstorms (in that order) were the type of events that cost the highest number of casulties and injuries. In terms of harm per event the picture is a bit different - here hurricanes, tsunamis, glazes, droughts, ice storms and dust storms were the most dangerous in terms of casualties. Hence there is little overlap as we can also see from the plot and overall it must be concluded that droughts are perhaps worse than we think because they occur frequently enough to cause a quite high level of aggregate harm and per event they are also relatively lethal.

## RESULTS - ECONOMIC DAMAGE

qplot(x=COST, y=COST_per_event, color=EVTYPE, data=storms3, geom="point")

The plot shows that cost from severe weather events are highest for floods, hurricanes, storm surges and tornados (in that order), whereas cost per event are highest for hurricanes, storm surges, freezes and tropical storms (in that order), Hence in terms of cost there is higher correlation between overall damage and cost per event than there was for human harm, with both hurricanes and storm surges having very large repercussions for the livelihoods of the implicated families.

Floods are also overall very expensive, however, much less so per event. But they are still worthwhile strategising up against such as by building dams and other special fences due to their frequent occurrence. They are also likely to hurt particular vulnerable geographies. As with tornadoes there is also the possibility to plan against locating where they come amongst indviduals.

# CONCLUSION

The analysis gives a quite multifaceted picture of the impact weather has on wellbeing and welfare in the US. On the top list of all these events are tornadoes (simply because they take the most lives), followed by hurricanes (that are costly to US livelihood in several ways), Floods bear the highest cost on society overall and also lead to a high number of total deaths and injuries. Tsunamis are much less frequent but when they strike they have a high casualty rate. Droughts also have severe cost and are in some sense the worst because they are both relatively frequent and bear also a relatively high cost in terms of casualties per event.