An analysis of the U.S. National Oceanic and Atmospheric Administration’s storm event database, gathered between 1950 and 2011, was performed with emphasis major causes of personal health issues and major damages for both crops and property. By summing the totals over the range of years, the study revealed that tornadoes are the single most contibuting factor, in both injuries and fatalities, in the United States. The economic impact of weather events was also summed up for the range of years. Tornadoes came out first in property damages . The largest damage to crops was caused by hail.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
Storm Data [47Mb] There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation National Climatic Data Center Storm Events FAQ The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
The code chunk below performs the required processing of the data for publication. It prepares the neccessary data tables to be displayed in the Result section.
setwd("C:/Albert/Coursera/Reproduceable Research/Project 2")
library(ggplot2)
library(scales)
library(dplyr)
library(data.table)
library(tidyr)
library(lubridate)
library(magrittr)
library("devtools")
library(rCharts)
require(knitr)
devtools::install_github('jbryer/DataCache')
## Skipping install of 'DataCache' from a github remote, the SHA1 (c1889dab) has not changed since last install.
## Use `force = TRUE` to force installation
require(markdown) # required for md to html
library("knitr")
library('DataCache')
SD<-read.csv("repdata%2Fdata%2FStormData.csv.bz2") ## read in the data set
dim(SD) ##rows and columns
## [1] 902297 37
str(SD)##show data fields, object type, sample fields ... etc.
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
library(lattice)
library(dplyr) ## Kludge to prevent select from hiccuping
p<-select(SD,FATALITIES,INJURIES,EVTYPE) # fatalities injuries and events from DB
p<-filter(p,FATALITIES>0|INJURIES>0) #filter out zero values
pF<-select(p,EVTYPE,FATALITIES)#split into Fatalities
pI<-select(p,EVTYPE,INJURIES)# and Injuries
pF<-filter(pF,FATALITIES>0) # remove any residual zeros
pI<-filter(pI,INJURIES>0) # ditto
names(pI)<-c("EVTYPE","number") #change column names
names(pF)<-c("EVTYPE","number")
I <- aggregate(x = list(number = pI$number), by = list(EVTYPE = pI$EVTYPE),
FUN = sum, na.rm = TRUE) # take sum of all the values for each event
F <- aggregate(x = list(number = pF$number), by = list(EVTYPE = pF$EVTYPE),
FUN = sum, na.rm = TRUE) #Ditto
F<-mutate(F,lognum=log10(number)) # take the base 10 log of the total number of fatalities
I<-mutate(I,lognum=log10(number))# take the base 10 log of the total number of injuries
names(I)<-c("EVTYPE","number","lognum") #add lognum name for the column
names(F)<-c("EVTYPE","number","lognum") #ditto
F<-data.frame(F,type="Fatalities") # add the type name to data frame fatalities
I<-data.frame(I,type="Injuries") # same for injuries
SP<-rbind(F,I) #bind the 2 frames together for the plot
SP<-filter(SP,lognum>2.47748) #remove totals <= 300
P<-select(SD,EVTYPE,PROPDMG)# same thing all the way down
C<-select(SD,EVTYPE,CROPDMG)
P<-filter(P,PROPDMG>0)
C<-filter(C,CROPDMG>0)
names(P)<-c("EVTYPE","number")
names(C)<-c("EVTYPE","number")
PD <- aggregate(x = list(number = P$number), by = list(EVTYPE = P$EVTYPE),
FUN = sum, na.rm = TRUE)
CD<- aggregate(x = list(number = C$number), by = list(EVTYPE = C$EVTYPE),
FUN = sum, na.rm = TRUE)
PD<-mutate(PD,lognum=log10(number))
CD<-mutate(CD,lognum=log10(number))
CD<-data.frame(CD,type="Crop Damage")
PD<-data.frame(PD,type="Property Damage")
SX<-rbind(PD,CD)
SX<-filter(SX,lognum>3.47718) # limit number entries <= 3000
In the tables below the 12 largest storm sources for property damage and crop damage, repectively, are provided. Note that TSTM stands for Too Small To Measure and also, that there is column to the right of the number column that is called lognum. The lognum column contains the value of the Base 10 logarithm of the number column. This is used for plotting the ordinate in the figure.
The figure plot clearly delineates the sources of the damage issues for both crop damage and propert damage.
head(PD[order(-PD[,2]),],12) ## put PROPERTY DAMAGE in descending order
## EVTYPE number lognum type
## 334 TORNADO 3212258.16 6.506810 Property Damage
## 51 FLASH FLOOD 1420124.59 6.152326 Property Damage
## 348 TSTM WIND 1335965.61 6.125795 Property Damage
## 64 FLOOD 899938.48 5.954213 Property Damage
## 296 THUNDERSTORM WIND 876844.17 5.942922 Property Damage
## 106 HAIL 688693.38 5.838026 Property Damage
## 209 LIGHTNING 603351.78 5.780571 Property Damage
## 309 THUNDERSTORM WINDS 446293.18 5.649620 Property Damage
## 159 HIGH WIND 324731.56 5.511524 Property Damage
## 399 WINTER STORM 132720.59 5.122938 Property Damage
## 133 HEAVY SNOW 122251.99 5.087256 Property Damage
## 389 WILDFIRE 84459.34 4.926648 Property Damage
head(CD[order(-CD[,2]),],12) ## put CROP DAMAGE in descending order
## EVTYPE number lognum type
## 42 HAIL 579596.28 5.763126 Crop Damage
## 23 FLASH FLOOD 179200.46 5.253339 Crop Damage
## 27 FLOOD 168037.88 5.225407 Crop Damage
## 115 TSTM WIND 109202.60 5.038233 Crop Damage
## 107 TORNADO 100018.52 5.000080 Crop Damage
## 94 THUNDERSTORM WIND 66791.45 4.824721 Crop Damage
## 10 DROUGHT 33898.62 4.530182 Crop Damage
## 97 THUNDERSTORM WINDS 18684.93 4.271491 Crop Damage
## 60 HIGH WIND 17283.21 4.237624 Crop Damage
## 54 HEAVY RAIN 11122.80 4.046214 Crop Damage
## 37 FROST/FREEZE 7034.14 3.847211 Crop Damage
## 19 EXTREME COLD 6121.14 3.786832 Crop Damage
xyplot(as.integer(SX$lognum) ~ SX$EVTYPE|SX$type, main="Types of events that have the greatest economic consequences",xlab="Source of Damage", ylab="Base 10 Log of Total Cost in US Dollars",layout=c(1,2), type="h",scales=list(x=list(rot=90)))