Synopsis

In this document we will analyse the Storm Data from the NOAA. This dataset is a temporal series from 1950 to 2011 about the effect of atmospheric events over the health of population and the damage to crops and properties. We will attempt to determine which of the events is more harmful. To avoid effects of changes in currency values and other factors that could be linked to the passage of time we will do the analysis on an annual basis. The analysis of the data shows an inflection point around 1994. Previous to this date the most harmful event (both health and economy) was wind, mainly in the form of tornado. From 1994 to 2011 other events arise as the most harmful of the year, particularly heat for fatalities, flood for properties and drought for crops.

Data Processing

The data we’ll process are those of the NOAA at our disposal in the Coursera web site. This file is a csv one commpressed with bzip2. Once we have downloaded the file to our working directory we proceed to extract and read it.

    raw_data=read.csv(bzfile("./repdata_data_StormData.csv.bz2"))
    names(raw_data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

The dataset comprises 37 variables and 902297 observations. The variables we are interested in are BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP.
We are interested in the year of the event because the costs of the damages in crops or properties may not be comparable, so we create a new variable YEAR

    raw_data$YEAR=as.numeric(format(as.Date(raw_data$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"),'%Y'))

The number of events reported shows a clear ascending trend, with an inflection point around 1994. The meaning of this trend would be an interesting subject for further studies.

    plot(table(raw_data$YEAR),type='l', ylab='Number of events',xlab='Year')

plot of chunk unnamed-chunk-3

The NOAA recognizes 48 events types, the EVTYPE variable of the dataset has 985 levels. There are meaningless levels as APACHE COUNTY, there are synonyms, upper and lower cases and misspellings, so we have to try to fix this. The standard event types are

    standard_ev=c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill", "Dense Fog", "Debris Flow", "Dense Smoke", "Drought","Dust Devil", "Dust Storm", "Extreme Heat", "Extreme Cold/Wind Chill","Flash Flood", "Flood", "Frost/Freeze", "Funnel Cloud", "Freezing Fog","Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf", "High Wind", "Hurricane", "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood", "Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet","Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm", "Winter Weather")

Then we process the levels of EVTYPE

#remove leadenig blank spaces, misspelling,etc
test=toupper(raw_data$EVTYPE)
test=gsub('^\\s+','',test)
test=gsub('^\\s+','',test)
test=gsub('(RECORD)[ /]','',test)
test=gsub('STORMS|STORMW','STORM',test)
test=gsub('EXCESSIVE','EXTREME',test)
test=gsub('FLOODING|FLOODS|FLD"','FLOOD',test)
test=gsub('WINDS|WND','WIND',test)
test=gsub('TSTM|THUNDEERSTORM|THUNDERESTORM|TUNDERSTORM','THUNDERSTORM',test)
test=gsub('AVALANCE','AVALANCHE',test)
test=gsub('WINTERY|WINTRY','WINTER WEATHER',test)
test=gsub('^WATER SPOUT','WATERSPOUT',test)
test=gsub('^THUNDERSNOW','THUNDERSTORM',test)
test=gsub('^LIGHTING|LIGNTNING','LIGHTNING',test)
test=gsub('^HEAVY SURF|HEAVY SEAS|HAZARDOUS SURF','HIGH SURF',test)
test=gsub('^HEAVY PRECIPATATION|HEAVY PRECIPITATION','HEAVY RAIN',test)
#harmonizing denominations
pattern=c('THUNDER.*WIND','^(HURRICANE[ ,/]|TYPHOON)','^TROPICAL STORM[ ,/]','FLASH','BLIZZARD','(WILD|BRUSH).*(FIRE|FIRES)','(COAST|CSTL|BEACH).*FLOOD','^COLD','^STRONG.*WIND','^EXTREME.*(COLD|WINDCHILL|WIND CHILL)|^BLOWING SNOW','DUST (DEVIL|DEVEL)','^FLOOD','^HAIL','DROUGHT|DRY','(FROST|FREEZE|FREEZING)','LAKE.*FLOOD','^TEMPERATURE')
response=c('THUNDERSTORM WIND','HURRICANE','TROPICAL STORM','FLASH FLOOD','BLIZZARD','WILDFIRE','COASTAL FLOOD','COLD/WIND CHILL','HIGH WIND','EXTREME COLD/WIND CHILL','DUST DEVIL','FLOOD','HAIL','DROUGHT','FROST/FREEZE','LAKESHORE FLOOD','HEAT')
for (i in 1:length(pattern)){
    test[grep(pattern[i],test)]=response[i]
}
test[grepl('(FLOOD)',test) & !grepl('FLASH|COASTAL|LAKE',test)==T]='FLOOD'
test[grepl('SURGE|TIDE',test) & !grepl('ASTRO|BLOW-OUT',test)==T]='STORM SURGE/TIDE'

#applying standard names
for (event in standard_ev) { 
    rows=grep(paste("^", toupper(event), sep=""), test)
    if (length(rows) > 0) {
        test[rows]=event
    }
}

#creating a tidy dataset
cad=standard_ev[1]
for (i in 2:length(standard_ev)){
    cad=paste(cad,standard_ev[i],sep='|')
}
tidy_data=raw_data
tidy_data$EVTYPE=test
tidy_data=tidy_data[grep(cad,test),]

After this processing we retain the 99.1361% of the rows.
We will set up two datasets, one for each study:

health=tidy_data[,c(8,23,24,38)]
health=health[health$FATALITIES>0 | health$INJURIES>0,]
economy=tidy_data[,c(8,25,26,27,28,38)]

The economy dataset requires more processing. First we only want records with any amount of money. Second, the amount of money is expressed as a number (CROPDMG, PROPDMG) and a power of 10 (CROPDMGEXP,PROPDMGEXP). This powers of 10 are represented as a set of symbols.

economy=economy[economy$CROPDMG>0 | economy$PROPDMG>0,]
levels(economy$PROPDMGEXP)=list('0'=c('','-','?','+','0'),'1'='1','2'=c('2','h','H'),'3'=c('3','K'),'4'='4','5'='5','6'=c('m','M'),'7'='7','8'='8', '9'='B')
economy$PROPVAL=economy$PROPDMG*10^as.numeric(economy$PROPDMGEXP)
levels(economy$CROPDMGEXP)=list('0'=c('','-','?','+','0'),'1'='1','2'=c('2','h','H'),'3'=c('3','K'),'4'='4','5'='5','6'=c('m','M'),'7'='7','8'='8', '9'='B')
economy$CROPVAL=economy$CROPDMG*10^as.numeric(economy$CROPDMGEXP)
economy[is.na(economy$PROPVAL),7]=0
economy[is.na(economy$CROPVAL),8]=0
economy$TOTAL=economy$PROPVAL+economy$CROPVAL
#remove useless variables
economy=economy[,-(2:5)]

Results

First we will analyse the health data. This analysis is made in annual basis. The following code will show the event type with more casualties (fatalities and injuries, in that order) by year.

health_total=aggregate(cbind(FATALITIES,INJURIES)~EVTYPE+YEAR,data=health,sum)
max_fat=by(health_total$FATALITIES,health_total$YEAR,max)
max_inj=by(health_total$INJURIES,health_total$YEAR,max)
year=levels(as.factor(health_total$YEAR))
for (i in 1:length(year)){
    temp=health_total[health_total$YEAR==year[i],]
    fat=temp[which(temp$FATALITIES==max_fat[i]),1]
    inj=temp[which(temp$INJURIES==max_inj[i]),1]
    print(paste(year[i],fat,inj,sep='  '))
    
}
## [1] "1950  Tornado  Tornado"
## [1] "1951  Tornado  Tornado"
## [1] "1952  Tornado  Tornado"
## [1] "1953  Tornado  Tornado"
## [1] "1954  Tornado  Tornado"
## [1] "1955  Tornado  Tornado"
## [1] "1956  Tornado  Tornado"
## [1] "1957  Tornado  Tornado"
## [1] "1958  Tornado  Tornado"
## [1] "1959  Tornado  Tornado"
## [1] "1960  Tornado  Tornado"
## [1] "1961  Tornado  Tornado"
## [1] "1962  Tornado  Tornado"
## [1] "1963  Tornado  Tornado"
## [1] "1964  Tornado  Tornado"
## [1] "1965  Tornado  Tornado"
## [1] "1966  Tornado  Tornado"
## [1] "1967  Tornado  Tornado"
## [1] "1968  Tornado  Tornado"
## [1] "1969  Tornado  Tornado"
## [1] "1970  Tornado  Tornado"
## [1] "1971  Tornado  Tornado"
## [1] "1972  Tornado  Tornado"
## [1] "1973  Tornado  Tornado"
## [1] "1974  Tornado  Tornado"
## [1] "1975  Tornado  Tornado"
## [1] "1976  Tornado  Tornado"
## [1] "1977  Tornado  Tornado"
## [1] "1978  Tornado  Tornado"
## [1] "1979  Tornado  Tornado"
## [1] "1980  Tornado  Tornado"
## [1] "1981  Tornado  Tornado"
## [1] "1982  Tornado  Tornado"
## [1] "1983  Tornado  Tornado"
## [1] "1984  Tornado  Tornado"
## [1] "1985  Tornado  Tornado"
## [1] "1986  Thunderstorm Wind  Tornado"
## [1] "1987  Tornado  Tornado"
## [1] "1988  Tornado  Tornado"
## [1] "1989  Tornado  Tornado"
## [1] "1990  Tornado  Tornado"
## [1] "1991  Tornado  Tornado"
## [1] "1992  Tornado  Tornado"
## [1] "1993  High Wind  Tornado"         "1993  Thunderstorm Wind  Tornado"
## [1] "1994  Lightning  Ice Storm"
## [1] "1995  Heat  Heat"    "1995  Heat  Tornado"
## [1] "1996  Flash Flood  Tornado"
## [1] "1997  Extreme Heat  Tornado"
## [1] "1998  Extreme Heat  Flood"
## [1] "1999  Extreme Heat  Tornado"
## [1] "2000  Extreme Heat  Tornado"
## [1] "2001  Extreme Heat  Tornado"
## [1] "2002  Extreme Heat  Tornado"
## [1] "2003  Flash Flood  Tornado"
## [1] "2004  Flash Flood  Hurricane"
## [1] "2005  Extreme Heat  Tornado"
## [1] "2006  Extreme Heat  Extreme Heat"
## [1] "2007  Tornado  Tornado"
## [1] "2008  Tornado  Tornado"
## [1] "2009  Rip Current  Tornado"
## [1] "2010  Flash Flood  Tornado"
## [1] "2011  Tornado  Tornado"

From 1950 to 1993 the most harmful event was the wind, mostly in the form of tornado. This is true both to injuries and fatalities. 1994 is a special year since the most fatalities was caused by lightning and the most injuries by ice storm. Especially interestig is the series 1995-2006: the most fatalities was caused by heat and flash flood (Is the global warming playing here?).Tornado remains as main cause of injuries. In the last years (2007-2011) tornado and floods was the cause of fatalities and injuries. The conclusion would be that tornado is the most harmful event, but in the last years we could see an increment in the importance of heat and floods.

The economic data will also be analyzed on an annual basis. The result are presented in this order: year,properties cost, crops cost and total cost.

economy_total=aggregate(cbind(PROPVAL,CROPVAL,TOTAL)~EVTYPE+YEAR,data=economy,sum)
max_prop=by(economy_total$PROPVAL,economy_total$YEAR,max)
max_crop=by(economy_total$CROPVAL,economy_total$YEAR,max)
max_total=by(economy_total$TOTAL,economy_total$YEAR,max)
year=levels(as.factor(economy_total$YEAR))
for (i in 1:length(year)){
    temp=economy_total[economy_total$YEAR==year[i],]
    prop=temp[which(temp$PROPVAL==max_prop[i]),1]
    crop=temp[which(temp$CROPVAL==max_crop[i]),1]
    total=temp[which(temp$TOTAL==max_total[i]),1]
    print(paste(year[i],prop,crop,total,sep='  '))
    
}
## [1] "1950  Tornado  Tornado  Tornado"
## [1] "1951  Tornado  Tornado  Tornado"
## [1] "1952  Tornado  Tornado  Tornado"
## [1] "1953  Tornado  Tornado  Tornado"
## [1] "1954  Tornado  Tornado  Tornado"
## [1] "1955  Tornado  Tornado  Tornado"
## [1] "1956  Tornado  Tornado  Tornado"
## [1] "1957  Tornado  Tornado  Tornado"
## [1] "1958  Tornado  Tornado  Tornado"
## [1] "1959  Tornado  Tornado  Tornado"
## [1] "1960  Tornado  Tornado  Tornado"
## [1] "1961  Tornado  Tornado  Tornado"
## [1] "1962  Tornado  Tornado  Tornado"
## [1] "1963  Tornado  Tornado  Tornado"
## [1] "1964  Tornado  Tornado  Tornado"
## [1] "1965  Tornado  Tornado  Tornado"
## [1] "1966  Tornado  Tornado  Tornado"
## [1] "1967  Tornado  Tornado  Tornado"
## [1] "1968  Tornado  Tornado  Tornado"
## [1] "1969  Tornado  Tornado  Tornado"
## [1] "1970  Tornado  Tornado  Tornado"
## [1] "1971  Tornado  Tornado  Tornado"
## [1] "1972  Tornado  Tornado  Tornado"
## [1] "1973  Tornado  Tornado  Tornado"
## [1] "1974  Tornado  Tornado  Tornado"
## [1] "1975  Tornado  Tornado  Tornado"
## [1] "1976  Tornado  Tornado  Tornado"
## [1] "1977  Tornado  Tornado  Tornado"
## [1] "1978  Tornado  Tornado  Tornado"
## [1] "1979  Tornado  Tornado  Tornado"
## [1] "1980  Tornado  Tornado  Tornado"
## [1] "1981  Tornado  Tornado  Tornado"
## [1] "1982  Tornado  Tornado  Tornado"
## [1] "1983  Tornado  Tornado  Tornado"
## [1] "1984  Tornado  Tornado  Tornado"
## [1] "1985  Tornado  Tornado  Tornado"
## [1] "1986  Tornado  Tornado  Tornado"
## [1] "1987  Tornado  Tornado  Tornado"
## [1] "1988  Tornado  Tornado  Tornado"
## [1] "1989  Tornado  Tornado  Tornado"
## [1] "1990  Tornado  Tornado  Tornado"
## [1] "1991  Tornado  Tornado  Tornado"
## [1] "1992  Tornado  Tornado  Tornado"
## [1] "1993  Flood  Flood  Flood"
## [1] "1994  Flash Flood  Ice Storm  Ice Storm"
## [1] "1995  Hurricane  Flood  Hurricane"
## [1] "1996  Hurricane  Drought  Hurricane"
## [1] "1997  Flood  Drought  Flood"
## [1] "1998  Hurricane  Drought  Hurricane"
## [1] "1999  Hurricane  Hurricane  Hurricane"
## [1] "2000  Wildfire  Drought  Drought"
## [1] "2001  Tropical Storm  Drought  Tropical Storm"
## [1] "2002  Tornado  Drought  Hurricane"
## [1] "2003  Wildfire  Drought  Wildfire"
## [1] "2004  Hurricane  Hurricane  Hurricane"
## [1] "2005  Hurricane  Hurricane  Hurricane"
## [1] "2006  Flood  Drought  Flood"
## [1] "2007  Tornado  Flood  Tornado"
## [1] "2008  Storm Surge/Tide  Flood  Storm Surge/Tide"
## [1] "2009  Hail  Hail  Hail"
## [1] "2010  Hail  Flood  Flood"
## [1] "2011  Tornado  Flood  Tornado"

From 1950 to 1992 the most costly event was tornado. From 1994 to 2011 the more harmful events for properties was winds (tornado and hurricane), floods, hail and wildfire, for crops was drought, hurricane, flood and hail/ice storm.