In this document we will analyse the Storm Data from the NOAA. This dataset is a temporal series from 1950 to 2011 about the effect of atmospheric events over the health of population and the damage to crops and properties. We will attempt to determine which of the events is more harmful. To avoid effects of changes in currency values and other factors that could be linked to the passage of time we will do the analysis on an annual basis. The analysis of the data shows an inflection point around 1994. Previous to this date the most harmful event (both health and economy) was wind, mainly in the form of tornado. From 1994 to 2011 other events arise as the most harmful of the year, particularly heat for fatalities, flood for properties and drought for crops.
The data we’ll process are those of the NOAA at our disposal in the Coursera web site. This file is a csv one commpressed with bzip2. Once we have downloaded the file to our working directory we proceed to extract and read it.
raw_data=read.csv(bzfile("./repdata_data_StormData.csv.bz2"))
names(raw_data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
The dataset comprises 37 variables and 902297 observations. The variables we are interested in are BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP.
We are interested in the year of the event because the costs of the damages in crops or properties may not be comparable, so we create a new variable YEAR
raw_data$YEAR=as.numeric(format(as.Date(raw_data$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"),'%Y'))
The number of events reported shows a clear ascending trend, with an inflection point around 1994. The meaning of this trend would be an interesting subject for further studies.
plot(table(raw_data$YEAR),type='l', ylab='Number of events',xlab='Year')
The NOAA recognizes 48 events types, the EVTYPE variable of the dataset has 985 levels. There are meaningless levels as APACHE COUNTY, there are synonyms, upper and lower cases and misspellings, so we have to try to fix this. The standard event types are
standard_ev=c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill", "Dense Fog", "Debris Flow", "Dense Smoke", "Drought","Dust Devil", "Dust Storm", "Extreme Heat", "Extreme Cold/Wind Chill","Flash Flood", "Flood", "Frost/Freeze", "Funnel Cloud", "Freezing Fog","Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf", "High Wind", "Hurricane", "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood", "Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet","Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm", "Winter Weather")
Then we process the levels of EVTYPE
#remove leadenig blank spaces, misspelling,etc
test=toupper(raw_data$EVTYPE)
test=gsub('^\\s+','',test)
test=gsub('^\\s+','',test)
test=gsub('(RECORD)[ /]','',test)
test=gsub('STORMS|STORMW','STORM',test)
test=gsub('EXCESSIVE','EXTREME',test)
test=gsub('FLOODING|FLOODS|FLD"','FLOOD',test)
test=gsub('WINDS|WND','WIND',test)
test=gsub('TSTM|THUNDEERSTORM|THUNDERESTORM|TUNDERSTORM','THUNDERSTORM',test)
test=gsub('AVALANCE','AVALANCHE',test)
test=gsub('WINTERY|WINTRY','WINTER WEATHER',test)
test=gsub('^WATER SPOUT','WATERSPOUT',test)
test=gsub('^THUNDERSNOW','THUNDERSTORM',test)
test=gsub('^LIGHTING|LIGNTNING','LIGHTNING',test)
test=gsub('^HEAVY SURF|HEAVY SEAS|HAZARDOUS SURF','HIGH SURF',test)
test=gsub('^HEAVY PRECIPATATION|HEAVY PRECIPITATION','HEAVY RAIN',test)
#harmonizing denominations
pattern=c('THUNDER.*WIND','^(HURRICANE[ ,/]|TYPHOON)','^TROPICAL STORM[ ,/]','FLASH','BLIZZARD','(WILD|BRUSH).*(FIRE|FIRES)','(COAST|CSTL|BEACH).*FLOOD','^COLD','^STRONG.*WIND','^EXTREME.*(COLD|WINDCHILL|WIND CHILL)|^BLOWING SNOW','DUST (DEVIL|DEVEL)','^FLOOD','^HAIL','DROUGHT|DRY','(FROST|FREEZE|FREEZING)','LAKE.*FLOOD','^TEMPERATURE')
response=c('THUNDERSTORM WIND','HURRICANE','TROPICAL STORM','FLASH FLOOD','BLIZZARD','WILDFIRE','COASTAL FLOOD','COLD/WIND CHILL','HIGH WIND','EXTREME COLD/WIND CHILL','DUST DEVIL','FLOOD','HAIL','DROUGHT','FROST/FREEZE','LAKESHORE FLOOD','HEAT')
for (i in 1:length(pattern)){
test[grep(pattern[i],test)]=response[i]
}
test[grepl('(FLOOD)',test) & !grepl('FLASH|COASTAL|LAKE',test)==T]='FLOOD'
test[grepl('SURGE|TIDE',test) & !grepl('ASTRO|BLOW-OUT',test)==T]='STORM SURGE/TIDE'
#applying standard names
for (event in standard_ev) {
rows=grep(paste("^", toupper(event), sep=""), test)
if (length(rows) > 0) {
test[rows]=event
}
}
#creating a tidy dataset
cad=standard_ev[1]
for (i in 2:length(standard_ev)){
cad=paste(cad,standard_ev[i],sep='|')
}
tidy_data=raw_data
tidy_data$EVTYPE=test
tidy_data=tidy_data[grep(cad,test),]
After this processing we retain the 99.1361% of the rows.
We will set up two datasets, one for each study:
health=tidy_data[,c(8,23,24,38)]
health=health[health$FATALITIES>0 | health$INJURIES>0,]
economy=tidy_data[,c(8,25,26,27,28,38)]
The economy dataset requires more processing. First we only want records with any amount of money. Second, the amount of money is expressed as a number (CROPDMG, PROPDMG) and a power of 10 (CROPDMGEXP,PROPDMGEXP). This powers of 10 are represented as a set of symbols.
economy=economy[economy$CROPDMG>0 | economy$PROPDMG>0,]
levels(economy$PROPDMGEXP)=list('0'=c('','-','?','+','0'),'1'='1','2'=c('2','h','H'),'3'=c('3','K'),'4'='4','5'='5','6'=c('m','M'),'7'='7','8'='8', '9'='B')
economy$PROPVAL=economy$PROPDMG*10^as.numeric(economy$PROPDMGEXP)
levels(economy$CROPDMGEXP)=list('0'=c('','-','?','+','0'),'1'='1','2'=c('2','h','H'),'3'=c('3','K'),'4'='4','5'='5','6'=c('m','M'),'7'='7','8'='8', '9'='B')
economy$CROPVAL=economy$CROPDMG*10^as.numeric(economy$CROPDMGEXP)
economy[is.na(economy$PROPVAL),7]=0
economy[is.na(economy$CROPVAL),8]=0
economy$TOTAL=economy$PROPVAL+economy$CROPVAL
#remove useless variables
economy=economy[,-(2:5)]
First we will analyse the health data. This analysis is made in annual basis. The following code will show the event type with more casualties (fatalities and injuries, in that order) by year.
health_total=aggregate(cbind(FATALITIES,INJURIES)~EVTYPE+YEAR,data=health,sum)
max_fat=by(health_total$FATALITIES,health_total$YEAR,max)
max_inj=by(health_total$INJURIES,health_total$YEAR,max)
year=levels(as.factor(health_total$YEAR))
for (i in 1:length(year)){
temp=health_total[health_total$YEAR==year[i],]
fat=temp[which(temp$FATALITIES==max_fat[i]),1]
inj=temp[which(temp$INJURIES==max_inj[i]),1]
print(paste(year[i],fat,inj,sep=' '))
}
## [1] "1950 Tornado Tornado"
## [1] "1951 Tornado Tornado"
## [1] "1952 Tornado Tornado"
## [1] "1953 Tornado Tornado"
## [1] "1954 Tornado Tornado"
## [1] "1955 Tornado Tornado"
## [1] "1956 Tornado Tornado"
## [1] "1957 Tornado Tornado"
## [1] "1958 Tornado Tornado"
## [1] "1959 Tornado Tornado"
## [1] "1960 Tornado Tornado"
## [1] "1961 Tornado Tornado"
## [1] "1962 Tornado Tornado"
## [1] "1963 Tornado Tornado"
## [1] "1964 Tornado Tornado"
## [1] "1965 Tornado Tornado"
## [1] "1966 Tornado Tornado"
## [1] "1967 Tornado Tornado"
## [1] "1968 Tornado Tornado"
## [1] "1969 Tornado Tornado"
## [1] "1970 Tornado Tornado"
## [1] "1971 Tornado Tornado"
## [1] "1972 Tornado Tornado"
## [1] "1973 Tornado Tornado"
## [1] "1974 Tornado Tornado"
## [1] "1975 Tornado Tornado"
## [1] "1976 Tornado Tornado"
## [1] "1977 Tornado Tornado"
## [1] "1978 Tornado Tornado"
## [1] "1979 Tornado Tornado"
## [1] "1980 Tornado Tornado"
## [1] "1981 Tornado Tornado"
## [1] "1982 Tornado Tornado"
## [1] "1983 Tornado Tornado"
## [1] "1984 Tornado Tornado"
## [1] "1985 Tornado Tornado"
## [1] "1986 Thunderstorm Wind Tornado"
## [1] "1987 Tornado Tornado"
## [1] "1988 Tornado Tornado"
## [1] "1989 Tornado Tornado"
## [1] "1990 Tornado Tornado"
## [1] "1991 Tornado Tornado"
## [1] "1992 Tornado Tornado"
## [1] "1993 High Wind Tornado" "1993 Thunderstorm Wind Tornado"
## [1] "1994 Lightning Ice Storm"
## [1] "1995 Heat Heat" "1995 Heat Tornado"
## [1] "1996 Flash Flood Tornado"
## [1] "1997 Extreme Heat Tornado"
## [1] "1998 Extreme Heat Flood"
## [1] "1999 Extreme Heat Tornado"
## [1] "2000 Extreme Heat Tornado"
## [1] "2001 Extreme Heat Tornado"
## [1] "2002 Extreme Heat Tornado"
## [1] "2003 Flash Flood Tornado"
## [1] "2004 Flash Flood Hurricane"
## [1] "2005 Extreme Heat Tornado"
## [1] "2006 Extreme Heat Extreme Heat"
## [1] "2007 Tornado Tornado"
## [1] "2008 Tornado Tornado"
## [1] "2009 Rip Current Tornado"
## [1] "2010 Flash Flood Tornado"
## [1] "2011 Tornado Tornado"
From 1950 to 1993 the most harmful event was the wind, mostly in the form of tornado. This is true both to injuries and fatalities. 1994 is a special year since the most fatalities was caused by lightning and the most injuries by ice storm. Especially interestig is the series 1995-2006: the most fatalities was caused by heat and flash flood (Is the global warming playing here?).Tornado remains as main cause of injuries. In the last years (2007-2011) tornado and floods was the cause of fatalities and injuries. The conclusion would be that tornado is the most harmful event, but in the last years we could see an increment in the importance of heat and floods.
The economic data will also be analyzed on an annual basis. The result are presented in this order: year,properties cost, crops cost and total cost.
economy_total=aggregate(cbind(PROPVAL,CROPVAL,TOTAL)~EVTYPE+YEAR,data=economy,sum)
max_prop=by(economy_total$PROPVAL,economy_total$YEAR,max)
max_crop=by(economy_total$CROPVAL,economy_total$YEAR,max)
max_total=by(economy_total$TOTAL,economy_total$YEAR,max)
year=levels(as.factor(economy_total$YEAR))
for (i in 1:length(year)){
temp=economy_total[economy_total$YEAR==year[i],]
prop=temp[which(temp$PROPVAL==max_prop[i]),1]
crop=temp[which(temp$CROPVAL==max_crop[i]),1]
total=temp[which(temp$TOTAL==max_total[i]),1]
print(paste(year[i],prop,crop,total,sep=' '))
}
## [1] "1950 Tornado Tornado Tornado"
## [1] "1951 Tornado Tornado Tornado"
## [1] "1952 Tornado Tornado Tornado"
## [1] "1953 Tornado Tornado Tornado"
## [1] "1954 Tornado Tornado Tornado"
## [1] "1955 Tornado Tornado Tornado"
## [1] "1956 Tornado Tornado Tornado"
## [1] "1957 Tornado Tornado Tornado"
## [1] "1958 Tornado Tornado Tornado"
## [1] "1959 Tornado Tornado Tornado"
## [1] "1960 Tornado Tornado Tornado"
## [1] "1961 Tornado Tornado Tornado"
## [1] "1962 Tornado Tornado Tornado"
## [1] "1963 Tornado Tornado Tornado"
## [1] "1964 Tornado Tornado Tornado"
## [1] "1965 Tornado Tornado Tornado"
## [1] "1966 Tornado Tornado Tornado"
## [1] "1967 Tornado Tornado Tornado"
## [1] "1968 Tornado Tornado Tornado"
## [1] "1969 Tornado Tornado Tornado"
## [1] "1970 Tornado Tornado Tornado"
## [1] "1971 Tornado Tornado Tornado"
## [1] "1972 Tornado Tornado Tornado"
## [1] "1973 Tornado Tornado Tornado"
## [1] "1974 Tornado Tornado Tornado"
## [1] "1975 Tornado Tornado Tornado"
## [1] "1976 Tornado Tornado Tornado"
## [1] "1977 Tornado Tornado Tornado"
## [1] "1978 Tornado Tornado Tornado"
## [1] "1979 Tornado Tornado Tornado"
## [1] "1980 Tornado Tornado Tornado"
## [1] "1981 Tornado Tornado Tornado"
## [1] "1982 Tornado Tornado Tornado"
## [1] "1983 Tornado Tornado Tornado"
## [1] "1984 Tornado Tornado Tornado"
## [1] "1985 Tornado Tornado Tornado"
## [1] "1986 Tornado Tornado Tornado"
## [1] "1987 Tornado Tornado Tornado"
## [1] "1988 Tornado Tornado Tornado"
## [1] "1989 Tornado Tornado Tornado"
## [1] "1990 Tornado Tornado Tornado"
## [1] "1991 Tornado Tornado Tornado"
## [1] "1992 Tornado Tornado Tornado"
## [1] "1993 Flood Flood Flood"
## [1] "1994 Flash Flood Ice Storm Ice Storm"
## [1] "1995 Hurricane Flood Hurricane"
## [1] "1996 Hurricane Drought Hurricane"
## [1] "1997 Flood Drought Flood"
## [1] "1998 Hurricane Drought Hurricane"
## [1] "1999 Hurricane Hurricane Hurricane"
## [1] "2000 Wildfire Drought Drought"
## [1] "2001 Tropical Storm Drought Tropical Storm"
## [1] "2002 Tornado Drought Hurricane"
## [1] "2003 Wildfire Drought Wildfire"
## [1] "2004 Hurricane Hurricane Hurricane"
## [1] "2005 Hurricane Hurricane Hurricane"
## [1] "2006 Flood Drought Flood"
## [1] "2007 Tornado Flood Tornado"
## [1] "2008 Storm Surge/Tide Flood Storm Surge/Tide"
## [1] "2009 Hail Hail Hail"
## [1] "2010 Hail Flood Flood"
## [1] "2011 Tornado Flood Tornado"
From 1950 to 1992 the most costly event was tornado. From 1994 to 2011 the more harmful events for properties was winds (tornado and hurricane), floods, hail and wildfire, for crops was drought, hurricane, flood and hail/ice storm.