This document is an analysis of the Storms and other severe weather events which cause both public health and economic problems for communities and municipalities. This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The objective of the project is to explore the storm and weather related events with respect to its impact on population health and economy.To identify the impact of weather events on the population health a composite index is introduced which is the summation of fatalities and injuries due to a weather event type.
The impact on the economy is assessed based on an index factor which is the summation of loss due to property damage and crop damage. From the analysis it is found that the event type “Tornado” has the biggest impact on both population health and economic loss.
The first step in data processing is to set the working directory for R. This is done as follows
setwd("D:/R/Reproducible research")
The dataset for the analysis is available through the following URL “https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2”
The data is downloaded using the follwing command
setInternet2(T)
fileurl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileurl,destfile="StormData.csv.bz2")
stormdata <- read.csv("StormData.csv.bz2")
The data set has the following variables
names(stormdata)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
Through observation of the variables, it can be realised that the variables “Fatalities” & “Injuries” are the ones which are relevant for assessing the impact on population health. Similarly “PROPDMG” & “CROPDMG” which stands for Property damage and Crop damage are the pertinent variables for assessing the economic loss.
Let us first assess the impact of weather types on population health.
As a first step the “stormdata” data set has to be transformed to take only the data relevant to the analysis. This can be done by subsetting the data set to include only data relevant to health.
pophealth <- stormdata[,c("COUNTY","COUNTYNAME","STATE","EVTYPE","BGN_DATE","FATALITIES","INJURIES")]
head(pophealth)
## COUNTY COUNTYNAME STATE EVTYPE BGN_DATE FATALITIES INJURIES
## 1 97 MOBILE AL TORNADO 4/18/1950 0:00:00 0 15
## 2 3 BALDWIN AL TORNADO 4/18/1950 0:00:00 0 0
## 3 57 FAYETTE AL TORNADO 2/20/1951 0:00:00 0 2
## 4 89 MADISON AL TORNADO 6/8/1951 0:00:00 0 2
## 5 43 CULLMAN AL TORNADO 11/15/1951 0:00:00 0 2
## 6 77 LAUDERDALE AL TORNADO 11/15/1951 0:00:00 0 6
As seen from the dataset, the “Fatalities” & “Injuries” are two variables which are indicators of the impact of weather type on population health. Inorder to understand the total impact of these two indicators on population health, we wil introduce a new variable called “healthindex”. This variable is a weighted composite score derived from the former two variables. A heigher weight of “7” is given to “Fatalities” and a lower weight of “3” is assigned to the “Injuries” variable. The composite variable is derived as follows.
pophealth$healthindex <- pophealth$FATALITIES*7 + pophealth$INJURIES*3
head(pophealth)
## COUNTY COUNTYNAME STATE EVTYPE BGN_DATE FATALITIES INJURIES
## 1 97 MOBILE AL TORNADO 4/18/1950 0:00:00 0 15
## 2 3 BALDWIN AL TORNADO 4/18/1950 0:00:00 0 0
## 3 57 FAYETTE AL TORNADO 2/20/1951 0:00:00 0 2
## 4 89 MADISON AL TORNADO 6/8/1951 0:00:00 0 2
## 5 43 CULLMAN AL TORNADO 11/15/1951 0:00:00 0 2
## 6 77 LAUDERDALE AL TORNADO 11/15/1951 0:00:00 0 6
## healthindex
## 1 45
## 2 0
## 3 6
## 4 6
## 5 6
## 6 18
The next step is to find those eventype based on its impact on the population health. This can be done by summing up the composite score with respect to the “EvTYPE” variable. The summation will give the composite index of an eventype accross all states and accross all time periods. The aggregation can be done as follows and the results are also listed.
aggpophealth <- aggregate(pophealth$healthindex,by=list(pophealth$EVTYPE),FUN=sum,na.rm=T)
names(aggpophealth) <- c("event","index")
head(aggpophealth)
## event index
## 1 HIGH SURF ADVISORY 0
## 2 COASTAL FLOOD 0
## 3 FLASH FLOOD 0
## 4 LIGHTNING 0
## 5 TSTM WIND 0
## 6 TSTM WIND (G45) 0
tail(aggpophealth)
## event index
## 980 WINTER WEATHER/MIX 412
## 981 WINTERY MIX 0
## 982 Wintry mix 0
## 983 Wintry Mix 0
## 984 WINTRY MIX 238
## 985 WND 0
Once we have summed up the composite score, the next step is to sort this data so as to identify those event types which have the highest impact on population health.After sorting, the top 10 eventypes are also listed. The results are detailed as below.
library(data.table)
sorted <- aggpophealth[order(aggpophealth[,"index"],decreasing=T),]
select <- sorted[1:10,]
data.table(select)
## event index
## 1: TORNADO 313469
## 2: EXCESSIVE HEAT 32896
## 3: TSTM WIND 24399
## 4: FLOOD 23657
## 5: LIGHTNING 21402
## 6: HEAT 12859
## 7: FLASH FLOOD 12177
## 8: ICE STORM 6548
## 9: WINTER STORM 5405
## 10: THUNDERSTORM WIND 5395
A plot of the data is also as shown below. The x axis shows the event type and the y axis shows the severity index.
library(ggplot2)
popplot <- ggplot(select,aes(event,index,fill=event))
popplot + geom_bar(stat="identity")+labs(title = "Event type v/s Severity to Population Health ")+labs(x="Event type",y="Severity Index")
From the analysis it is evident that weather event “Tornado” has the biggest impact on population health. Let us go a step further and identify the states & counties accross the country which are most impacted by this event type.To achieve this let us take only the data relevant to “Tornado” from the data set.
tordata <- subset(pophealth,pophealth$EVTYPE=="TORNADO")
head(tordata)
## COUNTY COUNTYNAME STATE EVTYPE BGN_DATE FATALITIES INJURIES
## 1 97 MOBILE AL TORNADO 4/18/1950 0:00:00 0 15
## 2 3 BALDWIN AL TORNADO 4/18/1950 0:00:00 0 0
## 3 57 FAYETTE AL TORNADO 2/20/1951 0:00:00 0 2
## 4 89 MADISON AL TORNADO 6/8/1951 0:00:00 0 2
## 5 43 CULLMAN AL TORNADO 11/15/1951 0:00:00 0 2
## 6 77 LAUDERDALE AL TORNADO 11/15/1951 0:00:00 0 6
## healthindex
## 1 45
## 2 0
## 3 6
## 4 6
## 5 6
## 6 18
tail(tordata)
## COUNTY COUNTYNAME STATE EVTYPE BGN_DATE FATALITIES
## 901814 61 JASPER MS TORNADO 11/16/2011 0:00:00 0
## 901815 143 PITTSYLVANIA VA TORNADO 11/16/2011 0:00:00 0
## 901821 117 ORANGE IN TORNADO 11/14/2011 0:00:00 0
## 901826 487 WILBARGER TX TORNADO 11/7/2011 0:00:00 0
## 901827 65 JACKSON OK TORNADO 11/7/2011 0:00:00 0
## 901829 81 LEE AL TORNADO 11/16/2011 0:00:00 0
## INJURIES healthindex
## 901814 0 0
## 901815 0 0
## 901821 0 0
## 901826 0 0
## 901827 0 0
## 901829 2 6
Once the “Tornado” data has been subsetted, the next step is to summarise the composite index with respect to states and also counties. After summarising the data the data is sorted and the top ten states and counties accross the country which are most affected by “Tornado” is listed. All these results are as below.
library(plyr)
## summarising based on county name
aggtorcon <- ddply(tordata,c("STATE","COUNTYNAME"),summarise,index=sum(healthindex))
## sorting the data according to the severity index
consort <- aggtorcon[order(aggtorcon[,"index"],decreasing=T),]
Top 10 counties accross the country with the biggest impact of Tornado
data.table(consort[1:10,])
## STATE COUNTYNAME index
## 1: TX WICHITA 5892
## 2: AL JEFFERSON 5336
## 3: MO JASPER 4915
## 4: MA WORCESTER 4403
## 5: OH GREENE 4084
## 6: AL TUSCALOOSA 3584
## 7: MI GENESEE 3573
## 8: AL MADISON 2702
## 9: TX MCLENNAN 2628
## 10: IL COOK 2583
Identifying the States accross the country with biggest impact of Tornado
aggtorsta <- ddply(tordata,c("STATE"),summarise,index=sum(healthindex))
stasort <- aggtorsta[order(aggtorsta[,"index"],decreasing=T),]
Top 10 states accross the country with the biggest impact of Tornado
data.table(stasort[1:10,])
## STATE index
## 1: TX 28387
## 2: AL 28106
## 3: MS 21882
## 4: AR 18001
## 5: TN 16820
## 6: OK 16559
## 7: MO 15706
## 8: OH 14651
## 9: IN 14436
## 10: IL 13856
As a first step the “stormdata” data set has to be transformed to take only the data relevant to the analysis. This can be done by subsetting the data set to include only data relevant to economy.
ecovalue <- stormdata[,c("COUNTY","COUNTYNAME","STATE","EVTYPE","BGN_DATE","PROPDMG","CROPDMG")]
As seen from the dataset, the “PROPDMG” & “CROPDMG” are two variables which are indicators of the impact of weather type on economy. Inorder to understand the total impact of these two indicators on the economy, we wil introduce a new variable called “ecoindex”. This variable is a score derived by summing up the property damage variable and crop damage variables. The ecoindex variable is derived as follows.
ecovalue$ecoindex <- ecovalue$PROPDMG + ecovalue$CROPDMG
head(ecovalue)
## COUNTY COUNTYNAME STATE EVTYPE BGN_DATE PROPDMG CROPDMG
## 1 97 MOBILE AL TORNADO 4/18/1950 0:00:00 25.0 0
## 2 3 BALDWIN AL TORNADO 4/18/1950 0:00:00 2.5 0
## 3 57 FAYETTE AL TORNADO 2/20/1951 0:00:00 25.0 0
## 4 89 MADISON AL TORNADO 6/8/1951 0:00:00 2.5 0
## 5 43 CULLMAN AL TORNADO 11/15/1951 0:00:00 2.5 0
## 6 77 LAUDERDALE AL TORNADO 11/15/1951 0:00:00 2.5 0
## ecoindex
## 1 25.0
## 2 2.5
## 3 25.0
## 4 2.5
## 5 2.5
## 6 2.5
tail(ecovalue)
## COUNTY COUNTYNAME STATE EVTYPE
## 902292 21 TNZ001>004 - 019>021 - 048>055 - 088 TN WINTER WEATHER
## 902293 7 WYZ007 - 017 WY HIGH WIND
## 902294 9 MTZ009 - 010 MT HIGH WIND
## 902295 213 AKZ213 AK HIGH WIND
## 902296 202 AKZ202 AK BLIZZARD
## 902297 6 ALZ006 AL HEAVY SNOW
## BGN_DATE PROPDMG CROPDMG ecoindex
## 902292 11/28/2011 0:00:00 0 0 0
## 902293 11/30/2011 0:00:00 0 0 0
## 902294 11/10/2011 0:00:00 0 0 0
## 902295 11/8/2011 0:00:00 0 0 0
## 902296 11/9/2011 0:00:00 0 0 0
## 902297 11/28/2011 0:00:00 0 0 0
The next step is to find those eventype based on the loss it imparts on the economy. This can be done by summing up the econindex score with respect to the “EvTYPE” variable. The summation will give the composite index of an eventype accross all states and accross all time periods. The aggregation can be done as follows and the results are also listed.
aggecoind <- aggregate(ecovalue$ecoindex,by=list(ecovalue$EVTYPE),FUN=sum,na.rm=T)
names(aggecoind) <- c("event","index")
head(aggecoind)
## event index
## 1 HIGH SURF ADVISORY 200
## 2 COASTAL FLOOD 0
## 3 FLASH FLOOD 50
## 4 LIGHTNING 0
## 5 TSTM WIND 108
## 6 TSTM WIND (G45) 8
tail(aggecoind)
## event index
## 980 WINTER WEATHER/MIX 4873.5
## 981 WINTERY MIX 0.0
## 982 Wintry mix 0.0
## 983 Wintry Mix 2.5
## 984 WINTRY MIX 10.0
## 985 WND 0.0
Once we have summed up econindex score, the next step is to sort this data so as to identify those event types which have the biggest impact on the economy.After sorting, the top 10 eventypes are also listed. The results are detailed as below.
library(data.table)
ecosorted <- aggecoind[order(aggecoind[,"index"],decreasing=T),]
ecosel <- ecosorted[1:10,]
data.table(ecosel)
## event index
## 1: TORNADO 3312277
## 2: FLASH FLOOD 1599325
## 3: TSTM WIND 1445168
## 4: HAIL 1268290
## 5: FLOOD 1067976
## 6: THUNDERSTORM WIND 943636
## 7: LIGHTNING 606932
## 8: THUNDERSTORM WINDS 464978
## 9: HIGH WIND 342015
## 10: WINTER STORM 134700
A plot of the data is also as shown below. The x axis shows the event type and the y axis shows the economic index.
library(ggplot2)
ecoplot <- ggplot(ecosel,aes(event,index,fill=event))
ecoplot + geom_bar(stat="identity")+labs(title = "Event type v/s Economic Consequence Index ")+labs(x="Event type",y="Consequence Index")
Similar to the population health, it is evident that weather event “Tornado” is the biggest contributor to economic loss. Let us go a step further and identify the states & counties accross the country which are most impacted by this event type.To achieve this let us take only the data relevant to “Tornado” from the data set.
ecotordata <- subset(ecovalue,ecovalue$EVTYPE=="TORNADO")
Once the “Tornado” data has been subsetted, the next step is to summarise the economic index with respect to states and also counties. After summarising the data the data is sorted and the top ten states and counties accross the country which are most affected by “Tornado” is listed. All these results are as below.
library(plyr)
library(data.table)
## summarising based on state and county name
ecotorcon <- ddply(ecotordata,c("STATE","COUNTYNAME"),summarise,index=sum(ecoindex))
ecotorsort <- ecotorcon[order(ecotorcon[,"index"],decreasing=T),]
## summarising based on state name
ecotorsta <- ddply(ecotordata,c("STATE"),summarise,index=sum(ecoindex))
ecostasort <- ecotorsta[order(ecotorsta[,"index"],decreasing=T),]
Top 10 counties accross the country with the biggest economic loss due to Tornado
data.table(ecotorsort[1:10,])
## STATE COUNTYNAME index
## 1: TX HARRIS 14137
## 2: AL JEFFERSON 8570
## 3: FL BREVARD 8559
## 4: FL HILLSBOROUGH 8516
## 5: NE THAYER 8215
## 6: FL PINELLAS 7634
## 7: MS SMITH 6922
## 8: FL POLK 6555
## 9: TN RUTHERFORD 6380
## 10: KS MITCHELL 6371
Top 10 states accross the country with the biggest economic loss due to Tornado
data.table(ecostasort[1:10,])
## STATE index
## 1: TX 287963
## 2: MS 212805
## 3: AL 169469
## 4: OK 165774
## 5: FL 159901
## 6: IA 156895
## 7: GA 155142
## 8: KS 148692
## 9: MO 134446
## 10: LA 134320
The results from the analysis can be summarised as follows
data.table(select)
## event index
## 1: TORNADO 313469
## 2: EXCESSIVE HEAT 32896
## 3: TSTM WIND 24399
## 4: FLOOD 23657
## 5: LIGHTNING 21402
## 6: HEAT 12859
## 7: FLASH FLOOD 12177
## 8: ICE STORM 6548
## 9: WINTER STORM 5405
## 10: THUNDERSTORM WIND 5395
data.table(stasort[1:10,])
## STATE index
## 1: TX 28387
## 2: AL 28106
## 3: MS 21882
## 4: AR 18001
## 5: TN 16820
## 6: OK 16559
## 7: MO 15706
## 8: OH 14651
## 9: IN 14436
## 10: IL 13856
data.table(consort[1:10,])
## STATE COUNTYNAME index
## 1: TX WICHITA 5892
## 2: AL JEFFERSON 5336
## 3: MO JASPER 4915
## 4: MA WORCESTER 4403
## 5: OH GREENE 4084
## 6: AL TUSCALOOSA 3584
## 7: MI GENESEE 3573
## 8: AL MADISON 2702
## 9: TX MCLENNAN 2628
## 10: IL COOK 2583
4.Top 10 event types with largest loss to the economy
data.table(ecosel)
## event index
## 1: TORNADO 3312277
## 2: FLASH FLOOD 1599325
## 3: TSTM WIND 1445168
## 4: HAIL 1268290
## 5: FLOOD 1067976
## 6: THUNDERSTORM WIND 943636
## 7: LIGHTNING 606932
## 8: THUNDERSTORM WINDS 464978
## 9: HIGH WIND 342015
## 10: WINTER STORM 134700
data.table(ecostasort[1:10,])
## STATE index
## 1: TX 287963
## 2: MS 212805
## 3: AL 169469
## 4: OK 165774
## 5: FL 159901
## 6: IA 156895
## 7: GA 155142
## 8: KS 148692
## 9: MO 134446
## 10: LA 134320
6.Top 10 counties accross the country with the largest economic loss due to Tornado
data.table(ecotorsort[1:10,])
## STATE COUNTYNAME index
## 1: TX HARRIS 14137
## 2: AL JEFFERSON 8570
## 3: FL BREVARD 8559
## 4: FL HILLSBOROUGH 8516
## 5: NE THAYER 8215
## 6: FL PINELLAS 7634
## 7: MS SMITH 6922
## 8: FL POLK 6555
## 9: TN RUTHERFORD 6380
## 10: KS MITCHELL 6371