The assignment consist in analysing the data contained in the file repdata_data_StormData.csv.bz2.
This file contains the data of the U.S. National Oceanic and Atmospheric Administration’s (NOAA) about the storms, with data that are from the year 1950 and end in November 2011.
The goal of this assignment is to answer some questions about storms events, questions like:
- which types of events have the greatest economic consequences?
- which types of events are most harmful with respect to population health?
The dataset is downloaded and unzipped if it is not presented in the working directory and saved in the R environment.
if(!file.exists("./repdata_data_StormData.csv.bz2")){
file.create("./repdata_data_StormData.csv.bz2")
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
"./repdata_data_StormData.csv.bz2")
}
## Read data as data frame
storm<-read.csv("./repdata_data_StormData.csv.bz2", header=T)
storm$EVTYPE = toupper(storm$EVTYPE)
We retrive the dimension of the file.
dim(storm)
## [1] 902297 37
Let’s look at the contents of the file:
head(storm)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
Now, we change the name of the events as following:
storm[storm$EVTYPE == "HURRICANE/TYPHOON", ]$EVTYPE = "HURRICANE-TYPHOON"
storm[storm$EVTYPE == "HURRICANE", ]$EVTYPE = "HURRICANE-TYPHOON"
storm[storm$EVTYPE == "RIVER FLOOD", ]$EVTYPE = "FLOOD"
storm[storm$EVTYPE == "THUNDERSTORM WINDS", ]$EVTYPE = "THUNDERSTORM WIND"
storm[storm$EVTYPE == "TSTM WIND", ]$EVTYPE = "THUNDERSTORM WIND"
The data contains four type of damage: 1) FATALITIES (fatality) 2) INJURIES (injury) 3) PROPDMG (property damage) 4) CROPDMG (crop damage)
We now aggregate the data by event type and then rank them in decreasing order.
fatalities <- aggregate(FATALITIES ~ EVTYPE, storm, sum)
fatalities <- fatalities[fatalities$FATALITIES > 0, ]
fatalities_desc <- fatalities[order(fatalities$FATALITIES, decreasing = TRUE), ]
head(fatalities_desc)
## EVTYPE FATALITIES
## 755 TORNADO 5633
## 116 EXCESSIVE HEAT 1903
## 138 FLASH FLOOD 978
## 243 HEAT 937
## 417 LIGHTNING 816
## 683 THUNDERSTORM WIND 701
Tornado and excessive heat are two of the events which cause fatality in the past years since 1950 as the data above show.
Next, we will summary the data of injury.
injuries <- aggregate(INJURIES ~ EVTYPE, storm, sum)
injuries <- injuries[injuries$INJURIES > 0, ]
injuries_desc <- injuries[order(injuries$INJURIES, decreasing = TRUE), ]
head(injuries_desc)
## EVTYPE INJURIES
## 755 TORNADO 91346
## 683 THUNDERSTORM WIND 9353
## 154 FLOOD 6791
## 116 EXCESSIVE HEAT 6525
## 417 LIGHTNING 5230
## 243 HEAT 2100
Again, the two events causing people injury are tornado and excessive heat, which are identical the the previous data. Now we draw a panel of two figures which contains both fatality and injury data.
par(mfrow = c(1, 1))
barplot(fatalities_desc[1:10, 2], col = rainbow(10), legend.text = fatalities_desc[1:10, 1], ylab = "Fatality", main = "Top 10 events that caused most Fatality")
par(mfrow = c(1, 1))
barplot(injuries_desc[1:10, 2], col = rainbow(10), legend.text = injuries_desc[1:10, 1], ylab = "Injured people", main = "Top 10 events that caused most Injuries")
We can also find what events cause both major fatality and body injury.
intersect(fatalities_desc[1:10, 1], injuries_desc[1:10, 1])
## [1] "TORNADO" "EXCESSIVE HEAT" "FLASH FLOOD"
## [4] "HEAT" "LIGHTNING" "THUNDERSTORM WIND"
## [7] "FLOOD"
Now for the events resulting in Property Damage we must clean the data because we have some unknown symbol such as “?”, blanks, or single digit and we find both upper and lower case for the letter “K” (that stands for thousands), or for the letter “M” (millions) and “B” (billions).
So we need to transform these terms into valid values.
storm$PROPDMGEXP<-as.character(storm$PROPDMGEXP)
storm$CROPDMGEXP<-as.character(storm$CROPDMGEXP)
## Find column numbers for property and crop damage units
propcol<-grep("PROPDMGEXP",colnames(storm))
cropcol<-grep("CROPDMGEXP",colnames(storm))
## Convert property and crop damage units for calculations later
storm[storm$PROPDMGEXP=="",propcol]<-"0"
storm[storm$PROPDMGEXP=="-",propcol]<-"0"
storm[storm$PROPDMGEXP=="?",propcol]<-"0"
storm[storm$PROPDMGEXP=="+",propcol]<-"0"
storm[storm$PROPDMGEXP=="1",propcol]<-"0"
storm[storm$PROPDMGEXP=="2",propcol]<-"0"
storm[storm$PROPDMGEXP=="3",propcol]<-"0"
storm[storm$PROPDMGEXP=="4",propcol]<-"0"
storm[storm$PROPDMGEXP=="5",propcol]<-"0"
storm[storm$PROPDMGEXP=="6",propcol]<-"0"
storm[storm$PROPDMGEXP=="7",propcol]<-"0"
storm[storm$PROPDMGEXP=="8",propcol]<-"0"
storm[storm$PROPDMGEXP=="h",propcol]<-"100"
storm[storm$PROPDMGEXP=="H",propcol]<-"100"
storm[storm$PROPDMGEXP=="K",propcol]<-"1000"
storm[storm$PROPDMGEXP=="m",propcol]<-"1000000"
storm[storm$PROPDMGEXP=="M",propcol]<-"1000000"
storm[storm$PROPDMGEXP=="B",propcol]<-"1000000000"
storm[storm$CROPDMGEXP=="",cropcol]<-"0"
storm[storm$CROPDMGEXP=="?",cropcol]<-"0"
storm[storm$CROPDMGEXP=="2",cropcol]<-"0"
storm[storm$CROPDMGEXP=="k",cropcol]<-"1000"
storm[storm$CROPDMGEXP=="K",cropcol]<-"1000"
storm[storm$CROPDMGEXP=="m",cropcol]<-"1000000"
storm[storm$CROPDMGEXP=="M",cropcol]<-"1000000"
storm[storm$CROPDMGEXP=="B",cropcol]<-"1000000000"
## Convert property and crop damage units to numeric data
storm$PROPDMGEXP<-as.numeric(storm$PROPDMGEXP)
storm$CROPDMGEXP<-as.numeric(storm$CROPDMGEXP)
Now, we calculate the total property loss and total crop loss by multiplying damage unit columns with damage columns.
## Calculate total property and crop damage losses for an event
storm$PropLoss<-storm$PROPDMG*storm$PROPDMGEXP
storm$CropLoss<-storm$CROPDMG*storm$CROPDMGEXP
Then, we calculate total economic loss due to a storm event by summing up property and crop loss figures.
## Calculate total economic losses for an event
storm$EconomicLoss<-storm$PropLoss+storm$CropLoss
Now we create smaller data frames so that aggregation is simple and clear. After this, we exclude the rows with NA values.
## Find relevant column numbers for human loss calculations
col1<-grep("EVTYPE",colnames(storm))
col2<-grep("FATALITIES",colnames(storm))
## Create small data set
HumanLoss<-storm[,c(col1,col2)]
## Omit NA values
HumanLoss<-na.omit(HumanLoss)
## Find relevant column numbers for economic loss calculations
col1<-grep("EVTYPE",colnames(storm))
col2<-grep("EconomicLoss",colnames(storm))
## Create small data set
EcoLoss<-storm[,c(col1,col2)]
## Omit NA values
EcoLoss<-na.omit(EcoLoss)
Now, we aggregate the collected data over event types for economic losses. We then sort the aggregated data based on damages and consider the data for only top 10 max damage causing events.
## Aggregate total economic losses over all events
EcoLoss<-aggregate(EconomicLoss~EVTYPE,EcoLoss,sum,na.rm=TRUE)
## Sort economic losses data in decreasing order of losses
EcoLoss<-EcoLoss[order(EcoLoss$EconomicLoss,decreasing = TRUE),]
## Create small dataset for plotting only top 3 events
EcoLossTop<-EcoLoss[1:10,]
Now we draw the graph which contains the data of the economic loss.
## barplot for top 10 events and their losses
barplot(EcoLossTop[1:10, 2], col = rainbow(10), legend.text = EcoLossTop[1:10, 1], ylab = "Economic Loss ($)", main = "Top 10 Storm Events to Cause Max Economic Losses")
From the data obtained as described above, we can conclude that tornado events are causing most damage in terms of human casualties. Tornados are followed by Excessive Heat and TSTM Wind respectively in decreasing order.
We can clearly see Flood events are causing most damage in terms of economic loss. Floods are followed by Hurricane/Typhoon and Tornado respectively in decreasing order.