This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. We analize the effects of weather evens in the United States. The the effects under observation are the effect of adverse weather conditions on the population’s health and economic. We have answered two questions via this analysis:
The raw data start from year 1950 to November 2011, with 902297 records. By choosing the required feilds for our analysis, we have produced plots to show the type of weather events that resulted in above 97.5% percentile of the total health issues. Likewise, to understand which storms had the greatest economic consquences, the type of event contributing to above 75% percentile of the total number of crop and property damage caused by these storm was plotted.
More information regarding this dataset can be found on the NOAA website https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf.
Make a directory named as "data" in your working directory to store the raw data.
library(dplyr)
library(ggplot2)
library(knitr)
library(xtable)
To make the program re-usable, we just change the name of the URL and the name of the file to be stored before calling the functon to load the file from the URL.
fileURL<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
fileStoredName<-"WeatherMonitorData"
The chunk of function below loads the data in to R when the function is called. To make the code error resistant, we use the try-catch.
getData<-function(fileOpen= fileStoredName,direct="R_programming"){
print(paste("The present working directory is:",workingDir<-getwd()))
date<-format(Sys.Date(),"%Y")
newFileName<-paste(fileStoredName,date,sep = "-")
fileList<-list.files()
if(sum("data"==fileList)>0){
if(!file.exists(paste("./data/",newFileName,".bz2",sep=""))){
download.file(fileURL,method="curl",destfile =
paste("./data/",newFileName,".bz2",sep=""))
}
}
else
print("You seem to be in a different directory")
actualDoc<-paste("./data/",newFileName,".bz2",sep="")
output<- tryCatch(
{
read.csv(bzfile(actualDoc))
},
error=function(cond){
message(paste("file does not seem to exist:", actualDoc))
message("Here's the original error message:")
message(cond)
return(2)
},
warning=function(cond) {
message(paste("file caused a warning:", actualDoc))
message("Here's the original warning message:")
message(cond)
return(3)
},
finally={
message(paste("Processed file: ", actualDoc," was opened successfully"))
}
)
output
}
weatherData<-getData()
## Processed file: ./data/WeatherMonitorData-2015.bz2 was opened successfully
The Raw data has many columns that are redundant for the purpose at hand. So we select just the columns that contain the information to find the Halth and Ecologic effect of differnt weather types. We use the columns
STATE : specifyng the state from where the data was collected from
BGN_DATE : Date of the data collected
EVTYPE : Type of weather event
FATALITIES : Number of fatalities caused by the weather type
INJURIES : Number of injuries caused by the weather type
PROPDMG : Damage to the property (scaled down exponentially)
PROPDMGEXP : Exponential value of the damage to the property
CROPDMG : Damage to the crop(scaled down exponentially)
CROPDMGEXP : Exponential value of the damage to the crop
We filter out the observations that cause no health related issues in the population and clean the data to specify the proper date. The data is checked for any missing value which may induce errors in the analysis. and the result is displayed below.
stormData<-select(weatherData, STATE,
BGN_DATE,EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)
stormData<-stormData%>%filter(INJURIES!=0 | FATALITIES!=0)
str(stormData,5)
## 'data.frame': 21929 obs. of 9 variables:
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 4242 11116 2224 2224 2260 3980 3980 3980 3980 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 826 826 826 826 826 826 826 826 826 826 ...
## $ FATALITIES: num 0 0 0 0 0 0 1 0 0 1 ...
## $ INJURIES : num 15 2 2 2 6 1 14 3 3 26 ...
## $ PROPDMG : num 25 25 2.5 2.5 2.5 2.5 25 2.5 2.5 250 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","+","-","0",..: 16 16 16 16 16 16 16 17 17 16 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","0","2","?",..: 1 1 1 1 1 1 1 1 1 1 ...
stormData<-mutate(stormData,BGN_DATE=as.Date(strptime(BGN_DATE,"%d/%m/%Y %H:%M:%S")))
sum(is.na(stormData))
## [1] 12895
The health related issues caused to the population is the sum of the injuries caused and the fatalities caused by the weather. The HEALTH</> column in the stormData data frame contains the sum of the fatalities and injuries caused by the weather.
stormData<-mutate(stormData,HEALTH=(FATALITIES+INJURIES))
The data for the damage in property and crop is divided between two columns, one columns contains the value and other contains the exponential multiple for the value in the first column. The exponential levels are given in form of alphabets, that were converted in their corresponding numerical values. Each of the “K”, “M”, and “B” units were replaced by their numeric counterparts and multiply each of the quantites by their respective multiples of 10. The null values specify missing data, which were considered 0 for the simplicity of the solution. The variable ECO_DAMAGE in the stormDataEco dataframe contains the sum of the damages caused to property and the damages caused to crop.
stormDataEco<-stormData%>%filter(PROPDMG!=0 & CROPDMG !=0)
values<-grep("EXP",names(stormDataEco))
for(i in values){
levelNames<-levels(stormDataEco[,i])
for(j in 1:10){
levelNames[levelNames==as.character(j)| levelNames=="+"]<-"1"
}
a<-gsub("b",9,levelNames,ignore.case = T)
b<-gsub("k",3,a,ignore.case = T)
c<-gsub("M",6,b,ignore.case = T)
d<-gsub("h",2,c,ignore.case = T)
d[d==""|d=="-"|d=="?"]<-0
levels(stormDataEco[,i])<-d
}
stormDataEco<-stormDataEco%>%mutate(ECO_DAMAGE=(PROPDMG*(10^as.numeric(PROPDMGEXP))) +
(CROPDMG*(10^as.numeric(CROPDMGEXP))))
The damages from each type of weather events were summed up with respect to their weather event type. The output data was reduced by selecting the events that corresponded to more than 97.5% of the damage to the health.
worstType<-stormData%>%group_by(EVTYPE)%>%summarize(HEALTH_EFFECT=sum(HEALTH))%>%arrange(desc(HEALTH_EFFECT))
numberOfRows<-nrow(worstType)
quant<-quantile(worstType$HEALTH_EFFECT,0.975)
mostHarmful<-filter(worstType,HEALTH_EFFECT>quant)
percent<-nrow(mostHarmful)*100/numberOfRows
print(mostHarmful)
## Source: local data frame [6 x 2]
##
## EVTYPE HEALTH_EFFECT
## 1 TORNADO 96979
## 2 EXCESSIVE HEAT 8428
## 3 TSTM WIND 7461
## 4 FLOOD 7259
## 5 LIGHTNING 6046
## 6 HEAT 3037
The table above consists of top 2.7272727 % of the total events causing health related issues. All these events have caused a health issue of more than 97.5% of the total health related issues which is 2903.05 health related issues.
The figure below shows a histogram of the total hazards to human health based on storm event. It is apparent that TORNADO created the most damage of 9.6979 × 104 reported health issues. There is a very large gap amongst the most hazardous and remaining storm types.
qplot(data=mostHarmful,HEALTH_EFFECT,fill=EVTYPE, main = "Histogram of Total Health Effects")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
The damages from each type of weather events were summed up with respect to their weather event type. The output data was reduced by selecting the events that corresponded to more than 75% of the damages to the property and crop.
worstEcoType<-stormDataEco%>%group_by(EVTYPE)%>%summarize(ECONOMIC_EFFECT=sum(ECO_DAMAGE))%>%
arrange(desc(ECONOMIC_EFFECT))
numberOfRowsECO<-nrow(worstEcoType)
quantECO<-quantile(worstEcoType$ECONOMIC_EFFECT,0.75)
mostHarmfulECO<-filter(worstEcoType,ECONOMIC_EFFECT>quantECO)
percentECO<-nrow(mostHarmfulECO)*100/numberOfRowsECO
print(mostHarmfulECO)
## Source: local data frame [12 x 2]
##
## EVTYPE ECONOMIC_EFFECT
## 1 TORNADO 5461089000
## 2 HURRICANE 2132201000
## 3 FLOOD 1760381500
## 4 HIGH WIND 1386280000
## 5 FLASH FLOOD 1224105000
## 6 TSTM WIND 931725000
## 7 TROPICAL STORM 841434000
## 8 HURRICANE/TYPHOON 764406810
## 9 HAIL 562550000
## 10 THUNDERSTORM WINDS 424615500
## 11 WILDFIRE 398011040
## 12 WINTER STORM 238330000
The table above consists of top 26.0869565 % of the total events causing economical damages. All these events have caused a economical damage of more than 75% of the total economical damages which is 2.3720625 × 108$ ecomical damages.
The figure below shows a histogram of the total hazards to human health based on storm event. It is apparent that TORNADO created the most damage of 5.461089 × 109 reported health issues. There is a very large gap amongst the most hazardous and remaining storm types.
qplot(data=mostHarmfulECO,ECONOMIC_EFFECT,fill=EVTYPE,main = "Histogram of Total Economic Effects")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.