Synopsis

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. We analize the effects of weather evens in the United States. The the effects under observation are the effect of adverse weather conditions on the population’s health and economic. We have answered two questions via this analysis:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

The raw data start from year 1950 to November 2011, with 902297 records. By choosing the required feilds for our analysis, we have produced plots to show the type of weather events that resulted in above 97.5% percentile of the total health issues. Likewise, to understand which storms had the greatest economic consquences, the type of event contributing to above 75% percentile of the total number of crop and property damage caused by these storm was plotted.

More information regarding this dataset can be found on the NOAA website https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf.

Requirements to run

Make a directory named as "data" in your working directory to store the raw data.

Data Processing

Loading the libraries

library(dplyr)
library(ggplot2)
library(knitr)
library(xtable)

Specifying the URL and name of the file to be stored

To make the program re-usable, we just change the name of the URL and the name of the file to be stored before calling the functon to load the file from the URL.

fileURL<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
fileStoredName<-"WeatherMonitorData"

Defining a function to load the data with try cases

The chunk of function below loads the data in to R when the function is called. To make the code error resistant, we use the try-catch.

getData<-function(fileOpen= fileStoredName,direct="R_programming"){
      
      print(paste("The present working directory is:",workingDir<-getwd()))
      date<-format(Sys.Date(),"%Y")
      newFileName<-paste(fileStoredName,date,sep = "-")
      fileList<-list.files()
      if(sum("data"==fileList)>0){
            if(!file.exists(paste("./data/",newFileName,".bz2",sep=""))){
                  download.file(fileURL,method="curl",destfile =
                                      paste("./data/",newFileName,".bz2",sep=""))
        }
      }
      else
            print("You seem to be in a different directory")
      actualDoc<-paste("./data/",newFileName,".bz2",sep="")
      output<- tryCatch(
            {
                  read.csv(bzfile(actualDoc))
           },
            error=function(cond){
                  message(paste("file does not seem to exist:", actualDoc))
                  message("Here's the original error message:")
                  message(cond)
                  return(2)
                  
            },
            warning=function(cond) {
                  message(paste("file caused a warning:", actualDoc))
                  message("Here's the original warning message:")
                  message(cond)
                  return(3)
            },
            finally={
                  message(paste("Processed file: ", actualDoc," was opened successfully"))
            }
      )
      output
}

Calling the function to load the data

weatherData<-getData()
## Processed file:  ./data/WeatherMonitorData-2015.bz2  was opened successfully

The Raw data has many columns that are redundant for the purpose at hand. So we select just the columns that contain the information to find the Halth and Ecologic effect of differnt weather types. We use the columns
STATE : specifyng the state from where the data was collected from
BGN_DATE : Date of the data collected
EVTYPE : Type of weather event
FATALITIES : Number of fatalities caused by the weather type
INJURIES : Number of injuries caused by the weather type
PROPDMG : Damage to the property (scaled down exponentially)
PROPDMGEXP : Exponential value of the damage to the property
CROPDMG : Damage to the crop(scaled down exponentially)
CROPDMGEXP : Exponential value of the damage to the crop

We filter out the observations that cause no health related issues in the population and clean the data to specify the proper date. The data is checked for any missing value which may induce errors in the analysis. and the result is displayed below.

stormData<-select(weatherData, STATE,
                  BGN_DATE,EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)
stormData<-stormData%>%filter(INJURIES!=0 | FATALITIES!=0)
str(stormData,5)
## 'data.frame':    21929 obs. of  9 variables:
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 4242 11116 2224 2224 2260 3980 3980 3980 3980 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 826 826 826 826 826 826 826 826 826 826 ...
##  $ FATALITIES: num  0 0 0 0 0 0 1 0 0 1 ...
##  $ INJURIES  : num  15 2 2 2 6 1 14 3 3 26 ...
##  $ PROPDMG   : num  25 25 2.5 2.5 2.5 2.5 25 2.5 2.5 250 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","+","-","0",..: 16 16 16 16 16 16 16 17 17 16 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","0","2","?",..: 1 1 1 1 1 1 1 1 1 1 ...
stormData<-mutate(stormData,BGN_DATE=as.Date(strptime(BGN_DATE,"%d/%m/%Y %H:%M:%S")))
sum(is.na(stormData))
## [1] 12895

Setting the correct values of the variables

The health related issues caused to the population is the sum of the injuries caused and the fatalities caused by the weather. The HEALTH</> column in the stormData data frame contains the sum of the fatalities and injuries caused by the weather.

stormData<-mutate(stormData,HEALTH=(FATALITIES+INJURIES))

The data for the damage in property and crop is divided between two columns, one columns contains the value and other contains the exponential multiple for the value in the first column. The exponential levels are given in form of alphabets, that were converted in their corresponding numerical values. Each of the “K”, “M”, and “B” units were replaced by their numeric counterparts and multiply each of the quantites by their respective multiples of 10. The null values specify missing data, which were considered 0 for the simplicity of the solution. The variable ECO_DAMAGE in the stormDataEco dataframe contains the sum of the damages caused to property and the damages caused to crop.

stormDataEco<-stormData%>%filter(PROPDMG!=0 & CROPDMG !=0)
values<-grep("EXP",names(stormDataEco))
for(i in values){
      levelNames<-levels(stormDataEco[,i])
      for(j in 1:10){
            levelNames[levelNames==as.character(j)| levelNames=="+"]<-"1"
      }
      a<-gsub("b",9,levelNames,ignore.case = T)
      b<-gsub("k",3,a,ignore.case = T)
      c<-gsub("M",6,b,ignore.case = T)
      d<-gsub("h",2,c,ignore.case = T)
      d[d==""|d=="-"|d=="?"]<-0
levels(stormDataEco[,i])<-d
}
stormDataEco<-stormDataEco%>%mutate(ECO_DAMAGE=(PROPDMG*(10^as.numeric(PROPDMGEXP))) +
                                          (CROPDMG*(10^as.numeric(CROPDMGEXP))))

Results

Below are the top events that caused more than 97.5 quantile of the total damages and issues to population health.

The damages from each type of weather events were summed up with respect to their weather event type. The output data was reduced by selecting the events that corresponded to more than 97.5% of the damage to the health.

worstType<-stormData%>%group_by(EVTYPE)%>%summarize(HEALTH_EFFECT=sum(HEALTH))%>%arrange(desc(HEALTH_EFFECT))
numberOfRows<-nrow(worstType)
quant<-quantile(worstType$HEALTH_EFFECT,0.975)
mostHarmful<-filter(worstType,HEALTH_EFFECT>quant)
percent<-nrow(mostHarmful)*100/numberOfRows
print(mostHarmful)
## Source: local data frame [6 x 2]
## 
##           EVTYPE HEALTH_EFFECT
## 1        TORNADO         96979
## 2 EXCESSIVE HEAT          8428
## 3      TSTM WIND          7461
## 4          FLOOD          7259
## 5      LIGHTNING          6046
## 6           HEAT          3037

The table above consists of top 2.7272727 % of the total events causing health related issues. All these events have caused a health issue of more than 97.5% of the total health related issues which is 2903.05 health related issues.

The figure below shows a histogram of the total hazards to human health based on storm event. It is apparent that TORNADO created the most damage of 9.6979 × 104 reported health issues. There is a very large gap amongst the most hazardous and remaining storm types.

qplot(data=mostHarmful,HEALTH_EFFECT,fill=EVTYPE, main = "Histogram of Total Health Effects")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-9

Below are the top events that caused more than 75 quantile of the total economic damages.

The damages from each type of weather events were summed up with respect to their weather event type. The output data was reduced by selecting the events that corresponded to more than 75% of the damages to the property and crop.

worstEcoType<-stormDataEco%>%group_by(EVTYPE)%>%summarize(ECONOMIC_EFFECT=sum(ECO_DAMAGE))%>%
      arrange(desc(ECONOMIC_EFFECT))
numberOfRowsECO<-nrow(worstEcoType)
quantECO<-quantile(worstEcoType$ECONOMIC_EFFECT,0.75)
mostHarmfulECO<-filter(worstEcoType,ECONOMIC_EFFECT>quantECO)
percentECO<-nrow(mostHarmfulECO)*100/numberOfRowsECO
print(mostHarmfulECO)
## Source: local data frame [12 x 2]
## 
##                EVTYPE ECONOMIC_EFFECT
## 1             TORNADO      5461089000
## 2           HURRICANE      2132201000
## 3               FLOOD      1760381500
## 4           HIGH WIND      1386280000
## 5         FLASH FLOOD      1224105000
## 6           TSTM WIND       931725000
## 7      TROPICAL STORM       841434000
## 8   HURRICANE/TYPHOON       764406810
## 9                HAIL       562550000
## 10 THUNDERSTORM WINDS       424615500
## 11           WILDFIRE       398011040
## 12       WINTER STORM       238330000

The table above consists of top 26.0869565 % of the total events causing economical damages. All these events have caused a economical damage of more than 75% of the total economical damages which is 2.3720625 × 108$ ecomical damages.

The figure below shows a histogram of the total hazards to human health based on storm event. It is apparent that TORNADO created the most damage of 5.461089 × 109 reported health issues. There is a very large gap amongst the most hazardous and remaining storm types.

qplot(data=mostHarmfulECO,ECONOMIC_EFFECT,fill=EVTYPE,main = "Histogram of Total Economic Effects")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-11