Severe weather events may have catastrophic consequences to the economy and public health for communities and countries. Preventing these effects, or at least reduce theirs effect is a key concern. The U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database contains the information of occurrence of storms and other significant weather phenomena with enough intensity to cause damage in United States. This information is collected from a variety of sources around the country.
In the present project, we want to use the NOAA database to help prevent these catastrophic events. We will address the following question:
In this project, we answer these two main questions using the NOAA database. The project contains 2 sections: 1) A data processing part where we describe the dataset and we make a preliminary exploratory data analysis. 2) The result part where we include some of the most relevant features relevant to help predict these weather events. We will focus on the parameter that can produce more damage to the economy and the public health.
Our study shows that for population health, tornado, flood and heat are the main concern. So, in this project we recommend to develop a preventive plan to help people keep themselves protected in case of tornado, flood and excessive heat. All of these are the main reason for fatalities and injuries in United Stated during the last 60 years. The events with more economic consequences mainly related with property damage, are flood, hurricane and tornado. It is important to pay more attention to protect properties against these events. For the particular case of crop damage, drought and flood are the main concerns. We recommend to develop some programs to mitigate the effect of drought in crop production. Flood is hard to predict, however, drought can be reduced with long term programs providing water to keep crop production at good rate.
First, we need to understand the NOAA dataset. As a first step we load the first row with the names of the columns to identify the most relevant information for our problem. Since the data is very large we can use this step to filter out the columns we are not going to use in the present project. Note that the dataset used in this project does not correspond to the most recent data available in the NOAA webpage. For more details about the dataset, check the link http://www.ncdc.noaa.gov/stormevents/
options(scipen = 1, digits = 5)
# Checking the zip file is already there
if (!file.exists("repdata_data_StormData.csv.bz2")) {
download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
method = "curl", destfile = "repdata_data_StormData.csv.bz2", quiet = TRUE)
}
# Reading the raw csv file
data <- read.csv("./repdata_data_StormData.csv.bz2",header = TRUE, stringsAsFactors = FALSE, strip.white = TRUE, na.strings = c("NA",""),nrow=2)
str(data)
## 'data.frame': 2 obs. of 37 variables:
## $ STATE__ : num 1 1
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00"
## $ BGN_TIME : int 130 145
## $ TIME_ZONE : chr "CST" "CST"
## $ COUNTY : num 97 3
## $ COUNTYNAME: chr "MOBILE" "BALDWIN"
## $ STATE : chr "AL" "AL"
## $ EVTYPE : chr "TORNADO" "TORNADO"
## $ BGN_RANGE : num 0 0
## $ BGN_AZI : logi NA NA
## $ BGN_LOCATI: logi NA NA
## $ END_DATE : logi NA NA
## $ END_TIME : logi NA NA
## $ COUNTY_END: num 0 0
## $ COUNTYENDN: logi NA NA
## $ END_RANGE : num 0 0
## $ END_AZI : logi NA NA
## $ END_LOCATI: logi NA NA
## $ LENGTH : num 14 2
## $ WIDTH : num 100 150
## $ F : int 3 2
## $ MAG : num 0 0
## $ FATALITIES: num 0 0
## $ INJURIES : num 15 0
## $ PROPDMG : num 25 2.5
## $ PROPDMGEXP: chr "K" "K"
## $ CROPDMG : num 0 0
## $ CROPDMGEXP: logi NA NA
## $ WFO : logi NA NA
## $ STATEOFFIC: logi NA NA
## $ ZONENAMES : logi NA NA
## $ LATITUDE : num 3040 3042
## $ LONGITUDE : num 8812 8755
## $ LATITUDE_E: num 3051 0
## $ LONGITUDE_: num 8806 0
## $ REMARKS : logi NA NA
## $ REFNUM : num 1 2
info <- dim(data)
The dataset contains 37 columns. So, we decided to keep only the variables relevant to our problem. Then we need first to identify all variables and look for those related with our project. Since we do not have a cookbook for this dataset we get some help from the detailed information about the fields/columns from the following link: http://www.ncdc.noaa.gov/stormevents/ftp.jsp. Below, I am including the description of variables we are going to use in our project, other columns will be removed from our dataset while reading the raw data.
Now, we read the raw data with the relevant columns:
# Reading the raw csv file
data <- read.csv("./repdata_data_StormData.csv.bz2",colClasses =c("NULL",NA, rep("NULL", 4),NA,NA, rep("NULL", 14), rep(NA,6), rep("NULL", 9)), header = TRUE, stringsAsFactors = FALSE, strip.white = TRUE, na.strings = c("NA",""),nrow=902297)
str(data)
## 'data.frame': 902297 obs. of 9 variables:
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr NA NA NA NA ...
info <- dim(data)
The dataset contains 9 columns and 902297 rows. Now, we know that the variables FATALITIES and INJURIES will be relevant to the public health and PROPDMG with CROPDMG will show the consequences of these events to the public health. The other variables may be useful to plot the data (as FATALITIES vs DATE) or for some transformations of variables. Since we are studying the data with respect to public health and economy, we will keep only the records where we have at least one of these variable different from zero. This means that we are removing the rows where all these variables are zeros.
First, we check how many rows contains only zeros for these variables
dim(data[which(data$FATALITIES == 0 & data$INJURIES == 0 & data$PROPDMG ==0 & data$CROPDMG ==0 ) ,])
## [1] 647664 9
So, we need to remove them from our dataset. So, the rows we want to keep are:
dim(data[which(!(data$FATALITIES == 0 & data$INJURIES == 0 & data$PROPDMG ==0 & data$CROPDMG ==0) ) ,])
## [1] 254633 9
then we end up with a smaller and relevant dataset:
dataDMG <- data[which(!(data$FATALITIES == 0 & data$INJURIES == 0 & data$PROPDMG ==0 & data$CROPDMG ==0) ) ,]
rm(data)
The date variable BGN_DATE needs to be converted from the original character format of the date to the date format of R.
dataDMG$BGN_DATE <-as.Date(dataDMG$BGN_DATE, format="%m/%d/%Y")
The variables PROPDMGEXP and CROPDMGEXP are related with the units of the variables PROPDMG and CROPDMG. They contain the following symbols:
unique(dataDMG$PROPDMGEXP)
## [1] "K" "M" NA "B" "m" "+" "0" "5" "6" "4" "h" "2" "7" "3" "H" "-"
unique(dataDMG$CROPDMGEXP)
## [1] NA "M" "K" "m" "B" "?" "0" "k"
After reading the documentation of most recent dataset (see link above) we consider it is correct to assume that B or b are Billions, M or m are millions, K or k are thousands, and H or h are hundreds dollars damage in properties and crops. So, we need to modify the corresponding variables PROPDMG and CROPDMG accordingly.
library(plyr)
ddply(dataDMG,.(PROPDMGEXP), summarize, length(PROPDMG))
## PROPDMGEXP ..1
## 1 - 1
## 2 + 5
## 3 0 210
## 4 2 1
## 5 3 1
## 6 4 4
## 7 5 18
## 8 6 3
## 9 7 3
## 10 B 40
## 11 h 1
## 12 H 6
## 13 K 231428
## 14 m 7
## 15 M 11320
## 16 <NA> 11585
ddply(dataDMG,.(CROPDMGEXP), summarize, length(CROPDMG))
## CROPDMGEXP ..1
## 1 ? 6
## 2 0 17
## 3 B 7
## 4 k 21
## 5 K 99932
## 6 m 1
## 7 M 1985
## 8 <NA> 152664
removed_valuesPROP <- sum( ddply(dataDMG,.(PROPDMGEXP), summarize, length(PROPDMG))[1:9,2] )/ sum( ddply(dataDMG,.(PROPDMGEXP), summarize, length(PROPDMG))[,2] ) *100.
removed_valuesCROP <- sum( ddply(dataDMG,.(CROPDMGEXP), summarize, length(CROPDMG))[1:2,2] )/ sum( ddply(dataDMG,.(CROPDMGEXP), summarize, length(CROPDMG))[,2] ) *100.
There is not missing values in the property damage and the crop damage variables. In the variables related with their units, there are many missing values (see results above) that do not modify the original variable because they were blank space in the original dataset. The other terms (numbers, +, -, and ? ) that appear in these variable are not so often and we decided to remove them from the dataset. For the property damage variable these rows are only 0.09661 % of the dataset and for the crop damage variable they are only 0.00903 % of the total of rows of the dataset.
We transform the units of the property and crop damage using the variables studied before. First, we decided to replace the missing values by a placeholder “NO” to avoid problem transforming the variables. We also included a new variable called year for better manipulation later with the plots.
dataDMG$PROPDMGEXP <- ifelse(is.na(dataDMG$PROPDMGEXP), "NO", dataDMG$PROPDMGEXP)
dataDMG$CROPDMGEXP <- ifelse(is.na(dataDMG$CROPDMGEXP), "NO", dataDMG$CROPDMGEXP)
dataDMG <- dataDMG[ dataDMG$CROPDMGEXP != "?" & dataDMG$CROPDMGEXP != "0" & dataDMG$PROPDMGEXP != "-" & dataDMG$PROPDMGEXP != "+" & dataDMG$PROPDMGEXP != "0" & dataDMG$PROPDMGEXP != "2" & dataDMG$PROPDMGEXP != "3" & dataDMG$PROPDMGEXP != "4" & dataDMG$PROPDMGEXP != "5" & dataDMG$PROPDMGEXP != "6" & dataDMG$PROPDMGEXP != "7" , ]
dataDMG$PROPDMG <- ifelse(dataDMG$PROPDMG> 0 & dataDMG$PROPDMGEXP =="M" ,1000000.0*dataDMG$PROPDMG ,dataDMG$PROPDMG)
dataDMG$PROPDMG <- ifelse(dataDMG$PROPDMG > 0 & dataDMG$PROPDMGEXP =="m" ,1000000.0*dataDMG$PROPDMG ,dataDMG$PROPDMG)
dataDMG$PROPDMG <- ifelse(dataDMG$PROPDMG > 0 & dataDMG$PROPDMGEXP =="B" ,1000000000.0*dataDMG$PROPDMG ,dataDMG$PROPDMG)
dataDMG$PROPDMG <- ifelse(dataDMG$PROPDMG > 0 & dataDMG$PROPDMGEXP =="H" ,100.*dataDMG$PROPDMG ,dataDMG$PROPDMG)
dataDMG$PROPDMG <- ifelse(dataDMG$PROPDMG > 0 & dataDMG$PROPDMGEXP =="h" ,100.*dataDMG$PROPDMG ,dataDMG$PROPDMG)
dataDMG$PROPDMG <- ifelse(dataDMG$PROPDMG > 0 & dataDMG$PROPDMGEXP =="K" ,1000.*dataDMG$PROPDMG ,dataDMG$PROPDMG)
dataDMG$PROPDMG <- ifelse(dataDMG$PROPDMG > 0 & dataDMG$PROPDMGEXP =="k" ,1000.*dataDMG$PROPDMG ,dataDMG$PROPDMG)
dataDMG$CROPDMG <- ifelse(dataDMG$CROPDMG> 0 & dataDMG$CROPDMGEXP =="M" ,1000000.0*dataDMG$CROPDMG ,dataDMG$CROPDMG)
dataDMG$CROPDMG <- ifelse(dataDMG$CROPDMG > 0 & dataDMG$CROPDMGEXP =="m" ,1000000.0*dataDMG$CROPDMG ,dataDMG$CROPDMG)
dataDMG$CROPDMG <- ifelse(dataDMG$CROPDMG > 0 & dataDMG$CROPDMGEXP =="B" ,1000000000.0*dataDMG$CROPDMG ,dataDMG$CROPDMG)
dataDMG$CROPDMG <- ifelse(dataDMG$CROPDMG > 0 & dataDMG$CROPDMGEXP =="H" ,100.*dataDMG$CROPDMG ,dataDMG$CROPDMG)
dataDMG$CROPDMG <- ifelse(dataDMG$CROPDMG > 0 & dataDMG$CROPDMGEXP =="h" ,100.*dataDMG$CROPDMG ,dataDMG$CROPDMG)
dataDMG$CROPDMG <- ifelse(dataDMG$CROPDMG > 0 & dataDMG$CROPDMGEXP =="K" ,1000.*dataDMG$CROPDMG ,dataDMG$CROPDMG)
dataDMG$CROPDMG <- ifelse(dataDMG$CROPDMG > 0 & dataDMG$CROPDMGEXP =="k" ,1000.*dataDMG$CROPDMG ,dataDMG$CROPDMG)
dataDMG$year <- as.factor(strftime(dataDMG$BGN_DATE, "%Y"))
First, we address the question: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health? We first take FATALITIES as the variable to explain how harmful a particular weather event can be to the population health (later we will use the variable INJURIES too). A plot of FATALITIES vs event will help to decide what event is more harmful for the population health since the variable “EVTYPE” contains 484 unique events. In figure 1 we plot total of fatalities for all year vs index of variable to visualize most important catastrophic event.
library(plyr)
Ftype <- ddply(dataDMG, .(EVTYPE), summarise, total = sum(FATALITIES))
plot(Ftype$total,xlab =" index", ylab='Total of fatalities',main= "Figure 1")
From this plot we observe that there are only few events with a lot of fatalities. These are very easy to identify, let order these events and see which are these catastrofic event:
ordertotal <- Ftype[order(-Ftype$total),]
head(ordertotal,10)
## EVTYPE total
## 402 TORNADO 5630
## 60 EXCESSIVE HEAT 1903
## 72 FLASH FLOOD 978
## 148 HEAT 937
## 253 LIGHTNING 816
## 418 TSTM WIND 504
## 84 FLOOD 470
## 301 RIP CURRENT 368
## 196 HIGH WIND 246
## 11 AVALANCHE 224
FatPerc <- sum(ordertotal[1:10,2])/sum(ordertotal[,2])*100.
FatTornado <- sum(ordertotal[1:1,2])/sum(ordertotal[,2])*100.
FatExcHeat <- sum(ordertotal[1:2,2])/sum(ordertotal[,2])*100.
In the previous list, I listed the 10 most important event related with fatalities. The first 10 event are responsible for 79.78857% of the fatalities in United Stated from 1950 to 2011. In fact, the first one, tornado by itself is responsible for 37.19855% of all fatalities. From these results, we can say that the most relevant events with respect to fatalities taking 49.77205% of all fatalities are TORNADO and EXCESSIVE HEAT.
In Figure 2, in the top plot we include FATALITIES for every year to see whether these effects are due to a big catastrophe of a particular year, or they are events that happen yearly. We plot together total of fatalities for all event compared with particular events. Here we took the first events in the list above, mainly, TORNADO, HEAT, FLOOD. Note that we include all variables with the word HEAT and FLOOD as one event for a better understanding of heat and flood effects.
year_Fat_allevent <- ddply(dataDMG, .(year), summarise, count = sum(FATALITIES))
year_Fat_tornado <- ddply(dataDMG[dataDMG$EVTYPE == "TORNADO",], .(year), summarise, sum = sum(FATALITIES))
year_Fat_FLOOD_all <- ddply(dataDMG[grep("FLOOD", dataDMG$EVTYPE),], .(year), summarise, sum = sum(FATALITIES))
year_Fat_heat_all <- ddply(dataDMG[grep("HEAT", dataDMG$EVTYPE),], .(year), summarise, sum = sum(FATALITIES))
In this plot, the small black rectangle means the total fatalities for each year, the red circles are the fatalities due to tornado, green circles are fatalities from events related with heat and blue circles are fatalities from events related with flood. From this plot we conclude that for years 1995-1992 tornado was the most relevant event for population health, after 1992 other weather events started to take place like flood and heat. Other important point is that for earlier years there are fewer events recorded, most likely due to lack of good record. So, in principle, we do not know for earlier years whether flood and heat were relevant or not, however following our observations we conclude that nowadays it is important to pay more attention to protect the population against tornado, flood and heat. Also, we can conclude that these weather events usually happen every year. This increases in fatalities due to heat it may also be related to the increase in temperature on the earth during last year, i.e. global warming process. Again, we recommend to improve protection against heat mainly during summer season when heat has a peak. We are going to study heat for every season in this project, however, it seems a necessary point to study in the future.
Now we investigate injuries. We follow similar analysis as for fatalities. First, we separate the data
FtypeIJ <- ddply(dataDMG, .(EVTYPE), summarise, total = sum(INJURIES))
ordertotalIJ <- FtypeIJ[order(-FtypeIJ$total),]
InjPer <- sum(ordertotalIJ[1:10,2])/sum(ordertotalIJ[,2])*100.
InjTornado <- sum(ordertotalIJ[1:1,2])/sum(ordertotalIJ[,2])*100.
head(ordertotalIJ,10)
## EVTYPE total
## 402 TORNADO 91321
## 418 TSTM WIND 6957
## 84 FLOOD 6789
## 60 EXCESSIVE HEAT 6525
## 253 LIGHTNING 5230
## 148 HEAT 2100
## 233 ICE STORM 1975
## 72 FLASH FLOOD 1777
## 359 THUNDERSTORM WIND 1488
## 131 HAIL 1360
In the previous list, I listed the 10 most important event related with injuries. The first 10 event are responsible for 89.35604% of the injuries in United Stated from 1950 to 2011. In fact, the first one, tornado by itself is responsible for 65.00918% of all injuries. So, tornado is the most relevant weather event related with population health. However, it is also important to pay attention to other events as flood, heat, wind and lightning.
year_Fat_alleventIJ <- ddply(dataDMG, .(year), summarise, mean.count = sum(INJURIES))
year_Fat_tornadoIJ <- ddply(dataDMG[dataDMG$EVTYPE == "TORNADO",], .(year), summarise, sum = sum(INJURIES))
year_Fat_FLOOD_allIJ <- ddply(dataDMG[grep("FLOOD", dataDMG$EVTYPE),], .(year), summarise, sum = sum(INJURIES))
year_Fat_heat_allIJ <- ddply(dataDMG[grep("HEAT", dataDMG$EVTYPE),], .(year), summarise, sum = sum(INJURIES))
year_Fat_WIND_allIJ <- ddply(dataDMG[grep("WIND", dataDMG$EVTYPE),], .(year), summarise, sum = sum(INJURIES))
par(mfrow=c(2,1),mar = c(4, 4, 2, 1))
plot(year_Fat_allevent$year,year_Fat_allevent$count, xlab= "Years", ylab= "Fatalities",main= "Figure 2")
points(year_Fat_tornado$year,year_Fat_tornado$sum, col="red")
points(year_Fat_heat_all$year,year_Fat_heat_all$sum, col="green")
points(year_Fat_FLOOD_all$year,year_Fat_FLOOD_all$sum, col="blue")
legend("topleft",c("Tornado","Heat","Flood"), pch =1,col=c("red","green","blue"))
legend("top",c("All events"),lty=c(1), lwd=c(2.5),col=c("black"))
plot(year_Fat_alleventIJ$year,year_Fat_alleventIJ$mean.count,xlab= "Years", ylab= "Injuries")
points(year_Fat_tornadoIJ$year,year_Fat_tornadoIJ$sum, col="red")
points(year_Fat_heat_allIJ$year,year_Fat_heat_allIJ$sum, col="green")
points(year_Fat_FLOOD_allIJ$year,year_Fat_FLOOD_allIJ$sum, col="blue")
points(year_Fat_WIND_allIJ$year,year_Fat_WIND_allIJ$sum, col="orange")
legend("topleft",c("Tornado","Heat","Flood", "WIND"), pch =1,col=c("red","green","blue","orange"))
legend("top",c("All events"),lty=c(1), lwd=c(2.5),col=c("black"))
Also, in Figure 2, in the bottom plot we include INJURIES for every year to study whether these effects are due to a particular year. We plot together total of injuries for all event compared with particular events. In this plot, the small black rectangle means the total injuries for each year, the red circles are the injuries due to tornado, green circles are injuries from events related with heat, blue circles are injuries from events related with flood and orange circles are the injuries due to wind. Here, the main difference compared with fatalities is that wind produces a lot injuries but never ends up in fatalities. Here again, we recommend to develop a preventive plan to help people be saved in case of tornado, flood, excessive heat and wind. All of these are the main reason of injuries in United Stated during the last 60 years.
Now, we need to address the question: Across the United States, which types of events have the greatest economic consequences? We have already determined the variables related with economy, PROPDMG and CROPDMG. First, we will study PROPDMG (property damage). So, we need now to identify the most important events. Below, I am listing the 10 most relevant event considering the variable PROPDMG.
FtypePROP <- ddply(dataDMG, .(EVTYPE), summarise, total = sum(PROPDMG))
ordertotalPROP <- FtypePROP[order(-FtypePROP$total),]
PropPerc <- sum(ordertotalPROP[1:10,2])/sum(ordertotalPROP[,2])*100.
PropFlood <- sum(ordertotalPROP[1:1,2])/sum(ordertotalPROP[,2])*100.
PropHurricane <- sum(ordertotalPROP[1:2,2])/sum(ordertotalPROP[,2])*100.
PropTornado <- sum(ordertotalPROP[1:3,2])/sum(ordertotalPROP[,2])*100.
head(ordertotalPROP,10)
## EVTYPE total
## 84 FLOOD 1.4466e+11
## 219 HURRICANE/TYPHOON 6.9306e+10
## 402 TORNADO 5.6937e+10
## 345 STORM SURGE 4.3324e+10
## 72 FLASH FLOOD 1.6141e+10
## 131 HAIL 1.5732e+10
## 210 HURRICANE 1.1868e+10
## 412 TROPICAL STORM 7.7039e+09
## 476 WINTER STORM 6.6885e+09
## 196 HIGH WIND 5.2700e+09
The first 10 event are responsible for 88.37176% of the property damage in United Stated from 1950 to 2011. In fact, the first one, flood by itself is responsible for 33.85252% of all property damage. From these results, we can say that the most relevant events with respect to property damage taking 63.39563% of all property damage are Flood, hurricane and tornado.
Now, we plot PROPDMG for every year. In figure 3, in the top plot, we plot together total of property damage for all events compared with particular events. Here we took the first events in the list above, mainly, Flood, hurricane, and tornado. Note that we include all variables with the word FLOOD as one event for a better understanding of flood effects.
year_Fat_alleventPROP <- ddply(dataDMG, .(year), summarise, mean.count = sum(PROPDMG))
year_Fat_tornadoPROP <- ddply(dataDMG[dataDMG$EVTYPE == "TORNADO",], .(year), summarise, sum = sum(PROPDMG))
year_Fat_FLOOD_allPROP <- ddply(dataDMG[grep("FLOOD", dataDMG$EVTYPE),], .(year), summarise, sum = sum(PROPDMG))
year_Fat_hur_allPROP <- ddply(dataDMG[grep("HURRICANE", dataDMG$EVTYPE),], .(year), summarise, sum = sum(PROPDMG))
In this plot, the small black rectangle means the total property damage for each year, the red circles are the property damage due to tornado, blue circles are property damage from events related with flood and green circles are property damage from hurricane. From this plot we conclude that for years 1995-1992 tornado was the most relevant event affecting the economy, after 1992 other weather events started to be more important take as flood and hurricane Here, we have the same problem like for injuries,there are fewer events recorded for earlier years. However, we conclude that from the economic point of view it is important to pay more attention to protect properties against flood, hurricane and tornado. Also, these events tend to occurs every year. Hurricane is the only one that usually never happens every year but when it does the economic consequences are catastrophic.
Now, also to answer the question related with types of events that have strong economic impact, let study the variable CROPMDG (crop damage). Below, I am listing the 10 most relevant event considering the variable CROPDMG.
FtypeCROP <- ddply(dataDMG, .(EVTYPE), summarise, total = sum(CROPDMG))
ordertotalCROP <- FtypeCROP[order(-FtypeCROP$total),]
CROPPerc <- sum(ordertotalCROP[1:10,2])/sum(ordertotalCROP[,2])*100.
CROPDrought <- sum(ordertotalCROP[1:1,2])/sum(ordertotalCROP[,2])*100.
CROPFlood <- sum(ordertotalCROP[1:3,2])/sum(ordertotalCROP[,2])*100.
head(ordertotalCROP,10)
## EVTYPE total
## 48 DROUGHT 13972566000
## 84 FLOOD 5661968450
## 305 RIVER FLOOD 5029459000
## 233 ICE STORM 5022110000
## 131 HAIL 3000954453
## 210 HURRICANE 2741910000
## 219 HURRICANE/TYPHOON 2607872800
## 72 FLASH FLOOD 1420727100
## 66 EXTREME COLD 1292973000
## 111 FROST/FREEZE 1094086000
The first 10 event are responsible for 85.34739% of the crop damage in United Stated from 1950 to 2011. In fact, the first one, drought alone is responsible for 28.49881% of all crop damage. The most relevant events with respect to crop damage taking 50.30533% of all crop damage are drought and flood.
In figure 3, in the bottom plot, we plot together total of crop damage for all events compared with two particular events, drought and flood.
year_Fat_alleventCROP <- ddply(dataDMG, .(year), summarise, mean.count = sum(CROPDMG))
year_Fat_droughtCROP <- ddply(dataDMG[dataDMG$EVTYPE == "DROUGHT",], .(year), summarise, sum = sum(CROPDMG))
year_Fat_FLOOD_allCROP <- ddply(dataDMG[grep("FLOOD", dataDMG$EVTYPE),], .(year), summarise, sum = sum(CROPDMG))
par(mfrow=c(2,1),mar = c(4, 4, 2, 1))
plot(year_Fat_alleventPROP$year,year_Fat_alleventPROP$mean.count,main= "Figure 3",xlab= "Years", ylab= "Property Damage")
points(year_Fat_tornadoPROP$year,year_Fat_tornadoPROP$sum, col="red")
points(year_Fat_hur_allPROP$year,year_Fat_hur_allPROP$sum, col="green")
points(year_Fat_FLOOD_allPROP$year,year_Fat_FLOOD_allPROP$sum, col="blue")
legend("topleft",c("Tornado","Hurricane","Flood"), pch =1,col=c("red","green","blue"))
legend("top",c("All events"),lty=c(1), lwd=c(2.5),col=c("black"))
plot(year_Fat_alleventCROP$year,year_Fat_alleventCROP$mean.count,xlab= "Years", ylab= "Crop Damage")
points(year_Fat_droughtCROP$year,year_Fat_droughtCROP$sum, col="red")
points(year_Fat_FLOOD_allCROP$year,year_Fat_FLOOD_allCROP$sum, col="blue")
legend("topleft",c("Drought","Flood"), pch =1,col=c("red","blue"))
legend("top",c("All events"),lty=c(1), lwd=c(2.5),col=c("black"))
In this plot, the small black rectangle means the total crop damage for each year, the red circles are the crop damage due to drought and blue circles are crop damage from events related with flood. Note, that for crop damage we do not have data for earlier year. We conclude that for years after 1992 drought and flood are the main concern for crop production. So, we recommend to develop some programs to mitigate the effect of drought in crop production. Flood is hard to predict, however, drought can be reduced with long term programs providing water to keep crop production at good rate.
We may want to know the distribution of events per states in United States. This can be done using the variable STATE. This will part of a future report.