The data is collected by National Weather Service Instruction (NWSI)It shows the storms ,flood ,heat and other kind of event which has made impact of public health or economy in 1950 to 2011,including 902297 observation in the data set. An index of impact of public health and another one for economic consequance is made for the analysis.I show the first 40th harmful events for public health as well as economy.The conclusions are as follows:
The codes are as follows,just cope and paste them into the Console window and run them.This step may take some time because the data contains over 900 thousands observation.So,please be patient.
setwd("C:/Users/Administrator/Desktop/repdata_data_StormData.csv")
storm <- read.csv("repdata_data_StormData.csv", sep = ",")
We can briefly see the head of the Storm data.
head(storm)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
A storm may lead to fatalities and injuries which shows the influences of pubic health. The total fatalities according to Storm Data:
sum(storm$FATALITIES)
## [1] 15145
The total injuries according to Storm Data:
sum(storm$INJURIES)
## [1] 140528
It shows that there are about 15 thousand dead and 140 thousand injury in 1950 to 2011 by tornado,flood etc.
It is suitable to make an index as follows: \[ index_{health}=fatalities \cdot k+injuries \] where k is a ratio of total fatalities vs.total injuries:
\[ k=\frac{total\ fatalities}{total\ injuries} \] This index combine the data of fatalities and injuries and make it easy to compare among the different events.
The codes for compute this index are as follows:
k = sum(storm$INJURIES)/sum(storm$FATALITIES)
### compute the ratio k
index_health <- storm$FATALITIES * k + storm$INJURIES
### compute the index_health
The computing of index of economy may be easier than the index of pubic health. The significant property damage,which is the variance PROPDMG in data set.The PROPDMGEXP is show the magnitude of PROPDMG, Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.
However,the PROPDMG is not clean enough to the research:
levels(storm$PROPDMGEXP)
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
As we can see there are other factors like"-“,”?“,”+“ etc.I want to treat them into missing value and just ignore them. After ignoring them, I have to see how many values are ambiguously defined,the amount of every factor of PROPDMG is :
tapply(storm$PROPDMGE, storm$PROPDMGEXP, length)
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
It doesn't seem a significance in total.The proportion of these factors are:
a <- tapply(storm$PROPDMGE, storm$PROPDMGEXP, length)
sum(a[-1][-13][-15][-16])/sum(a)
## [1] 0.0003635
It shows that 0.03% data are ambiguously defined.Ignoring them would be suitable after the computing above.
The oringin data: storm data is too large to analysis,so I decide to make it smaller to compute in next section.
The index for public health show the influence of pubic health,but many events do not make any sence to public health.
As can be shown in the summary of this index
summary(index_health)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 0 0 5410
The mediam is 0, it is show that more than half of the events did not cause any fatalities or injuries.
The number of events which cause the fatalities or injuries is :
sum(index_health > 0)
## [1] 21929
21929 events have influence in public health.I would write them into a data frame which called data_health for further analysis.
data_health1 <- storm[index_health > 0, ]
### keep the data which index_health>0
data_health <- data.frame(data_health1$EVTYPE, data_health1$BGN_DATE, data_health1$STATE,
index_health[index_health > 0])
names(data_health) <- c("EVTYPE", "BGN_DATE", "STATE", "index_health")
This data frame include event type,date,State,and index of public health.
For economy data I just use the data which cause the magnitude of "billion” or “million”
data_eco1 <- storm[storm$PROPDMGEXP == "M" | storm$PROPDMGEXP == "B", ]
Write the variance:PROPDMG into dollar:
index_eco_consequance <- 0
for (i in 1:dim(data_eco1)[1]) {
if (data_eco1$PROPDMGEXP[i] == "M") {
index_eco_consequance[i] <- data_eco1$PROPDMG[i] * 10^6
#### if magnitude is 'M',multiply 10^6
} else {
index_eco_consequance[i] <- data_eco1$PROPDMG[i] * 10^9
#### if magnitude is 'B',multiply 10^9
}
}
And at last ,write the data into a data frame called data_eco:
data_eco <- data.frame(data_eco1$EVTYPE, data_eco1$BGN_DATE, data_eco1$STATE,
index_eco_consequance)
names(data_eco) <- c("EVTYPE", "BGN_DATE", "STATE", "index_eco")
This data frame include event type,date,State,and index of economy .
head(data_health[order(data_health$index_health, decreasing = T), ], 40)
## EVTYPE BGN_DATE STATE index_health
## 7667 HEAT 7/12/1995 0:00:00 IL 5409.6
## 21439 TORNADO 5/22/2011 0:00:00 MO 2616.1
## 6360 TORNADO 4/10/1979 0:00:00 TX 2089.7
## 3112 TORNADO 6/9/1953 0:00:00 MA 2063.1
## 3153 TORNADO 6/8/1953 0:00:00 MI 1861.3
## 5988 TORNADO 5/11/1953 0:00:00 TX 1654.8
## 8476 ICE STORM 2/8/1994 0:00:00 OH 1577.3
## 4812 TORNADO 4/3/1974 0:00:00 OH 1484.0
## 21396 TORNADO 4/27/2011 0:00:00 AL 1208.3
## 3590 TORNADO 3/3/1966 0:00:00 MS 1032.9
## 2274 TORNADO 5/25/1955 0:00:00 KS 965.9
## 12435 EXCESSIVE HEAT 7/28/1999 0:00:00 IL 918.6
## 21306 TORNADO 4/27/2011 0:00:00 AL 885.6
## 16458 HURRICANE/TYPHOON 8/13/2004 0:00:00 FL 845.0
## 12840 EXCESSIVE HEAT 7/4/1999 0:00:00 PA 821.6
## 11965 FLOOD 10/17/1998 0:00:00 TX 818.6
## 1616 TORNADO 4/21/1967 0:00:00 IL 806.2
## 417 TORNADO 3/21/1952 0:00:00 AR 788.9
## 12612 EXCESSIVE HEAT 7/18/1999 0:00:00 MO 786.7
## 11978 FLOOD 10/17/1998 0:00:00 TX 750.0
## 6212 TORNADO 5/11/1970 0:00:00 TX 741.2
## 1847 TORNADO 4/11/1965 0:00:00 IN 717.7
## 11964 FLOOD 10/17/1998 0:00:00 TX 702.1
## 3641 TORNADO 2/21/1971 0:00:00 MS 689.4
## 520 TORNADO 5/15/1968 0:00:00 AR 665.5
## 311 TORNADO 11/15/1989 0:00:00 AL 657.9
## 1611 TORNADO 4/21/1967 0:00:00 IL 632.7
## 3520 TORNADO 12/5/1953 0:00:00 MS 622.6
## 8623 EXCESSIVE HEAT 7/1/1995 0:00:00 PA 621.7
## 1765 TORNADO 8/28/1990 0:00:00 IL 619.1
## 2356 TORNADO 6/8/1966 0:00:00 KS 598.5
## 3640 TORNADO 2/21/1971 0:00:00 MS 574.0
## 2121 TORNADO 5/15/1968 0:00:00 IA 570.6
## 11972 FLOOD 10/17/1998 0:00:00 TX 555.7
## 11009 TORNADO 4/8/1998 0:00:00 AL 554.9
## 11968 FLOOD 10/17/1998 0:00:00 TX 550.0
## 2573 TORNADO 4/3/1974 0:00:00 KY 544.6
## 1841 TORNADO 4/11/1965 0:00:00 IN 539.6
## 1846 TORNADO 4/11/1965 0:00:00 IN 539.6
## 18665 EXCESSIVE HEAT 8/4/2007 0:00:00 MO 537.6
Count the number of events in first 40th in this data and draw a par plot ,we can see the types of events most harmful to population health
plot_data2 <- head(data_health[order(data_health$index_health, decreasing = T),
], 40)
### write the first 40th data into plot_data2
aaa <- tapply(plot_data2$index_health, plot_data2$EVTYPE, length)
aaa <- sort(aaa, decreasing = T)
names_health <- names(aaa)
### get the names of events in the plot_data2
plot_data2 <- plot_data2[plot_data2$EVTYPE %in% names_health, ]
library(ggplot2)
### use package ggplot2
g <- ggplot(plot_data2, aes(EVTYPE))
g + geom_bar(aes(fill = EVTYPE)) + labs(x = "types of events", y = "count of types in first 40 harmful types") +
opts(title = "the most Harmful Types of Events for Pubic Health ")
## 'opts' is deprecated. Use 'theme' instead. (Deprecated; last used in version 0.9.1)
## Setting the plot title with opts(title="...") is deprecated.
## Use labs(title="...") or ggtitle("...") instead. (Deprecated; last used in version 0.9.1)
### draw the picture.
head(data_eco[order(data_eco$index_eco, decreasing = T), ], 40)
## EVTYPE BGN_DATE STATE index_eco
## 8081 FLOOD 1/1/2006 0:00:00 CA 1.150e+11
## 7839 STORM SURGE 8/29/2005 0:00:00 LA 3.130e+10
## 7838 HURRICANE/TYPHOON 8/28/2005 0:00:00 LA 1.693e+10
## 7885 STORM SURGE 8/29/2005 0:00:00 MS 1.126e+10
## 7768 HURRICANE/TYPHOON 10/24/2005 0:00:00 FL 1.000e+10
## 7884 HURRICANE/TYPHOON 8/28/2005 0:00:00 MS 7.350e+09
## 7886 HURRICANE/TYPHOON 8/29/2005 0:00:00 MS 5.880e+09
## 7269 HURRICANE/TYPHOON 8/13/2004 0:00:00 FL 5.420e+09
## 6361 TROPICAL STORM 6/5/2001 0:00:00 TX 5.150e+09
## 2774 WINTER STORM 3/12/1993 0:00:00 AL 5.000e+09
## 3011 RIVER FLOOD 8/31/1993 0:00:00 IL 5.000e+09
## 7275 HURRICANE/TYPHOON 9/4/2004 0:00:00 FL 4.830e+09
## 7282 HURRICANE/TYPHOON 9/13/2004 0:00:00 FL 4.000e+09
## 7843 HURRICANE/TYPHOON 9/23/2005 0:00:00 LA 4.000e+09
## 9635 STORM SURGE/TIDE 9/12/2008 0:00:00 TX 4.000e+09
## 4725 FLOOD 4/18/1997 0:00:00 ND 3.000e+09
## 5631 HURRICANE 9/15/1999 0:00:00 NC 3.000e+09
## 10934 TORNADO 5/22/2011 0:00:00 MO 2.800e+09
## 3161 HEAVY RAIN/SEVERE WEATHER 5/8/1995 0:00:00 LA 2.500e+09
## 7224 HURRICANE/TYPHOON 9/13/2004 0:00:00 AL 2.500e+09
## 2918 HURRICANE OPAL 10/3/1995 0:00:00 FL 2.100e+09
## 8031 HURRICANE/TYPHOON 9/23/2005 0:00:00 TX 2.090e+09
## 11012 FLOOD 5/1/2011 0:00:00 TN 2.000e+09
## 10504 HAIL 10/5/2010 0:00:00 AZ 1.800e+09
## 5419 HURRICANE 9/21/1998 0:00:00 PR 1.700e+09
## 2919 TORNADOES, TSTM WIND, HAIL 3/12/1993 0:00:00 FL 1.600e+09
## 5931 WILD/FOREST FIRE 5/4/2000 0:00:00 NM 1.500e+09
## 7752 HURRICANE/TYPHOON 7/9/2005 0:00:00 FL 1.500e+09
## 10284 FLOOD 5/1/2010 0:00:00 TN 1.500e+09
## 10906 TORNADO 4/27/2011 0:00:00 AL 1.500e+09
## 7272 HIGH WIND 8/13/2004 0:00:00 FL 1.300e+09
## 3902 SEVERE THUNDERSTORM 5/5/1995 0:00:00 TX 1.200e+09
## 6750 WILDFIRE 10/25/2003 0:00:00 CA 1.040e+09
## 2923 HURRICANE OPAL 10/4/1995 0:00:00 FL 1.000e+09
## 6726 FLASH FLOOD 5/7/2003 0:00:00 AL 1.000e+09
## 7677 HURRICANE/TYPHOON 8/27/2005 0:00:00 AL 1.000e+09
## 9634 HURRICANE 9/12/2008 0:00:00 TX 1.000e+09
## 10743 TORNADO 4/27/2011 0:00:00 AL 1.000e+09
## 11009 FLOOD 5/1/2011 0:00:00 MS 1.000e+09
## 7270 HIGH WIND 8/13/2004 0:00:00 FL 9.290e+08
Count the number of events in first 40th in this data and draw a par plot ,we can see the types of events which are most harmful to economy.
plot_data <- head(data_eco[order(data_eco$index_eco, decreasing = T), ], 40)
aaa <- tapply(plot_data$EVTYPE, plot_data$EVTYPE, length)
aaa <- sort(aaa, decreasing = T)
names_eco <- names(aaa[1:7])
plot_data <- plot_data[plot_data$EVTYPE %in% names_eco, ]
g <- ggplot(plot_data, aes(EVTYPE))
g + geom_bar(aes(fill = EVTYPE)) + labs(x = "types of events", y = "count of types in first 40 harmful types") +
opts(title = "the most Harmful Types of Events for Economy ")
## 'opts' is deprecated. Use 'theme' instead. (Deprecated; last used in version 0.9.1)
## Setting the plot title with opts(title="...") is deprecated.
## Use labs(title="...") or ggtitle("...") instead. (Deprecated; last used in version 0.9.1)
The tornado,flood and excessive heat made the biggest influence to public health.
The hurricane/typhoon,flood and tornado made the biggest influence to economy.