A reproducible Research based on the Storm Data

Which types of events are most harmful to the society?

Synopsis

The data is collected by National Weather Service Instruction (NWSI)It shows the storms ,flood ,heat and other kind of event which has made impact of public health or economy in 1950 to 2011,including 902297 observation in the data set. An index of impact of public health and another one for economic consequance is made for the analysis.I show the first 40th harmful events for public health as well as economy.The conclusions are as follows:

Data Processing

The codes are as follows,just cope and paste them into the Console window and run them.This step may take some time because the data contains over 900 thousands observation.So,please be patient.

setwd("C:/Users/Administrator/Desktop/repdata_data_StormData.csv")
storm <- read.csv("repdata_data_StormData.csv", sep = ",")

We can briefly see the head of the Storm data.

head(storm)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

Computing the index

A storm may lead to fatalities and injuries which shows the influences of pubic health. The total fatalities according to Storm Data:

sum(storm$FATALITIES)
## [1] 15145

The total injuries according to Storm Data:

sum(storm$INJURIES)
## [1] 140528

It shows that there are about 15 thousand dead and 140 thousand injury in 1950 to 2011 by tornado,flood etc.

It is suitable to make an index as follows: \[ index_{health}=fatalities \cdot k+injuries \] where k is a ratio of total fatalities vs.total injuries:

\[ k=\frac{total\ fatalities}{total\ injuries} \] This index combine the data of fatalities and injuries and make it easy to compare among the different events.

The codes for compute this index are as follows:

k = sum(storm$INJURIES)/sum(storm$FATALITIES)
### compute the ratio k
index_health <- storm$FATALITIES * k + storm$INJURIES
### compute the index_health

The computing of index of economy may be easier than the index of pubic health. The significant property damage,which is the variance PROPDMG in data set.The PROPDMGEXP is show the magnitude of PROPDMG, Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.

However,the PROPDMG is not clean enough to the research:

levels(storm$PROPDMGEXP)
##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"

As we can see there are other factors like"-“,”?“,”+“ etc.I want to treat them into missing value and just ignore them. After ignoring them, I have to see how many values are ambiguously defined,the amount of every factor of PROPDMG is :

tapply(storm$PROPDMGE, storm$PROPDMGEXP, length)
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330

It doesn't seem a significance in total.The proportion of these factors are:

a <- tapply(storm$PROPDMGE, storm$PROPDMGEXP, length)

sum(a[-1][-13][-15][-16])/sum(a)
## [1] 0.0003635

It shows that 0.03% data are ambiguously defined.Ignoring them would be suitable after the computing above.

Build the data frame for result

The oringin data: storm data is too large to analysis,so I decide to make it smaller to compute in next section.

The index for public health show the influence of pubic health,but many events do not make any sence to public health.

As can be shown in the summary of this index

summary(index_health)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0       0       0    5410

The mediam is 0, it is show that more than half of the events did not cause any fatalities or injuries.

The number of events which cause the fatalities or injuries is :

sum(index_health > 0)
## [1] 21929

21929 events have influence in public health.I would write them into a data frame which called data_health for further analysis.

data_health1 <- storm[index_health > 0, ]
### keep the data which index_health>0
data_health <- data.frame(data_health1$EVTYPE, data_health1$BGN_DATE, data_health1$STATE, 
    index_health[index_health > 0])
names(data_health) <- c("EVTYPE", "BGN_DATE", "STATE", "index_health")

This data frame include event type,date,State,and index of public health.

For economy data I just use the data which cause the magnitude of "billion” or “million”

data_eco1 <- storm[storm$PROPDMGEXP == "M" | storm$PROPDMGEXP == "B", ]

Write the variance:PROPDMG into dollar:

index_eco_consequance <- 0
for (i in 1:dim(data_eco1)[1]) {
    if (data_eco1$PROPDMGEXP[i] == "M") {
        index_eco_consequance[i] <- data_eco1$PROPDMG[i] * 10^6
        #### if magnitude is 'M',multiply 10^6
    } else {
        index_eco_consequance[i] <- data_eco1$PROPDMG[i] * 10^9
        #### if magnitude is 'B',multiply 10^9
    }
}

And at last ,write the data into a data frame called data_eco:

data_eco <- data.frame(data_eco1$EVTYPE, data_eco1$BGN_DATE, data_eco1$STATE, 
    index_eco_consequance)
names(data_eco) <- c("EVTYPE", "BGN_DATE", "STATE", "index_eco")

This data frame include event type,date,State,and index of economy .

Result

head(data_health[order(data_health$index_health, decreasing = T), ], 40)
##                  EVTYPE           BGN_DATE STATE index_health
## 7667               HEAT  7/12/1995 0:00:00    IL       5409.6
## 21439           TORNADO  5/22/2011 0:00:00    MO       2616.1
## 6360            TORNADO  4/10/1979 0:00:00    TX       2089.7
## 3112            TORNADO   6/9/1953 0:00:00    MA       2063.1
## 3153            TORNADO   6/8/1953 0:00:00    MI       1861.3
## 5988            TORNADO  5/11/1953 0:00:00    TX       1654.8
## 8476          ICE STORM   2/8/1994 0:00:00    OH       1577.3
## 4812            TORNADO   4/3/1974 0:00:00    OH       1484.0
## 21396           TORNADO  4/27/2011 0:00:00    AL       1208.3
## 3590            TORNADO   3/3/1966 0:00:00    MS       1032.9
## 2274            TORNADO  5/25/1955 0:00:00    KS        965.9
## 12435    EXCESSIVE HEAT  7/28/1999 0:00:00    IL        918.6
## 21306           TORNADO  4/27/2011 0:00:00    AL        885.6
## 16458 HURRICANE/TYPHOON  8/13/2004 0:00:00    FL        845.0
## 12840    EXCESSIVE HEAT   7/4/1999 0:00:00    PA        821.6
## 11965             FLOOD 10/17/1998 0:00:00    TX        818.6
## 1616            TORNADO  4/21/1967 0:00:00    IL        806.2
## 417             TORNADO  3/21/1952 0:00:00    AR        788.9
## 12612    EXCESSIVE HEAT  7/18/1999 0:00:00    MO        786.7
## 11978             FLOOD 10/17/1998 0:00:00    TX        750.0
## 6212            TORNADO  5/11/1970 0:00:00    TX        741.2
## 1847            TORNADO  4/11/1965 0:00:00    IN        717.7
## 11964             FLOOD 10/17/1998 0:00:00    TX        702.1
## 3641            TORNADO  2/21/1971 0:00:00    MS        689.4
## 520             TORNADO  5/15/1968 0:00:00    AR        665.5
## 311             TORNADO 11/15/1989 0:00:00    AL        657.9
## 1611            TORNADO  4/21/1967 0:00:00    IL        632.7
## 3520            TORNADO  12/5/1953 0:00:00    MS        622.6
## 8623     EXCESSIVE HEAT   7/1/1995 0:00:00    PA        621.7
## 1765            TORNADO  8/28/1990 0:00:00    IL        619.1
## 2356            TORNADO   6/8/1966 0:00:00    KS        598.5
## 3640            TORNADO  2/21/1971 0:00:00    MS        574.0
## 2121            TORNADO  5/15/1968 0:00:00    IA        570.6
## 11972             FLOOD 10/17/1998 0:00:00    TX        555.7
## 11009           TORNADO   4/8/1998 0:00:00    AL        554.9
## 11968             FLOOD 10/17/1998 0:00:00    TX        550.0
## 2573            TORNADO   4/3/1974 0:00:00    KY        544.6
## 1841            TORNADO  4/11/1965 0:00:00    IN        539.6
## 1846            TORNADO  4/11/1965 0:00:00    IN        539.6
## 18665    EXCESSIVE HEAT   8/4/2007 0:00:00    MO        537.6

Count the number of events in first 40th in this data and draw a par plot ,we can see the types of events most harmful to population health

plot_data2 <- head(data_health[order(data_health$index_health, decreasing = T), 
    ], 40)
### write the first 40th data into plot_data2
aaa <- tapply(plot_data2$index_health, plot_data2$EVTYPE, length)
aaa <- sort(aaa, decreasing = T)
names_health <- names(aaa)
### get the names of events in the plot_data2

plot_data2 <- plot_data2[plot_data2$EVTYPE %in% names_health, ]
library(ggplot2)
### use package ggplot2
g <- ggplot(plot_data2, aes(EVTYPE))
g + geom_bar(aes(fill = EVTYPE)) + labs(x = "types of events", y = "count of types in first 40 harmful types") + 
    opts(title = "the most Harmful Types of Events for Pubic Health ")
## 'opts' is deprecated. Use 'theme' instead. (Deprecated; last used in version 0.9.1)
## Setting the plot title with opts(title="...") is deprecated.
##  Use labs(title="...") or ggtitle("...") instead. (Deprecated; last used in version 0.9.1)

plot of chunk unnamed-chunk-16

### draw the picture.

It is shown that the tornado,flood and excessive heat are most harmful to population health

head(data_eco[order(data_eco$index_eco, decreasing = T), ], 40)
##                           EVTYPE           BGN_DATE STATE index_eco
## 8081                       FLOOD   1/1/2006 0:00:00    CA 1.150e+11
## 7839                 STORM SURGE  8/29/2005 0:00:00    LA 3.130e+10
## 7838           HURRICANE/TYPHOON  8/28/2005 0:00:00    LA 1.693e+10
## 7885                 STORM SURGE  8/29/2005 0:00:00    MS 1.126e+10
## 7768           HURRICANE/TYPHOON 10/24/2005 0:00:00    FL 1.000e+10
## 7884           HURRICANE/TYPHOON  8/28/2005 0:00:00    MS 7.350e+09
## 7886           HURRICANE/TYPHOON  8/29/2005 0:00:00    MS 5.880e+09
## 7269           HURRICANE/TYPHOON  8/13/2004 0:00:00    FL 5.420e+09
## 6361              TROPICAL STORM   6/5/2001 0:00:00    TX 5.150e+09
## 2774                WINTER STORM  3/12/1993 0:00:00    AL 5.000e+09
## 3011                 RIVER FLOOD  8/31/1993 0:00:00    IL 5.000e+09
## 7275           HURRICANE/TYPHOON   9/4/2004 0:00:00    FL 4.830e+09
## 7282           HURRICANE/TYPHOON  9/13/2004 0:00:00    FL 4.000e+09
## 7843           HURRICANE/TYPHOON  9/23/2005 0:00:00    LA 4.000e+09
## 9635            STORM SURGE/TIDE  9/12/2008 0:00:00    TX 4.000e+09
## 4725                       FLOOD  4/18/1997 0:00:00    ND 3.000e+09
## 5631                   HURRICANE  9/15/1999 0:00:00    NC 3.000e+09
## 10934                    TORNADO  5/22/2011 0:00:00    MO 2.800e+09
## 3161   HEAVY RAIN/SEVERE WEATHER   5/8/1995 0:00:00    LA 2.500e+09
## 7224           HURRICANE/TYPHOON  9/13/2004 0:00:00    AL 2.500e+09
## 2918              HURRICANE OPAL  10/3/1995 0:00:00    FL 2.100e+09
## 8031           HURRICANE/TYPHOON  9/23/2005 0:00:00    TX 2.090e+09
## 11012                      FLOOD   5/1/2011 0:00:00    TN 2.000e+09
## 10504                       HAIL  10/5/2010 0:00:00    AZ 1.800e+09
## 5419                   HURRICANE  9/21/1998 0:00:00    PR 1.700e+09
## 2919  TORNADOES, TSTM WIND, HAIL  3/12/1993 0:00:00    FL 1.600e+09
## 5931            WILD/FOREST FIRE   5/4/2000 0:00:00    NM 1.500e+09
## 7752           HURRICANE/TYPHOON   7/9/2005 0:00:00    FL 1.500e+09
## 10284                      FLOOD   5/1/2010 0:00:00    TN 1.500e+09
## 10906                    TORNADO  4/27/2011 0:00:00    AL 1.500e+09
## 7272                   HIGH WIND  8/13/2004 0:00:00    FL 1.300e+09
## 3902         SEVERE THUNDERSTORM   5/5/1995 0:00:00    TX 1.200e+09
## 6750                    WILDFIRE 10/25/2003 0:00:00    CA 1.040e+09
## 2923              HURRICANE OPAL  10/4/1995 0:00:00    FL 1.000e+09
## 6726                 FLASH FLOOD   5/7/2003 0:00:00    AL 1.000e+09
## 7677           HURRICANE/TYPHOON  8/27/2005 0:00:00    AL 1.000e+09
## 9634                   HURRICANE  9/12/2008 0:00:00    TX 1.000e+09
## 10743                    TORNADO  4/27/2011 0:00:00    AL 1.000e+09
## 11009                      FLOOD   5/1/2011 0:00:00    MS 1.000e+09
## 7270                   HIGH WIND  8/13/2004 0:00:00    FL 9.290e+08

Count the number of events in first 40th in this data and draw a par plot ,we can see the types of events which are most harmful to economy.

plot_data <- head(data_eco[order(data_eco$index_eco, decreasing = T), ], 40)
aaa <- tapply(plot_data$EVTYPE, plot_data$EVTYPE, length)
aaa <- sort(aaa, decreasing = T)
names_eco <- names(aaa[1:7])
plot_data <- plot_data[plot_data$EVTYPE %in% names_eco, ]

g <- ggplot(plot_data, aes(EVTYPE))
g + geom_bar(aes(fill = EVTYPE)) + labs(x = "types of events", y = "count of types in first 40 harmful types") + 
    opts(title = "the most Harmful Types of Events for Economy ")
## 'opts' is deprecated. Use 'theme' instead. (Deprecated; last used in version 0.9.1)
## Setting the plot title with opts(title="...") is deprecated.
##  Use labs(title="...") or ggtitle("...") instead. (Deprecated; last used in version 0.9.1)

plot of chunk unnamed-chunk-18

It is shown that the hurricane/typhoon,flood and tornado are most harmful toeconomy.

The tornado,flood and excessive heat made the biggest influence to public health.

The hurricane/typhoon,flood and tornado made the biggest influence to economy.