Homework #6

Huiting Ma

This homework will focus on the following parts:

Introduction

This homework is to analyze the Natural Disasters around the world from the year 2000 to 2013. The natural disasters include earthquake, volcano and mass movement. My main focus is to identify which continents or countries have more natural disasters and whether the number of death, injuries, homeless and the population affected are correlated with the damaged amount. The dataset I am going to use is from EM-DAT, which is the International Disaster Database here.

The definition of all variables can be found in the above website, which are:

Data Cleaning

library(plyr)
library(xtable)
library(ggplot2)
library(reshape2)

Supercheck whether data has imported correctly

NaturalDisaster <- read.csv("Disaster.csv") 
str(NaturalDisaster)
## 'data.frame':    1648 obs. of  9 variables:
##  $ Country    : Factor w/ 188 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year       : int  2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ...
##  $ NumDisaster: int  6 6 16 9 3 14 13 7 4 5 ...
##  $ NumKilled  : int  594 485 4083 137 18 582 382 296 1334 101 ...
##  $ NumInjured : int  0 20 1391 4 40 44 185 20 182 86 ...
##  $ NumAffected: int  2582228 204425 302279 500 2800 37901 2225515 26755 452602 62521 ...
##  $ NumHomeless: int  0 250 10000 4250 2700 6775 8210 3480 180 3250 ...
##  $ TotalDamUSD: int  50 10 0 0 0 5050 0 0 0 20000 ...
##  $ Continent  : Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...

Learn from Jennie, to define a function for converting and printing to HTML table.

htmlPrint <- function(x, ...,
                      digits = 0, include.rownames = FALSE) {
  print(xtable(x, digits = digits, ...), type = 'html',
        include.rownames = include.rownames, ...)
}

Try to count the number of natural disaster over time on different continents

numCountByYear <- daply(NaturalDisaster,~Year + 
                          Continent, summarize, 
                        TotalCount = sum(NumDisaster))
numCountByYear <- as.data.frame(numCountByYear)
htmlPrint(numCountByYear)
Africa Americas Asia Europe Oceania
122 101 195 95 11
114 97 164 53 19
111 116 187 90 20
83 89 146 51 17
85 89 169 39 19
100 95 186 99 14
118 73 183 51 18
108 103 159 66 9
104 96 147 33 12
93 84 142 42 19
100 109 142 70 15
76 97 150 18 11
68 85 147 64 11
19 16 48 11 5

Seems 2013 does not have a lot of observations, Let's drop 2013.

NDisaster <- droplevels(subset(NaturalDisaster,Year != "2013"))
table(NDisaster$Year) #Check whether 2013 has dropped
## 
## 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 
##  128  120  128  118  122  139  106  134  121  114  135  105  120

Try to count the number of countries in each continents

numCountries <- ddply(NDisaster, ~Continent, summarize, numCoutries = length(unique(Country)))
htmlPrint(numCountries)
Continent numCoutries
Africa 50
Americas 37
Asia 45
Europe 40
Oceania 16

Now, let us try to plot a graph and visualize the results

ggplot(NaturalDisaster, aes(x = Year, y = NumDisaster, color = Year)) +
  geom_jitter() + facet_wrap(~ Continent) +
  ggtitle("How is Number Disasters Changing over Time on Diffferent Continents")

plot of chunk unnamed-chunk-8

Based on the graph, it is found that “Oceania” does not have a lot of data, drop it!

NDisaster <- droplevels(subset(NaturalDisaster,Continent != "Oceania"))
table(NDisaster$Continent) #Check whether "Oceania" has dropped
## 
##   Africa Americas     Asia   Europe 
##      501      339      399      316

Also, I found there are some missing data for Number of affected. Let's drop these data!

NDisaster <- droplevels(subset(NDisaster,NumAffected != "0"))
str(NDisaster)
## 'data.frame':    1280 obs. of  9 variables:
##  $ Country    : Factor w/ 165 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year       : int  2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ...
##  $ NumDisaster: int  6 6 16 9 3 14 13 7 4 5 ...
##  $ NumKilled  : int  594 485 4083 137 18 582 382 296 1334 101 ...
##  $ NumInjured : int  0 20 1391 4 40 44 185 20 182 86 ...
##  $ NumAffected: int  2582228 204425 302279 500 2800 37901 2225515 26755 452602 62521 ...
##  $ NumHomeless: int  0 250 10000 4250 2700 6775 8210 3480 180 3250 ...
##  $ TotalDamUSD: int  50 10 0 0 0 5050 0 0 0 20000 ...
##  $ Continent  : Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...

Since there also a lot of missing data for Total damage, let's drop these values.

Disaster <- droplevels(subset(NDisaster,TotalDamUSD != "0"))
str(Disaster)
## 'data.frame':    561 obs. of  9 variables:
##  $ Country    : Factor w/ 121 levels "Afghanistan",..: 1 1 1 1 1 2 2 3 3 3 ...
##  $ Year       : int  2000 2001 2005 2009 2011 2002 2004 2001 2002 2005 ...
##  $ NumDisaster: int  6 6 14 5 4 3 2 1 3 4 ...
##  $ NumKilled  : int  594 485 582 101 83 7 3 921 48 34 ...
##  $ NumInjured : int  0 20 44 86 115 0 0 423 20 5 ...
##  $ NumAffected: int  2582228 204425 37901 62521 1753000 192110 2500 45000 2285 70 ...
##  $ NumHomeless: int  0 250 6775 3250 9700 0 0 0 0 1750 ...
##  $ TotalDamUSD: int  50 10 5050 20000 142000 17500 173 300000 1500 7256 ...
##  $ Continent  : Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 4 4 1 1 1 ...

Since I want to fit linear regression model in the second script, it is necessary to make sure all countries should have multiple observations. I will delete the countries with only 2 or less observations.

NumObsCountry <- ddply(Disaster, ~Country, summarize, numobs = length(Year))
CountryDrop <- droplevels(subset(NumObsCountry,!(numobs > 2)))
Disaster <- droplevels(subset(Disaster,
                              !(Country %in% CountryDrop$Country)))
DropNumObsCountry <- ddply(Disaster, ~Country, summarize, numobs = length(Year))
htmlPrint(DropNumObsCountry)
Country numobs
Afghanistan 5
Algeria 7
Argentina 7
Bangladesh 4
Belize 4
Bolivia 7
Bosnia-Hercegovenia 3
Brazil 11
Bulgaria 3
Cambodia 5
Canada 7
Chile 9
China P Rep 14
Colombia 5
Costa Rica 6
Cuba 6
Czech Rep 3
Dominican Rep 6
Ecuador 5
El Salvador 5
France 8
Georgia 3
Germany 3
Greece 4
Guatemala 6
Haiti 6
Honduras 8
Hungary 5
India 13
Indonesia 14
Iran Islam Rep 9
Italy 6
Jamaica 6
Japan 12
Kazakhstan 5
Kenya 5
Korea Dem P Rep 6
Korea Rep 9
Madagascar 8
Malaysia 4
Mexico 13
Moldova Rep 4
Mongolia 3
Morocco 4
Mozambique 6
Myanmar 5
Nepal 5
Nicaragua 3
Nigeria 7
Pakistan 8
Panama 3
Paraguay 3
Peru 3
Philippines 14
Poland 5
Puerto Rico 3
Romania 4
Russia 11
South Africa 7
Spain 9
Sri Lanka 8
Sudan 3
Switzerland 4
Taiwan (China) 9
Tajikistan 11
Thailand 11
Turkey 7
Ukraine 7
United Kingdom 8
United States 14
Venezuela 5
Viet Nam 13
Zimbabwe 4

Yes, countries with only few observations have been dropped. The last thing I want to do is to reorder continents based on the total nunmber of disasters happened.

Disaster <- within(Disaster, Continent <- reorder(Continent, NumDisaster, sum))
Disaster <- arrange(Disaster,Continent)

Data Aggregation and Visulation

In this stage, dataset has been sucessfully cleaned to the format that I want. Let us get into Exploratory Data Analysis stage.

First, Learn for JB, for each country, write stripplots to file to compare the number of disasters vs. year plot of chunk unnamed-chunk-15plot of chunk unnamed-chunk-15plot of chunk unnamed-chunk-15plot of chunk unnamed-chunk-15plot of chunk unnamed-chunk-15plot of chunk unnamed-chunk-15plot of chunk unnamed-chunk-15plot of chunk unnamed-chunk-15plot of chunk unnamed-chunk-15plot of chunk unnamed-chunk-15plot of chunk unnamed-chunk-15plot of chunk unnamed-chunk-15

Based on the graphs above, I did not find there is an obvious relationship between the number of disasters and year in different countries.

Second, try to get the spread of death within the continents.

spreaddeath <- ddply(Disaster, ~ Continent, summarize,
                     sdNumKilled = sd(NumKilled), 
                     iqrNumKilled = IQR(NumKilled))
htmlPrint(arrange(spreaddeath, sdNumKilled))
Continent sdNumKilled iqrNumKilled
Africa 264 147
Europe 6802 42
Asia 17676 634
Americas 18674 126

It can be seen that there are a lot of variation for the number of killed people in natural disasters.

newspread <- melt(spreaddeath, id="Continent")
ggplot(newspread, aes(x = Continent, y = value, colour = variable)) + 
     geom_point() + geom_line(aes(x=as.numeric(Continent)))+ ylab("spread") +
     ggtitle("Measure of Spread")

plot of chunk unnamed-chunk-17

Then,try to produce the maximum and minimum statistics for different variables in all continents for all continents

maxmean<- ddply(Disaster, ~ Continent, summarize, 
                meanDamage = mean(TotalDamUSD),
                maxDamage= max(TotalDamUSD),
                meanAffPop = mean(NumAffected),
                maxAffPop = max(NumAffected),
                meanInjured = mean(NumInjured),
                maxInjured = max(NumInjured),
                meanDeath = mean(NumKilled),
                maxDeath = max(NumKilled),
                meanHomeless = mean(NumHomeless),
                maxHomeless = max(NumHomeless)
)
htmlPrint(arrange(maxmean,maxDamage))
Continent meanDamage maxDamage meanAffPop maxAffPop meanInjured maxInjured meanDeath maxDeath meanHomeless maxHomeless
Africa 118076 779000 824687 7015029 128 1200 203 921 17465 258450
Europe 1316447 17137601 87556 2600000 183 5850 1328 55844 1364 30000
Americas 4122132 159060330 537394 13390150 4103 577520 1682 229549 14865 800000
Asia 3576366 212520000 12551989 342023353 8179 368719 3563 166691 137750 5003500

We have summarized some detailed information about different variables in the dataset. Now, let's discouver more details. First,let's identify which varibales play an important role to predict the damaged amount from natural disasters

To start with, Let us to fit the full model.

FullModel <- lm(TotalDamUSD ~ NumDisaster + NumKilled + NumInjured +
                NumAffected +NumHomeless, data = Disaster)
summary(FullModel)
## 
## Call:
## lm(formula = TotalDamUSD ~ NumDisaster + NumKilled + NumInjured + 
##     NumAffected + NumHomeless, data = Disaster)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -38666229  -2518125   -747136    166490 206946995 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.14e+06   9.13e+05   -1.25     0.21    
## NumDisaster  6.34e+05   1.14e+05    5.58  4.1e-08 ***
## NumKilled    1.62e+01   5.72e+01    0.28     0.78    
## NumInjured   6.87e+01   2.67e+01    2.57     0.01 *  
## NumAffected -1.12e-02   2.68e-02   -0.42     0.68    
## NumHomeless -2.98e+00   1.94e+00   -1.54     0.12    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14100000 on 480 degrees of freedom
## Multiple R-squared:  0.111,  Adjusted R-squared:  0.102 
## F-statistic:   12 on 5 and 480 DF,  p-value: 5.73e-11

Based on the summmary, it can be shown that number of disasters played a significant role to determine the damaged amount. Therefore, I will pay more attention on whether number of disasters also play an essential role in country level.

mFun <- function(x) {
  model <- lm(TotalDamUSD ~ NumDisaster, x)
  estCoefs <- c(coef(model))
  # estSE <- c(se.coef(model))
  ## 2 means 2nd row (we do not want to test intercept), 
  ## 4 means the 4th column, which corresponding to p-value
  p_value <- summary(model)$coefficients[2, 4]  
  names(estCoefs) <- c("intercept", "NumDisaster")
  #names(estSE) <- c("SE(intercept)", "SE(NumDisaster)")
  names(p_value) <- "p-value (NumDisaster)"
  return (c(estCoefs,p_value))
}

mCoefs <- ddply(Disaster, ~Country, mFun)
print(mCoefs)
##                Country  intercept NumDisaster p-value (NumDisaster)
## 1          Afghanistan  8.053e+04  -6.730e+03             0.4589563
## 2              Algeria  4.192e+05  -7.655e+04             0.1692596
## 3            Argentina  3.517e+05   9.989e+03             0.9027709
## 4           Bangladesh  2.029e+06  -5.878e+04             0.8119140
## 5               Belize  1.625e+05  -1.631e+04             0.9353574
## 6              Bolivia -3.106e+04   5.384e+04             0.4841940
## 7  Bosnia-Hercegovenia  1.750e+05  -1.750e+04             0.8234795
## 8               Brazil  1.189e+06  -8.193e+04             0.4775022
## 9             Bulgaria -5.774e+05   1.292e+05             0.4567365
## 10            Cambodia  6.630e+05  -3.225e+05             0.0972609
## 11              Canada  2.467e+06  -3.347e+05             0.1638766
## 12               Chile  2.343e+07  -5.622e+06             0.1073906
## 13         China P Rep -3.802e+06   8.040e+05             0.5965933
## 14            Colombia  2.341e+06  -2.951e+05             0.2403580
## 15          Costa Rica  2.170e+05  -7.750e+04             0.1518295
## 16                Cuba -7.023e+05   6.903e+05             0.1475787
## 17           Czech Rep  6.762e+05   9.878e+04             0.9499379
## 18       Dominican Rep -5.039e+04   4.236e+04             0.2893147
## 19             Ecuador  6.549e+05  -1.396e+05             0.2795469
## 20         El Salvador  1.035e+06  -5.659e+04             0.8160431
## 21              France  8.089e+05   2.077e+05             0.7096798
## 22             Georgia  4.560e+05  -1.810e+05             0.3963003
## 23             Germany -6.612e+06   2.648e+06             0.4851640
## 24              Greece  1.117e+06  -1.241e+05             0.8102967
## 25           Guatemala  2.901e+05   2.358e+04             0.9005996
## 26               Haiti -8.410e+05   4.622e+05             0.4348501
## 27            Honduras -2.172e+04   2.292e+04             0.0363515
## 28             Hungary  4.467e+05  -2.117e+05             0.2633967
## 29               India -7.494e+05   1.670e+05             0.0618948
## 30           Indonesia -3.486e+02   7.951e+04             0.5044540
## 31      Iran Islam Rep  5.403e+04   2.293e+04             0.4979471
## 32               Italy  5.122e+06   1.184e+05             0.9373062
## 33             Jamaica  5.036e+05  -1.196e+05             0.6064703
## 34               Japan -1.466e+07   5.848e+06             0.4642779
## 35          Kazakhstan  1.184e+05  -5.018e+04             0.3660053
## 36               Kenya  7.958e+04  -8.492e+03             0.5442917
## 37     Korea Dem P Rep  7.045e+06  -2.994e+06             0.0666164
## 38           Korea Rep  2.211e+06  -2.341e+05             0.5290848
## 39          Madagascar  2.744e+05  -4.604e+04             0.2454068
## 40            Malaysia  6.645e+05  -6.805e+04             0.8137809
## 41              Mexico  5.382e+04   2.409e+05             0.5019379
## 42         Moldova Rep  3.872e+05  -1.837e+05             0.4608119
## 43            Mongolia -3.228e+04   3.987e+04             0.1960214
## 44             Morocco  1.492e+05  -5.214e+04             0.4775352
## 45          Mozambique -1.892e+05   6.052e+04             0.1149816
## 46             Myanmar  7.859e+06  -3.859e+06             0.0007387
## 47               Nepal -3.327e+03   4.499e+03             0.4487414
## 48           Nicaragua  1.029e+03  -3.571e+00             0.8789623
## 49             Nigeria  1.686e+05  -1.621e+04             0.4834093
## 50            Pakistan  1.565e+06   1.748e+05             0.6953090
## 51              Panama  1.233e+04  -3.250e+03             0.5332411
## 52            Paraguay  3.908e+04  -1.005e+04             0.1058932
## 53                Peru  2.143e+05   2.142e+04             0.8789929
## 54         Philippines -2.169e+05   3.705e+04             0.0136360
## 55              Poland  7.441e+05   1.539e+04             0.9875127
## 56         Puerto Rico  2.947e+05  -2.300e+04             0.9329778
## 57             Romania -6.334e+05   2.083e+05             0.1153793
## 58              Russia  8.353e+05  -6.734e+03             0.9125249
## 59        South Africa  2.898e+05  -2.456e+04             0.2677210
## 60               Spain  4.345e+05  -6.019e+04             0.5198923
## 61           Sri Lanka -2.065e+04   9.909e+04             0.4732060
## 62               Sudan -5.600e+04   9.343e+04             0.2020026
## 63         Switzerland  1.290e+06  -2.100e+05             0.6977024
## 64      Taiwan (China) -1.030e+05   1.062e+05             0.1380400
## 65          Tajikistan  2.199e+05  -3.533e+04             0.5734894
## 66            Thailand  2.047e+04   6.859e+05             0.7949296
## 67              Turkey  1.282e+06  -1.839e+05             0.3572760
## 68             Ukraine -8.176e+04   2.405e+05             0.3854931
## 69      United Kingdom -1.747e+06   1.117e+06             0.2348985
## 70       United States  5.639e+07  -8.762e+05             0.6313770
## 71           Venezuela  4.700e+04   8.167e+03             0.7943047
## 72            Viet Nam  2.013e+05   3.811e+04             0.4116813
## 73            Zimbabwe  1.164e+05  -1.343e+04             0.8139364

Question: For this part, I did not use htmlPrint() since I do not know how to control decimals in r markdown. R markdown automatically round numbers.

I tried my best, but I am not sure whether this is the correct way. Based on above table, we found that even though the number of disasters play an critical role in the world level, it does not significant in country level for most countries. It might because the individual country has limited sample size. Thus, the standard error is huge.

The next topic that I would like to focus one is how is the number of disasters changing over time on different continents.To begin with, let us plot the number of disasters in different continents.

To begin with, let us plot the number of disasters in different continents.

ggplot(Disaster,aes(x=NumDisaster, fill= Continent)) + facet_wrap(~Continent)+
  geom_bar(binwidth = 2, color = "black") +
  ggtitle("Number of Disasters in Diffferent Continents")

plot of chunk unnamed-chunk-21

After having some basic ideas about the frequency of the number of disasters in different continents,I will look at the number of disasters changing over time on diffferent continents

p <- ggplot(Disaster, aes(x = Year, y = NumDisaster, color = Year))+ 
  geom_jitter() + facet_wrap(~ Continent) + 
  geom_line(stat = "summary", fun.y = "mean", col = "red", lwd = 1) +
  ggtitle("How is Number of Disasters Changing over Time on Diffferent Continents") + 
  scale_x_continuous(name = "Year", breaks = seq(min(Disaster$Year), 
                                                 max(Disaster$Year), by = 2))  + 
  xlab("Year") + ylab("Number of Disasters")
print(p)

plot of chunk unnamed-chunk-22

Based on the plot, there is no significan relationship beween the number of disasters and time. However, it can be found that there is some difference for the number of disasters across continents.

Try boxplot for the year 2000, 2005 and 2010

ggplot(subset(Disaster, Year %in% c(2000,2005,2010)), aes(x = factor(Year), 
                                                          y = NumDisaster, 
                                                          fill = Continent), 
       groups = Continent) + geom_boxplot(alpha = 0.2, outlier.colour= "red") 

plot of chunk unnamed-chunk-23

Obviously, Asia has more variation for the number of disasters.

Finally, let us try another plot about the number of disasters changing over time on diffferent continents

low_number= 10
ggplot(Disaster, aes(x = Year, y = NumDisaster, 
                        colour = NumDisaster <= low_number)) + 
  geom_jitter(position = position_jitter(width = .2)) + 
  facet_wrap(~ Continent) + 
  ggtitle(paste("NumDisaster <= ", low_number)) + 
  theme(plot.title = element_text(face="bold")) + 
  scale_colour_discrete(name="",breaks=c("FALSE", "TRUE"),
                        labels=c("Number > 5", "Number <= 5"))+
  ggtitle("Show Number of Disasters by Year across Continents")

plot of chunk unnamed-chunk-24

Now,Let us look at a special plot: dots scatterplot of number of natural disaster over year for China. Why?

ggplot(subset(Disaster, Country == "China P Rep"), aes(x = Year, y = NumKilled)) + 
  geom_line() + xlab("Year") + ylab("Number killed") +
  ggtitle("How is Number of People Killed over time in People's Repubic of China") +
  scale_x_continuous(name = "Year", breaks = seq(min(NDisaster$Year), 
                                                 max(NDisaster$Year), by = 2)) 

plot of chunk unnamed-chunk-25

The graph shows that more than 80,000 were killed by natural disaster in the year of 2008 in China. That was Sichuan Earthquake. At that time, I was also in Sichuan and experienced this catastrophe. I feel sorry for those people who lost their lives during this disaster.Since I experienced this natural disaster, I would think this is an important plot for me!!

Next, I would like to evaluate whether there is a relationship between affected population and death across continents.

The question is that whether more people got affected by natural disasters will lead more people be killed? Since some natural disasters caused a lot of people lost lives, I used log transformation (Question Log transformation may not make sense?)

ggplot(Disaster, aes(x = NumAffected, y = NumKilled, color = Continent)) + 
  geom_point() + scale_x_log10() + scale_y_log10()+
  ggtitle("How NumKilled related to NumAffected across Continents") 

plot of chunk unnamed-chunk-26

Based on the graph, it can be seen that with the increase of number of people got affected, there is a slightly increasing trend for the number of people got killed.

Conclusion

In conclusion, the number of disasters played a significant role to determine the total damaged amount in population level but not in country level. In addition, with the increase of number of people got affected, there is a slightly increasing trend for the number of people got killed.