stat545a-2013-hw03_khosravi-mah

Data Aggregation

In this homework, we want to execute a series of data aggregation tasks on a set of data.

Max. and min. of GDP for different continents.

Let's import the Gapminder data set again:

SourceDat <- read.delim(file = "gapminderDataFiveYear.txt")

We can check the structure and tail of the imported source data to see if data is properly imported:

## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

##       country year      pop continent lifeExp gdpPercap
## 1701 Zimbabwe 1992 10704340    Africa   60.38     693.4
## 1702 Zimbabwe 1997 11404948    Africa   46.81     792.4
## 1703 Zimbabwe 2002 11926563    Africa   39.99     672.0
## 1704 Zimbabwe 2007 12311143    Africa   43.49     469.7

Now we want to take a look at the minimum and maximum of GDP per capita for each continent. In order to that
we can use the ddply() syntax from plyr package.

library(plyr)
GDPExterWide <- ddply(SourceDat, ~continent, summarize, MinGDP = round(min(gdpPercap), 
    2), MaxGDP = round(max(gdpPercap), 2))

For showing the result in a more pleasant way, we can use xtable package and draw a table.

library(xtable)
print(arrange(xtable(GDPExterWide), MinGDP), include.rownames = FALSE, type = "html")

continent	MinGDP	MaxGDP
Africa	241.17	21951.21
Asia	331.00	113523.13
Europe	973.53	49357.19
Americas	1201.64	42951.65
Oceania	10039.60	34435.37

As we can see from the acquired results, different continents won't be in the same order if we sort them on MaxGDP. This means the continent which embraces the country with the lowest minimum GDP around the world, does not necessarily have also the country with the lowest maximum GDP per capita around the world. On the other hand, we should pay attention not to misinterpret the difference between minimum and maximum of GDP per capita for each continent. An important parameter to be considered in that regard is the number of countries in each continent: (The chunk I wrote for this table is hideous and I avoided bringing it here)

Continent	Africa	Americas	Asia	Europe	Oceania
Countries	52	25	33	30	2

Now we can see that for example Oceania, who is the only continent with minimum and maximum GDP of the same order of magnitude, comprises only two countries!

Spread of GDP within the continents

With a very similar method we can review the spread of GDP per capita within the different continents. Here, the table is sorted by median absolute deviation.

continent	SD	MAD	IQR
Africa	2827.93	775.32	1616.17
Asia	14045.37	2820.83	7492.26
Americas	6396.76	3269.33	4402.43
Oceania	6358.98	6459.10	8072.26
Europe	9355.21	8846.05	13248.30

As we can see again, continents do not follow the same order for standard deviation and interquartile range. However, at this stage, it is not quite clear to me what these stats tell.

Trimmed mean of life expectancy for different years

In this part, we want to check the average life expectancy around the world in different years of our study. Two variables of 12.5% and 25% (interquartile) trimmed mean are also illustrated in the following table

# You have two choices for trim fraction This chunk might be nicer, and more
# practical, if we define a function I think.
TP1 <- 0.125
TP2 <- 0.25
AveLExPYear <- ddply(SourceDat, ~year, summarize, Average = mean(lifeExp), Dumm1 = mean(lifeExp, 
    trim = TP1), Dumm2 = mean(lifeExp, trim = TP2))
names(AveLExPYear)[3:4] <- paste0("Tmean", 100 * c(TP1, TP2))
AveLExPYear <- arrange(AveLExPYear, Average)
# It was already sorted, but we can do it just to make sure
print(xtable(AveLExPYear), type = "html", include.rownames = FALSE)

year	Average	Tmean12.5	Tmean25
1952	49.06	48.42	47.34
1957	51.51	51.16	50.28
1962	53.61	53.51	52.79
1967	55.68	55.85	55.43
1972	57.65	58.05	58.08
1977	59.57	60.17	60.47
1982	61.53	62.22	62.70
1987	63.21	64.07	64.77
1992	64.16	65.34	66.19
1997	65.01	66.21	67.25
2002	65.69	66.95	68.31
2007	67.01	68.34	69.69

Interestingly, the average/trimmed mean life expectancy monotonically has increased by time around the world. However, this does not eliminate the necessity of investigating correlations of positive factors (e.g. health science and general comfort) and negative factors (e.g. war and disease) with life expectancy.

Also, as we can see here, trimmed mean values demonstrate more pronounced increase in the average life expectancy by time. More end data is truncated in evaluation of trimmed mean, more noticeable increases would be.

Variation of life expectancy over time for different continents (tall and wide)

An easy way to acquire this data is using ddply() function with two simultaneous filters, namely continent and year.

AvelExPYearCont <- ddply(SourceDat, ~continent + year, summarize, Average = mean(lifeExp))

But, let's first take a look at the structure of the output data.frame using str() :

## 'data.frame':    60 obs. of  3 variables:
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ Average  : num  39.1 41.3 43.3 45.3 47.5 ...

As we can see here, we have a data.frame with 60 rows. Since the default result of ddply() call is a tall format file, this result can be terrible to be presented as a regular table sometimes. Here we can see a table showing 10 random rows of this data.frame:

Dumm1 <- AvelExPYearCont[sample(nrow(AvelExPYearCont), 10), , ]
print(xtable(Dumm1), type = "html", include.rownames = FALSE)

continent	year	Average
Asia	1997	68.02
Asia	2002	69.23
Asia	1962	51.56
Americas	1977	64.39
Americas	1962	58.40
Europe	1957	66.70
Oceania	1987	75.32
Africa	1972	47.45
Europe	1972	70.78
Asia	1977	59.61

However, apparently this is common challenge and people have found the solution for this problem. Using the instruction over a discussion in stackoverflow, we can transform our data.frame to the wide format. In this table you can see the average of life expectancy for each continent in different years.

AvelExPYearContW <- reshape(AvelExPYearCont, idvar = "continent", timevar = "year", 
    direction = "wide")
names(AvelExPYearContW)[-1] <- unique(SourceDat$year)
print(xtable(AvelExPYearContW), type = "html", include.rownames = FALSE)

continent	1952	1957	1962	1967	1972	1977	1982	1987	1992	1997	2002	2007
Africa	39.14	41.27	43.32	45.33	47.45	49.58	51.59	53.34	53.63	53.60	53.33	54.81
Americas	53.28	55.96	58.40	60.41	62.39	64.39	66.23	68.09	69.57	71.15	72.42	73.61
Asia	46.31	49.32	51.56	54.66	57.32	59.61	62.62	64.85	66.54	68.02	69.23	70.73
Europe	64.41	66.70	68.54	69.74	70.78	71.94	72.81	73.64	74.44	75.51	76.70	77.65
Oceania	69.25	70.30	71.09	71.31	71.91	72.85	74.29	75.32	76.94	78.19	79.74	80.72

Interpretation of the results is much easier this way. We can see that the trend is upward for all the five continents. Except for Africa in the years 1997 and 2002, where we can see a slight decrease in life expectancy, the average life expectancy in different continents have been gradually increased by time.

Number of countries with low life expectancy over time by continent.

Now we want to get some sense about the number of countries in each continent that have a life expectancy lower than a certain “global”“ life expectancy index. In order to do this, obviously, we need to define the global life expectancy first.
Defining a constant global life expectancy would not be a good idea according to the increasing trend over time, as we saw earlier. Therefore, we should have a global life expectancy for each year. One way to do this is to get another average on life expectancy for each year in the AvelExPYearContW data.frame, which illustrates the average life expectancy in each continent and each year of our study. But, this approach neglects the effect of number of countries in each continent and gives relatively higher global life expectancy and it is better to take average of life expectancy for all the countries together, as in AveLExPYear data.frame. compare the life expectancies 71.5021 and 67.0074, to see the higher predicted global life expectancy for the year 2007 acquired through another averaging on average life expectancies in different continents. The latter value is from the AveLExPYear data.frame.

AveLExGlob <- round(AveLExPYear$Average, 2)
names(AveLExGlob) <- AveLExPYear$year
AveLExGlob

##  1952  1957  1962  1967  1972  1977  1982  1987  1992  1997  2002  2007 
## 49.06 51.51 53.61 55.68 57.65 59.57 61.53 63.21 64.16 65.01 65.69 67.01

Now based on this global life expectancy vector we count the number of countries that have life expectancies of lower than corresponding vector element. Since the initial tall results is not pleasant to show, we use the reshape() function once again.

LowLExCount <- ddply(SourceDat, ~continent + year, summarize, countries = length(lifeExp[lifeExp < 
    AveLExGlob[as.character(year)]]))
LowLExCount <- reshape(LowLExCount, direction = "wide", idvar = "continent", 
    timevar = "year")
names(LowLExCount)[-1] <- unique(SourceDat$year)
print(xtable(LowLExCount), type = "html", include.rownames = FALSE)

continent	1952	1957	1962	1967	1972	1977	1982	1987	1992	1997	2002	2007
Africa	50	50	50	50	50	49	48	46	46	45	45	45
Americas	9	9	8	6	6	7	7	5	3	2	2	2
Asia	22	20	19	18	18	14	12	12	11	10	10	10
Europe	1	1	1	1	1	1	1	1	0	0	0	0
Oceania	0	0	0	0	0	0	0	0	0	0	0	0

Looking through this table, we realize that Americas and Asia have been improving over years and the number of countries in these continents that have a low life expectancy, based on our definition, is decreasing. Africa, persistently, holds a large number of countries with low life expectancy, although it has started getting better after 1977. Apparently, life expectancy in Oceania and Europe has been well above the global value every year.