stat545a-2013-hw02_khosravi-mah

Markdown Exercise

This report is a basic illustration of how we can utilize Markdown in RStudio to develop a web page. Here we import a sample set of data from Gapminder and present some basic descriptive statistics on it.

Data Import

It would always be wise if we verify where we are actually working and then continue:

getwd()

## [1] "C:/Users/Public/Documents/R Project/HW_02"

gDat <- read.delim("gapminderDataFiveYear.txt")

Let's make sure the data set has imported properly:

str(gDat)

## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

tail(gDat, n = 4)

##       country year      pop continent lifeExp gdpPercap
## 1701 Zimbabwe 1992 10704340    Africa   60.38     693.4
## 1702 Zimbabwe 1997 11404948    Africa   46.81     792.4
## 1703 Zimbabwe 2002 11926563    Africa   39.99     672.0
## 1704 Zimbabwe 2007 12311143    Africa   43.49     469.7

At the same time we realize that the imported dataset is in fact a data.frame with 6 variables (lists) and 1704 number of observations.

Dataset Basic Identification

The variables, as can be seen in the result of structure function, are:

names(gDat)

## [1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"

The same function (str(gDat)) shows that the data.frame contains variables with two flavors:

Factors
numerics (integers are numeric too)

Considering this fact, we can choose between different ways to describe the data. As an example, instead of using the general summery function:

summary(gDat)

##         country          year           pop              continent  
##  Afghanistan:  12   Min.   :1952   Min.   :6.00e+04   Africa  :624  
##  Albania    :  12   1st Qu.:1966   1st Qu.:2.79e+06   Americas:300  
##  Algeria    :  12   Median :1980   Median :7.02e+06   Asia    :396  
##  Angola     :  12   Mean   :1980   Mean   :2.96e+07   Europe  :360  
##  Argentina  :  12   3rd Qu.:1993   3rd Qu.:1.96e+07   Oceania : 24  
##  Australia  :  12   Max.   :2007   Max.   :1.32e+09                 
##  (Other)    :1632                                                   
##     lifeExp       gdpPercap     
##  Min.   :23.6   Min.   :   241  
##  1st Qu.:48.2   1st Qu.:  1202  
##  Median :60.7   Median :  3532  
##  Mean   :59.5   Mean   :  7215  
##  3rd Qu.:70.8   3rd Qu.:  9325  
##  Max.   :82.6   Max.   :113523  
##

It might be better to check the summery of only the numeric variables (lists), e.g.:

summary(gDat$lifeExp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    23.6    48.2    60.7    59.5    70.8    82.6

summary(gDat$gdpPercap)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     241    1200    3530    7220    9330  114000

And use other methods to verify data for factor variables. The number of observations associated with each continent, for instance, can be presented as follows:

library(lattice)
barchart(table(gDat$continent))

plot of chunk unnamed-chunk-6

Correlation Investigations

Now we are ready to start searching for correlations among the variables. Relation between life expectancy and gross domestic product per capita in the year 1992, is presented here in two ways as an example:

xyplot(lifeExp ~ gdpPercap, gDat, group = continent, subset = year == 1992, 
    auto.key = TRUE)

plot of chunk unnamed-chunk-7

xyplot(lifeExp ~ gdpPercap | continent, gDat, subset = year == 1992, auto.key = TRUE)

plot of chunk unnamed-chunk-8