This report is a basic illustration of how we can utilize Markdown in RStudio to develop a web page. Here we import a sample set of data from Gapminder and present some basic descriptive statistics on it.
It would always be wise if we verify where we are actually working and then continue:
getwd()
## [1] "C:/Users/Public/Documents/R Project/HW_02"
gDat <- read.delim("gapminderDataFiveYear.txt")
Let's make sure the data set has imported properly:
str(gDat)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
tail(gDat, n = 4)
## country year pop continent lifeExp gdpPercap
## 1701 Zimbabwe 1992 10704340 Africa 60.38 693.4
## 1702 Zimbabwe 1997 11404948 Africa 46.81 792.4
## 1703 Zimbabwe 2002 11926563 Africa 39.99 672.0
## 1704 Zimbabwe 2007 12311143 Africa 43.49 469.7
At the same time we realize that the imported dataset is in fact a data.frame with 6 variables (lists) and 1704 number of observations.
The variables, as can be seen in the result of structure function, are:
names(gDat)
## [1] "country" "year" "pop" "continent" "lifeExp" "gdpPercap"
The same function (str(gDat)) shows that the data.frame contains variables with two flavors:
Considering this fact, we can choose between different ways to describe the data. As an example, instead of using the general summery function:
summary(gDat)
## country year pop continent
## Afghanistan: 12 Min. :1952 Min. :6.00e+04 Africa :624
## Albania : 12 1st Qu.:1966 1st Qu.:2.79e+06 Americas:300
## Algeria : 12 Median :1980 Median :7.02e+06 Asia :396
## Angola : 12 Mean :1980 Mean :2.96e+07 Europe :360
## Argentina : 12 3rd Qu.:1993 3rd Qu.:1.96e+07 Oceania : 24
## Australia : 12 Max. :2007 Max. :1.32e+09
## (Other) :1632
## lifeExp gdpPercap
## Min. :23.6 Min. : 241
## 1st Qu.:48.2 1st Qu.: 1202
## Median :60.7 Median : 3532
## Mean :59.5 Mean : 7215
## 3rd Qu.:70.8 3rd Qu.: 9325
## Max. :82.6 Max. :113523
##
It might be better to check the summery of only the numeric variables (lists), e.g.:
summary(gDat$lifeExp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23.6 48.2 60.7 59.5 70.8 82.6
summary(gDat$gdpPercap)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 241 1200 3530 7220 9330 114000
And use other methods to verify data for factor variables. The number of observations associated with each continent, for instance, can be presented as follows:
library(lattice)
barchart(table(gDat$continent))
Now we are ready to start searching for correlations among the variables. Relation between life expectancy and gross domestic product per capita in the year 1992, is presented here in two ways as an example:
xyplot(lifeExp ~ gdpPercap, gDat, group = continent, subset = year == 1992,
auto.key = TRUE)
xyplot(lifeExp ~ gdpPercap | continent, gDat, subset = year == 1992, auto.key = TRUE)