stat545a-2013-hw02_inskip-jes.rmd

This is a tutorial on preliminary data.frame use and manipulation and also gives an example of an R Markdown document.

Importing data as dataframe

Set working directory to current folder (this will be different in your computer) and import .txt file (note: tab completion for filename only occurs once quotations are open).

setwd("~/Rwork/20130909")
gDat <- read.delim(file = "gapminderDataFiveYear.txt")

Things to keep your eye out for when checking whether data imported properly: (some found here: http://www.youtube.com/watch?v=3rDNpcluseM)

number of rows and columns
missing values
values outside expected range
values in wrong units
mislabeled variables
variables that are in the wrong class

A quick look at the first few lines lets you know that the read has been successful.

head(gDat)  # can specify number of rows in head using argument 'n = '

##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia   28.80     779.4
## 2 Afghanistan 1957  9240934      Asia   30.33     820.9
## 3 Afghanistan 1962 10267083      Asia   32.00     853.1
## 4 Afghanistan 1967 11537966      Asia   34.02     836.2
## 5 Afghanistan 1972 13079460      Asia   36.09     740.0
## 6 Afghanistan 1977 14880372      Asia   38.44     786.1

And a look at the dimentions lets you quickly check that things are in order.

dim(gDat)

## [1] 1704    6

The str() function gives you information about the type of variables and the column names.

str(gDat)

## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

The summary() function gives you a first peek at some basic statistics.

summary(gDat)

##         country          year           pop              continent  
##  Afghanistan:  12   Min.   :1952   Min.   :6.00e+04   Africa  :624  
##  Albania    :  12   1st Qu.:1966   1st Qu.:2.79e+06   Americas:300  
##  Algeria    :  12   Median :1980   Median :7.02e+06   Asia    :396  
##  Angola     :  12   Mean   :1980   Mean   :2.96e+07   Europe  :360  
##  Argentina  :  12   3rd Qu.:1993   3rd Qu.:1.96e+07   Oceania : 24  
##  Australia  :  12   Max.   :2007   Max.   :1.32e+09                 
##  (Other)    :1632                                                   
##     lifeExp       gdpPercap     
##  Min.   :23.6   Min.   :   241  
##  1st Qu.:48.2   1st Qu.:  1202  
##  Median :60.7   Median :  3532  
##  Mean   :59.5   Mean   :  7215  
##  3rd Qu.:70.8   3rd Qu.:  9325  
##  Max.   :82.6   Max.   :113523  
##

Other handy information to know might be the: names of variables, number of rows, number of unique entries (like the number of countries), range of years.

names(gDat)  # default names is columns; row names can be found with: head(rownames(gDat))

## [1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"

nrow(gDat)

## [1] 1704

length(unique(gDat$country))

## [1] 142

range(gDat$year)

## [1] 1952 2007

Data subsets

There are several ways to choose a data subset (neither of which are dependent on the row number or anything like that that can change and isn't very informative for people reading over your code).

Within functions, the argument subset = can be used to define a subset, for e.g. subset = year == "1997" could be used to plot data from only this year.

library(lattice)  # needed for xyplot
xyplot(lifeExp ~ gdpPercap, gDat, subset = year == "1997")

plot of chunk unnamed-chunk-7

Subset is also its own function subset(), which has subset = as an argument :S

subset(gDat, subset = country == "Canada")

##     country year      pop continent lifeExp gdpPercap
## 241  Canada 1952 14785584  Americas   68.75     11367
## 242  Canada 1957 17010154  Americas   69.96     12490
## 243  Canada 1962 18985849  Americas   71.30     13462
## 244  Canada 1967 20819767  Americas   72.13     16077
## 245  Canada 1972 22284500  Americas   72.88     18971
## 246  Canada 1977 23796400  Americas   74.21     22091
## 247  Canada 1982 25201900  Americas   75.76     22899
## 248  Canada 1987 26549700  Americas   76.86     26627
## 249  Canada 1992 28523502  Americas   77.95     26343
## 250  Canada 1997 30305843  Americas   78.61     28955
## 251  Canada 2002 31902268  Americas   79.77     33329
## 252  Canada 2007 33390141  Americas   80.65     36319

Inline code

You can also write inline code in R markdown; all of the numbers in the following sentence are calculated during Knit to HTML process:

The Gapminder dataset includes information on 142 different countries from 1952 to 2007.