This is a tutorial on preliminary data.frame use and manipulation and also gives an example of an R Markdown document.
Importing data as dataframe
Set working directory to current folder (this will be different in your computer) and import .txt file (note: tab completion for filename only occurs once quotations are open).
setwd("~/Rwork/20130909")
gDat <- read.delim(file = "gapminderDataFiveYear.txt")
Things to keep your eye out for when checking whether data imported properly: (some found here: http://www.youtube.com/watch?v=3rDNpcluseM)
A quick look at the first few lines lets you know that the read has been successful.
head(gDat) # can specify number of rows in head using argument 'n = '
## country year pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333 Asia 28.80 779.4
## 2 Afghanistan 1957 9240934 Asia 30.33 820.9
## 3 Afghanistan 1962 10267083 Asia 32.00 853.1
## 4 Afghanistan 1967 11537966 Asia 34.02 836.2
## 5 Afghanistan 1972 13079460 Asia 36.09 740.0
## 6 Afghanistan 1977 14880372 Asia 38.44 786.1
And a look at the dimentions lets you quickly check that things are in order.
dim(gDat)
## [1] 1704 6
The str() function gives you information about the type of variables and the column names.
str(gDat)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
The summary() function gives you a first peek at some basic statistics.
summary(gDat)
## country year pop continent
## Afghanistan: 12 Min. :1952 Min. :6.00e+04 Africa :624
## Albania : 12 1st Qu.:1966 1st Qu.:2.79e+06 Americas:300
## Algeria : 12 Median :1980 Median :7.02e+06 Asia :396
## Angola : 12 Mean :1980 Mean :2.96e+07 Europe :360
## Argentina : 12 3rd Qu.:1993 3rd Qu.:1.96e+07 Oceania : 24
## Australia : 12 Max. :2007 Max. :1.32e+09
## (Other) :1632
## lifeExp gdpPercap
## Min. :23.6 Min. : 241
## 1st Qu.:48.2 1st Qu.: 1202
## Median :60.7 Median : 3532
## Mean :59.5 Mean : 7215
## 3rd Qu.:70.8 3rd Qu.: 9325
## Max. :82.6 Max. :113523
##
Other handy information to know might be the: names of variables, number of rows, number of unique entries (like the number of countries), range of years.
names(gDat) # default names is columns; row names can be found with: head(rownames(gDat))
## [1] "country" "year" "pop" "continent" "lifeExp" "gdpPercap"
nrow(gDat)
## [1] 1704
length(unique(gDat$country))
## [1] 142
range(gDat$year)
## [1] 1952 2007
Data subsets
There are several ways to choose a data subset (neither of which are dependent on the row number or anything like that that can change and isn't very informative for people reading over your code).
Within functions, the argument subset = can be used to define a subset, for e.g. subset = year == "1997" could be used to plot data from only this year.
library(lattice) # needed for xyplot
xyplot(lifeExp ~ gdpPercap, gDat, subset = year == "1997")
Subset is also its own function subset(), which has subset = as an argument :S
subset(gDat, subset = country == "Canada")
## country year pop continent lifeExp gdpPercap
## 241 Canada 1952 14785584 Americas 68.75 11367
## 242 Canada 1957 17010154 Americas 69.96 12490
## 243 Canada 1962 18985849 Americas 71.30 13462
## 244 Canada 1967 20819767 Americas 72.13 16077
## 245 Canada 1972 22284500 Americas 72.88 18971
## 246 Canada 1977 23796400 Americas 74.21 22091
## 247 Canada 1982 25201900 Americas 75.76 22899
## 248 Canada 1987 26549700 Americas 76.86 26627
## 249 Canada 1992 28523502 Americas 77.95 26343
## 250 Canada 1997 30305843 Americas 78.61 28955
## 251 Canada 2002 31902268 Americas 79.77 33329
## 252 Canada 2007 33390141 Americas 80.65 36319
Inline code
You can also write inline code in R markdown; all of the numbers in the following sentence are calculated during Knit to HTML process:
The Gapminder dataset includes information on 142 different countries from 1952 to 2007.