R is an interpreted languages, not a compiled one. This means, you type something into R and it does it. There is no data step. There are no procs. The SAS and R book is very useful for going between the two programs.
R uses libraries to do different types of analysis, so we will need to install lots of different libraries to do different things. These need to be downloaded from the internet, using the install.packages() command. You only need to install a package once. E.g.
install.packages("car") will install the lme4 library. To use the functions within it, type
library(car)
Now you have access to those functions.
I strongly recommend you install several packages prior to us beginning. I’ve written a short script on Github you can use it by running:
source("https://raw.githubusercontent.com/coreysparks/Rcode/master/install_first_7273.R")
Below we will go through a simple R session where we introduce some concepts that are important for R.
#addition and subtraction
3+7
## [1] 10
3-7
## [1] -4
#multiplication and division
3*7
## [1] 21
3/7
## [1] 0.4285714
#powers
3^2
## [1] 9
3^3
## [1] 27
#functions
log(3/7)
## [1] -0.8472979
exp(3/7)
## [1] 1.535063
sin(3/7)
## [1] 0.4155719
In R we assign values to objects (object-oriented programming). These can generally have any name, but some names are reserved for R. For instance you probably wouldn’t want to call something ‘mean’ because there’s a ‘mean()’ function already in R. For instance:
x<-3
y<-7
x+y
## [1] 10
x*y
## [1] 21
log(x*y)
## [1] 3.044522
R thinks everything is a matrix, or a vector, meaning a row or column of numbers, or characters. One of R’s big selling points is that much of it is completely vectorized. Meaning, I can apply an operation along all elements of a vector without having to write a loop. For example, if I want to multiply a vector of numbers by a constant, in SAS, I could do: #for (i in 1 to 5) # x[i]<-y[i]*5 #end
but in R, I can just do:
x<-c(3, 4, 5, 6, 7)
#c() makes a vector
y<-7
x*y
## [1] 21 28 35 42 49
R is also very good about using vectors, let’s say I wanted to find the third element of x:
x[3]
## [1] 5
#or if I want to test if this element is 10
x[3]==10
## [1] FALSE
x[3]!=10
## [1] TRUE
#of is it larger than another number:
x[3]>3
## [1] TRUE
#or is any element of the whole vector greater than 3
x>3
## [1] FALSE TRUE TRUE TRUE TRUE
If you want to see what’s in an object, use str(), for structure
str(x)
## num [1:5] 3 4 5 6 7
and we see that x is numeric, and has those values.
We can also see different characteristics of x
#how long is x?
length(x)
## [1] 5
#is x numeric?
is.numeric(x)
## [1] TRUE
is.character(x)
## [1] FALSE
#is any element of x missing?
is.na(x)
## [1] FALSE FALSE FALSE FALSE FALSE
xc<-c("1","2")
#now i'll modify x
x<-c(x, NA) #combine x and a missing value ==NA
x
## [1] 3 4 5 6 7 NA
is.na(x)
## [1] FALSE FALSE FALSE FALSE FALSE TRUE
Above, we had a missing value in X, let’s say we want to replace it with another value:
x<-ifelse(test = is.na(x)==T, yes = sqrt(7.2), no = x)
x
## [1] 3.000000 4.000000 5.000000 6.000000 7.000000 2.683282
Done!
Traditionally, R organizes variables into data frames, these are like a spreadsheet. The columns can have names, and the dataframe itself can have data of different types. Here we make a short data frame with three columns, two numeric and one character:
mydat<-data.frame(
x=c(1,2,3,4,5),
y=c(10, 20, 35, 57, 37),
group=c("A", "A" ,"A", "B", "B")
)
#See the size of the dataframe
dim(mydat)
## [1] 5 3
length(mydat$x)
## [1] 5
#Open the dataframe in a viewer and just print it
View(mydat)
print(mydat)
## x y group
## 1 1 10 A
## 2 2 20 A
## 3 3 35 A
## 4 4 57 B
## 5 5 37 B
Now let’s open a ‘real’ data file. This is the 2008 World population data sheet from the Population Reference Bureau. It contains summary information on many demographic and population level characteristics of nations around the world in 2008.
I’ve had this entered into a Comma Separated Values file by some poor previous GRA of mine and it lives happily on Github now for all the world to see. CSV files are a good way to store data coming out of a spreadsheet. R can read Excel files, but it digests text files easier. Save something from Excel as CSV.
I can read it from github directly by using a function in the readr library:
library(readr)
prb<-read_csv(file = "https://raw.githubusercontent.com/coreysparks/data/master/PRB2008_All.csv")
## Parsed with column specification:
## cols(
## .default = col_integer(),
## Country = col_character(),
## Continent = col_character(),
## Region = col_character(),
## Population. = col_double(),
## Rate.of.natural.increase = col_double(),
## ProjectedPopMid2025 = col_double(),
## ProjectedPopMid2050 = col_double(),
## IMR = col_double(),
## TFR = col_double(),
## PercPop1549HIVAIDS2001 = col_double(),
## PercPop1549HIVAIDS2007 = col_double(),
## PercPpUnderNourished0204 = col_double(),
## PopDensPerSqMile = col_double()
## )
## See spec(...) for full column specifications.
names(prb) #print the column names
## [1] "Y"
## [2] "X"
## [3] "ID"
## [4] "Country"
## [5] "Continent"
## [6] "Region"
## [7] "Year"
## [8] "Population."
## [9] "CBR"
## [10] "CDR"
## [11] "Rate.of.natural.increase"
## [12] "Net.Migration.Rate"
## [13] "ProjectedPopMid2025"
## [14] "ProjectedPopMid2050"
## [15] "ProjectedPopChange_08_50Perc"
## [16] "IMR"
## [17] "WomandLifeTimeRiskMaternalDeath"
## [18] "TFR"
## [19] "PercPopLT15"
## [20] "PercPopGT65"
## [21] "e0Total"
## [22] "e0Male"
## [23] "e0Female"
## [24] "PercUrban"
## [25] "PercPopinUrbanGT750k"
## [26] "PercPop1549HIVAIDS2001"
## [27] "PercPop1549HIVAIDS2007"
## [28] "PercMarWomContraALL"
## [29] "PercMarWomContraModern"
## [30] "PercPpUnderNourished0204"
## [31] "MotorVehper1000Pop0005"
## [32] "PercPopwAccessImprovedWaterSource"
## [33] "GNIPPPperCapitaUSDollars"
## [34] "PopDensPerSqKM"
## [35] "PopDensPerSqMile"
View(prb) #open it in a viewer
That’s handy. If the file lived on our computer, I could read it in like so: note, please make a folder on your computer so you can store things for this class in a single location!!!! Organization is Key to Success in Graduate School
#prb<-read_csv("C:/Users/ozd504/Google Drive/classes/dem7273/class_17/data/PRB2008_All.csv")
Same result.
The haven library can read files from other statistical packages easily, so if you have data in Stata, SAS or SPSS (barf!), you can read it into R using those functions, for example, the read_dta() function reads stata files.
Don’t know what a function’s called use ??
??stata ??csv
and Rstudio will show you a list of functions that have these strings in them.
What if you know the function name, like read_csv() but you want to see all the function arguments?
?read_csv
will open up the help file for that specific function
Want to save something as a R data file? Use save()
#save(prb, file="C:/Users/ozd504/Google Drive/classes/dem7273/class_17/data/prb_2008.Rdata")
If you have an R data file, use load() to open it:
#load("C:/Users/ozd504/Google Drive/classes/dem7273/class_17/data/prb_2008.Rdata")
Let’s have a look at some descriptive information about the data:
#Frequency Table of # of Contries by Continent
table(prb$Continent)
##
## Africa Asia Europe North America Oceania
## 56 51 45 27 17
## South America
## 13
#basic summary statistics for the variable TFR or the total fertility rate
summary(prb$TFR)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 1.775 2.500 3.032 4.000 7.100 1
There is one country missing the Total fertility rate variable. The minimum is 1 and the maximum is 7.1 children per woman.
More next week!!