Karim Naguib (Boston University)
9/11/2013
1 + 2
[1] 3
4 * (1 - 2)
[1] -4
2^10
[1] 1024
A storage for a value or calculation
x <- 1
x
[1] 1
x <- x + 1
x
[1] 2
y <- 3
z <- x + y
z
[1] 5
A variable can also store a vector of values.
x <- c(4, 4, 1, 5, 100)
x
[1] 4 4 1 5 100
y <- c("abc", "hello")
y
[1] "abc" "hello"
f <- function(x, y) { x^2 + y }
f(2, 3)
[1] 7
x <- f(1, 10)
x
[1] 11
foreign
package we use the commandlibrary(foreign)
It is a useful function to get a quick description of data
x <- c(1, 2, 100, 33, -34)
summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-34.0 1.0 2.0 20.4 33.0 100.0
For example, suppose want to load the CPS data used in chapter 3 of the text book
load("/usr/data/ec414/cps_ch3.RData")
If we execute the ls
functions to display the variables in the workspace we will see
ls()
[1] "cps.ch3" "f" "x" "y" "z"
There are several functions we can use to learn more about the data we just loaded
str(cps.ch3)
'data.frame': 15393 obs. of 3 variables:
$ a.sex: int 1 1 1 2 1 2 2 1 1 1 ...
$ year : int 1992 1992 1992 1992 1992 1992 1992 1992 1992 1992 ...
$ ahe08: num 17.2 15.3 22.9 13.3 22.1 ...
summary(cps.ch3)
a.sex year ahe08
Min. :1.00 Min. :1992 Min. : 2.0
1st Qu.:1.00 1st Qu.:1996 1st Qu.:15.3
Median :1.00 Median :2000 Median :20.5
Mean :1.48 Mean :2001 Mean :22.4
3rd Qu.:2.00 3rd Qu.:2004 3rd Qu.:27.4
Max. :2.00 Max. :2008 Max. :82.4
cps.ch3
are stored as integers (a.sex
and year
)data.frame
) as categorical variablesfactor
cps.ch3$a.sex <- factor(cps.ch3$a.sex, levels=c(1,2), labels=c("male", "female"))
cps.ch3$year <- factor(cps.ch3$year)
str(cps.ch3)
'data.frame': 15393 obs. of 3 variables:
$ a.sex: Factor w/ 2 levels "male","female": 1 1 1 2 1 2 2 1 1 1 ...
$ year : Factor w/ 5 levels "1992","1996",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ahe08: num 17.2 15.3 22.9 13.3 22.1 ...
Using the data we just loaded, we want to calculate the sample mean and variance of average hourly earnings (ahe08
)
mean(cps.ch3$ahe08)
[1] 22.4
var(cps.ch3$ahe08)
[1] 108.4
sd(cps.ch3$ahe08) # standard deviation
[1] 10.41
But since we want to compare average wages between men and women, we want to calculate sample statistics for each of these groups
male.cps <- subset(cps.ch3, a.sex == "male")
str(male.cps)
'data.frame': 8008 obs. of 3 variables:
$ a.sex: Factor w/ 2 levels "male","female": 1 1 1 1 1 1 1 1 1 1 ...
$ year : Factor w/ 5 levels "1992","1996",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ahe08: num 17.2 15.3 22.9 22.1 36.1 ...
female.cps <- subset(cps.ch3, a.sex == "female")
str(female.cps)
'data.frame': 7385 obs. of 3 variables:
$ a.sex: Factor w/ 2 levels "male","female": 2 2 2 2 2 2 2 2 2 2 ...
$ year : Factor w/ 5 levels "1992","1996",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ahe08: num 13.28 12.17 21.07 7.82 18.61 ...
To get an more specific subset, for example, female and surveyed in 1992
female.92.cps <- subset(cps.ch3, a.sex == "female" & year == "1992")
str(female.92.cps)
'data.frame': 1368 obs. of 3 variables:
$ a.sex: Factor w/ 2 levels "male","female": 2 2 2 2 2 2 2 2 2 2 ...
$ year : Factor w/ 5 levels "1992","1996",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ahe08: num 13.28 12.17 21.07 7.82 18.61 ...
summary(female.92.cps)
a.sex year ahe08
male : 0 1992:1368 Min. : 2.61
female:1368 1996: 0 1st Qu.:14.75
2000: 0 Median :18.65
2004: 0 Mean :20.05
2008: 0 3rd Qu.:24.40
Max. :66.37
ggplot2
package.library(ggplot2)
qplot
A boxplot is useful at looking at the different distribution of average hourly earnings.
qplot(x=a.sex, y=ahe08, data=cps.ch3, geom="boxplot")
If we wanted to see the same boxplots but separate the years we use the facets
argument.
qplot(x=a.sex, y=ahe08, data=cps.ch3, geom="boxplot", facets= ~ year)
We might want to add a few things to the plot to make it easier to read, such as axis labels
qplot(x=a.sex, y=ahe08, data=cps.ch3, geom="boxplot", facets= ~ year, xlab="Sex", ylab="Average Hourly Earnings")
Another way to view the same data is to use density plots
qplot(x=ahe08, color=a.sex, data=cps.ch3, geom="density", facets= ~ year, xlab="Average Hourly Earnings")
For this part let's load another dataset, the March CPS survey for 2008 with more information on full-time workers
load("/usr/data/ec414/cps08_full.RData")
ls()
[1] "cps.08.full" "cps.ch3" "f" "female.92.cps"
[5] "female.cps" "male.cps" "x" "y"
[9] "z"
str(cps.08.full)
'data.frame': 7711 obs. of 5 variables:
$ ahe : num 38.46 12.5 9.86 8.24 17.79 ...
$ year : int 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
$ bachelor: int 1 1 0 0 0 0 1 1 1 1 ...
$ female : int 0 0 0 0 0 1 0 0 0 1 ...
$ age : int 33 31 30 30 31 29 26 28 30 25 ...
cps.08.full$year <- factor(cps.08.full$year)
cps.08.full$bachelor <- cps.08.full$bachelor == 1
cps.08.full$female <- cps.08.full$female == 1
cps.08.bac <- subset(cps.08.full, bachelor == TRUE)
qplot(x=age, y=ahe, data=cps.08.bac, color=female, geom="point")
qplot(x=factor(age), y=ahe, data=cps.08.bac, color=female, geom="boxplot")
?
command (make sure that the function's package is loaded). For example? qplot