Usually you save the files for your empirical project into a folder. In order for R to read and write into that folder you should set the ‘working directory’ to that folder using the setwd().
setwd("C:/Users/dvorakt/Google Drive/teaching/243")
R can read all sorts of other formats: tab delimited, html tables, Stata (using packages foreign), excel (using package xlsx2). For now you should save your data in a comma delimited file (.csv). You can read that data using the read.csv function. The output of that function is a data frame. We use the <- operator to assign this data frame name data. You can check the fist few rows of that data frame using function head().
data <- read.csv("cps08.csv")
head(data)
## salary edu age male white black asian other married private
## 1 27000 12 43 0 0 1 0 0 1 1
## 2 36002 12 46 1 0 0 1 0 1 1
## 3 70000 12 36 1 0 0 0 1 1 1
## 4 60000 18 37 1 0 0 1 0 1 1
## 5 16000 11 52 0 0 0 0 1 0 0
## 6 17500 12 31 1 0 1 0 0 0 1
A basic function for computing descriptive statistics is summary().
summary(data)
## salary edu age male
## Min. : 20 Min. : 0.00 Min. :15.0 Min. :0.0000
## 1st Qu.: 25000 1st Qu.:12.00 1st Qu.:32.0 1st Qu.:0.0000
## Median : 40000 Median :13.00 Median :42.0 Median :1.0000
## Mean : 50213 Mean :13.75 Mean :41.8 Mean :0.5549
## 3rd Qu.: 60000 3rd Qu.:16.00 3rd Qu.:51.0 3rd Qu.:1.0000
## Max. :706117 Max. :21.00 Max. :85.0 Max. :1.0000
## white black asian other
## Min. :0.0000 Min. :0.000 Min. :0.00000 Min. :0.00000
## 1st Qu.:1.0000 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :1.0000 Median :0.000 Median :0.00000 Median :0.00000
## Mean :0.7986 Mean :0.115 Mean :0.05266 Mean :0.03374
## 3rd Qu.:1.0000 3rd Qu.:0.000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.000 Max. :1.00000 Max. :1.00000
## married private
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.0000
## Median :1.0000 Median :1.0000
## Mean :0.6145 Mean :0.8201
## 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
This does the job but is not particularly pretty. There are many additions to base R. These additions are called packages. You can download a package by clicking on the Packages tab in the lower right window, selecting a package and clicking install. A useful package for displaying pretty tables is stargazer. Install it into your version of R. Once you have it install, you need to let R know that you would like to use the package by executing functions library(stargazer). Then you can use the function on your data frame and watch what you get.
library(stargazer)
stargazer(data, type = "text")
##
## ==================================================
## Statistic N Mean St. Dev. Min Max
## --------------------------------------------------
## salary 63,787 50,212.930 47,700.540 20 706,117
## edu 63,787 13.746 2.792 0 21
## age 63,787 41.799 11.827 15 85
## male 63,787 0.555 0.497 0 1
## white 63,787 0.799 0.401 0 1
## black 63,787 0.115 0.319 0 1
## asian 63,787 0.053 0.223 0 1
## other 63,787 0.034 0.181 0 1
## married 63,787 0.614 0.487 0 1
## private 63,787 0.820 0.384 0 1
## --------------------------------------------------
This table is much easier to read, isn’t it?
Let’s use our data to estimate a simple regression. We use function lm(). The arguments of that function include the formula to be estimated with dependent variable first and a list of independent variables following the symbol ~. The second argument is the name of the data to be used (in our case it is ‘data’). We assign the result to object we will name ‘model’ and then display summary of that object using function summary().
model <- lm(salary ~ edu, data)
summary(model)
##
## Call:
## lm(formula = salary ~ edu, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -86563 -20990 -8963 9737 648854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36008.26 882.71 -40.79 <2e-16 ***
## edu 6272.59 62.93 99.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44370 on 63785 degrees of freedom
## Multiple R-squared: 0.1348, Adjusted R-squared: 0.1347
## F-statistic: 9934 on 1 and 63785 DF, p-value: < 2.2e-16