Computer Exercise 2: Loading in data, computing descriptive statistics and estimating regressions

Learning objectives:

  • setting a working directory
  • reading in data
  • computing descriptive statistics
  • estimating regressions

1. Setting a working directory

Usually you save the files for your empirical project into a folder. In order for R to read and write into that folder you should set the ‘working directory’ to that folder using the setwd().

setwd("C:/Users/dvorakt/Google Drive/teaching/243")

2. Reading in data

R can read all sorts of other formats: tab delimited, html tables, Stata (using packages foreign), excel (using package xlsx2). For now you should save your data in a comma delimited file (.csv). You can read that data using the read.csv function. The output of that function is a data frame. We use the <- operator to assign this data frame name data. You can check the fist few rows of that data frame using function head().

data <- read.csv("cps08.csv")
head(data)
##   salary edu age male white black asian other married private
## 1  27000  12  43    0     0     1     0     0       1       1
## 2  36002  12  46    1     0     0     1     0       1       1
## 3  70000  12  36    1     0     0     0     1       1       1
## 4  60000  18  37    1     0     0     1     0       1       1
## 5  16000  11  52    0     0     0     0     1       0       0
## 6  17500  12  31    1     0     1     0     0       0       1

3. Computing descriptive statistics

A basic function for computing descriptive statistics is summary().

summary(data)
##      salary            edu             age            male       
##  Min.   :    20   Min.   : 0.00   Min.   :15.0   Min.   :0.0000  
##  1st Qu.: 25000   1st Qu.:12.00   1st Qu.:32.0   1st Qu.:0.0000  
##  Median : 40000   Median :13.00   Median :42.0   Median :1.0000  
##  Mean   : 50213   Mean   :13.75   Mean   :41.8   Mean   :0.5549  
##  3rd Qu.: 60000   3rd Qu.:16.00   3rd Qu.:51.0   3rd Qu.:1.0000  
##  Max.   :706117   Max.   :21.00   Max.   :85.0   Max.   :1.0000  
##      white            black           asian             other        
##  Min.   :0.0000   Min.   :0.000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:1.0000   1st Qu.:0.000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :1.0000   Median :0.000   Median :0.00000   Median :0.00000  
##  Mean   :0.7986   Mean   :0.115   Mean   :0.05266   Mean   :0.03374  
##  3rd Qu.:1.0000   3rd Qu.:0.000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.000   Max.   :1.00000   Max.   :1.00000  
##     married          private      
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.0000  
##  Median :1.0000   Median :1.0000  
##  Mean   :0.6145   Mean   :0.8201  
##  3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000

This does the job but is not particularly pretty. There are many additions to base R. These additions are called packages. You can download a package by clicking on the Packages tab in the lower right window, selecting a package and clicking install. A useful package for displaying pretty tables is stargazer. Install it into your version of R. Once you have it install, you need to let R know that you would like to use the package by executing functions library(stargazer). Then you can use the function on your data frame and watch what you get.

library(stargazer)
stargazer(data, type = "text")
## 
## ==================================================
## Statistic   N       Mean     St. Dev.  Min   Max  
## --------------------------------------------------
## salary    63,787 50,212.930 47,700.540 20  706,117
## edu       63,787   13.746     2.792     0    21   
## age       63,787   41.799     11.827   15    85   
## male      63,787   0.555      0.497     0     1   
## white     63,787   0.799      0.401     0     1   
## black     63,787   0.115      0.319     0     1   
## asian     63,787   0.053      0.223     0     1   
## other     63,787   0.034      0.181     0     1   
## married   63,787   0.614      0.487     0     1   
## private   63,787   0.820      0.384     0     1   
## --------------------------------------------------

This table is much easier to read, isn’t it?

4. Estimating regressions

Let’s use our data to estimate a simple regression. We use function lm(). The arguments of that function include the formula to be estimated with dependent variable first and a list of independent variables following the symbol ~. The second argument is the name of the data to be used (in our case it is ‘data’). We assign the result to object we will name ‘model’ and then display summary of that object using function summary().

model <- lm(salary ~ edu, data)
summary(model)
## 
## Call:
## lm(formula = salary ~ edu, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -86563 -20990  -8963   9737 648854 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36008.26     882.71  -40.79   <2e-16 ***
## edu           6272.59      62.93   99.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44370 on 63785 degrees of freedom
## Multiple R-squared:  0.1348, Adjusted R-squared:  0.1347 
## F-statistic:  9934 on 1 and 63785 DF,  p-value: < 2.2e-16


Exercises:

  1. Estimate the effect of age on salary.
  2. Download data ‘scores.csv’ from Nexus’ folder named ‘data for classroom examples’ Load that data into R.
  3. Calculate descriptive statistics.
  4. Estimate the effect of missed classes on scores.