ScPoEconometrics 2

Florian Oswald
2018-09-22

Working With Data

Econometrics is about Data.
In these slides we will start to look at this.
Let's first all load a dataset:

data("mpg",package="ggplot2")

how many observations, how many variables?

dim(mpg)

[1] 234  11

The mpg dataset

And let's look at the first couple of rows:

head(mpg)

  manufacturer model displ year cyl      trans drv cty hwy fl   class
1         audi    a4   1.8 1999   4   auto(l5)   f  18  29  p compact
2         audi    a4   1.8 1999   4 manual(m5)   f  21  29  p compact
3         audi    a4   2.0 2008   4 manual(m6)   f  20  31  p compact
4         audi    a4   2.0 2008   4   auto(av)   f  21  30  p compact
5         audi    a4   2.8 1999   6   auto(l5)   f  16  26  p compact
6         audi    a4   2.8 1999   6 manual(m5)   f  18  26  p compact

tail gives you the last rows.
names gives the column names.

The mpg dataset: datatypes

It's important to know how the data is stored.
We use str:

str(mpg)

Classes 'tbl_df', 'tbl' and 'data.frame':   234 obs. of  11 variables:
 $ manufacturer: chr  "audi" "audi" "audi" "audi" ...
 $ model       : chr  "a4" "a4" "a4" "a4" ...
 $ displ       : num  1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
 $ year        : int  1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
 $ cyl         : int  4 4 4 4 6 6 6 4 4 4 ...
 $ trans       : chr  "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
 $ drv         : chr  "f" "f" "f" "f" ...
 $ cty         : int  18 21 20 21 16 18 18 18 16 20 ...
 $ hwy         : int  29 29 31 30 26 26 27 26 25 28 ...
 $ fl          : chr  "p" "p" "p" "p" ...
 $ class       : chr  "compact" "compact" "compact" "compact" ...

Summarizing Data

One can learn only so much from looking at the data.frame.
Even if you could see all rows of the dataset, you would not know very much about it.
We need to summarize the data for us to learn from it.
In general, we can compute summary statistics, or visualize the data with plots.
Let's start with some statistics first!
Let's look at two features: central tendency and spread.

Central Tendency

mean(x): the average of all values in x.
median: the value \( x_j \) below and above which 50% of the values in x lie.

x <- c(1,2,2,2,2,100)
mean(x)

[1] 18.16667

mean(x) == sum(x) / length(x)

[1] TRUE

median(x)

[1] 2

Spread

Another interesting feature is how much a variable is spread out about it's center (mean or median).
The variance is such a measure.

var(x)

[1] 1607.367

all.equal(var(x), sum((x - mean(x))^2) / (length(x)-1))

[1] TRUE

Similarly, the range is

range(x)

[1]   1 100

The table function

table(x) is a useful function that counts the occurence of each unique value in x:

table(x)

x
  1   2 100 
  1   4   1

table(mpg$trans)


  auto(av)   auto(l3)   auto(l4)   auto(l5)   auto(l6)   auto(s4) 
         5          2         83         39          6          3 
  auto(s5)   auto(s6) manual(m5) manual(m6) 
         3         16         58         19

Crosstables

Given two vectors, table produces a contingency table:

table(mpg$trans,mpg$drv)


              4  f  r
  auto(av)    0  5  0
  auto(l3)    0  2  0
  auto(l4)   34 37 12
  auto(l5)   29  8  2
  auto(l6)    2  2  2
  auto(s4)    2  1  0
  auto(s5)    1  2  0
  auto(s6)    7  8  1
  manual(m5) 21 33  4
  manual(m6)  7  8  4

with prop.table, we can get proportions:

prop.table(table(mpg$trans,mpg$drv),margin=2)

Plotting

R base plotting is fairly good.
There is an extremely powerful alternative in package ggplot2. We'll see both.
A histogram counts how many obserations fall within a certain bin.

Histogram

hist(mpg$cty)

plot of chunk unnamed-chunk-11

Nicer Histogram

hist(mpg$cty, xlab   = "Miles Per Gallon (City)", main   = "Histogram of MPG (City)", breaks = 12, col = "red",border = "blue")

plot of chunk unnamed-chunk-12

Looking for Outliers: Boxplots

It's good to know if a variable has outliers, i.e. values much more extreme than the mass of all values.

Scatter Plots

Two variables \( x \) and \( y \)
Natural to ask: How often do certain pairs of \( (x_i,y_i) \) occur?

head(mpg[,c("hwy","displ")])

  hwy displ
1  29   1.8
2  29   1.8
3  31   2.0
4  30   2.0
5  26   2.8
6  26   2.8

That's what a scatter plots shows.

Scatter Plots

plot(hwy ~ displ, data = mpg)

plot of chunk unnamed-chunk-15

It's Tutorial Time!

Time for our first tutorial!! Type this into your RStudio console:

library(ScPoEconometrics)
runTutorial('chapter2')

How are x and y related? Covariance

plot of chunk x-y-corr

The relevant section in the book is mandatory reading.

Correlation App

library(ScPoEconometrics)
runTutorial('correlation')

The Tidyverse

Hadley Wickham
What is tidy data?
1. Each variable is a column
2. Each observation is a row
3. Each value is a cell.
That's not always how we get data.
Some tools first.

tibbles are tidy data.frames

library(tidyr)  # also loads library(tibble)
data(mpg,package = "ggplot2")  # data from the ggplot2 package
mpg

# A tibble: 234 x 11
   manufacturer model displ  year   cyl trans drv     cty   hwy fl    cla…
   <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <ch>
 1 audi         a4      1.8  1999     4 auto… f        18    29 p     com…
 2 audi         a4      1.8  1999     4 manu… f        21    29 p     com…
 3 audi         a4      2    2008     4 manu… f        20    31 p     com…
 4 audi         a4      2    2008     4 auto… f        21    30 p     com…
 5 audi         a4      2.8  1999     6 auto… f        16    26 p     com…
 6 audi         a4      2.8  1999     6 manu… f        18    26 p     com…
 7 audi         a4      3.1  2008     6 auto… f        18    27 p     com…
 8 audi         a4 q…   1.8  1999     4 manu… 4        18    26 p     com…
 9 audi         a4 q…   1.8  1999     4 auto… 4        16    25 p     com…
10 audi         a4 q…   2    2008     4 manu… 4        20    28 p     com…
# ... with 224 more rows

Subsetting tibbles

Same as before.

# mpg[row condition, col condition]
mpg[mpg$hwy > 35, c("manufacturer", "model", "year")]

# A tibble: 6 x 3
  manufacturer model       year
  <chr>        <chr>      <int>
1 honda        civic       2008
2 honda        civic       2008
3 toyota       corolla     2008
4 volkswagen   jetta       1999
5 volkswagen   new beetle  1999
6 volkswagen   new beetle  1999

Enter: dplyr

Very powerful package. Check it out!

library(dplyr)
mpg %>%    # %>% is the "pipe" operator
  filter(hwy > 35) %>%  # takes output and puts into next function
  select(manufacturer, model, year)

# A tibble: 6 x 3
  manufacturer model       year
  <chr>        <chr>      <int>
1 honda        civic       2008
2 honda        civic       2008
3 toyota       corolla     2008
4 volkswagen   jetta       1999
5 volkswagen   new beetle  1999
6 volkswagen   new beetle  1999

# as such, equivalent to
select(filter(mpg, hwy > 35), manufacturer, model, year)

Case Study: How to Read xls data

Excel (or other spreadsheet) data is ubiquous.
Unfortunately it doesn't always come in a easy to use form to us.
We need to clean it.
Let's go through the worked example in the book!