A Learner’s Diary: Descriptive Statistics

Prep Work

We begin by loading four packages: tidyverse, haven, forcats, and foreign.

library(tidyverse)
library(haven)
library(foreign)
library(forcats)

Let’s load some data.

affairs <- read_dta("http://fmwww.bc.edu/ec-p/data/wooldridge/affairs.dta")
affairs

## # A tibble: 601 x 19
##       id  male   age yrsmarr  kids relig  educ occup ratemarr naffairs affair
##    <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>  <dbl>
##  1     4     1    37   10        0     3    18     7        4        0      0
##  2     5     0    27    4        0     4    14     6        4        0      0
##  3     6     1    27    1.5      0     3    18     4        4        3      1
##  4    11     0    32   15        1     1    12     1        4        0      0
##  5    12     0    27    4        1     3    17     1        5        3      1
##  6    16     1    57   15        1     5    18     6        5        0      0
##  7    23     1    22    0.75     0     2    17     6        3        0      0
##  8    29     0    32    1.5      0     2    17     5        5        0      0
##  9    43     1    37   15        1     5    18     6        2        7      1
## 10    44     0    22    0.75     0     2    12     1        3        0      0
## # ... with 591 more rows, and 8 more variables: vryhap <dbl>, hapavg <dbl>,
## #   avgmarr <dbl>, unhap <dbl>, vryrel <dbl>, smerel <dbl>, slghtrel <dbl>,
## #   notrel <dbl>

Tabulating frequencies: total, relative, conditional

The binary/dummy variable affairs$kids is easier to understand when it is converted to a factor variable. Same for the categorical variable affairs$ratemarr:

haskids <- factor(x = affairs$kids,labels = c("no","yes"))
mlab <- c("very unhappy","somewhat unhappy","average", "somewhat happy", "very happy")
marriage <- factor(x = affairs$ratemarr, labels = mlab)
is.factor(marriage)        # TRUE; checks whether a vector is a factor.

## [1] TRUE

Frequency Distributions

Here are the frequencies for those who have kids and those who don’t, and the frequencies for the various ratings for marital happiness:

table(haskids)  # Tallies the numbers of observations for each factor level

## haskids
##  no yes 
## 171 430

table(affairs$kids)     # Compare with previous command to see why it may make sense to convert a categorical variable into a factor.

## 
##   0   1 
## 171 430

table(marriage)

## marriage
##     very unhappy somewhat unhappy          average   somewhat happy 
##               16               66               93              194 
##       very happy 
##              232

class(table(marriage))         # "table"

## [1] "table"

Relative Frequency Distributions

prop.table(table(haskids))

## haskids
##        no       yes 
## 0.2845258 0.7154742

prop.table(table(marriage))

## marriage
##     very unhappy somewhat unhappy          average   somewhat happy 
##        0.0266223        0.1098170        0.1547421        0.3227953 
##       very happy 
##        0.3860233

Contingency Tables

Two-way tables are also called contingency tables. The following contingency table shows the frequencies for all haskids-marriage combinations:

(countstab <- table(marriage,haskids))  # frequencies

##                   haskids
## marriage            no yes
##   very unhappy       3  13
##   somewhat unhappy   8  58
##   average           24  69
##   somewhat happy    40 154
##   very happy        96 136

is.table(countstab)          # TRUE; checks whether an object is a table

## [1] TRUE

prop.table(countstab)       # relative frequencies

##                   haskids
## marriage                    no         yes
##   very unhappy     0.004991681 0.021630616
##   somewhat unhappy 0.013311148 0.096505824
##   average          0.039933444 0.114808652
##   somewhat happy   0.066555740 0.256239601
##   very happy       0.159733777 0.226289517

By the way, the contingency table countstab can be converted into a data frame using the command countstabdf <- as.data.frame(countstab).

Yet another possibility is to convert a contingency table into a tibble: countstabtib <- as_tibble(countstab).

We may also be interested in the conditional relative frequencies. For example, what proportion of those who are “very happy” in their marriages have children?

prop.table(x = countstab, margin = 1)    # The proportions add to 1 for every value of marriage

##                   haskids
## marriage                  no       yes
##   very unhappy     0.1875000 0.8125000
##   somewhat unhappy 0.1212121 0.8787879
##   average          0.2580645 0.7419355
##   somewhat happy   0.2061856 0.7938144
##   very happy       0.4137931 0.5862069

prop.table(x = countstab, margin = 2)    # The proportions add to 1 for every value of haskids

##                   haskids
## marriage                   no        yes
##   very unhappy     0.01754386 0.03023256
##   somewhat unhappy 0.04678363 0.13488372
##   average          0.14035088 0.16046512
##   somewhat happy   0.23391813 0.35813953
##   very happy       0.56140351 0.31627907

Graphing relative frequencies

Pie charts

pie(x = table(marriage), col = gray(seq(.2, 1, .2)), main = "Pie 1")

pie(x = table(affairs$ratemarr), col = gray(seq(.2, 1, .2)), main = "Pie 2")

pie(x = table(affairs$ratemarr), labels = mlab, col = gray(seq(.2, 1, .2)), main = "Pie 3")

pie(x = c(16, 66, 93, 194, 232), labels = mlab, col = gray(seq(.2, 1, .2)), main = "Pie 4")

Compare the first and second pie charts above. Good luck if you are trying to make sense of the second pie chart, which charts an unfactored categorical variable. Note that the second pie chart pie(affairs$ratemarr, col = gray(seq(.2, 1, .2))) was a huge failure. You need to convert a variable to a table form before you put it in pie().

The third pie shows that a pie chart of unfactored data can be made meaningful with the addition of appropriate labels.

The fourth pie shows that a pie chart can be made out of a numerical vector representing frequencies.

Barplots

barplot(table(marriage), horiz = TRUE, las = 1, main = "Happiness", sub = "Totals", xlim = c(0, 250))

barplot(table(marriage), horiz = FALSE, las = 1, main = "Happiness", sub = "Totals", ylim = c(0, 250))

Replace marriage with affairs$ratemarr and try again, for comparison. This would provide yet another demonstration of the need to use the factor versions of any categorical variable in graphing. Compare the last two commands.

By the way, some labels may disappear if R does not find space. So, when exporting, adjust the size till all labels become visible.

barplot(table(haskids, marriage), horiz = TRUE, las = 1, main = "Happiness by Kids", legend = TRUE, args.legend = c(x="bottomright"))

barplot(table(marriage, haskids), horiz = TRUE, las = 1, main = "Kids by Happiness", legend = TRUE, args.legend = c(x="bottomright"))

barplot(table(haskids, marriage), beside = TRUE, las = 2, legend = TRUE, args.legend = c(x="top"))

Compare the last two commands.

Histograms

ceosal1<-read_dta("http://fmwww.bc.edu/ec-p/data/wooldridge/ceosal1.dta")
ROE <- ceosal1$roe  # Extract ROE to vector
hist(ROE)

A few more:

hist(ceosal1$roe, main = "Histogram of ROE", xlab = "ROE") # With labels

hist(ROE, breaks=c(0,5,10,20,30,60) )     # Plots density on the vertical, probably because that's what's appropriate when the breaks are uneven.

hist(ROE, breaks=5)

Kernel density graphs

# First, a stand-alone density plot
plot(density(ROE), main = "Density Plot")

# Second, a density plot superimposed on a histogram
hist(ROE, freq=FALSE, ylim=c(0,.07), main ="Histogram and Density Plot")
lines( density(ROE), lwd=3 )
curve(0.0001*x^2, add = TRUE) # Add a function of x, for no good reason!

Empirical Cumulative Distribution Function

plot(ecdf(ROE))

Descriptive Statistics

mean(ceosal1$salary)        # Sample mean/average

## [1] 1281.12

median(ceosal1$salary)      # Sample median

## [1] 1039

sd(ceosal1$salary)          # Sample standard deviation

## [1] 1372.345

summary(ceosal1$salary)      # Sample summary information

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     223     736    1039    1281    1407   14822

cor(ceosal1$salary, ceosal1$roe)    # Sample correlation

## [1] 0.1148417

with(ceosal1, cor(salary, roe))   # Same as previous

## [1] 0.1148417