GitHub - https://github.com/rickysoo/freeny
Contact - ricky [at] rickysoo.com

A codebook shows the structure, contents and layout of data. We should always study the codebook to understand the variables before conducting any research on the data.

The creation of codebook is demonstrated using Freeny’s Revenue Data in 2 ways - automated codebook and custom-made codebook.

1. About The Data

The Freeny’s Revenue Data (freeny) is used in this demonstraton. The data consists of 39 observations showing quarterly revenue from 1962 2nd quarter to 1971 4th quarter.

The data originates from A. E. Freeny (1977) in A Portable Linear Regression Package with Test Programs by Bell Laboratories memorandum.

However, it remains unknown how the author created or collected the data, and whether it is based on quarterly performance on a real or fictitious company, as well as the exact units used for the measurements (thousands, millions etc.)

The data is included in the R Datasets package developed by R Core Team and contributors. The package also contains dozens of other datasets and is available in base R.

For more on the data, check out https://rdrr.io/r/datasets/freeny.html

2. Exploring Dataset

Loading Data

Load the memisc package required to create automated codebook. The package name stands for “Management of Survey Data and Presentation of Analysis Results”. The documentation can be found at https://cran.r-project.org/web/packages/memisc/memisc.pdf

library(memisc)
## Loading required package: lattice
## Loading required package: MASS
## 
## Attaching package: 'memisc'
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
## 
##     as.array

Load the dataset.

data("freeny")

Showing Data

Show the first few lines of the dataset.

head(freeny)
##               y lag.quarterly.revenue price.index income.level market.potential
## 1962.25 8.79236               8.79636     4.70997      5.82110          12.9699
## 1962.5  8.79137               8.79236     4.70217      5.82558          12.9733
## 1962.75 8.81486               8.79137     4.68944      5.83112          12.9774
## 1963    8.81301               8.81486     4.68558      5.84046          12.9806
## 1963.25 8.90751               8.81301     4.64019      5.85036          12.9831
## 1963.5  8.93673               8.90751     4.62553      5.86464          12.9854

View the dataset in the R Studio data viewer.

View(freeny)

Data Attributes

Show the dimension of the dataset. 39 observations and 5 variables are found.

dim(freeny)
## [1] 39  5

Show the column names. It appears that y is the revenue, and the other 4 variables are the explanatory variables.

names(freeny)
## [1] "y"                     "lag.quarterly.revenue" "price.index"          
## [4] "income.level"          "market.potential"

Show the row names of the dataset. It appears that each row represent a quarter of the period.

row.names(freeny)
##  [1] "1962.25" "1962.5"  "1962.75" "1963"    "1963.25" "1963.5"  "1963.75"
##  [8] "1964"    "1964.25" "1964.5"  "1964.75" "1965"    "1965.25" "1965.5" 
## [15] "1965.75" "1966"    "1966.25" "1966.5"  "1966.75" "1967"    "1967.25"
## [22] "1967.5"  "1967.75" "1968"    "1968.25" "1968.5"  "1968.75" "1969"   
## [29] "1969.25" "1969.5"  "1969.75" "1970"    "1970.25" "1970.5"  "1970.75"
## [36] "1971"    "1971.25" "1971.5"  "1971.75"

Show the attributes of the dataset.

attributes(freeny)
## $names
## [1] "y"                     "lag.quarterly.revenue" "price.index"          
## [4] "income.level"          "market.potential"     
## 
## $class
## [1] "data.frame"
## 
## $row.names
##  [1] "1962.25" "1962.5"  "1962.75" "1963"    "1963.25" "1963.5"  "1963.75"
##  [8] "1964"    "1964.25" "1964.5"  "1964.75" "1965"    "1965.25" "1965.5" 
## [15] "1965.75" "1966"    "1966.25" "1966.5"  "1966.75" "1967"    "1967.25"
## [22] "1967.5"  "1967.75" "1968"    "1968.25" "1968.5"  "1968.75" "1969"   
## [29] "1969.25" "1969.5"  "1969.75" "1970"    "1970.25" "1970.5"  "1970.75"
## [36] "1971"    "1971.25" "1971.5"  "1971.75"

Variables in Data

Based on the findings above and general knowledge in economics, a summary of the variables can be drawn up below.

Variable Description Units
row.names Quarter Year & Quarter
y Quarterly revenue $, possibly in million
lag.quarterly.revenue Quarterly revenue in last quarter $, possibly in million
price.index Price index (possibly consumer price index) Possibly %
income.level Income level Unknown
market.potential Market potential $, possibly in million

Data Types

Check the type of the dataset. It’s found to be a list.

typeof(freeny)
## [1] "list"

Check the class of the dataset. It’s found to be a data frame.

class(freeny)
## [1] "data.frame"

3. Automated Codebook

Codebook Function

The codebook function of the memisc package is used to automatically generate a codebook of a dataset. It shows the statistics of the dataset including minimum value, maximum value, mean, standard deviation, skewness and kurtosis of the data.

codebook(freeny)
## ================================================================================
## 
##    y
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: double
## 
##         Min:  8.79137000
##         Max:  9.79424000
##        Mean:  9.30630436
##    Std.Dev.:  0.31154410
##    Skewness: -0.07950976
##    Kurtosis: -1.25889614
## 
## ================================================================================
## 
##    lag.quarterly.revenue
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: double
## 
##         Min:  8.791
##         Max:  9.775
##        Mean:  9.281
##    Std.Dev.:  0.311
##    Skewness: -0.044
##    Kurtosis: -1.281
## 
## ================================================================================
## 
##    price.index
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: double
## 
##         Min:  4.278
##         Max:  4.710
##        Mean:  4.496
##    Std.Dev.:  0.132
##    Skewness: -0.180
##    Kurtosis: -1.226
## 
## ================================================================================
## 
##    income.level
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: double
## 
##         Min:  5.821
##         Max:  6.200
##        Mean:  6.039
##    Std.Dev.:  0.119
##    Skewness: -0.474
##    Kurtosis: -1.072
## 
## ================================================================================
## 
##    market.potential
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: double
## 
##         Min: 12.970
##         Max: 13.166
##        Mean: 13.067
##    Std.Dev.:  0.064
##    Skewness:  0.024
##    Kurtosis: -1.375

Summary Function

Show the summary statistics of all variables in the dataset.

summary(freeny)
##        y         lag.quarterly.revenue  price.index     income.level  
##  Min.   :8.791   Min.   :8.791         Min.   :4.278   Min.   :5.821  
##  1st Qu.:9.045   1st Qu.:9.020         1st Qu.:4.392   1st Qu.:5.948  
##  Median :9.314   Median :9.284         Median :4.510   Median :6.061  
##  Mean   :9.306   Mean   :9.281         Mean   :4.496   Mean   :6.039  
##  3rd Qu.:9.591   3rd Qu.:9.561         3rd Qu.:4.605   3rd Qu.:6.139  
##  Max.   :9.794   Max.   :9.775         Max.   :4.710   Max.   :6.200  
##  market.potential
##  Min.   :12.97   
##  1st Qu.:13.01   
##  Median :13.07   
##  Mean   :13.07   
##  3rd Qu.:13.12   
##  Max.   :13.17

4. Custom-Made Codebook

Instead of automatic generation of codebook, here the same information is shown by using individual R functions manually.

Data Types

Show the type of each variable.

sapply(freeny, typeof)
##                     y lag.quarterly.revenue           price.index 
##              "double"              "double"              "double" 
##          income.level      market.potential 
##              "double"              "double"

Show the class of each variable.

sapply(freeny, class)
##                     y lag.quarterly.revenue           price.index 
##                  "ts"             "numeric"             "numeric" 
##          income.level      market.potential 
##             "numeric"             "numeric"

Descriptive Statistics

Show the mean of each column.

sapply(freeny, mean)
##                     y lag.quarterly.revenue           price.index 
##              9.306304              9.280718              4.496182 
##          income.level      market.potential 
##              6.038596             13.066831

Show the minimum value of each variable.

sapply(freeny, min)
##                     y lag.quarterly.revenue           price.index 
##               8.79137               8.79137               4.27789 
##          income.level      market.potential 
##               5.82110              12.96990

Show the maximum value of each variable.

sapply(freeny, max)
##                     y lag.quarterly.revenue           price.index 
##               9.79424               9.77536               4.70997 
##          income.level      market.potential 
##               6.20030              13.16640

Show the range of each variable.

sapply(freeny, range)
##            y lag.quarterly.revenue price.index income.level market.potential
## [1,] 8.79137               8.79137     4.27789       5.8211          12.9699
## [2,] 9.79424               9.77536     4.70997       6.2003          13.1664

Show the first quartile of each variable.

sapply(freeny, quantile, 0.25)
##                     y.25% lag.quarterly.revenue.25%           price.index.25% 
##                  9.044600                  9.019585                  4.391615 
##          income.level.25%      market.potential.25% 
##                  5.947985                 13.006600

Show the median of each variable.

sapply(freeny, median)
##                     y lag.quarterly.revenue           price.index 
##               9.31378               9.28436               4.51018 
##          income.level      market.potential 
##               6.06093              13.06930

Show the third quartile of each variable.

sapply(freeny, quantile, 0.75)
##                     y.75% lag.quarterly.revenue.75%           price.index.75% 
##                  9.590855                  9.560515                  4.604965 
##          income.level.75%      market.potential.75% 
##                  6.139120                 13.124400

Show the interquartile range of each variable.

sapply(freeny, IQR)
##                     y lag.quarterly.revenue           price.index 
##              0.546255              0.540930              0.213350 
##          income.level      market.potential 
##              0.191135              0.117800

Show the variance of each variable.

sapply(freeny, var)
##                     y lag.quarterly.revenue           price.index 
##           0.099613933           0.099519976           0.017784056 
##          income.level      market.potential 
##           0.014506558           0.004160757

Show the standard deviation of each variable.

sapply(freeny, sd)
##                     y lag.quarterly.revenue           price.index 
##            0.31561675            0.31546787            0.13335687 
##          income.level      market.potential 
##            0.12044317            0.06450393

The moments package is included to show skewness and kurtosis.

# install.packages('moments') # Install the package if necessary
library(moments)

Show the skewness of each variable.

sapply(freeny, skewness)
##                     y lag.quarterly.revenue           price.index 
##           -0.07950976           -0.04381719           -0.18010672 
##          income.level      market.potential 
##           -0.47365195            0.02446065

Show the kurtosis of each variable.

sapply(freeny, kurtosis)
##                     y lag.quarterly.revenue           price.index 
##              1.741104              1.719270              1.774144 
##          income.level      market.potential 
##              1.928212              1.624997

Visualizing Data

Histograms are generated for all variables for data visualization.

sapply(1:length(freeny), function (x) {
  var_name <- names(freeny[x])
  hist(freeny[[x]], main = paste0('Histogram - ', var_name), xlab = var_name, ylab = 'Frequency')
})

##          [,1]          [,2]          [,3]          [,4]          [,5]         
## breaks   Numeric,7     Numeric,7     Numeric,11    Numeric,10    Numeric,6    
## counts   Integer,6     Integer,6     Integer,10    Integer,9     Integer,5    
## density  Numeric,6     Numeric,6     Numeric,10    Numeric,9     Numeric,5    
## mids     Numeric,6     Numeric,6     Numeric,10    Numeric,9     Numeric,5    
## xname    "freeny[[x]]" "freeny[[x]]" "freeny[[x]]" "freeny[[x]]" "freeny[[x]]"
## equidist TRUE          TRUE          TRUE          TRUE          TRUE

Line plots are generated for all variables for data visualization.

sapply(1:length(freeny), function (x) {
  var_name <- names(freeny[x])
  plot(row.names(freeny[x]), freeny[[x]], type = 'l', main = paste0('Line Plot - ', var_name), xlab = 'Quarter', ylab = var_name)
})

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL