GitHub - https://github.com/rickysoo/freeny
Contact - ricky [at] rickysoo.com
A codebook shows the structure, contents and layout of data. We should always study the codebook to understand the variables before conducting any research on the data.
The creation of codebook is demonstrated using Freeny’s Revenue Data in 2 ways - automated codebook and custom-made codebook.
Source: https://www.pexels.com
The Freeny’s Revenue Data (freeny) is used in this demonstraton. The data consists of 39 observations showing quarterly revenue from 1962 2nd quarter to 1971 4th quarter.
The data originates from A. E. Freeny (1977) in A Portable Linear Regression Package with Test Programs by Bell Laboratories memorandum.
However, it remains unknown how the author created or collected the data, and whether it is based on quarterly performance on a real or fictitious company, as well as the exact units used for the measurements (thousands, millions etc.)
The data is included in the R Datasets package developed by R Core Team and contributors. The package also contains dozens of other datasets and is available in base R.
For more on the data, check out https://rdrr.io/r/datasets/freeny.html
Load the memisc package required to create automated codebook. The package name stands for “Management of Survey Data and Presentation of Analysis Results”. The documentation can be found at https://cran.r-project.org/web/packages/memisc/memisc.pdf
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
##
## Attaching package: 'memisc'
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
##
## as.array
Load the dataset.
data("freeny")
Show the first few lines of the dataset.
head(freeny)
## y lag.quarterly.revenue price.index income.level market.potential
## 1962.25 8.79236 8.79636 4.70997 5.82110 12.9699
## 1962.5 8.79137 8.79236 4.70217 5.82558 12.9733
## 1962.75 8.81486 8.79137 4.68944 5.83112 12.9774
## 1963 8.81301 8.81486 4.68558 5.84046 12.9806
## 1963.25 8.90751 8.81301 4.64019 5.85036 12.9831
## 1963.5 8.93673 8.90751 4.62553 5.86464 12.9854
View the dataset in the R Studio data viewer.
View(freeny)
Show the dimension of the dataset. 39 observations and 5 variables are found.
dim(freeny)
## [1] 39 5
Show the column names. It appears that y is the revenue, and the other 4 variables are the explanatory variables.
names(freeny)
## [1] "y" "lag.quarterly.revenue" "price.index"
## [4] "income.level" "market.potential"
Show the row names of the dataset. It appears that each row represent a quarter of the period.
row.names(freeny)
## [1] "1962.25" "1962.5" "1962.75" "1963" "1963.25" "1963.5" "1963.75"
## [8] "1964" "1964.25" "1964.5" "1964.75" "1965" "1965.25" "1965.5"
## [15] "1965.75" "1966" "1966.25" "1966.5" "1966.75" "1967" "1967.25"
## [22] "1967.5" "1967.75" "1968" "1968.25" "1968.5" "1968.75" "1969"
## [29] "1969.25" "1969.5" "1969.75" "1970" "1970.25" "1970.5" "1970.75"
## [36] "1971" "1971.25" "1971.5" "1971.75"
Show the attributes of the dataset.
attributes(freeny)
## $names
## [1] "y" "lag.quarterly.revenue" "price.index"
## [4] "income.level" "market.potential"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] "1962.25" "1962.5" "1962.75" "1963" "1963.25" "1963.5" "1963.75"
## [8] "1964" "1964.25" "1964.5" "1964.75" "1965" "1965.25" "1965.5"
## [15] "1965.75" "1966" "1966.25" "1966.5" "1966.75" "1967" "1967.25"
## [22] "1967.5" "1967.75" "1968" "1968.25" "1968.5" "1968.75" "1969"
## [29] "1969.25" "1969.5" "1969.75" "1970" "1970.25" "1970.5" "1970.75"
## [36] "1971" "1971.25" "1971.5" "1971.75"
Based on the findings above and general knowledge in economics, a summary of the variables can be drawn up below.
| Variable | Description | Units |
|---|---|---|
| row.names | Quarter | Year & Quarter |
| y | Quarterly revenue | $, possibly in million |
| lag.quarterly.revenue | Quarterly revenue in last quarter | $, possibly in million |
| price.index | Price index (possibly consumer price index) | Possibly % |
| income.level | Income level | Unknown |
| market.potential | Market potential | $, possibly in million |
Check the type of the dataset. It’s found to be a list.
typeof(freeny)
## [1] "list"
Check the class of the dataset. It’s found to be a data frame.
class(freeny)
## [1] "data.frame"
The codebook function of the memisc package is used to automatically generate a codebook of a dataset. It shows the statistics of the dataset including minimum value, maximum value, mean, standard deviation, skewness and kurtosis of the data.
codebook(freeny)
## ================================================================================
##
## y
##
## --------------------------------------------------------------------------------
##
## Storage mode: double
##
## Min: 8.79137000
## Max: 9.79424000
## Mean: 9.30630436
## Std.Dev.: 0.31154410
## Skewness: -0.07950976
## Kurtosis: -1.25889614
##
## ================================================================================
##
## lag.quarterly.revenue
##
## --------------------------------------------------------------------------------
##
## Storage mode: double
##
## Min: 8.791
## Max: 9.775
## Mean: 9.281
## Std.Dev.: 0.311
## Skewness: -0.044
## Kurtosis: -1.281
##
## ================================================================================
##
## price.index
##
## --------------------------------------------------------------------------------
##
## Storage mode: double
##
## Min: 4.278
## Max: 4.710
## Mean: 4.496
## Std.Dev.: 0.132
## Skewness: -0.180
## Kurtosis: -1.226
##
## ================================================================================
##
## income.level
##
## --------------------------------------------------------------------------------
##
## Storage mode: double
##
## Min: 5.821
## Max: 6.200
## Mean: 6.039
## Std.Dev.: 0.119
## Skewness: -0.474
## Kurtosis: -1.072
##
## ================================================================================
##
## market.potential
##
## --------------------------------------------------------------------------------
##
## Storage mode: double
##
## Min: 12.970
## Max: 13.166
## Mean: 13.067
## Std.Dev.: 0.064
## Skewness: 0.024
## Kurtosis: -1.375
Show the summary statistics of all variables in the dataset.
summary(freeny)
## y lag.quarterly.revenue price.index income.level
## Min. :8.791 Min. :8.791 Min. :4.278 Min. :5.821
## 1st Qu.:9.045 1st Qu.:9.020 1st Qu.:4.392 1st Qu.:5.948
## Median :9.314 Median :9.284 Median :4.510 Median :6.061
## Mean :9.306 Mean :9.281 Mean :4.496 Mean :6.039
## 3rd Qu.:9.591 3rd Qu.:9.561 3rd Qu.:4.605 3rd Qu.:6.139
## Max. :9.794 Max. :9.775 Max. :4.710 Max. :6.200
## market.potential
## Min. :12.97
## 1st Qu.:13.01
## Median :13.07
## Mean :13.07
## 3rd Qu.:13.12
## Max. :13.17
Instead of automatic generation of codebook, here the same information is shown by using individual R functions manually.
Show the type of each variable.
sapply(freeny, typeof)
## y lag.quarterly.revenue price.index
## "double" "double" "double"
## income.level market.potential
## "double" "double"
Show the class of each variable.
sapply(freeny, class)
## y lag.quarterly.revenue price.index
## "ts" "numeric" "numeric"
## income.level market.potential
## "numeric" "numeric"
Show the mean of each column.
sapply(freeny, mean)
## y lag.quarterly.revenue price.index
## 9.306304 9.280718 4.496182
## income.level market.potential
## 6.038596 13.066831
Show the minimum value of each variable.
sapply(freeny, min)
## y lag.quarterly.revenue price.index
## 8.79137 8.79137 4.27789
## income.level market.potential
## 5.82110 12.96990
Show the maximum value of each variable.
sapply(freeny, max)
## y lag.quarterly.revenue price.index
## 9.79424 9.77536 4.70997
## income.level market.potential
## 6.20030 13.16640
Show the range of each variable.
sapply(freeny, range)
## y lag.quarterly.revenue price.index income.level market.potential
## [1,] 8.79137 8.79137 4.27789 5.8211 12.9699
## [2,] 9.79424 9.77536 4.70997 6.2003 13.1664
Show the first quartile of each variable.
sapply(freeny, quantile, 0.25)
## y.25% lag.quarterly.revenue.25% price.index.25%
## 9.044600 9.019585 4.391615
## income.level.25% market.potential.25%
## 5.947985 13.006600
Show the median of each variable.
sapply(freeny, median)
## y lag.quarterly.revenue price.index
## 9.31378 9.28436 4.51018
## income.level market.potential
## 6.06093 13.06930
Show the third quartile of each variable.
sapply(freeny, quantile, 0.75)
## y.75% lag.quarterly.revenue.75% price.index.75%
## 9.590855 9.560515 4.604965
## income.level.75% market.potential.75%
## 6.139120 13.124400
Show the interquartile range of each variable.
sapply(freeny, IQR)
## y lag.quarterly.revenue price.index
## 0.546255 0.540930 0.213350
## income.level market.potential
## 0.191135 0.117800
Show the variance of each variable.
sapply(freeny, var)
## y lag.quarterly.revenue price.index
## 0.099613933 0.099519976 0.017784056
## income.level market.potential
## 0.014506558 0.004160757
Show the standard deviation of each variable.
sapply(freeny, sd)
## y lag.quarterly.revenue price.index
## 0.31561675 0.31546787 0.13335687
## income.level market.potential
## 0.12044317 0.06450393
The moments package is included to show skewness and kurtosis.
# install.packages('moments') # Install the package if necessary
library(moments)
Show the skewness of each variable.
sapply(freeny, skewness)
## y lag.quarterly.revenue price.index
## -0.07950976 -0.04381719 -0.18010672
## income.level market.potential
## -0.47365195 0.02446065
Show the kurtosis of each variable.
sapply(freeny, kurtosis)
## y lag.quarterly.revenue price.index
## 1.741104 1.719270 1.774144
## income.level market.potential
## 1.928212 1.624997
Histograms are generated for all variables for data visualization.
sapply(1:length(freeny), function (x) {
var_name <- names(freeny[x])
hist(freeny[[x]], main = paste0('Histogram - ', var_name), xlab = var_name, ylab = 'Frequency')
})
## [,1] [,2] [,3] [,4] [,5]
## breaks Numeric,7 Numeric,7 Numeric,11 Numeric,10 Numeric,6
## counts Integer,6 Integer,6 Integer,10 Integer,9 Integer,5
## density Numeric,6 Numeric,6 Numeric,10 Numeric,9 Numeric,5
## mids Numeric,6 Numeric,6 Numeric,10 Numeric,9 Numeric,5
## xname "freeny[[x]]" "freeny[[x]]" "freeny[[x]]" "freeny[[x]]" "freeny[[x]]"
## equidist TRUE TRUE TRUE TRUE TRUE
Line plots are generated for all variables for data visualization.
sapply(1:length(freeny), function (x) {
var_name <- names(freeny[x])
plot(row.names(freeny[x]), freeny[[x]], type = 'l', main = paste0('Line Plot - ', var_name), xlab = 'Quarter', ylab = var_name)
})
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL