Introduction

Writing a codebook is an important step in the management of a data analysis project. The codebook serves as a reference for the data analysis team and helps those who are new to the project catch up quickly.

Here is the first few rows of a dataset on racial profiling that is available here.

##   Organization.Identification.ID Department.Name Organization.Activity.Text
## 1                      CT0001400        Branford             Racial Profile
## 2                      CTCSP1000    State Police             Racial Profile
## 3                      CTCSP0600    State Police             Racial Profile
## 4                      CTCSP0700    State Police             Racial Profile
## 5                      CT0001400        Branford             Racial Profile
## 6                      CT0011800      Ridgefield             Racial Profile
##   Reporting.Officer.Identification.ID Intervention.Identification.ID
## 1                                 378                     1300024756
## 2                          1000001940                     1300616313
## 3                          1000002349                     1300616887
## 4                           987312802                     1300615169
## 5                                1051                     1300024738
## 6                             1000142                     1300016337
##   Identification.Category.Description.Text      Intervention.Date Day.of.Week
## 1                              Incident ID 10/01/2013 12:00:00 AM     Tuesday
## 2                              Incident ID 10/01/2013 12:00:00 AM     Tuesday
## 3                              Incident ID 10/01/2013 12:00:00 AM     Tuesday
## 4                              Incident ID 10/01/2013 12:00:00 AM     Tuesday
## 5                              Incident ID 10/01/2013 12:00:00 AM     Tuesday
## 6                              Incident ID 10/01/2013 12:00:00 AM     Tuesday
##   Subject.Race.Code Subject.Ethnicity.Code Subject.Sex.Code Subject.Age
## 1                 W                      N                F          36
## 2                 W                      M                M          46
## 3                 W                      H                F          33
## 4                 B                      N                M          64
## 5                 W                      N                M          49
## 6                 B                      N                M          18
##   Resident.Indicator Town.Resident.Indicator Intervention.Location.Name
## 1               TRUE                   FALSE                   BRANFORD
## 2               TRUE                    TRUE                  NEW HAVEN
## 3               TRUE                   FALSE                    NORWICH
## 4               TRUE                   FALSE               OLD SAYBROOK
## 5               TRUE                    TRUE                   Branford
## 6               TRUE                   FALSE                 RIDGEFIELD
##   Intervention.Location.Description.Text Intervention.Reason.Code
## 1               west main/short beach rd                        V
## 2     i-91 NORTHBOUND BY EXIT 7 ENTRANCE                        V
## 3          00000 N I 395 (NORWICH, T104)                        V
## 4                                 X67 SB                        V
## 5                           cedar street                        I
## 6                 Danbury Rd/Laurel Lane                        V
##   Intervention.Technique.Code Intervention.Duration.Code Towed.Indicator
## 1                           G                          1           FALSE
## 2                           G                          1           FALSE
## 3                           G                          1            TRUE
## 4                           G                          1           FALSE
## 5                           G                          1           FALSE
## 6                           G                          1           FALSE
##   Statute.Code.Identification.ID Statute.Code.Description Statutatory.Citation
## 1                  14-100a(c)(1)                 Seatbelt        14-100a(c)(1)
## 2                         14-298                    Other               14-298
## 3                     14-100a(d)                 Seatbelt                 <NA>
## 4                         14-298                    Other               14-298
## 5                       14-18(A)        Display of Plates                 <NA>
## 6                        14-219b            Speed Related          14-36(b)(1)
##   Vehicle.Searched.Indicator Search.Authorization.Code Contraband.Indicator
## 1                      FALSE                         N                FALSE
## 2                      FALSE                         N                FALSE
## 3                      FALSE                         N                FALSE
## 4                      FALSE                         N                FALSE
## 5                      FALSE                         N                FALSE
## 6                      FALSE                         N                FALSE
##   Custodial.Arrest.Indicator Intervention.Disposition.Code Intervention.Time
## 1                      FALSE                             V             12:52
## 2                      FALSE                             V             14:38
## 3                      FALSE                             I             20:25
## 4                      FALSE                             I              2:11
## 5                      FALSE                             N              9:35
## 6                      FALSE                             I              9:49

What is a Codebook?

A codebook is a technical description of a dataset. It describes how the data are arranged, what the various numbers and letters mean, and any special instructions on how to use the data properly. The best codebooks have:

Even though a codebook should have all of this information, not all codebooks will arrange it identically.

Data Preparation

Data preparation can be conducted in a variety of ways. However, there are some good general suggestions that will help you.

Variable Names

Make sure that variable names are unambiguous and unique. A variable name should be given to each variable. Ensure that these names are a single string consisting of letters, number and underscores with numbers and underscores being used only when appropriate. Spaces and periods will often lead to data processing issues in some software. Ensure that variable names appear at the top of each column, are short enough to deal with but are long enough to be meaningful.

Variable Labels

A label is a description of a variable. These can be as complex as a definition written in text or as simple as a reference to a survey question. These labels should be included in your codebook.

Variable labels not only let the statistician understand the contents of the data, but it also promotes output understanding by the researcher or data scientist. They simply provide everyone some context.

Variable Codes

With categorical variables, we need to include an exhaustive list of mutually exclusive codes. Using standard categories such as 0 = NO, 1 = YES should be used, if possible. Standard terminology promotes quick comparisons across variables and studies.

Variable Formats

Data should be in a numeric code format when possible. This minimizes typographical errors when entering literal answers (ie, NA vs Na vs na). Thus, there is less misinterpretation of two equivalent answers as being different. Most statistical packages handle numerical variables better than letters as well.

Character entries should be used for descriptive purposes only. For example, for a long answer question. Creating numeric codes for long answer or free-form possibilities is cumbersome and ultimately leads to little or no benefit.

Missing Data

Missing data occurs frequently in real data. They may arise from a variety of sources, such as refusal to answer, omission, missing by design, etc.

It is sometimes important, either in data analysis or when writing reports, to be able to distinguish between different types of missing data, and that will require some coding. Furthermore, specifically coding missing values makes it clear that the data item is truly missing, rather than simply an omission by the data entry person, which does happen. If true missing values are coded, identifying data entry omissions will be trivial.

Date Variables

Dates are tricky because there is not a standard format for dates. Thus, it is critical that you select a standard and maintain that standard throughout your project.

An Example

The Codebook

First, we take the original data names and replace them with variable names that are easier to work with. Remember, you don’t want to be writing long names in your code when you are working with the data. I simply replaced the column headings with new variable names in my .csv document. I make sure to note it in my codebook.

dat2 <- read.csv("Racial_ProfilingIII.csv")
head(dat2)
##          ID         Dept         OrgAct      OffID      IntID       IDCat
## 1 CT0001400     Branford Racial Profile        378 1300024756 Incident ID
## 2 CTCSP1000 State Police Racial Profile 1000001940 1300616313 Incident ID
## 3 CTCSP0600 State Police Racial Profile 1000002349 1300616887 Incident ID
## 4 CTCSP0700 State Police Racial Profile  987312802 1300615169 Incident ID
## 5 CT0001400     Branford Racial Profile       1051 1300024738 Incident ID
## 6 CT0011800   Ridgefield Racial Profile    1000142 1300016337 Incident ID
##              Date     Day Race Ethnic Sex Age ResInd TownInd     Location
## 1 10/01/2013 0:00 Tuesday    W      N   F  36   TRUE   FALSE     BRANFORD
## 2 10/01/2013 0:00 Tuesday    W      M   M  46   TRUE    TRUE    NEW HAVEN
## 3 10/01/2013 0:00 Tuesday    W      H   F  33   TRUE   FALSE      NORWICH
## 4 10/01/2013 0:00 Tuesday    B      N   M  64   TRUE   FALSE OLD SAYBROOK
## 5 10/01/2013 0:00 Tuesday    W      N   M  49   TRUE    TRUE     Branford
## 6 10/01/2013 0:00 Tuesday    B      N   M  18   TRUE   FALSE   RIDGEFIELD
##                         LocationText Reason Technique Duration   Tow
## 1           west main/short beach rd      V         G        1 FALSE
## 2 i-91 NORTHBOUND BY EXIT 7 ENTRANCE      V         G        1 FALSE
## 3      00000 N I 395 (NORWICH, T104)      V         G        1  TRUE
## 4                             X67 SB      V         G        1 FALSE
## 5                       cedar street      I         G        1 FALSE
## 6             Danbury Rd/Laurel Lane      V         G        1 FALSE
##       StatuteID       StatuteDesc      Citation VehSearch SearchAuth Contraband
## 1 14-100a(c)(1)          Seatbelt 14-100a(c)(1)     FALSE          N      FALSE
## 2        14-298             Other        14-298     FALSE          N      FALSE
## 3    14-100a(d)          Seatbelt          <NA>     FALSE          N      FALSE
## 4        14-298             Other        14-298     FALSE          N      FALSE
## 5      14-18(A) Display of Plates          <NA>     FALSE          N      FALSE
## 6       14-219b     Speed Related   14-36(b)(1)     FALSE          N      FALSE
##   Arrest DisposalCode IntTime
## 1  FALSE            V   12:52
## 2  FALSE            V   14:38
## 3  FALSE            I   20:25
## 4  FALSE            I    2:11
## 5  FALSE            N    9:35
## 6  FALSE            I    9:49

The codebook I have generated so far looks like this. I have included a column that identifies missing data and a range column that looks at the range of numerical data. I have also included the original data type as a column. All of this information will help those working on the dataset deal with the data quicker.

We should also include with the codebook a note that states:

"[t]his dataset is an archive the most current version will be maintained here: https://data.ct.gov/Public-Safety/Traffic-Stops-Racial-Profiling-Prohibition-Project/nahi-zqrt

This dataset was created in accordance with The Alvin W. Penn Racial Profiling Prohibition Act (Connecticut General Statutes Sections 54-1l and 54-1m) which prohibits any law enforcement agency from stopping, detaining, or searching any motorist when the stop is motivated solely by considerations of the race, color, ethnicity, age, gender or sexual orientation."

Finally, we also note the last update was March 24, 2001 and was provided by the Institute for Municipal and Regional policy. It contains roughly 842,000 rows and 29 columns.

Formatting the Data

In subsequent sections, we will show exactly how to use the dplyr package to help transform this data to something that is easier to work with. While it is not essential that we change categorical variables into numerical variables in all cases, it will certainly help with certain algorithms.

For example, I would likely change the Day, Race, Ethnic, Sex, Age, ResInd, TownInd, Reason, Technique, VehSearch, SearchAuto, Contraband, Arrest and DisposalCode to numerical values. While we could do it in the dataset itself, it is best if we use R to do that for us. That way, we leave the original data intact.

Citations

Belisle, Patrick and Joseph Lawrence. Codebook cookbook: A guide to writing a good codebook for data analysis projects in medicine. http://www.medicine.mcgill.ca/epidemiology/joseph/pbelisle/CodebookCookbook/CodebookCookbook.pdf

Petersen, Anne, & Claus Thorn Ekstrøm. “dataMaid: Your Assistant for Documenting Supervised Data Quality Screening in R.” Journal of Statistical Software [Online], 90.6 (2019): 1 - 38. Web. 17 Apr. 2021.

Princeton University Data and Statistical Services. How to Use a Codebook. http://dss.princeton.edu/online_help/analysis/codebook.htm

Ritchie David. (1999). Data Analysis: Code Book. http://web.pdx.edu/~cgrd/codebk.htm

SPSS Tutorials: Creating A Codebook. (2021). https://libguides.library.kent.edu/SPSS/Codebooks.

UK Data Archive. Documenting Your Data/Data Level/Structured Tabular Data. http://www.data-archive.ac.uk/create-manage/document/data-level?index=1