Writing a codebook is an important step in the management of a data analysis project. The codebook serves as a reference for the data analysis team and helps those who are new to the project catch up quickly.
Here is the first few rows of a dataset on racial profiling that is available here.
## Organization.Identification.ID Department.Name Organization.Activity.Text
## 1 CT0001400 Branford Racial Profile
## 2 CTCSP1000 State Police Racial Profile
## 3 CTCSP0600 State Police Racial Profile
## 4 CTCSP0700 State Police Racial Profile
## 5 CT0001400 Branford Racial Profile
## 6 CT0011800 Ridgefield Racial Profile
## Reporting.Officer.Identification.ID Intervention.Identification.ID
## 1 378 1300024756
## 2 1000001940 1300616313
## 3 1000002349 1300616887
## 4 987312802 1300615169
## 5 1051 1300024738
## 6 1000142 1300016337
## Identification.Category.Description.Text Intervention.Date Day.of.Week
## 1 Incident ID 10/01/2013 12:00:00 AM Tuesday
## 2 Incident ID 10/01/2013 12:00:00 AM Tuesday
## 3 Incident ID 10/01/2013 12:00:00 AM Tuesday
## 4 Incident ID 10/01/2013 12:00:00 AM Tuesday
## 5 Incident ID 10/01/2013 12:00:00 AM Tuesday
## 6 Incident ID 10/01/2013 12:00:00 AM Tuesday
## Subject.Race.Code Subject.Ethnicity.Code Subject.Sex.Code Subject.Age
## 1 W N F 36
## 2 W M M 46
## 3 W H F 33
## 4 B N M 64
## 5 W N M 49
## 6 B N M 18
## Resident.Indicator Town.Resident.Indicator Intervention.Location.Name
## 1 TRUE FALSE BRANFORD
## 2 TRUE TRUE NEW HAVEN
## 3 TRUE FALSE NORWICH
## 4 TRUE FALSE OLD SAYBROOK
## 5 TRUE TRUE Branford
## 6 TRUE FALSE RIDGEFIELD
## Intervention.Location.Description.Text Intervention.Reason.Code
## 1 west main/short beach rd V
## 2 i-91 NORTHBOUND BY EXIT 7 ENTRANCE V
## 3 00000 N I 395 (NORWICH, T104) V
## 4 X67 SB V
## 5 cedar street I
## 6 Danbury Rd/Laurel Lane V
## Intervention.Technique.Code Intervention.Duration.Code Towed.Indicator
## 1 G 1 FALSE
## 2 G 1 FALSE
## 3 G 1 TRUE
## 4 G 1 FALSE
## 5 G 1 FALSE
## 6 G 1 FALSE
## Statute.Code.Identification.ID Statute.Code.Description Statutatory.Citation
## 1 14-100a(c)(1) Seatbelt 14-100a(c)(1)
## 2 14-298 Other 14-298
## 3 14-100a(d) Seatbelt <NA>
## 4 14-298 Other 14-298
## 5 14-18(A) Display of Plates <NA>
## 6 14-219b Speed Related 14-36(b)(1)
## Vehicle.Searched.Indicator Search.Authorization.Code Contraband.Indicator
## 1 FALSE N FALSE
## 2 FALSE N FALSE
## 3 FALSE N FALSE
## 4 FALSE N FALSE
## 5 FALSE N FALSE
## 6 FALSE N FALSE
## Custodial.Arrest.Indicator Intervention.Disposition.Code Intervention.Time
## 1 FALSE V 12:52
## 2 FALSE V 14:38
## 3 FALSE I 20:25
## 4 FALSE I 2:11
## 5 FALSE N 9:35
## 6 FALSE I 9:49
A codebook is a technical description of a dataset. It describes how the data are arranged, what the various numbers and letters mean, and any special instructions on how to use the data properly. The best codebooks have:
Even though a codebook should have all of this information, not all codebooks will arrange it identically.
Data preparation can be conducted in a variety of ways. However, there are some good general suggestions that will help you.
Make sure that variable names are unambiguous and unique. A variable name should be given to each variable. Ensure that these names are a single string consisting of letters, number and underscores with numbers and underscores being used only when appropriate. Spaces and periods will often lead to data processing issues in some software. Ensure that variable names appear at the top of each column, are short enough to deal with but are long enough to be meaningful.
A label is a description of a variable. These can be as complex as a definition written in text or as simple as a reference to a survey question. These labels should be included in your codebook.
Variable labels not only let the statistician understand the contents of the data, but it also promotes output understanding by the researcher or data scientist. They simply provide everyone some context.
With categorical variables, we need to include an exhaustive list of mutually exclusive codes. Using standard categories such as 0 = NO, 1 = YES should be used, if possible. Standard terminology promotes quick comparisons across variables and studies.
Data should be in a numeric code format when possible. This minimizes typographical errors when entering literal answers (ie, NA vs Na vs na). Thus, there is less misinterpretation of two equivalent answers as being different. Most statistical packages handle numerical variables better than letters as well.
Character entries should be used for descriptive purposes only. For example, for a long answer question. Creating numeric codes for long answer or free-form possibilities is cumbersome and ultimately leads to little or no benefit.
Missing data occurs frequently in real data. They may arise from a variety of sources, such as refusal to answer, omission, missing by design, etc.
It is sometimes important, either in data analysis or when writing reports, to be able to distinguish between different types of missing data, and that will require some coding. Furthermore, specifically coding missing values makes it clear that the data item is truly missing, rather than simply an omission by the data entry person, which does happen. If true missing values are coded, identifying data entry omissions will be trivial.
Dates are tricky because there is not a standard format for dates. Thus, it is critical that you select a standard and maintain that standard throughout your project.
First, we take the original data names and replace them with variable names that are easier to work with. Remember, you don’t want to be writing long names in your code when you are working with the data. I simply replaced the column headings with new variable names in my .csv document. I make sure to note it in my codebook.
dat2 <- read.csv("Racial_ProfilingIII.csv")
head(dat2)
## ID Dept OrgAct OffID IntID IDCat
## 1 CT0001400 Branford Racial Profile 378 1300024756 Incident ID
## 2 CTCSP1000 State Police Racial Profile 1000001940 1300616313 Incident ID
## 3 CTCSP0600 State Police Racial Profile 1000002349 1300616887 Incident ID
## 4 CTCSP0700 State Police Racial Profile 987312802 1300615169 Incident ID
## 5 CT0001400 Branford Racial Profile 1051 1300024738 Incident ID
## 6 CT0011800 Ridgefield Racial Profile 1000142 1300016337 Incident ID
## Date Day Race Ethnic Sex Age ResInd TownInd Location
## 1 10/01/2013 0:00 Tuesday W N F 36 TRUE FALSE BRANFORD
## 2 10/01/2013 0:00 Tuesday W M M 46 TRUE TRUE NEW HAVEN
## 3 10/01/2013 0:00 Tuesday W H F 33 TRUE FALSE NORWICH
## 4 10/01/2013 0:00 Tuesday B N M 64 TRUE FALSE OLD SAYBROOK
## 5 10/01/2013 0:00 Tuesday W N M 49 TRUE TRUE Branford
## 6 10/01/2013 0:00 Tuesday B N M 18 TRUE FALSE RIDGEFIELD
## LocationText Reason Technique Duration Tow
## 1 west main/short beach rd V G 1 FALSE
## 2 i-91 NORTHBOUND BY EXIT 7 ENTRANCE V G 1 FALSE
## 3 00000 N I 395 (NORWICH, T104) V G 1 TRUE
## 4 X67 SB V G 1 FALSE
## 5 cedar street I G 1 FALSE
## 6 Danbury Rd/Laurel Lane V G 1 FALSE
## StatuteID StatuteDesc Citation VehSearch SearchAuth Contraband
## 1 14-100a(c)(1) Seatbelt 14-100a(c)(1) FALSE N FALSE
## 2 14-298 Other 14-298 FALSE N FALSE
## 3 14-100a(d) Seatbelt <NA> FALSE N FALSE
## 4 14-298 Other 14-298 FALSE N FALSE
## 5 14-18(A) Display of Plates <NA> FALSE N FALSE
## 6 14-219b Speed Related 14-36(b)(1) FALSE N FALSE
## Arrest DisposalCode IntTime
## 1 FALSE V 12:52
## 2 FALSE V 14:38
## 3 FALSE I 20:25
## 4 FALSE I 2:11
## 5 FALSE N 9:35
## 6 FALSE I 9:49
The codebook I have generated so far looks like this. I have included a column that identifies missing data and a range column that looks at the range of numerical data. I have also included the original data type as a column. All of this information will help those working on the dataset deal with the data quicker.
We should also include with the codebook a note that states:
"[t]his dataset is an archive the most current version will be maintained here: https://data.ct.gov/Public-Safety/Traffic-Stops-Racial-Profiling-Prohibition-Project/nahi-zqrt
This dataset was created in accordance with The Alvin W. Penn Racial Profiling Prohibition Act (Connecticut General Statutes Sections 54-1l and 54-1m) which prohibits any law enforcement agency from stopping, detaining, or searching any motorist when the stop is motivated solely by considerations of the race, color, ethnicity, age, gender or sexual orientation."
Finally, we also note the last update was March 24, 2001 and was provided by the Institute for Municipal and Regional policy. It contains roughly 842,000 rows and 29 columns.
In subsequent sections, we will show exactly how to use the dplyr package to help transform this data to something that is easier to work with. While it is not essential that we change categorical variables into numerical variables in all cases, it will certainly help with certain algorithms.
For example, I would likely change the Day, Race, Ethnic, Sex, Age, ResInd, TownInd, Reason, Technique, VehSearch, SearchAuto, Contraband, Arrest and DisposalCode to numerical values. While we could do it in the dataset itself, it is best if we use R to do that for us. That way, we leave the original data intact.
Belisle, Patrick and Joseph Lawrence. Codebook cookbook: A guide to writing a good codebook for data analysis projects in medicine. http://www.medicine.mcgill.ca/epidemiology/joseph/pbelisle/CodebookCookbook/CodebookCookbook.pdf
Petersen, Anne, & Claus Thorn Ekstrøm. “dataMaid: Your Assistant for Documenting Supervised Data Quality Screening in R.” Journal of Statistical Software [Online], 90.6 (2019): 1 - 38. Web. 17 Apr. 2021.
Princeton University Data and Statistical Services. How to Use a Codebook. http://dss.princeton.edu/online_help/analysis/codebook.htm
Ritchie David. (1999). Data Analysis: Code Book. http://web.pdx.edu/~cgrd/codebk.htm
SPSS Tutorials: Creating A Codebook. (2021). https://libguides.library.kent.edu/SPSS/Codebooks.
UK Data Archive. Documenting Your Data/Data Level/Structured Tabular Data. http://www.data-archive.ac.uk/create-manage/document/data-level?index=1