Raw Survey Data Cleaning and Review

Introduction
Importing Data
Reviewing and Cleaning Data
- Reviewing Full Data
- Removing Incomplete and Potential Bot Respondents
Cleaning Survey Data for Analysis

Introduction

Once you have finished collecting your survey data, it is time to import the raw data into R for cleaning and preparation. Most survey data that you collect yourself will not immediately be ready for analysis. In this tutorial, we will cover the following:

Importing raw data in R using haven package
How to review the raw data to decide what needs to be done
How to identify and deal with missing data/non-substantive responses
Filtering out incomplete and bad responses
Collapsing multiple response options into fewer
Recoding existing variables into new ones
Applying variable labels for use in later analyses

Importing Data

Once data collection is finished, Qualtrics allows you to export your raw survey data as a variety of file types. We want to export as a SPSS file type, .sav. By doing this versus Excel or CSV, we keep the structure of the survey data intact. Meaning, survey data typically has a number (e.g. 1) and a label (e.g. strongly agree) associated with most variables. Exporting as file type other than SPSS loses that information as you can only get one or the other. By using SPSS as the export file type, we keep all the information about the response that we need.

Once you have exported your raw survey data from Qualtrics, or any other survey software platform, you need to import the data into R. As best practice, you should first move your downloaded data into a new folder appropriate for the project. Either use the Import Dataset dropdown function in the Environment panel to browse for the file or upate the following code as necessary to point to your specific folder and dataset.

In this code chunk, I import the raw Qualtrics exported survey data into R directly from my download folder and name the file data. You should always give your R data simple names that are intutitive to you. Because you will be typing the name of the dataset over and over, making it short saves time and makes you a more efficient coder.

omni <- read_sav("C:/Users/Carey/Downloads/srm695_omnibus_November+27,+2024_11.07.sav")

If code is successful, you will see the data with the given name, here we named it omni, in the Environment panel in R Studio. We see that omni has 297 observations, so there are 297 responses in our dataset that a respondent started but not necessarily completed, with 178 individual columns of data. The 178 variables will be the individual survey questions, plus meta data about the survey and the respondent, as well as the demographic responses provided by Lucid for the sample.

FYI: Lucid does not provide survey weights for their sample so you must create them yourselves when analyzing your questions on the final project.

Reviewing and Cleaning Data

The first step to take once data has been imported is to review it both visually and summarily to get to know the data better. You can open the data for visual inspection using the view(data) command. This opens the entire dataset for you to review. Below we use the head(omni) function to simply look at the first few rows of the data.

Always download the codebook for the survey at the same time as you download the raw survey data. You will need to refer to the codebook frequently so save it someplace that is accessible for you.

Reviewing Full Data

#view(data)
head(omni)

## # A tibble: 6 × 178
##   StartDate           EndDate             Status         IPAddress      Progress
##   <dttm>              <dttm>              <dbl+lbl>      <chr>             <dbl>
## 1 2024-11-26 13:39:00 2024-11-26 13:39:21 0 [IP Address] 173.49.139.121      100
## 2 2024-11-26 13:39:12 2024-11-26 13:39:25 0 [IP Address] 172.56.222.211      100
## 3 2024-11-26 13:38:59 2024-11-26 13:39:32 0 [IP Address] 174.97.227.99       100
## 4 2024-11-26 13:39:42 2024-11-26 13:39:50 0 [IP Address] 73.99.70.82         100
## 5 2024-11-26 13:39:22 2024-11-26 13:40:29 0 [IP Address] 173.218.31.167      100
## 6 2024-11-26 13:39:00 2024-11-26 13:40:43 0 [IP Address] 73.186.44.253       100
## # ℹ 173 more variables: Duration__in_seconds_ <dbl>, Finished <dbl+lbl>,
## #   RecordedDate <dttm>, ResponseId <chr>, RecipientLastName <chr>,
## #   RecipientFirstName <chr>, RecipientEmail <chr>, ExternalReference <chr>,
## #   LocationLatitude <chr>, LocationLongitude <chr>, DistributionChannel <chr>,
## #   UserLanguage <chr>, Q_RecaptchaScore <dbl>, Q_RelevantIDDuplicate <chr>,
## #   Q_RelevantIDDuplicateScore <dbl>, Q_RelevantIDFraudScore <dbl>,
## #   Q_RelevantIDLastStartDate <dttm>, Q1 <dbl+lbl>, main_branch_1 <dbl+lbl>, …

names(omni) #Reveals the name of every column in data

##   [1] "StartDate"                  "EndDate"                   
##   [3] "Status"                     "IPAddress"                 
##   [5] "Progress"                   "Duration__in_seconds_"     
##   [7] "Finished"                   "RecordedDate"              
##   [9] "ResponseId"                 "RecipientLastName"         
##  [11] "RecipientFirstName"         "RecipientEmail"            
##  [13] "ExternalReference"          "LocationLatitude"          
##  [15] "LocationLongitude"          "DistributionChannel"       
##  [17] "UserLanguage"               "Q_RecaptchaScore"          
##  [19] "Q_RelevantIDDuplicate"      "Q_RelevantIDDuplicateScore"
##  [21] "Q_RelevantIDFraudScore"     "Q_RelevantIDLastStartDate" 
##  [23] "Q1"                         "main_branch_1"             
##  [25] "main_branch_2"              "main_branch_3"             
##  [27] "main_branch_4"              "main_branch_5"             
##  [29] "main_branch_6"              "main_branch_7"             
##  [31] "main_branch_8"              "main_branch_9"             
##  [33] "main_branch_10"             "mohanty_news_1"            
##  [35] "mohanty_news_2"             "mohanty_news_3"            
##  [37] "mohanty_news_4"             "mohanty_news_5"            
##  [39] "mohanty_news_6"             "mohanty_news_6_TEXT"       
##  [41] "dinsmore_smusage"           "dinsmore_comms"            
##  [43] "dinsmore_comms_7_TEXT"      "yifan_livecomm_1"          
##  [45] "yifan_livecomm_2"           "yifan_livecomm_3"          
##  [47] "yifan_livecomm_4"           "shawn_worktype"            
##  [49] "shawn_worktype_5_TEXT"      "bot_check_4"               
##  [51] "bot_check_5"                "bot_check_6"               
##  [53] "bot_check_7"                "bot_check_8"               
##  [55] "gutierrez_suarez_ai"        "krishna_cata_1"            
##  [57] "krishna_cata_2"             "krishna_cata_3"            
##  [59] "krishna_cata_4"             "krishna_cata_5"            
##  [61] "krishna_cata_6"             "krishna_cata_7"            
##  [63] "krishna_cata_8"             "krishna_cata_9"            
##  [65] "krishna_cata_10"            "krishna_cata_11"           
##  [67] "krishna_cata_10_TEXT"       "ai_ad_1"                   
##  [69] "ai_ad_2"                    "ai_ad_3"                   
##  [71] "krishna_dc"                 "gutierrez_clear"           
##  [73] "suarez_ad2_1"               "suarez_ad2_2"              
##  [75] "dinsmore_mh"                "balint_stress"             
##  [77] "gomez_dinsmore_ad_1"        "gomez_dinsmore_ad_2"       
##  [79] "gomez_dinsmore_ad_3"        "gomez_dinsmore_ad_4"       
##  [81] "dinsmore_friendship"        "mohanty_emotneeds"         
##  [83] "SC_approval"                "brake_influence"           
##  [85] "brake_confidence"           "brake_ideology"            
##  [87] "brake_number"               "mohanty_disagree"          
##  [89] "mohanty_famideo"            "mohanty_famideo_13_TEXT"   
##  [91] "mohanty_commideo"           "mohanty_commideo_6_TEXT"   
##  [93] "amalea_source_1"            "amalea_source_2"           
##  [95] "amalea_source_3"            "amalea_source_4"           
##  [97] "amalea_source_5"            "amalea_source_6"           
##  [99] "amalea_source_7"            "amalea_source_6_TEXT"      
## [101] "amalea_fam"                 "amalea_stigma"             
## [103] "amalea_ad"                  "God_importance_xu"         
## [105] "God_help_xu"                "God_answer_xu"             
## [107] "gomez_recruit"              "gomez_success"             
## [109] "balint_struggle"            "balint_cost"               
## [111] "balint_inc_1"               "balint_inc_2"              
## [113] "balint_inc_3"               "balint_inc_4"              
## [115] "balint_inc_5"               "balint_inc_6"              
## [117] "balint_inc_5_TEXT"          "balint_impact"             
## [119] "mitchell_threat"            "mitchell_ad_1"             
## [121] "mitchell_ad_2"              "Q1ZO"                      
## [123] "Q2ZO"                       "Q5ZO"                      
## [125] "omar_ad_1"                  "omar_ad_2"                 
## [127] "kopecky_sat"                "yeumin_rec"                
## [129] "yuemin_ad_1"                "yuemin_ad_2"               
## [131] "yuemin_ad_3"                "yuemin_ad_4"               
## [133] "kopecky_goals_1"            "kopecky_goals_2"           
## [135] "kopecky_goals_3"            "kopecky_goals_4"           
## [137] "kopecky_goals_5"            "kopecky_goals_6"           
## [139] "kopecky_goals_7"            "kopecky_goals_7_TEXT"      
## [141] "kopecky_focus"              "kopecky_focus_6_TEXT"      
## [143] "kopecky_type_1"             "kopecky_type_2"            
## [145] "kopecky_type_3"             "shawn_overall"             
## [147] "shawn_ad"                   "stapleton_so_1"            
## [149] "stapleton_so_2"             "stapleton_so_3"            
## [151] "stapleton_so_4"             "Q107"                      
## [153] "yifan_why_12"               "yifan_why_13"              
## [155] "yifan_why_14"               "yifan_why_15"              
## [157] "yifan_why_16"               "yifan_why_18"              
## [159] "yifan_why_19"               "yifan_why_20"              
## [161] "yifan_why_21"               "yifan_why_22"              
## [163] "yifan_why_22_TEXT"          "yifan_length"              
## [165] "mitchell_cc"                "pol_ideo"                  
## [167] "pol_ideo_8_TEXT"            "God_pray_xu"               
## [169] "rid"                        "age"                       
## [171] "gender"                     "hhi"                       
## [173] "ethnicity"                  "hispanic"                  
## [175] "education"                  "region"                    
## [177] "zip"                        "political_party"

skim(omni) #Gives high level information about each variable in data

Data summary
Name	omni
Number of rows	297
Number of columns	178
_______________________
Column type frequency:
character	33
numeric	141
POSIXct	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
IPAddress	1	9	15	0	297
ResponseId	1	17	17	0	297
RecipientLastName	1	0	0	297	1
RecipientFirstName	1	0	0	297	1
RecipientEmail	1	0	0	297	1
ExternalReference	1	0	0	297	1
LocationLatitude	1	5	7	0	278
LocationLongitude	1	6	9	0	278
DistributionChannel	1	9	9	0	1
UserLanguage	1	2	2	0	1
Q_RelevantIDDuplicate	1	0	0	297	1
mohanty_news_6_TEXT	1	0	36	281	17
dinsmore_comms_7_TEXT	1	0	21	295	3
shawn_worktype_5_TEXT	1	0	17	286	10
krishna_cata_10_TEXT	1	0	17	294	4
mohanty_famideo_13_TEXT	1	0	33	295	3
mohanty_commideo_6_TEXT	1	0	12	296	2
amalea_source_6_TEXT	1	0	88	290	8
balint_inc_5_TEXT	1	0	47	289	9
kopecky_goals_7_TEXT	1	0	62	291	7
kopecky_focus_6_TEXT	1	0	28	282	14
yifan_why_22_TEXT	1	0	178	276	22
pol_ideo_8_TEXT	1	0	11	296	2
rid	1	36	36	0	297
age	1	2	2	0	67
gender	1	1	1	0	2
hhi	1	1	2	0	24
ethnicity	1	1	2	0	11
hispanic	1	1	2	0	10
education	1	1	5	0	9
region	1	1	1	0	4
zip	1	5	5	0	294
political_party	1	1	1	0	8

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Status	0	1.00	0.00	0.00	0	0.00	0	0.00	0	▁▁▇▁▁
Progress	0	1.00	100.00	0.00	100	100.00	100	100.00	100	▁▁▇▁▁
Duration_in_seconds	0	1.00	992.61	3582.40	7	395.00	592	827.00	59343	▇▁▁▁▁
Finished	0	1.00	1.00	0.00	1	1.00	1	1.00	1	▁▁▇▁▁
Q_RecaptchaScore	0	1.00	0.90	0.18	0	0.90	1	1.00	1	▁▁▁▂▇
Q_RelevantIDDuplicateScore	0	1.00	0.00	0.00	0	0.00	0	0.00	0	▁▁▇▁▁
Q_RelevantIDFraudScore	0	1.00	1.31	7.44	0	0.00	0	0.00	105	▇▁▁▁▁
Q1	0	1.00	1.02	0.13	1	1.00	1	1.00	2	▇▁▁▁▁
main_branch_1	5	0.98	0.71	0.45	0	0.00	1	1.00	1	▃▁▁▁▇
main_branch_2	5	0.98	0.46	0.50	0	0.00	0	1.00	1	▇▁▁▁▇
main_branch_3	5	0.98	0.25	0.43	0	0.00	0	0.00	1	▇▁▁▁▂
main_branch_4	5	0.98	0.17	0.38	0	0.00	0	0.00	1	▇▁▁▁▂
main_branch_5	5	0.98	0.47	0.50	0	0.00	0	1.00	1	▇▁▁▁▇
main_branch_6	5	0.98	0.35	0.48	0	0.00	0	1.00	1	▇▁▁▁▅
main_branch_7	5	0.98	0.37	0.48	0	0.00	0	1.00	1	▇▁▁▁▅
main_branch_8	5	0.98	0.82	0.39	0	1.00	1	1.00	1	▂▁▁▁▇
main_branch_9	5	0.98	0.05	0.22	0	0.00	0	0.00	1	▇▁▁▁▁
main_branch_10	5	0.98	0.03	0.17	0	0.00	0	0.00	1	▇▁▁▁▁
mohanty_news_1	5	0.98	0.77	0.42	0	1.00	1	1.00	1	▂▁▁▁▇
mohanty_news_2	5	0.98	0.61	0.49	0	0.00	1	1.00	1	▅▁▁▁▇
mohanty_news_3	5	0.98	0.25	0.44	0	0.00	0	1.00	1	▇▁▁▁▃
mohanty_news_4	5	0.98	0.30	0.46	0	0.00	0	1.00	1	▇▁▁▁▃
mohanty_news_5	5	0.98	0.28	0.45	0	0.00	0	1.00	1	▇▁▁▁▃
mohanty_news_6	5	0.98	0.05	0.23	0	0.00	0	0.00	1	▇▁▁▁▁
dinsmore_smusage	5	0.98	2.25	1.30	1	1.00	2	3.00	6	▇▂▁▁▁
dinsmore_comms	6	0.98	2.67	1.76	1	1.00	2	3.00	8	▇▂▂▂▁
yifan_livecomm_1	5	0.98	0.38	0.49	0	0.00	0	1.00	1	▇▁▁▁▅
yifan_livecomm_2	5	0.98	0.03	0.17	0	0.00	0	0.00	1	▇▁▁▁▁
yifan_livecomm_3	5	0.98	0.08	0.27	0	0.00	0	0.00	1	▇▁▁▁▁
yifan_livecomm_4	5	0.98	0.52	0.50	0	0.00	1	1.00	1	▇▁▁▁▇
shawn_worktype	162	0.45	2.54	1.34	1	1.00	2	4.00	5	▇▆▅▅▂
bot_check_4	5	0.98	0.00	0.06	0	0.00	0	0.00	1	▇▁▁▁▁
bot_check_5	5	0.98	0.02	0.15	0	0.00	0	0.00	1	▇▁▁▁▁
bot_check_6	5	0.98	0.96	0.20	0	1.00	1	1.00	1	▁▁▁▁▇
bot_check_7	5	0.98	0.93	0.26	0	1.00	1	1.00	1	▁▁▁▁▇
bot_check_8	5	0.98	0.00	0.00	0	0.00	0	0.00	0	▁▁▇▁▁
gutierrez_suarez_ai	31	0.90	3.20	1.09	1	3.00	3	4.00	5	▂▃▇▇▂
krishna_cata_1	31	0.90	0.41	0.49	0	0.00	0	1.00	1	▇▁▁▁▆
krishna_cata_2	31	0.90	0.38	0.49	0	0.00	0	1.00	1	▇▁▁▁▅
krishna_cata_3	31	0.90	0.09	0.29	0	0.00	0	0.00	1	▇▁▁▁▁
krishna_cata_4	31	0.90	0.06	0.23	0	0.00	0	0.00	1	▇▁▁▁▁
krishna_cata_5	31	0.90	0.06	0.24	0	0.00	0	0.00	1	▇▁▁▁▁
krishna_cata_6	31	0.90	0.18	0.39	0	0.00	0	0.00	1	▇▁▁▁▂
krishna_cata_7	31	0.90	0.11	0.32	0	0.00	0	0.00	1	▇▁▁▁▁
krishna_cata_8	31	0.90	0.04	0.19	0	0.00	0	0.00	1	▇▁▁▁▁
krishna_cata_9	31	0.90	0.06	0.25	0	0.00	0	0.00	1	▇▁▁▁▁
krishna_cata_10	31	0.90	0.01	0.11	0	0.00	0	0.00	1	▇▁▁▁▁
krishna_cata_11	31	0.90	0.38	0.49	0	0.00	0	1.00	1	▇▁▁▁▅
ai_ad_1	32	0.89	2.06	1.07	1	1.00	2	3.00	5	▇▇▃▁▁
ai_ad_2	31	0.90	2.80	1.25	1	2.00	3	4.00	5	▅▇▇▃▃
ai_ad_3	31	0.90	2.24	1.08	1	1.00	2	3.00	5	▇▇▇▂▁
krishna_dc	133	0.55	2.71	1.23	1	2.00	3	4.00	5	▅▇▆▆▂
gutierrez_clear	133	0.55	1.82	0.87	1	1.00	2	2.00	5	▇▇▃▁▁
suarez_ad2_1	169	0.43	2.91	1.42	1	2.00	3	4.00	5	▅▇▅▅▆
suarez_ad2_2	169	0.43	2.41	1.26	1	1.00	2	3.00	5	▇▇▅▅▂
dinsmore_mh	31	0.90	2.15	0.97	1	1.00	2	3.00	5	▅▇▅▁▁
balint_stress	31	0.90	2.83	1.12	1	2.00	3	4.00	5	▃▇▇▅▂
gomez_dinsmore_ad_1	31	0.90	2.45	1.19	1	2.00	2	3.00	5	▅▇▃▃▂
gomez_dinsmore_ad_2	31	0.90	1.98	1.10	1	1.00	2	2.00	5	▇▇▂▂▁
gomez_dinsmore_ad_3	31	0.90	2.06	1.17	1	1.00	2	3.00	5	▇▆▃▁▁
gomez_dinsmore_ad_4	31	0.90	3.29	1.34	1	2.00	3	5.00	5	▃▆▇▆▇
dinsmore_friendship	31	0.90	2.02	0.98	1	1.00	2	3.00	5	▇▇▅▁▁
mohanty_emotneeds	31	0.90	2.50	1.21	1	2.00	2	3.00	5	▅▇▅▃▂
SC_approval	31	0.90	2.49	0.93	1	2.00	2	3.00	4	▂▇▁▆▃
brake_influence	31	0.90	2.83	1.10	1	2.00	3	3.00	5	▂▅▇▃▂
brake_confidence	31	0.90	3.17	1.20	1	2.00	3	4.00	5	▃▆▇▇▅
brake_ideology	31	0.90	2.24	0.72	1	2.00	2	3.00	3	▃▁▇▁▇
brake_number	32	0.89	8.90	3.53	1	8.00	9	9.00	21	▂▇▂▁▁
mohanty_disagree	31	0.90	3.31	1.19	1	2.00	4	4.00	5	▂▃▅▇▃
mohanty_famideo	31	0.90	10.09	1.86	8	9.00	10	10.00	15	▇▆▂▁▂
mohanty_commideo	31	0.90	2.84	1.41	1	2.00	3	3.00	6	▇▆▁▃▁
amalea_source_1	31	0.90	0.28	0.45	0	0.00	0	1.00	1	▇▁▁▁▃
amalea_source_2	31	0.90	0.18	0.39	0	0.00	0	0.00	1	▇▁▁▁▂
amalea_source_3	31	0.90	0.32	0.47	0	0.00	0	1.00	1	▇▁▁▁▃
amalea_source_4	31	0.90	0.03	0.16	0	0.00	0	0.00	1	▇▁▁▁▁
amalea_source_5	31	0.90	0.28	0.45	0	0.00	0	1.00	1	▇▁▁▁▃
amalea_source_6	31	0.90	0.03	0.17	0	0.00	0	0.00	1	▇▁▁▁▁
amalea_source_7	31	0.90	0.33	0.47	0	0.00	0	1.00	1	▇▁▁▁▃
amalea_fam	31	0.90	3.52	1.35	1	2.00	4	5.00	5	▂▅▅▅▇
amalea_stigma	31	0.90	3.27	1.34	1	2.00	3	4.75	5	▃▆▇▆▇
amalea_ad	31	0.90	1.82	0.95	1	1.00	2	2.00	5	▇▆▂▁▁
God_importance_xu	102	0.66	1.51	0.78	1	1.00	1	2.00	5	▇▃▂▁▁
God_help_xu	102	0.66	1.90	1.18	1	1.00	1	3.00	6	▇▂▁▁▁
God_answer_xu	102	0.66	2.23	1.27	1	1.00	2	3.00	6	▇▂▁▁▁
gomez_recruit	31	0.90	2.24	1.13	1	1.00	2	3.00	5	▇▇▆▁▂
gomez_success	31	0.90	1.48	0.81	1	1.00	1	2.00	5	▇▂▁▁▁
balint_struggle	31	0.90	3.19	1.32	1	2.00	3	4.00	5	▃▇▇▅▇
balint_cost	31	0.90	2.82	1.34	1	2.00	3	4.00	5	▆▇▇▃▅
balint_inc_1	31	0.90	0.11	0.31	0	0.00	0	0.00	1	▇▁▁▁▁
balint_inc_2	31	0.90	0.21	0.41	0	0.00	0	0.00	1	▇▁▁▁▂
balint_inc_3	31	0.90	0.19	0.39	0	0.00	0	0.00	1	▇▁▁▁▂
balint_inc_4	31	0.90	0.15	0.36	0	0.00	0	0.00	1	▇▁▁▁▂
balint_inc_5	31	0.90	0.03	0.17	0	0.00	0	0.00	1	▇▁▁▁▁
balint_inc_6	31	0.90	0.55	0.50	0	0.00	1	1.00	1	▇▁▁▁▇
balint_impact	31	0.90	2.85	1.23	1	2.00	3	4.00	5	▃▆▇▃▃
mitchell_threat	32	0.89	1.45	0.50	1	1.00	1	2.00	2	▇▁▁▁▆
mitchell_ad_1	31	0.90	1.68	0.78	1	1.00	2	2.00	5	▇▆▂▁▁
mitchell_ad_2	31	0.90	3.11	1.21	1	2.00	3	4.00	5	▃▆▇▆▅
Q1ZO	31	0.90	3.74	1.27	1	3.00	4	5.00	5	▁▂▅▅▇
Q2ZO	32	0.89	2.35	0.96	1	2.00	2	3.00	4	▅▇▁▅▃
Q5ZO	33	0.89	2.97	1.27	1	2.00	3	4.00	5	▃▆▇▅▅
omar_ad_1	31	0.90	2.62	1.18	1	2.00	3	3.00	5	▅▆▇▂▂
omar_ad_2	32	0.89	2.52	1.13	1	2.00	3	3.00	5	▅▅▇▂▁
kopecky_sat	169	0.43	1.81	0.96	1	1.00	2	2.00	5	▇▆▂▁▁
yeumin_rec	31	0.90	2.21	1.13	1	1.00	2	3.00	5	▇▇▆▂▁
yuemin_ad_1	31	0.90	2.19	1.01	1	1.00	2	3.00	5	▅▇▃▂▁
yuemin_ad_2	31	0.90	2.05	1.02	1	1.00	2	3.00	5	▇▇▃▁▁
yuemin_ad_3	31	0.90	1.71	0.77	1	1.00	2	2.00	5	▇▇▂▁▁
yuemin_ad_4	31	0.90	2.41	1.14	1	2.00	2	3.00	5	▅▇▃▃▁
kopecky_goals_1	169	0.43	0.84	0.36	0	1.00	1	1.00	1	▂▁▁▁▇
kopecky_goals_2	169	0.43	0.29	0.46	0	0.00	0	1.00	1	▇▁▁▁▃
kopecky_goals_3	169	0.43	0.32	0.47	0	0.00	0	1.00	1	▇▁▁▁▃
kopecky_goals_4	169	0.43	0.42	0.50	0	0.00	0	1.00	1	▇▁▁▁▆
kopecky_goals_5	169	0.43	0.24	0.43	0	0.00	0	0.00	1	▇▁▁▁▂
kopecky_goals_6	169	0.43	0.45	0.50	0	0.00	0	1.00	1	▇▁▁▁▆
kopecky_goals_7	169	0.43	0.05	0.21	0	0.00	0	0.00	1	▇▁▁▁▁
kopecky_focus	169	0.43	3.21	1.58	1	2.00	3	4.25	6	▇▃▂▂▂
kopecky_type_1	169	0.43	0.38	0.49	0	0.00	0	1.00	1	▇▁▁▁▅
kopecky_type_2	169	0.43	0.22	0.42	0	0.00	0	0.00	1	▇▁▁▁▂
kopecky_type_3	169	0.43	0.52	0.50	0	0.00	1	1.00	1	▇▁▁▁▇
shawn_overall	169	0.43	1.88	0.86	1	1.00	2	2.00	5	▇▇▃▁▁
shawn_ad	169	0.43	2.46	1.74	1	1.00	2	3.00	6	▇▁▁▁▂
stapleton_so_1	202	0.32	2.55	1.63	1	1.00	2	3.00	6	▇▂▁▁▂
stapleton_so_2	202	0.32	2.56	1.54	1	1.00	2	3.00	6	▇▂▁▁▂
stapleton_so_3	202	0.32	3.17	1.66	1	2.00	3	4.00	6	▇▅▃▁▃
stapleton_so_4	202	0.32	2.99	1.60	1	2.00	3	4.00	6	▇▃▂▁▃
Q107	279	0.06	1.94	0.80	1	1.25	2	2.00	4	▃▇▁▂▁
yifan_why_12	157	0.47	0.22	0.42	0	0.00	0	0.00	1	▇▁▁▁▂
yifan_why_13	157	0.47	0.34	0.48	0	0.00	0	1.00	1	▇▁▁▁▅
yifan_why_14	157	0.47	0.40	0.49	0	0.00	0	1.00	1	▇▁▁▁▅
yifan_why_15	157	0.47	0.06	0.23	0	0.00	0	0.00	1	▇▁▁▁▁
yifan_why_16	157	0.47	0.14	0.35	0	0.00	0	0.00	1	▇▁▁▁▁
yifan_why_18	157	0.47	0.10	0.30	0	0.00	0	0.00	1	▇▁▁▁▁
yifan_why_19	157	0.47	0.07	0.26	0	0.00	0	0.00	1	▇▁▁▁▁
yifan_why_20	157	0.47	0.06	0.23	0	0.00	0	0.00	1	▇▁▁▁▁
yifan_why_21	157	0.47	0.11	0.32	0	0.00	0	0.00	1	▇▁▁▁▁
yifan_why_22	157	0.47	0.16	0.37	0	0.00	0	0.00	1	▇▁▁▁▂
yifan_length	32	0.89	2.87	1.20	1	2.00	3	4.00	4	▃▃▁▃▇
mitchell_cc	31	0.90	2.90	1.26	1	2.00	3	4.00	5	▃▇▇▃▅
pol_ideo	31	0.90	4.33	1.98	1	3.00	4	6.00	9	▅▇▂▅▁
God_pray_xu	31	0.90	2.28	1.59	1	1.00	1	4.00	5	▇▂▂▁▃

Variable type: POSIXct

skim_variable	complete_rate	min	max	median	n_unique
StartDate	1	2024-11-26 13:38:32	2024-11-26 14:02:00	2024-11-26 13:43:11	211
EndDate	1	2024-11-26 13:39:21	2024-11-27 06:08:32	2024-11-26 13:54:23	277
RecordedDate	1	2024-11-26 13:39:22	2024-11-27 06:08:34	2024-11-26 13:54:25	279
Q_RelevantIDLastStartDate	1	1582-10-14 00:00:00	1582-10-14 00:00:00	1582-10-14 00:00:00	1

When we look at the data, we see a wide variety of background data (called meta data) such as the duration of the survey, the Qualtrics ID information, the IP address, etc. as well as the survey questions and results themselves. On close inspection, we see that for the initial branching question main_branch_1 we have 0’s, 1’s, and NAs. The NAs mean the respondent never saw the question which tells us as the researcher that it is an incomplete response. We need to investigate the 297 responses to decide which ones finished the survey plus who failed the bot detector question so that we only analyze legitimate responses.

For surveys with a lot of questions like these, we can also review a few specific variables at a time if desired.

The first thing to do to the raw data is to run our column names through a data cleaning package called janitor. Oftentimes data exports with odd characters or capitalizations that are inconsistent. By using a package like janitor we can ensure consistency in our variable names making our life as coders easier.

omni<-clean_names(omni)
names(omni)

##   [1] "start_date"                    "end_date"                     
##   [3] "status"                        "ip_address"                   
##   [5] "progress"                      "duration_in_seconds"          
##   [7] "finished"                      "recorded_date"                
##   [9] "response_id"                   "recipient_last_name"          
##  [11] "recipient_first_name"          "recipient_email"              
##  [13] "external_reference"            "location_latitude"            
##  [15] "location_longitude"            "distribution_channel"         
##  [17] "user_language"                 "q_recaptcha_score"            
##  [19] "q_relevant_id_duplicate"       "q_relevant_id_duplicate_score"
##  [21] "q_relevant_id_fraud_score"     "q_relevant_id_last_start_date"
##  [23] "q1"                            "main_branch_1"                
##  [25] "main_branch_2"                 "main_branch_3"                
##  [27] "main_branch_4"                 "main_branch_5"                
##  [29] "main_branch_6"                 "main_branch_7"                
##  [31] "main_branch_8"                 "main_branch_9"                
##  [33] "main_branch_10"                "mohanty_news_1"               
##  [35] "mohanty_news_2"                "mohanty_news_3"               
##  [37] "mohanty_news_4"                "mohanty_news_5"               
##  [39] "mohanty_news_6"                "mohanty_news_6_text"          
##  [41] "dinsmore_smusage"              "dinsmore_comms"               
##  [43] "dinsmore_comms_7_text"         "yifan_livecomm_1"             
##  [45] "yifan_livecomm_2"              "yifan_livecomm_3"             
##  [47] "yifan_livecomm_4"              "shawn_worktype"               
##  [49] "shawn_worktype_5_text"         "bot_check_4"                  
##  [51] "bot_check_5"                   "bot_check_6"                  
##  [53] "bot_check_7"                   "bot_check_8"                  
##  [55] "gutierrez_suarez_ai"           "krishna_cata_1"               
##  [57] "krishna_cata_2"                "krishna_cata_3"               
##  [59] "krishna_cata_4"                "krishna_cata_5"               
##  [61] "krishna_cata_6"                "krishna_cata_7"               
##  [63] "krishna_cata_8"                "krishna_cata_9"               
##  [65] "krishna_cata_10"               "krishna_cata_11"              
##  [67] "krishna_cata_10_text"          "ai_ad_1"                      
##  [69] "ai_ad_2"                       "ai_ad_3"                      
##  [71] "krishna_dc"                    "gutierrez_clear"              
##  [73] "suarez_ad2_1"                  "suarez_ad2_2"                 
##  [75] "dinsmore_mh"                   "balint_stress"                
##  [77] "gomez_dinsmore_ad_1"           "gomez_dinsmore_ad_2"          
##  [79] "gomez_dinsmore_ad_3"           "gomez_dinsmore_ad_4"          
##  [81] "dinsmore_friendship"           "mohanty_emotneeds"            
##  [83] "sc_approval"                   "brake_influence"              
##  [85] "brake_confidence"              "brake_ideology"               
##  [87] "brake_number"                  "mohanty_disagree"             
##  [89] "mohanty_famideo"               "mohanty_famideo_13_text"      
##  [91] "mohanty_commideo"              "mohanty_commideo_6_text"      
##  [93] "amalea_source_1"               "amalea_source_2"              
##  [95] "amalea_source_3"               "amalea_source_4"              
##  [97] "amalea_source_5"               "amalea_source_6"              
##  [99] "amalea_source_7"               "amalea_source_6_text"         
## [101] "amalea_fam"                    "amalea_stigma"                
## [103] "amalea_ad"                     "god_importance_xu"            
## [105] "god_help_xu"                   "god_answer_xu"                
## [107] "gomez_recruit"                 "gomez_success"                
## [109] "balint_struggle"               "balint_cost"                  
## [111] "balint_inc_1"                  "balint_inc_2"                 
## [113] "balint_inc_3"                  "balint_inc_4"                 
## [115] "balint_inc_5"                  "balint_inc_6"                 
## [117] "balint_inc_5_text"             "balint_impact"                
## [119] "mitchell_threat"               "mitchell_ad_1"                
## [121] "mitchell_ad_2"                 "q1zo"                         
## [123] "q2zo"                          "q5zo"                         
## [125] "omar_ad_1"                     "omar_ad_2"                    
## [127] "kopecky_sat"                   "yeumin_rec"                   
## [129] "yuemin_ad_1"                   "yuemin_ad_2"                  
## [131] "yuemin_ad_3"                   "yuemin_ad_4"                  
## [133] "kopecky_goals_1"               "kopecky_goals_2"              
## [135] "kopecky_goals_3"               "kopecky_goals_4"              
## [137] "kopecky_goals_5"               "kopecky_goals_6"              
## [139] "kopecky_goals_7"               "kopecky_goals_7_text"         
## [141] "kopecky_focus"                 "kopecky_focus_6_text"         
## [143] "kopecky_type_1"                "kopecky_type_2"               
## [145] "kopecky_type_3"                "shawn_overall"                
## [147] "shawn_ad"                      "stapleton_so_1"               
## [149] "stapleton_so_2"                "stapleton_so_3"               
## [151] "stapleton_so_4"                "q107"                         
## [153] "yifan_why_12"                  "yifan_why_13"                 
## [155] "yifan_why_14"                  "yifan_why_15"                 
## [157] "yifan_why_16"                  "yifan_why_18"                 
## [159] "yifan_why_19"                  "yifan_why_20"                 
## [161] "yifan_why_21"                  "yifan_why_22"                 
## [163] "yifan_why_22_text"             "yifan_length"                 
## [165] "mitchell_cc"                   "pol_ideo"                     
## [167] "pol_ideo_8_text"               "god_pray_xu"                  
## [169] "rid"                           "age"                          
## [171] "gender"                        "hhi"                          
## [173] "ethnicity"                     "hispanic"                     
## [175] "education"                     "region"                       
## [177] "zip"                           "political_party"

Since our variable names are generally consistent already, you will not see a huge difference from this command. The one thing to note is that it made the random columns that started with an uppercase letter lowercase, which is best coding practice.

Removing Incomplete and Potential Bot Respondents

We start by looking at the first question in the survey q1 which is the informed consent question. If a respondent did not consent, they immediately left the survey. This group does not need to be in our final data so let’s remove them first.

We start with the freq(omni$q1) function along with val_labels(omni$q1) to see how many of each response we have in our data.

freq(omni$q1, plot=F)

## Thank you for agreeing to participate in this survey conducted by students at the University of Massachusetts. Your participation in this study is completely voluntary, and you can withdraw at any time with no penalty. The survey should take around 10 minutes to complete. You will be asked questions about a wide variety of topics. 
## 
## By selecting Yes below, you consent to taking part in this research study. By selecting No below, you do not consent to take part in this study. 
##       Frequency Percent
## 1           292  98.316
## 2             5   1.684
## Total       297 100.000

val_labels(omni$q1)

## Yes  No 
##   1   2

Five respondents did not consent to take the survey so are included in the raw data. We delete them using the filter function from tidyverse.

We also save a new dataset to keep the original intact and untransformed. We’ll call the new data simply data.

data<-omni %>% 
  filter(q1==1)

Now we have a new dataframe in R that has 292 observations, 5 less than the original 297. The bot detection question asked respondents to select options 3 and 4. Bots are not good at following instructions like this so tend to fail this question routinely. We need to remove the responses that failed this check.

From the codebook, we see that the bot detector questions all start with bot_check_ followed by a number between 4 and 8. The odd numbering is a quirk of Qualtrics that I’ve never been able to figure out why it happens. But the codebook tells us that option 3 is bot_check_6 while option 4 is bot_check_7. To continue in the survey, respondents had to select both of those options and nothing else. Let’s fun a frequency on each option to see how many were selected independently.

freq(data$bot_check_6, plot=F)

## For this survey question, we are interested in how many people read the question closely so please select options 3 and 4 below. We thank you for being engaged with the survey. 3 
##       Frequency Percent
## 0            12    4.11
## 1           280   95.89
## Total       292  100.00

freq(data$bot_check_7, plot=F)

## For this survey question, we are interested in how many people read the question closely so please select options 3 and 4 below. We thank you for being engaged with the survey. 4 
##       Frequency Percent
## 0            21   7.192
## 1           271  92.808
## Total       292 100.000

A vast majority of responses selected the correct option, 96% selected option 3 and 93% selected option 4, but we need to remove the responses that did not answer both.

We do this by first creating a new variable that simply adds the two bot_check variables together and then removes any responses that does not equal 2. This works because adding the two bot_check questions together will only equal 2 if the respondent correctly selected both options.

To do this, we use the mutate function from tidyverse to sum the correct responses then we use the filter command to remove any cases that do not equal 2. We save the data into the same data since this is our transformed dataset. Remember, we keep the original untransformed just in case.

data <- data %>% 
  mutate(correct = bot_check_6 + bot_check_7) %>% 
  filter(correct == 2)

freq(data$finished, plot=F)

## Finished 
##       Frequency Percent
## 1           266     100
## Total       266     100

Now, we see that we have 266 observations so we dropped another 26 respondents who failed the bot detector question. There are oftentimes responses in raw survey data that only partially finished the survey. These cases should be removed and should not be included in any analysis or reporting. Since our data was exported before the incompletes are included, we do not have any additional cases to filter out. To confirm this, we run the freq command on the finished variable, which is a Qualtrics provided variable that = 1 if respondent finished the survey and 0 if they did not. If you have any 0’s here, you should filter them out using the filter approach from the above code chunk. Since we only have finished responses, we will not take any additional steps with our class data.

This means that our final survey data has 266 completed responses for us to analyze. Not every variable will have 266 responses though. Sometimes a few respondents will simply not answer the question while other questions are only asked of a subset of respondents who answered a previous question in a specific way. It is important for you to understand why you have missing data. Is it because the respondent skipped answering it or because it was designed for that respondent not to answer?

Cleaning Survey Data for Analysis

Raw survey data is rarely ready for analysis without transforming the data in some way(s). In this section, we will review a variety of common data cleaning and recoding techniques used in survey research including:

Finding and dealing with NA/s (system missing data)
Finding and Removing Don’t Knows/No Opinion/PNA options
Labeling unlabeled question values
Collapsing multiple groups into fewer
Combining Two Variables into One

Finding and dealing with NA/s (system missing data)

Any time a survey respondent does not see a question (i.e. they did not qualify to see it for some branching question reason) or skip a question when asked (i.e. they saw the question and simply refused to answer it), the data is imported into R as NA. Previously, we talked about how you can decide to include or not when reporting the results to the public. For analysis purposes however, it is important that we deal with them in order to get certain analyses to work.

Luckily, there are numerous ways to identify NAs in your results. Above, we used the skim function to look at the entire dataset. While that is useful, it can be cumbersome with a large survey like we have here.

Using these variables (stapleton_so_1/4 & main_branch_6) as an example, you can think about your variables and replacing these names with the appropriate names for your variables.

First, let’s review the summary information for these specific variables. There are number of ways to do this in R. Here we use the get_summary_stats function from the rstatix package. This function gives you a broad statistical overview of your variables.

There are two approaches using this function in the following code chunk. In the first approach, you manually input each variable names of interest. In the second approach, you can take advantage of consistent variable naming conventions and have it summarize all variables that start with a specific structure, here we use stapleton_so_ since all these variables start with that except for the branching question. We can simply manually add that variable to the code. This approach is efficient but only works if the naming conventions are consistent for your variables. So ensure they are before employing this approach.

We also use the skimr package for a slightly different look at the data. This package is very helpful at understanding missing data in each variable. To use it on specific variables, we first use the select function to specify the exact variable names we want and then use skim() to run the function. Both give broadly the same information although I like skim for its clearer readout of missing data in a variable.

#1
data %>%  #Manually enter the variable names
  get_summary_stats(     
    main_branch_6, stapleton_so_1,stapleton_so_2,
    stapleton_so_3, stapleton_so_4,  # columns to calculate for
    type = "common")

## Warning: attributes are not identical across measure variables; they will be
## dropped

## # A tibble: 5 × 10
##   variable           n   min   max median   iqr  mean    sd    se    ci
##   <fct>          <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 main_branch_6    266     0     1      0     1 0.357  0.48 0.029 0.058
## 2 stapleton_so_1    95     1     6      2     2 2.55   1.63 0.167 0.332
## 3 stapleton_so_2    95     1     6      2     2 2.56   1.54 0.158 0.314
## 4 stapleton_so_3    95     1     6      3     2 3.17   1.66 0.17  0.338
## 5 stapleton_so_4    95     1     6      3     2 2.99   1.60 0.164 0.326

#2
data %>%  #Review many at once depending on how it starts 
  select(starts_with("stapleton_so_"), main_branch_6) %>% 
  get_summary_stats(type = "common")

## Warning: attributes are not identical across measure variables; they will be
## dropped

## # A tibble: 5 × 10
##   variable           n   min   max median   iqr  mean    sd    se    ci
##   <fct>          <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 stapleton_so_1    95     1     6      2     2 2.55   1.63 0.167 0.332
## 2 stapleton_so_2    95     1     6      2     2 2.56   1.54 0.158 0.314
## 3 stapleton_so_3    95     1     6      3     2 3.17   1.66 0.17  0.338
## 4 stapleton_so_4    95     1     6      3     2 2.99   1.60 0.164 0.326
## 5 main_branch_6    266     0     1      0     1 0.357  0.48 0.029 0.058

#Skimr Package Approach
#data %>% 
 # select(main_branch_6, stapleton_so_1, stapleton_so_2, stapleton_so_3, stapleton_so_4) %>%   skim()

data %>% 
  select(starts_with("stapleton_so_"), main_branch_6) %>% 
  skim()

Data summary
Name	Piped data
Number of rows	266
Number of columns	5
_______________________
Column type frequency:
numeric	5
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
stapleton_so_1	171	0.36	2.55	1.63	1	1	2	3	6	▇▂▁▁▂
stapleton_so_2	171	0.36	2.56	1.54	1	1	2	3	6	▇▂▁▁▂
stapleton_so_3	171	0.36	3.17	1.66	1	2	3	4	6	▇▅▃▁▃
stapleton_so_4	171	0.36	2.99	1.60	1	2	3	4	6	▇▃▂▁▃
main_branch_6	0	1.00	0.36	0.48	0	0	0	1	1	▇▁▁▁▅

With this code, we see the min, max, median, mean and variation metrics of our specified variables. The two functions return broadly the same information but with some slight differences. The get_summary_stats code tells us the number of responses for each variable as well as min and max. From here, we see the branching question, main_branch_6, has 266 responses, which is the full sample.

However, we see only 95 responses for the other 4 questions meaning we have a majority of the data for that question is missing. We can look at the skim results to see that in fact we have 171 system missing responses with only a 35.7% completion rate. If I were expecting the entire sample to have answered this question, the 35.7% completion rate would be a horrifying result as it would mean something went wrong either in the branching logic or how it displayed to the respondents. Because I created the survey, I know that these questions were only asked of a subset of the full sample, those who said they liked MLB, so that volume of missing data does not concern me. To verify, we look at the mean percentage of respondents who answered yes they like baseball, which is 35.7%. The exact same percentage so no issues with these data.

Now, I know there these variables have system missing responses so I will need to deal with that once we start analyzing the data, but no additional action is necessary right now.

Finding and Removing Don’t Knows/PNA/No Opinions

There is one concerning issue in the previous results, which requires more investigation. The max value for the first four questions is 6. However, there are only 4 main responses to these questions (strongly approve, somewhat approve, somewhat disapprove, strongly disapprove) so the fact that there a value of 6 indicates there is another response option present in the data.

To investigate more closely, we will use the freq and val_labels commands to understand what the max option of 6 actually is.

val_labels(data$stapleton_so_1)

##    Strongly support    Somewhat support     Somewhat oppose     Strongly oppose 
##                   1                   2                   3                   4 
## Not sure/No opinion 
##                   6

freq(data$stapleton_so_1, plot=F)

## Recently, Major League Baseball implemented various rule changes about how the game is played. This includes...
## 
##  The "universal DH" rule, which allowed National League teams to employ a Designated Hitter in each game to match the rules in the American League,
##  A pitch clock which forces pitchers to throw a pitch within a specified amount of time,
##  Banning the use of the "shift"  strategy which saw infielders being moved all around the infield, and
##  Making the bases larger to make it easier to steal bases.
## 
## 
## 
## For each of the following new MLB rules, how much do you support or oppose it? - Universal Designated Hitter (DH) 
##       Frequency Percent Valid Percent
## 1            28  10.526        29.474
## 2            33  12.406        34.737
## 3            14   5.263        14.737
## 4             7   2.632         7.368
## 6            13   4.887        13.684
## NA's        171  64.286              
## Total       266 100.000       100.000

freq(data$stapleton_so_2, plot=F)

## Recently, Major League Baseball implemented various rule changes about how the game is played. This includes...
## 
##  The "universal DH" rule, which allowed National League teams to employ a Designated Hitter in each game to match the rules in the American League,
##  A pitch clock which forces pitchers to throw a pitch within a specified amount of time,
##  Banning the use of the "shift"  strategy which saw infielders being moved all around the infield, and
##  Making the bases larger to make it easier to steal bases.
## 
## 
## 
## For each of the following new MLB rules, how much do you support or oppose it? - Pitch Clock 
##       Frequency Percent Valid Percent
## 1            25   9.398        26.316
## 2            34  12.782        35.789
## 3            16   6.015        16.842
## 4             9   3.383         9.474
## 6            11   4.135        11.579
## NA's        171  64.286              
## Total       266 100.000       100.000

freq(data$stapleton_so_3, plot=F)

## Recently, Major League Baseball implemented various rule changes about how the game is played. This includes...
## 
##  The "universal DH" rule, which allowed National League teams to employ a Designated Hitter in each game to match the rules in the American League,
##  A pitch clock which forces pitchers to throw a pitch within a specified amount of time,
##  Banning the use of the "shift"  strategy which saw infielders being moved all around the infield, and
##  Making the bases larger to make it easier to steal bases.
## 
## 
## 
## For each of the following new MLB rules, how much do you support or oppose it? - Infielder Shift Ban 
##       Frequency Percent Valid Percent
## 1            15   5.639         15.79
## 2            24   9.023         25.26
## 3            22   8.271         23.16
## 4            16   6.015         16.84
## 6            18   6.767         18.95
## NA's        171  64.286              
## Total       266 100.000        100.00

freq(data$stapleton_so_4, plot=F)

## Recently, Major League Baseball implemented various rule changes about how the game is played. This includes...
## 
##  The "universal DH" rule, which allowed National League teams to employ a Designated Hitter in each game to match the rules in the American League,
##  A pitch clock which forces pitchers to throw a pitch within a specified amount of time,
##  Banning the use of the "shift"  strategy which saw infielders being moved all around the infield, and
##  Making the bases larger to make it easier to steal bases.
## 
## 
## 
## For each of the following new MLB rules, how much do you support or oppose it? - Larger bases 
##       Frequency Percent Valid Percent
## 1            16   6.015         16.84
## 2            28  10.526         29.47
## 3            22   8.271         23.16
## 4            14   5.263         14.74
## 6            15   5.639         15.79
## NA's        171  64.286              
## Total       266 100.000        100.00

With the val_labels command, we see that strongly support = 1 and strongly oppose = 4 while not sure/no opinion = 6. This means that right now if we were to use different analytic techniques on this variable, we would be telling R that the don’t knows are legitimate substantive responses that should be analyzed. This would lead to error and bias in your results so should be explicitly removed from the variable prior to data exploration and analysis. This does not mean that for general reporting to the public that should be removed. This is a separate part of being a survey researcher.

Now that we know that the option 6 is don’t know/no opinion, we first look at its frequency for each question. We have a range from 11 to 19%. Now, we need to remove these non-substantive responses from the data. We never remove them from the original variable, so anytime you make transformations to a variable create a new one.

We use the mutate function to recode the value 6 into NA using the 4 original stapleton_so_ variables then create new variables with names to signify what the variable is (e.g. stapleton_so_dh ). This keeps the original variable’s integrity intact while giving us a new variable that we can use in analysis and data exploration.

data <- data %>%
  mutate(
    stapleton_so_dh = replace(stapleton_so_1, stapleton_so_1 == 6, NA) , #Create new variable called 'stapleton_so_dh' which equals the original variable 'stapleton_so_1' but replace all 6 values (don't know/no opinion) as NA
    stapleton_so_pc = replace(stapleton_so_2, stapleton_so_2 == 6, NA),
    stapleton_so_shift = replace(stapleton_so_3, stapleton_so_3 == 6, NA),
    stapleton_so_bases = replace(stapleton_so_4, stapleton_so_4 == 6, NA)
  )

Always review your newly created variables to ensure the recoding/transformation worked as expected. Here, we use the get_summary_stats command to summarize all the stapleton_so_ variables easily. This is the reason why I kept the naming convention the same.

data %>%  #Review many at once depending on how it starts 
  select(starts_with("stapleton_so_")) %>% 
  get_summary_stats(type = "common")

## Warning: attributes are not identical across measure variables; they will be
## dropped

## # A tibble: 8 × 10
##   variable               n   min   max median   iqr  mean    sd    se    ci
##   <fct>              <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 stapleton_so_1        95     1     6      2  2     2.55 1.63  0.167 0.332
## 2 stapleton_so_2        95     1     6      2  2     2.56 1.54  0.158 0.314
## 3 stapleton_so_3        95     1     6      3  2     3.17 1.66  0.17  0.338
## 4 stapleton_so_4        95     1     6      3  2     2.99 1.60  0.164 0.326
## 5 stapleton_so_dh       82     1     4      2  1.75  2    0.93  0.103 0.204
## 6 stapleton_so_pc       84     1     4      2  2     2.11 0.957 0.104 0.208
## 7 stapleton_so_shift    77     1     4      2  1     2.51 1.03  0.118 0.235
## 8 stapleton_so_bases    80     1     4      2  1     2.42 1.00  0.112 0.223

Here we see all 8 variables that start stapleton_so_ including the four new variables we created. Notice how the n counts are slightly lower in the four new variables compared to their original. This means we successfully removed the Don’t Know options. We can also tell this from the max for the newly created variables which is 4.

We can also use the freq function to quickly look at the individual values for the newly created variables.

freq(data$stapleton_so_dh, plot=F)

## Recently, Major League Baseball implemented various rule changes about how the game is played. This includes...
## 
##  The "universal DH" rule, which allowed National League teams to employ a Designated Hitter in each game to match the rules in the American League,
##  A pitch clock which forces pitchers to throw a pitch within a specified amount of time,
##  Banning the use of the "shift"  strategy which saw infielders being moved all around the infield, and
##  Making the bases larger to make it easier to steal bases.
## 
## 
## 
## For each of the following new MLB rules, how much do you support or oppose it? - Universal Designated Hitter (DH) 
##       Frequency Percent Valid Percent
## 1            28  10.526        34.146
## 2            33  12.406        40.244
## 3            14   5.263        17.073
## 4             7   2.632         8.537
## NA's        184  69.173              
## Total       266 100.000       100.000

val_labels(data$stapleton_so_dh)

##    Strongly support    Somewhat support     Somewhat oppose     Strongly oppose 
##                   1                   2                   3                   4 
## Not sure/No opinion 
##                   6

Once again, we see that the value 6 is no longer in this data showing us we successfully removed them from these data. We also use the val_labels function to reveal that the new variable keeps the originally labels so we do not need to worry about that here.

Labeling Unlabeled Question Values

Not all survey data will import with labels already applied to the response options. For us, the only variables like this are the Lucid provided demographic results. Meaning, they just give us the number for the demographic questions while providing a codebook that tells us what the numbers mean. For reporting purposes, we always would need to provide the actual labels rather than the number for these options. This is also frequently done when analyzing the data so the researcher does not have to constantly go back and forth to the codebook.

The demographic variables we have available are: - hhi (household income measured with 24 categories) - age (measured in individual years from 18 to 99) - gender (actually biological sex but that is how Lucid names it) - ethnicity (actually race broken into 10 groups) - hispanic (if respondent is Hispanic background - with 10 unique options) - education (with 8 options) - region (region of the United States respondent lives with 4 options)

Start by running the freq & val_labels command on the hhi variable.

freq(data$hhi, plot=F)

## hhi 
##       Frequency  Percent
## 1            45  16.9173
## 10            6   2.2556
## 11           12   4.5113
## 12            9   3.3835
## 13            8   3.0075
## 14           10   3.7594
## 15            2   0.7519
## 16            6   2.2556
## 17            2   0.7519
## 18            1   0.3759
## 19           10   3.7594
## 2            11   4.1353
## 20            7   2.6316
## 21            9   3.3835
## 22            5   1.8797
## 23            2   0.7519
## 24            3   1.1278
## 3            24   9.0226
## 4            20   7.5188
## 5            15   5.6391
## 6            13   4.8872
## 7            10   3.7594
## 8            14   5.2632
## 9            22   8.2707
## Total       266 100.0000

val_labels(data$hhi)

## NULL

We see that hhi has 24 options. Notice how the val_labels command returned a null. This means that the individual values are not labeled in our dataset so we need to refer to the provided codebook to see what each one is. However, by running the freq command, we see the first option is 1 followed by 10, 11, and so on. Since the results are not in numeric order, this suggests that R is not considering this variable as a number but rather as a character.

We run the summary function on the demographic variables to understand how they imported. Results here are annoying because the class = character which means right now R thinks the numbers in each of these variables is a string rather than representing an actual number. First thing we must do here is force these variables into being numbers.

data %>% 
  select(hhi, age, gender, ethnicity, hispanic, education, region) %>%   
  summary()

##      hhi                age               gender           ethnicity        
##  Length:266         Length:266         Length:266         Length:266        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##    hispanic          education            region         
##  Length:266         Length:266         Length:266        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character

We use the following tidyverse code with as.numeric to change all the demographic variables to be numbers. This change allows us a lot more flexibility in how we use the data and puts them into appropriate order rather than going from 1, 10, 11, 12, 2, 20, 21, etc. Here we do not need to create new variables since we are simply making them numbers rather than transforming them in some tangible way.

data <- data %>%
  mutate(across(c(hhi, age, gender, ethnicity, hispanic, education, region), as.numeric))

Run same code as above to double-check we know have numbers instead of characters for our demographic variables.

data %>% 
  select(hhi, age, gender, ethnicity, hispanic, education, region) %>%   
  summary()

##       hhi             age            gender       ethnicity        hispanic    
##  Min.   : 1.00   Min.   :18.00   Min.   :1.00   Min.   : 1.00   Min.   : 1.00  
##  1st Qu.: 3.00   1st Qu.:35.00   1st Qu.:1.00   1st Qu.: 1.00   1st Qu.: 1.00  
##  Median : 7.00   Median :51.00   Median :2.00   Median : 1.00   Median : 1.00  
##  Mean   : 8.35   Mean   :50.98   Mean   :1.62   Mean   : 1.94   Mean   : 1.82  
##  3rd Qu.:12.00   3rd Qu.:66.75   3rd Qu.:2.00   3rd Qu.: 2.00   3rd Qu.: 1.00  
##  Max.   :24.00   Max.   :99.00   Max.   :2.00   Max.   :16.00   Max.   :16.00  
##    education            region     
##  Min.   :-3105.00   Min.   :1.000  
##  1st Qu.:    2.00   1st Qu.:2.000  
##  Median :    4.00   Median :3.000  
##  Mean   :  -19.05   Mean   :2.594  
##  3rd Qu.:    6.00   3rd Qu.:3.000  
##  Max.   :    8.00   Max.   :4.000

Now, we see numbers in our summary of the demographic variables. This is what we want to see because it means the transformation from character to number was successful. It is important to look at these numbers as well to look for any odd patterns. There is one that immediately stands out to me which is the min value for the education variable, -3105, and the mean value, -19.05. If you ever see negative values like this in survey data, it likely means that there are DK/PNA/NO options coded as negative values. This means you should investigate and if necessary remove these from your data prior to analysis like we did above.

Let’s use the education variable as an example of how to label questions that are previously unlabeled. We start by running the freq command and compare the results to our provided codebook. We see from the codebook that -3105 = none of the above. That is not a legitimate answer so we will initially remove that using the approach from above making sure to change the value to recode to NA from 6 to -3105. If you keep it 6, you would be removing all the respondents with a Bachelor’s degree from your data on accident. Notice we also give it a new name, educ, which ensures if we do make a mistake we have the original variable to work from.

freq(data$education, plot=F)

## data$education 
##       Frequency  Percent
## -3105         2   0.7519
## 1            16   6.0150
## 2            67  25.1880
## 3            14   5.2632
## 4            45  16.9173
## 5            30  11.2782
## 6            48  18.0451
## 7            20   7.5188
## 8            24   9.0226
## Total       266 100.0000

data <- data %>%
  mutate(
    educ = replace(education, education == -3105, NA))

freq(data$educ, plot=F)

## data$educ 
##       Frequency  Percent Valid Percent
## 1            16   6.0150         6.061
## 2            67  25.1880        25.379
## 3            14   5.2632         5.303
## 4            45  16.9173        17.045
## 5            30  11.2782        11.364
## 6            48  18.0451        18.182
## 7            20   7.5188         7.576
## 8            24   9.0226         9.091
## NA's          2   0.7519              
## Total       266 100.0000       100.000

Now that is removed, we look at the codebook for the rest of the options. We see that 1 = HS or less, 2= HS graduate, 3=Post HS training, etc. What we need to do now is apply those labels to the educ variable.

We can do this in one of two ways depending on how many variables we are applying the labels too. If we were applying them to many variables, we would first create a vector of values with our desired label names in order from 1:x. Here, since education is a unique variable, we can also simply do it in the same line of code.

Here we use the mutate function along with labelled to assign label names to each individual value. We keep the variable name the same here since we still have the original variable at our disposal in case something goes wrong.

# Assign value labels to the Response variable
data <- data %>%
  mutate(educ = labelled(
    educ,
    labels = c(
      "No HS Degree" = 1,
      "HS Degree" = 2,
      "Post-HS VoTec" = 3,
      "Some College, No Degree" = 4,
      "Associate" = 5,
      "BA/BA" = 6,
      "Masters" = 7,
      "Doctorate" = 8
    )
  ))

freq(data$educ, plot=F)

## data$educ 
##       Frequency  Percent Valid Percent
## 1            16   6.0150         6.061
## 2            67  25.1880        25.379
## 3            14   5.2632         5.303
## 4            45  16.9173        17.045
## 5            30  11.2782        11.364
## 6            48  18.0451        18.182
## 7            20   7.5188         7.576
## 8            24   9.0226         9.091
## NA's          2   0.7519              
## Total       266 100.0000       100.000

val_labels(data$educ)

##            No HS Degree               HS Degree           Post-HS VoTec 
##                       1                       2                       3 
## Some College, No Degree               Associate                   BA/BA 
##                       4                       5                       6 
##                 Masters               Doctorate 
##                       7                       8

Now we have added labels to the individual response options for the education variable. This will be useful for reporting and for data exploration.

Collapsing Many Groups into Fewer

One thing you might notice is that there are 8 options for educational level and some of those options have many fewer cases than others. When this happens, especially with demographic variables, it can be appropriate to collapse the groups into logical sub-groupings. For education variable, it is common to combine these types of options into either college degree or not (so >= 6 (BA/BS)) or to combine into 4 or 5 groups (HS or less, Some college/Associates, BA/BS, Advanced Degree). We will review how to do both of these approaches.

Collapsing into Binary Variable

When you create a binary variable from a variable with many options, provided the order is appropriate, you can simply use the if_else command to create a new variable that = 1 when some condition is met and 0 when it is not. Here, since respondents with a college degree or more begins at 6 and every value below that represents less than a BS, we simply use college = if_else(educ>5, 1, 0) to create a new variable called college that = 1 when the respondent has a BA/BS or higher and 0 if not. We also apply the labels to new variable as illustrated previously.

Any time you create a new variable, check it against the original to ensure the recode worked as desired. Here we group_by(educ), which is the original variable, and count(college) the new variable to see if the recode worked.

#Collapsing into Bivariate Variable
data <- data %>% 
  mutate(college = if_else(educ>5, 1, 0))

freq(data$college, plot=F)

## data$college 
##       Frequency  Percent Valid Percent
## 0           172  64.6617         65.15
## 1            92  34.5865         34.85
## NA's          2   0.7519              
## Total       266 100.0000        100.00

data <- data %>%
  mutate(college = labelled(
    college,
    labels = c(
      "No College Degree" = 0,
      "College Degree" = 1
    )
  ))

val_labels(data$college)

## No College Degree    College Degree 
##                 0                 1

data %>%     #Data we are using    
  group_by(educ) %>% #Original Variable 
  count(college)         # New Variable

## # A tibble: 9 × 3
## # Groups:   educ [9]
##   educ                         college                    n
##   <dbl+lbl>                    <dbl+lbl>              <int>
## 1  1 [No HS Degree]             0 [No College Degree]    16
## 2  2 [HS Degree]                0 [No College Degree]    67
## 3  3 [Post-HS VoTec]            0 [No College Degree]    14
## 4  4 [Some College, No Degree]  0 [No College Degree]    45
## 5  5 [Associate]                0 [No College Degree]    30
## 6  6 [BA/BA]                    1 [College Degree]       48
## 7  7 [Masters]                  1 [College Degree]       20
## 8  8 [Doctorate]                1 [College Degree]       24
## 9 NA                           NA                         2

To check if the recode worked, we look at the final table. Here we see three columns - educ, college, n - which tells us the information we need. We see that for educ between 1:5 the college variable = 0. This means that the educational attainment levels under 6, i.e. no BA/BS or more, is coded as 0 in the college variable and the 1 next to 6, 7, 8 shows they recoded correctly as well. We could then precede to use this variable in analysis.

Collapsing into More than 2 groups

Sometimes, you want to keep more than just 2 response options. This code is slightly more complex than above but still easily doable. Here, we collapse the 8 education options into 3 options - HS or less, Some College, College Degree. We then do the same thing as above to label and double-check that our recode worked as desired.

data <- data %>%
  mutate(educ3 = case_when(
    educ <= 1 ~ 1,
    educ > 2 & educ < 6 ~ 2,
    educ >= 6 ~ 3
  ))

freq(data$educ3, plot=F)

## data$educ3 
##       Frequency Percent Valid Percent
## 1            16   6.015         8.122
## 2            89  33.459        45.178
## 3            92  34.586        46.701
## NA's         69  25.940              
## Total       266 100.000       100.000

data <- data %>%
  mutate(educ3 = labelled(
    educ3,
    labels = c("HS or Less" = 1,
      "Some College" = 2,
      "College Degree or More"= 3)
  ))

val_labels(data$educ3)

##             HS or Less           Some College College Degree or More 
##                      1                      2                      3

data %>%     #Data we are using    
  group_by(educ) %>% #Original Variable 
  count(educ3)

## # A tibble: 9 × 3
## # Groups:   educ [9]
##   educ                         educ3                           n
##   <dbl+lbl>                    <dbl+lbl>                   <int>
## 1  1 [No HS Degree]             1 [HS or Less]                16
## 2  2 [HS Degree]               NA                             67
## 3  3 [Post-HS VoTec]            2 [Some College]              14
## 4  4 [Some College, No Degree]  2 [Some College]              45
## 5  5 [Associate]                2 [Some College]              30
## 6  6 [BA/BA]                    3 [College Degree or More]    48
## 7  7 [Masters]                  3 [College Degree or More]    20
## 8  8 [Doctorate]                3 [College Degree or More]    24
## 9 NA                           NA                              2

When we review our final table to check if the recode worked, we see an issue. Notice how there is a NA in the educ3 column when educ = 2. This indicates we did not include the educ = 2 option in our recode so we must fix this before moving forward with this variable. Since option 2 is HS degree, we know that educ3 should = 4 for this option. We review our code and see the issue - educ <= 1 ~ 1, - the <=1 should be <=2. We update that are rerun the rest of the code to ensure it is fixed. It is. Yay.

data <- data %>%
  mutate(educ3 = case_when(
    educ <= 2 ~ 1,
    educ > 2 & educ < 6 ~ 2,
    educ >= 6 ~ 3
  ))

freq(data$educ3, plot=F)

## data$educ3 
##       Frequency  Percent Valid Percent
## 1            83  31.2030         31.44
## 2            89  33.4586         33.71
## 3            92  34.5865         34.85
## NA's          2   0.7519              
## Total       266 100.0000        100.00

data <- data %>%
  mutate(educ3 = labelled(
    educ3,
    labels = c("HS or Less" = 1,
      "Some College" = 2,
      "College Degree or More"= 3)
  ))

val_labels(data$educ3)

##             HS or Less           Some College College Degree or More 
##                      1                      2                      3

data %>%     #Data we are using    
  group_by(educ) %>% #Original Variable 
  count(educ3)

## # A tibble: 9 × 3
## # Groups:   educ [9]
##   educ                         educ3                           n
##   <dbl+lbl>                    <dbl+lbl>                   <int>
## 1  1 [No HS Degree]             1 [HS or Less]                16
## 2  2 [HS Degree]                1 [HS or Less]                67
## 3  3 [Post-HS VoTec]            2 [Some College]              14
## 4  4 [Some College, No Degree]  2 [Some College]              45
## 5  5 [Associate]                2 [Some College]              30
## 6  6 [BA/BA]                    3 [College Degree or More]    48
## 7  7 [Masters]                  3 [College Degree or More]    20
## 8  8 [Doctorate]                3 [College Degree or More]    24
## 9 NA                           NA                              2

It can be appropriate to collapse other variables as well including bipolar ordinal scale questions like approve/disapprove, satisfied/dissatisfied, agree/disagree, etc. Sometimes you are more interested in overall approval ratings rather than splitting by intensity of feeling. In these cases, you can collapse the full scale combining the approve/satisfied/agree/etc positive options into one group, the negative options into another group, and if the scale has a midpoint keeping that in the middle of the positive and negative options.

We use the stapleton_so_dh variable to illustrate this. It has 4 scale points ranging from Strongly Support to Strongly Disapprove. If I wanted to report the percentage the support the rules then I would need to collapse the strongly and somewhat support options into a new option that = 1 for support and = 0 if the respondent answered somewhat or strongly disapprove.

First run the val_labels command to ensure you know how your original variable is coded. We see that the support options are 1 & 2 while the oppose options are 3 & 4. We use this knowledge directly in our code.

We give our variable a new name following the established naming conventions, stapleton_so_dh_2, this time adding _a to the end to signify to us that it is a binary variable. Then provide value labels as previously and check that the recode worked as desired.

val_labels(data$stapleton_so_dh)

##    Strongly support    Somewhat support     Somewhat oppose     Strongly oppose 
##                   1                   2                   3                   4 
## Not sure/No opinion 
##                   6

data <- data %>%
  mutate(stapleton_so_dh_2 = case_when(
    stapleton_so_dh <= 2 ~ 1,
    stapleton_so_dh >= 3 ~ 0
  ))

freq(data$stapleton_so_dh_2, plot=F)

## data$stapleton_so_dh_2 
##       Frequency Percent Valid Percent
## 0            21   7.895         25.61
## 1            61  22.932         74.39
## NA's        184  69.173              
## Total       266 100.000        100.00

data <- data %>%
  mutate(stapleton_so_dh_2 = labelled(
    stapleton_so_dh_2,
    labels = c("Oppose Rule" = 0,
      "Support Rule" = 1)
  ))

val_labels(data$stapleton_so_dh_2)

##  Oppose Rule Support Rule 
##            0            1

data %>%     #Data we are using    
  group_by(stapleton_so_dh) %>% #Original Variable 
  count(stapleton_so_dh_2)

## # A tibble: 5 × 3
## # Groups:   stapleton_so_dh [5]
##   stapleton_so_dh       stapleton_so_dh_2     n
##   <dbl+lbl>             <dbl+lbl>         <int>
## 1  1 [Strongly support]  1 [Support Rule]    28
## 2  2 [Somewhat support]  1 [Support Rule]    33
## 3  3 [Somewhat oppose]   0 [Oppose Rule]     14
## 4  4 [Strongly oppose]   0 [Oppose Rule]      7
## 5 NA                    NA                  184

data <- data %>%
  mutate(
    stapleton_so_pc_2 = case_when(
      stapleton_so_pc <= 2 ~ 1,
      stapleton_so_pc >= 3 ~ 0),
    stapleton_so_shift_2 = case_when(
      stapleton_so_shift <= 2 ~ 1,
      stapleton_so_shift >= 3 ~ 0),
    stapleton_so_bases_2 = case_when(
      stapleton_so_bases <= 2 ~ 1,
      stapleton_so_bases >= 3 ~ 0) )

Reviewing the new variable, we see that it worked as desired with the values from the original variable of 1 & 2 = 1 in the new variable and values 3 & 4 = 0. We can now move forward with this variable in any analysis or reporting required.

Combining Two Variables into One

The final common data transformation technique we will cover in this tutorial is to look at how to create one variable from two existing ones. This is not as frequently used as the previous transformations but does happen enough to cover it here. Most commonly, this approach is used with branching questions where you ask something like approve or disapprove in Question 1 then follow that up with do you strongly of somewhat approve/disapprove in Q2.

Here, we will create a new variable that is a combination of college degree or not (the new college variable created above) and biological sex (gender in the Lucid codebook). This will give us a new variable that makes reporting the impact of a college degree on some outcome variable interacted with biological sex. We will call this new variable college_sex to illustrate what it includes. We need to look at the Lucid codebook to learn that gender = 1 is males and gender = 2 is females. We created the college variable above and now that college education = 1 and no college degree = 0.

data <- data %>%
  mutate(college_sex = case_when(
    gender == 1 & college==0 ~ 1,
    gender == 1 & college==1 ~ 2,
    gender == 2 & college==0 ~ 3,
    gender == 2 & college==1 ~ 4
  ))

data <- data %>%
  mutate(college_sex = labelled(
    college_sex,
    labels = c("Male, No College Degree" = 1,
      "Male, No College Degree" = 2,
      "Female, No College Degree" = 3,
      "Female, College Degree" = 4)
  ))

freq(data$college_sex, plot=F)

## data$college_sex 
##       Frequency  Percent Valid Percent
## 1            66  24.8120         25.00
## 2            35  13.1579         13.26
## 3           106  39.8496         40.15
## 4            57  21.4286         21.59
## NA's          2   0.7519              
## Total       266 100.0000        100.00

val_labels(data$college_sex)

##   Male, No College Degree   Male, No College Degree Female, No College Degree 
##                         1                         2                         3 
##    Female, College Degree 
##                         4

data %>%     #Data we are using    
  group_by(gender, college) %>% #Original Variable 
  count(college_sex)

## # A tibble: 5 × 4
## # Groups:   gender, college [5]
##   gender college                college_sex                        n
##    <dbl> <dbl+lbl>              <dbl+lbl>                      <int>
## 1      1  0 [No College Degree]  1 [Male, No College Degree]      66
## 2      1  1 [College Degree]     2 [Male, No College Degree]      35
## 3      2  0 [No College Degree]  3 [Female, No College Degree]   106
## 4      2  1 [College Degree]     4 [Female, College Degree]       57
## 5      2 NA                     NA                                 2

Now, we have combined two variables into one that we can then use in reporting or analysis.

write_dta(data, "data_clean.dta")