Once you have finished collecting your survey data, it is time to import the raw data into R for cleaning and preparation. Most survey data that you collect yourself will not immediately be ready for analysis. In this tutorial, we will cover the following:
haven
packageOnce data collection is finished, Qualtrics allows you to export your
raw survey data as a variety of file types. We want to export as a SPSS
file type, .sav
. By doing this versus Excel or CSV, we keep
the structure of the survey data intact. Meaning, survey data typically
has a number (e.g. 1) and a label (e.g. strongly agree) associated with
most variables. Exporting as file type other than SPSS loses that
information as you can only get one or the other. By using SPSS as the
export file type, we keep all the information about the response that we
need.
Once you have exported your raw survey data from Qualtrics, or any
other survey software platform, you need to import the data into R. As
best practice, you should first move your downloaded data into a new
folder appropriate for the project. Either use the
Import Dataset
dropdown function in the
Environment
panel to browse for the file or upate the
following code as necessary to point to your specific folder and
dataset.
In this code chunk, I import the raw Qualtrics exported survey data
into R directly from my download folder and name the file
data
. You should always give your R data simple names that
are intutitive to you. Because you will be typing the name of the
dataset over and over, making it short saves time and makes you a more
efficient coder.
<- read_sav("C:/Users/Carey/Downloads/srm695_omnibus_November+27,+2024_11.07.sav") omni
If code is successful, you will see the data with the given name,
here we named it omni, in the Environment
panel in R
Studio. We see that omni
has 297 observations, so there are
297 responses in our dataset that a respondent started but not
necessarily completed, with 178 individual columns of data. The 178
variables will be the individual survey questions, plus meta data about
the survey and the respondent, as well as the demographic responses
provided by Lucid for the sample.
FYI: Lucid does not provide survey weights for their sample so you must create them yourselves when analyzing your questions on the final project.
The first step to take once data has been imported is to review it
both visually and summarily to get to know the data better. You can open
the data for visual inspection using the view(data)
command. This opens the entire dataset for you to review. Below we use
the head(omni)
function to simply look at the first few
rows of the data.
Always download the codebook for the survey at the same time as you download the raw survey data. You will need to refer to the codebook frequently so save it someplace that is accessible for you.
#view(data)
head(omni)
## # A tibble: 6 × 178
## StartDate EndDate Status IPAddress Progress
## <dttm> <dttm> <dbl+lbl> <chr> <dbl>
## 1 2024-11-26 13:39:00 2024-11-26 13:39:21 0 [IP Address] 173.49.139.121 100
## 2 2024-11-26 13:39:12 2024-11-26 13:39:25 0 [IP Address] 172.56.222.211 100
## 3 2024-11-26 13:38:59 2024-11-26 13:39:32 0 [IP Address] 174.97.227.99 100
## 4 2024-11-26 13:39:42 2024-11-26 13:39:50 0 [IP Address] 73.99.70.82 100
## 5 2024-11-26 13:39:22 2024-11-26 13:40:29 0 [IP Address] 173.218.31.167 100
## 6 2024-11-26 13:39:00 2024-11-26 13:40:43 0 [IP Address] 73.186.44.253 100
## # ℹ 173 more variables: Duration__in_seconds_ <dbl>, Finished <dbl+lbl>,
## # RecordedDate <dttm>, ResponseId <chr>, RecipientLastName <chr>,
## # RecipientFirstName <chr>, RecipientEmail <chr>, ExternalReference <chr>,
## # LocationLatitude <chr>, LocationLongitude <chr>, DistributionChannel <chr>,
## # UserLanguage <chr>, Q_RecaptchaScore <dbl>, Q_RelevantIDDuplicate <chr>,
## # Q_RelevantIDDuplicateScore <dbl>, Q_RelevantIDFraudScore <dbl>,
## # Q_RelevantIDLastStartDate <dttm>, Q1 <dbl+lbl>, main_branch_1 <dbl+lbl>, …
names(omni) #Reveals the name of every column in data
## [1] "StartDate" "EndDate"
## [3] "Status" "IPAddress"
## [5] "Progress" "Duration__in_seconds_"
## [7] "Finished" "RecordedDate"
## [9] "ResponseId" "RecipientLastName"
## [11] "RecipientFirstName" "RecipientEmail"
## [13] "ExternalReference" "LocationLatitude"
## [15] "LocationLongitude" "DistributionChannel"
## [17] "UserLanguage" "Q_RecaptchaScore"
## [19] "Q_RelevantIDDuplicate" "Q_RelevantIDDuplicateScore"
## [21] "Q_RelevantIDFraudScore" "Q_RelevantIDLastStartDate"
## [23] "Q1" "main_branch_1"
## [25] "main_branch_2" "main_branch_3"
## [27] "main_branch_4" "main_branch_5"
## [29] "main_branch_6" "main_branch_7"
## [31] "main_branch_8" "main_branch_9"
## [33] "main_branch_10" "mohanty_news_1"
## [35] "mohanty_news_2" "mohanty_news_3"
## [37] "mohanty_news_4" "mohanty_news_5"
## [39] "mohanty_news_6" "mohanty_news_6_TEXT"
## [41] "dinsmore_smusage" "dinsmore_comms"
## [43] "dinsmore_comms_7_TEXT" "yifan_livecomm_1"
## [45] "yifan_livecomm_2" "yifan_livecomm_3"
## [47] "yifan_livecomm_4" "shawn_worktype"
## [49] "shawn_worktype_5_TEXT" "bot_check_4"
## [51] "bot_check_5" "bot_check_6"
## [53] "bot_check_7" "bot_check_8"
## [55] "gutierrez_suarez_ai" "krishna_cata_1"
## [57] "krishna_cata_2" "krishna_cata_3"
## [59] "krishna_cata_4" "krishna_cata_5"
## [61] "krishna_cata_6" "krishna_cata_7"
## [63] "krishna_cata_8" "krishna_cata_9"
## [65] "krishna_cata_10" "krishna_cata_11"
## [67] "krishna_cata_10_TEXT" "ai_ad_1"
## [69] "ai_ad_2" "ai_ad_3"
## [71] "krishna_dc" "gutierrez_clear"
## [73] "suarez_ad2_1" "suarez_ad2_2"
## [75] "dinsmore_mh" "balint_stress"
## [77] "gomez_dinsmore_ad_1" "gomez_dinsmore_ad_2"
## [79] "gomez_dinsmore_ad_3" "gomez_dinsmore_ad_4"
## [81] "dinsmore_friendship" "mohanty_emotneeds"
## [83] "SC_approval" "brake_influence"
## [85] "brake_confidence" "brake_ideology"
## [87] "brake_number" "mohanty_disagree"
## [89] "mohanty_famideo" "mohanty_famideo_13_TEXT"
## [91] "mohanty_commideo" "mohanty_commideo_6_TEXT"
## [93] "amalea_source_1" "amalea_source_2"
## [95] "amalea_source_3" "amalea_source_4"
## [97] "amalea_source_5" "amalea_source_6"
## [99] "amalea_source_7" "amalea_source_6_TEXT"
## [101] "amalea_fam" "amalea_stigma"
## [103] "amalea_ad" "God_importance_xu"
## [105] "God_help_xu" "God_answer_xu"
## [107] "gomez_recruit" "gomez_success"
## [109] "balint_struggle" "balint_cost"
## [111] "balint_inc_1" "balint_inc_2"
## [113] "balint_inc_3" "balint_inc_4"
## [115] "balint_inc_5" "balint_inc_6"
## [117] "balint_inc_5_TEXT" "balint_impact"
## [119] "mitchell_threat" "mitchell_ad_1"
## [121] "mitchell_ad_2" "Q1ZO"
## [123] "Q2ZO" "Q5ZO"
## [125] "omar_ad_1" "omar_ad_2"
## [127] "kopecky_sat" "yeumin_rec"
## [129] "yuemin_ad_1" "yuemin_ad_2"
## [131] "yuemin_ad_3" "yuemin_ad_4"
## [133] "kopecky_goals_1" "kopecky_goals_2"
## [135] "kopecky_goals_3" "kopecky_goals_4"
## [137] "kopecky_goals_5" "kopecky_goals_6"
## [139] "kopecky_goals_7" "kopecky_goals_7_TEXT"
## [141] "kopecky_focus" "kopecky_focus_6_TEXT"
## [143] "kopecky_type_1" "kopecky_type_2"
## [145] "kopecky_type_3" "shawn_overall"
## [147] "shawn_ad" "stapleton_so_1"
## [149] "stapleton_so_2" "stapleton_so_3"
## [151] "stapleton_so_4" "Q107"
## [153] "yifan_why_12" "yifan_why_13"
## [155] "yifan_why_14" "yifan_why_15"
## [157] "yifan_why_16" "yifan_why_18"
## [159] "yifan_why_19" "yifan_why_20"
## [161] "yifan_why_21" "yifan_why_22"
## [163] "yifan_why_22_TEXT" "yifan_length"
## [165] "mitchell_cc" "pol_ideo"
## [167] "pol_ideo_8_TEXT" "God_pray_xu"
## [169] "rid" "age"
## [171] "gender" "hhi"
## [173] "ethnicity" "hispanic"
## [175] "education" "region"
## [177] "zip" "political_party"
skim(omni) #Gives high level information about each variable in data
Name | omni |
Number of rows | 297 |
Number of columns | 178 |
_______________________ | |
Column type frequency: | |
character | 33 |
numeric | 141 |
POSIXct | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
IPAddress | 0 | 1 | 9 | 15 | 0 | 297 | 0 |
ResponseId | 0 | 1 | 17 | 17 | 0 | 297 | 0 |
RecipientLastName | 0 | 1 | 0 | 0 | 297 | 1 | 0 |
RecipientFirstName | 0 | 1 | 0 | 0 | 297 | 1 | 0 |
RecipientEmail | 0 | 1 | 0 | 0 | 297 | 1 | 0 |
ExternalReference | 0 | 1 | 0 | 0 | 297 | 1 | 0 |
LocationLatitude | 0 | 1 | 5 | 7 | 0 | 278 | 0 |
LocationLongitude | 0 | 1 | 6 | 9 | 0 | 278 | 0 |
DistributionChannel | 0 | 1 | 9 | 9 | 0 | 1 | 0 |
UserLanguage | 0 | 1 | 2 | 2 | 0 | 1 | 0 |
Q_RelevantIDDuplicate | 0 | 1 | 0 | 0 | 297 | 1 | 0 |
mohanty_news_6_TEXT | 0 | 1 | 0 | 36 | 281 | 17 | 0 |
dinsmore_comms_7_TEXT | 0 | 1 | 0 | 21 | 295 | 3 | 0 |
shawn_worktype_5_TEXT | 0 | 1 | 0 | 17 | 286 | 10 | 0 |
krishna_cata_10_TEXT | 0 | 1 | 0 | 17 | 294 | 4 | 0 |
mohanty_famideo_13_TEXT | 0 | 1 | 0 | 33 | 295 | 3 | 0 |
mohanty_commideo_6_TEXT | 0 | 1 | 0 | 12 | 296 | 2 | 0 |
amalea_source_6_TEXT | 0 | 1 | 0 | 88 | 290 | 8 | 0 |
balint_inc_5_TEXT | 0 | 1 | 0 | 47 | 289 | 9 | 0 |
kopecky_goals_7_TEXT | 0 | 1 | 0 | 62 | 291 | 7 | 0 |
kopecky_focus_6_TEXT | 0 | 1 | 0 | 28 | 282 | 14 | 0 |
yifan_why_22_TEXT | 0 | 1 | 0 | 178 | 276 | 22 | 0 |
pol_ideo_8_TEXT | 0 | 1 | 0 | 11 | 296 | 2 | 0 |
rid | 0 | 1 | 36 | 36 | 0 | 297 | 0 |
age | 0 | 1 | 2 | 2 | 0 | 67 | 0 |
gender | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
hhi | 0 | 1 | 1 | 2 | 0 | 24 | 0 |
ethnicity | 0 | 1 | 1 | 2 | 0 | 11 | 0 |
hispanic | 0 | 1 | 1 | 2 | 0 | 10 | 0 |
education | 0 | 1 | 1 | 5 | 0 | 9 | 0 |
region | 0 | 1 | 1 | 1 | 0 | 4 | 0 |
zip | 0 | 1 | 5 | 5 | 0 | 294 | 0 |
political_party | 0 | 1 | 1 | 1 | 0 | 8 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Status | 0 | 1.00 | 0.00 | 0.00 | 0 | 0.00 | 0 | 0.00 | 0 | ▁▁▇▁▁ |
Progress | 0 | 1.00 | 100.00 | 0.00 | 100 | 100.00 | 100 | 100.00 | 100 | ▁▁▇▁▁ |
Duration_in_seconds | 0 | 1.00 | 992.61 | 3582.40 | 7 | 395.00 | 592 | 827.00 | 59343 | ▇▁▁▁▁ |
Finished | 0 | 1.00 | 1.00 | 0.00 | 1 | 1.00 | 1 | 1.00 | 1 | ▁▁▇▁▁ |
Q_RecaptchaScore | 0 | 1.00 | 0.90 | 0.18 | 0 | 0.90 | 1 | 1.00 | 1 | ▁▁▁▂▇ |
Q_RelevantIDDuplicateScore | 0 | 1.00 | 0.00 | 0.00 | 0 | 0.00 | 0 | 0.00 | 0 | ▁▁▇▁▁ |
Q_RelevantIDFraudScore | 0 | 1.00 | 1.31 | 7.44 | 0 | 0.00 | 0 | 0.00 | 105 | ▇▁▁▁▁ |
Q1 | 0 | 1.00 | 1.02 | 0.13 | 1 | 1.00 | 1 | 1.00 | 2 | ▇▁▁▁▁ |
main_branch_1 | 5 | 0.98 | 0.71 | 0.45 | 0 | 0.00 | 1 | 1.00 | 1 | ▃▁▁▁▇ |
main_branch_2 | 5 | 0.98 | 0.46 | 0.50 | 0 | 0.00 | 0 | 1.00 | 1 | ▇▁▁▁▇ |
main_branch_3 | 5 | 0.98 | 0.25 | 0.43 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▂ |
main_branch_4 | 5 | 0.98 | 0.17 | 0.38 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▂ |
main_branch_5 | 5 | 0.98 | 0.47 | 0.50 | 0 | 0.00 | 0 | 1.00 | 1 | ▇▁▁▁▇ |
main_branch_6 | 5 | 0.98 | 0.35 | 0.48 | 0 | 0.00 | 0 | 1.00 | 1 | ▇▁▁▁▅ |
main_branch_7 | 5 | 0.98 | 0.37 | 0.48 | 0 | 0.00 | 0 | 1.00 | 1 | ▇▁▁▁▅ |
main_branch_8 | 5 | 0.98 | 0.82 | 0.39 | 0 | 1.00 | 1 | 1.00 | 1 | ▂▁▁▁▇ |
main_branch_9 | 5 | 0.98 | 0.05 | 0.22 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▁ |
main_branch_10 | 5 | 0.98 | 0.03 | 0.17 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▁ |
mohanty_news_1 | 5 | 0.98 | 0.77 | 0.42 | 0 | 1.00 | 1 | 1.00 | 1 | ▂▁▁▁▇ |
mohanty_news_2 | 5 | 0.98 | 0.61 | 0.49 | 0 | 0.00 | 1 | 1.00 | 1 | ▅▁▁▁▇ |
mohanty_news_3 | 5 | 0.98 | 0.25 | 0.44 | 0 | 0.00 | 0 | 1.00 | 1 | ▇▁▁▁▃ |
mohanty_news_4 | 5 | 0.98 | 0.30 | 0.46 | 0 | 0.00 | 0 | 1.00 | 1 | ▇▁▁▁▃ |
mohanty_news_5 | 5 | 0.98 | 0.28 | 0.45 | 0 | 0.00 | 0 | 1.00 | 1 | ▇▁▁▁▃ |
mohanty_news_6 | 5 | 0.98 | 0.05 | 0.23 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▁ |
dinsmore_smusage | 5 | 0.98 | 2.25 | 1.30 | 1 | 1.00 | 2 | 3.00 | 6 | ▇▂▁▁▁ |
dinsmore_comms | 6 | 0.98 | 2.67 | 1.76 | 1 | 1.00 | 2 | 3.00 | 8 | ▇▂▂▂▁ |
yifan_livecomm_1 | 5 | 0.98 | 0.38 | 0.49 | 0 | 0.00 | 0 | 1.00 | 1 | ▇▁▁▁▅ |
yifan_livecomm_2 | 5 | 0.98 | 0.03 | 0.17 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▁ |
yifan_livecomm_3 | 5 | 0.98 | 0.08 | 0.27 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▁ |
yifan_livecomm_4 | 5 | 0.98 | 0.52 | 0.50 | 0 | 0.00 | 1 | 1.00 | 1 | ▇▁▁▁▇ |
shawn_worktype | 162 | 0.45 | 2.54 | 1.34 | 1 | 1.00 | 2 | 4.00 | 5 | ▇▆▅▅▂ |
bot_check_4 | 5 | 0.98 | 0.00 | 0.06 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▁ |
bot_check_5 | 5 | 0.98 | 0.02 | 0.15 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▁ |
bot_check_6 | 5 | 0.98 | 0.96 | 0.20 | 0 | 1.00 | 1 | 1.00 | 1 | ▁▁▁▁▇ |
bot_check_7 | 5 | 0.98 | 0.93 | 0.26 | 0 | 1.00 | 1 | 1.00 | 1 | ▁▁▁▁▇ |
bot_check_8 | 5 | 0.98 | 0.00 | 0.00 | 0 | 0.00 | 0 | 0.00 | 0 | ▁▁▇▁▁ |
gutierrez_suarez_ai | 31 | 0.90 | 3.20 | 1.09 | 1 | 3.00 | 3 | 4.00 | 5 | ▂▃▇▇▂ |
krishna_cata_1 | 31 | 0.90 | 0.41 | 0.49 | 0 | 0.00 | 0 | 1.00 | 1 | ▇▁▁▁▆ |
krishna_cata_2 | 31 | 0.90 | 0.38 | 0.49 | 0 | 0.00 | 0 | 1.00 | 1 | ▇▁▁▁▅ |
krishna_cata_3 | 31 | 0.90 | 0.09 | 0.29 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▁ |
krishna_cata_4 | 31 | 0.90 | 0.06 | 0.23 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▁ |
krishna_cata_5 | 31 | 0.90 | 0.06 | 0.24 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▁ |
krishna_cata_6 | 31 | 0.90 | 0.18 | 0.39 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▂ |
krishna_cata_7 | 31 | 0.90 | 0.11 | 0.32 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▁ |
krishna_cata_8 | 31 | 0.90 | 0.04 | 0.19 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▁ |
krishna_cata_9 | 31 | 0.90 | 0.06 | 0.25 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▁ |
krishna_cata_10 | 31 | 0.90 | 0.01 | 0.11 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▁ |
krishna_cata_11 | 31 | 0.90 | 0.38 | 0.49 | 0 | 0.00 | 0 | 1.00 | 1 | ▇▁▁▁▅ |
ai_ad_1 | 32 | 0.89 | 2.06 | 1.07 | 1 | 1.00 | 2 | 3.00 | 5 | ▇▇▃▁▁ |
ai_ad_2 | 31 | 0.90 | 2.80 | 1.25 | 1 | 2.00 | 3 | 4.00 | 5 | ▅▇▇▃▃ |
ai_ad_3 | 31 | 0.90 | 2.24 | 1.08 | 1 | 1.00 | 2 | 3.00 | 5 | ▇▇▇▂▁ |
krishna_dc | 133 | 0.55 | 2.71 | 1.23 | 1 | 2.00 | 3 | 4.00 | 5 | ▅▇▆▆▂ |
gutierrez_clear | 133 | 0.55 | 1.82 | 0.87 | 1 | 1.00 | 2 | 2.00 | 5 | ▇▇▃▁▁ |
suarez_ad2_1 | 169 | 0.43 | 2.91 | 1.42 | 1 | 2.00 | 3 | 4.00 | 5 | ▅▇▅▅▆ |
suarez_ad2_2 | 169 | 0.43 | 2.41 | 1.26 | 1 | 1.00 | 2 | 3.00 | 5 | ▇▇▅▅▂ |
dinsmore_mh | 31 | 0.90 | 2.15 | 0.97 | 1 | 1.00 | 2 | 3.00 | 5 | ▅▇▅▁▁ |
balint_stress | 31 | 0.90 | 2.83 | 1.12 | 1 | 2.00 | 3 | 4.00 | 5 | ▃▇▇▅▂ |
gomez_dinsmore_ad_1 | 31 | 0.90 | 2.45 | 1.19 | 1 | 2.00 | 2 | 3.00 | 5 | ▅▇▃▃▂ |
gomez_dinsmore_ad_2 | 31 | 0.90 | 1.98 | 1.10 | 1 | 1.00 | 2 | 2.00 | 5 | ▇▇▂▂▁ |
gomez_dinsmore_ad_3 | 31 | 0.90 | 2.06 | 1.17 | 1 | 1.00 | 2 | 3.00 | 5 | ▇▆▃▁▁ |
gomez_dinsmore_ad_4 | 31 | 0.90 | 3.29 | 1.34 | 1 | 2.00 | 3 | 5.00 | 5 | ▃▆▇▆▇ |
dinsmore_friendship | 31 | 0.90 | 2.02 | 0.98 | 1 | 1.00 | 2 | 3.00 | 5 | ▇▇▅▁▁ |
mohanty_emotneeds | 31 | 0.90 | 2.50 | 1.21 | 1 | 2.00 | 2 | 3.00 | 5 | ▅▇▅▃▂ |
SC_approval | 31 | 0.90 | 2.49 | 0.93 | 1 | 2.00 | 2 | 3.00 | 4 | ▂▇▁▆▃ |
brake_influence | 31 | 0.90 | 2.83 | 1.10 | 1 | 2.00 | 3 | 3.00 | 5 | ▂▅▇▃▂ |
brake_confidence | 31 | 0.90 | 3.17 | 1.20 | 1 | 2.00 | 3 | 4.00 | 5 | ▃▆▇▇▅ |
brake_ideology | 31 | 0.90 | 2.24 | 0.72 | 1 | 2.00 | 2 | 3.00 | 3 | ▃▁▇▁▇ |
brake_number | 32 | 0.89 | 8.90 | 3.53 | 1 | 8.00 | 9 | 9.00 | 21 | ▂▇▂▁▁ |
mohanty_disagree | 31 | 0.90 | 3.31 | 1.19 | 1 | 2.00 | 4 | 4.00 | 5 | ▂▃▅▇▃ |
mohanty_famideo | 31 | 0.90 | 10.09 | 1.86 | 8 | 9.00 | 10 | 10.00 | 15 | ▇▆▂▁▂ |
mohanty_commideo | 31 | 0.90 | 2.84 | 1.41 | 1 | 2.00 | 3 | 3.00 | 6 | ▇▆▁▃▁ |
amalea_source_1 | 31 | 0.90 | 0.28 | 0.45 | 0 | 0.00 | 0 | 1.00 | 1 | ▇▁▁▁▃ |
amalea_source_2 | 31 | 0.90 | 0.18 | 0.39 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▂ |
amalea_source_3 | 31 | 0.90 | 0.32 | 0.47 | 0 | 0.00 | 0 | 1.00 | 1 | ▇▁▁▁▃ |
amalea_source_4 | 31 | 0.90 | 0.03 | 0.16 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▁ |
amalea_source_5 | 31 | 0.90 | 0.28 | 0.45 | 0 | 0.00 | 0 | 1.00 | 1 | ▇▁▁▁▃ |
amalea_source_6 | 31 | 0.90 | 0.03 | 0.17 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▁ |
amalea_source_7 | 31 | 0.90 | 0.33 | 0.47 | 0 | 0.00 | 0 | 1.00 | 1 | ▇▁▁▁▃ |
amalea_fam | 31 | 0.90 | 3.52 | 1.35 | 1 | 2.00 | 4 | 5.00 | 5 | ▂▅▅▅▇ |
amalea_stigma | 31 | 0.90 | 3.27 | 1.34 | 1 | 2.00 | 3 | 4.75 | 5 | ▃▆▇▆▇ |
amalea_ad | 31 | 0.90 | 1.82 | 0.95 | 1 | 1.00 | 2 | 2.00 | 5 | ▇▆▂▁▁ |
God_importance_xu | 102 | 0.66 | 1.51 | 0.78 | 1 | 1.00 | 1 | 2.00 | 5 | ▇▃▂▁▁ |
God_help_xu | 102 | 0.66 | 1.90 | 1.18 | 1 | 1.00 | 1 | 3.00 | 6 | ▇▂▁▁▁ |
God_answer_xu | 102 | 0.66 | 2.23 | 1.27 | 1 | 1.00 | 2 | 3.00 | 6 | ▇▂▁▁▁ |
gomez_recruit | 31 | 0.90 | 2.24 | 1.13 | 1 | 1.00 | 2 | 3.00 | 5 | ▇▇▆▁▂ |
gomez_success | 31 | 0.90 | 1.48 | 0.81 | 1 | 1.00 | 1 | 2.00 | 5 | ▇▂▁▁▁ |
balint_struggle | 31 | 0.90 | 3.19 | 1.32 | 1 | 2.00 | 3 | 4.00 | 5 | ▃▇▇▅▇ |
balint_cost | 31 | 0.90 | 2.82 | 1.34 | 1 | 2.00 | 3 | 4.00 | 5 | ▆▇▇▃▅ |
balint_inc_1 | 31 | 0.90 | 0.11 | 0.31 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▁ |
balint_inc_2 | 31 | 0.90 | 0.21 | 0.41 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▂ |
balint_inc_3 | 31 | 0.90 | 0.19 | 0.39 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▂ |
balint_inc_4 | 31 | 0.90 | 0.15 | 0.36 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▂ |
balint_inc_5 | 31 | 0.90 | 0.03 | 0.17 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▁ |
balint_inc_6 | 31 | 0.90 | 0.55 | 0.50 | 0 | 0.00 | 1 | 1.00 | 1 | ▇▁▁▁▇ |
balint_impact | 31 | 0.90 | 2.85 | 1.23 | 1 | 2.00 | 3 | 4.00 | 5 | ▃▆▇▃▃ |
mitchell_threat | 32 | 0.89 | 1.45 | 0.50 | 1 | 1.00 | 1 | 2.00 | 2 | ▇▁▁▁▆ |
mitchell_ad_1 | 31 | 0.90 | 1.68 | 0.78 | 1 | 1.00 | 2 | 2.00 | 5 | ▇▆▂▁▁ |
mitchell_ad_2 | 31 | 0.90 | 3.11 | 1.21 | 1 | 2.00 | 3 | 4.00 | 5 | ▃▆▇▆▅ |
Q1ZO | 31 | 0.90 | 3.74 | 1.27 | 1 | 3.00 | 4 | 5.00 | 5 | ▁▂▅▅▇ |
Q2ZO | 32 | 0.89 | 2.35 | 0.96 | 1 | 2.00 | 2 | 3.00 | 4 | ▅▇▁▅▃ |
Q5ZO | 33 | 0.89 | 2.97 | 1.27 | 1 | 2.00 | 3 | 4.00 | 5 | ▃▆▇▅▅ |
omar_ad_1 | 31 | 0.90 | 2.62 | 1.18 | 1 | 2.00 | 3 | 3.00 | 5 | ▅▆▇▂▂ |
omar_ad_2 | 32 | 0.89 | 2.52 | 1.13 | 1 | 2.00 | 3 | 3.00 | 5 | ▅▅▇▂▁ |
kopecky_sat | 169 | 0.43 | 1.81 | 0.96 | 1 | 1.00 | 2 | 2.00 | 5 | ▇▆▂▁▁ |
yeumin_rec | 31 | 0.90 | 2.21 | 1.13 | 1 | 1.00 | 2 | 3.00 | 5 | ▇▇▆▂▁ |
yuemin_ad_1 | 31 | 0.90 | 2.19 | 1.01 | 1 | 1.00 | 2 | 3.00 | 5 | ▅▇▃▂▁ |
yuemin_ad_2 | 31 | 0.90 | 2.05 | 1.02 | 1 | 1.00 | 2 | 3.00 | 5 | ▇▇▃▁▁ |
yuemin_ad_3 | 31 | 0.90 | 1.71 | 0.77 | 1 | 1.00 | 2 | 2.00 | 5 | ▇▇▂▁▁ |
yuemin_ad_4 | 31 | 0.90 | 2.41 | 1.14 | 1 | 2.00 | 2 | 3.00 | 5 | ▅▇▃▃▁ |
kopecky_goals_1 | 169 | 0.43 | 0.84 | 0.36 | 0 | 1.00 | 1 | 1.00 | 1 | ▂▁▁▁▇ |
kopecky_goals_2 | 169 | 0.43 | 0.29 | 0.46 | 0 | 0.00 | 0 | 1.00 | 1 | ▇▁▁▁▃ |
kopecky_goals_3 | 169 | 0.43 | 0.32 | 0.47 | 0 | 0.00 | 0 | 1.00 | 1 | ▇▁▁▁▃ |
kopecky_goals_4 | 169 | 0.43 | 0.42 | 0.50 | 0 | 0.00 | 0 | 1.00 | 1 | ▇▁▁▁▆ |
kopecky_goals_5 | 169 | 0.43 | 0.24 | 0.43 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▂ |
kopecky_goals_6 | 169 | 0.43 | 0.45 | 0.50 | 0 | 0.00 | 0 | 1.00 | 1 | ▇▁▁▁▆ |
kopecky_goals_7 | 169 | 0.43 | 0.05 | 0.21 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▁ |
kopecky_focus | 169 | 0.43 | 3.21 | 1.58 | 1 | 2.00 | 3 | 4.25 | 6 | ▇▃▂▂▂ |
kopecky_type_1 | 169 | 0.43 | 0.38 | 0.49 | 0 | 0.00 | 0 | 1.00 | 1 | ▇▁▁▁▅ |
kopecky_type_2 | 169 | 0.43 | 0.22 | 0.42 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▂ |
kopecky_type_3 | 169 | 0.43 | 0.52 | 0.50 | 0 | 0.00 | 1 | 1.00 | 1 | ▇▁▁▁▇ |
shawn_overall | 169 | 0.43 | 1.88 | 0.86 | 1 | 1.00 | 2 | 2.00 | 5 | ▇▇▃▁▁ |
shawn_ad | 169 | 0.43 | 2.46 | 1.74 | 1 | 1.00 | 2 | 3.00 | 6 | ▇▁▁▁▂ |
stapleton_so_1 | 202 | 0.32 | 2.55 | 1.63 | 1 | 1.00 | 2 | 3.00 | 6 | ▇▂▁▁▂ |
stapleton_so_2 | 202 | 0.32 | 2.56 | 1.54 | 1 | 1.00 | 2 | 3.00 | 6 | ▇▂▁▁▂ |
stapleton_so_3 | 202 | 0.32 | 3.17 | 1.66 | 1 | 2.00 | 3 | 4.00 | 6 | ▇▅▃▁▃ |
stapleton_so_4 | 202 | 0.32 | 2.99 | 1.60 | 1 | 2.00 | 3 | 4.00 | 6 | ▇▃▂▁▃ |
Q107 | 279 | 0.06 | 1.94 | 0.80 | 1 | 1.25 | 2 | 2.00 | 4 | ▃▇▁▂▁ |
yifan_why_12 | 157 | 0.47 | 0.22 | 0.42 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▂ |
yifan_why_13 | 157 | 0.47 | 0.34 | 0.48 | 0 | 0.00 | 0 | 1.00 | 1 | ▇▁▁▁▅ |
yifan_why_14 | 157 | 0.47 | 0.40 | 0.49 | 0 | 0.00 | 0 | 1.00 | 1 | ▇▁▁▁▅ |
yifan_why_15 | 157 | 0.47 | 0.06 | 0.23 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▁ |
yifan_why_16 | 157 | 0.47 | 0.14 | 0.35 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▁ |
yifan_why_18 | 157 | 0.47 | 0.10 | 0.30 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▁ |
yifan_why_19 | 157 | 0.47 | 0.07 | 0.26 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▁ |
yifan_why_20 | 157 | 0.47 | 0.06 | 0.23 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▁ |
yifan_why_21 | 157 | 0.47 | 0.11 | 0.32 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▁ |
yifan_why_22 | 157 | 0.47 | 0.16 | 0.37 | 0 | 0.00 | 0 | 0.00 | 1 | ▇▁▁▁▂ |
yifan_length | 32 | 0.89 | 2.87 | 1.20 | 1 | 2.00 | 3 | 4.00 | 4 | ▃▃▁▃▇ |
mitchell_cc | 31 | 0.90 | 2.90 | 1.26 | 1 | 2.00 | 3 | 4.00 | 5 | ▃▇▇▃▅ |
pol_ideo | 31 | 0.90 | 4.33 | 1.98 | 1 | 3.00 | 4 | 6.00 | 9 | ▅▇▂▅▁ |
God_pray_xu | 31 | 0.90 | 2.28 | 1.59 | 1 | 1.00 | 1 | 4.00 | 5 | ▇▂▂▁▃ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
StartDate | 0 | 1 | 2024-11-26 13:38:32 | 2024-11-26 14:02:00 | 2024-11-26 13:43:11 | 211 |
EndDate | 0 | 1 | 2024-11-26 13:39:21 | 2024-11-27 06:08:32 | 2024-11-26 13:54:23 | 277 |
RecordedDate | 0 | 1 | 2024-11-26 13:39:22 | 2024-11-27 06:08:34 | 2024-11-26 13:54:25 | 279 |
Q_RelevantIDLastStartDate | 0 | 1 | 1582-10-14 00:00:00 | 1582-10-14 00:00:00 | 1582-10-14 00:00:00 | 1 |
When we look at the data, we see a wide variety of background data
(called meta data) such as the duration of the survey, the Qualtrics ID
information, the IP address, etc. as well as the survey questions and
results themselves. On close inspection, we see that for the initial
branching question main_branch_1
we have 0’s, 1’s, and NAs.
The NAs mean the respondent never saw the question which tells us as the
researcher that it is an incomplete response. We need to investigate the
297 responses to decide which ones finished the survey plus who failed
the bot detector question so that we only analyze legitimate
responses.
For surveys with a lot of questions like these, we can also review a few specific variables at a time if desired.
The first thing to do to the raw data is to run our column names
through a data cleaning package called janitor
. Oftentimes
data exports with odd characters or capitalizations that are
inconsistent. By using a package like janitor
we can ensure
consistency in our variable names making our life as coders easier.
<-clean_names(omni)
omninames(omni)
## [1] "start_date" "end_date"
## [3] "status" "ip_address"
## [5] "progress" "duration_in_seconds"
## [7] "finished" "recorded_date"
## [9] "response_id" "recipient_last_name"
## [11] "recipient_first_name" "recipient_email"
## [13] "external_reference" "location_latitude"
## [15] "location_longitude" "distribution_channel"
## [17] "user_language" "q_recaptcha_score"
## [19] "q_relevant_id_duplicate" "q_relevant_id_duplicate_score"
## [21] "q_relevant_id_fraud_score" "q_relevant_id_last_start_date"
## [23] "q1" "main_branch_1"
## [25] "main_branch_2" "main_branch_3"
## [27] "main_branch_4" "main_branch_5"
## [29] "main_branch_6" "main_branch_7"
## [31] "main_branch_8" "main_branch_9"
## [33] "main_branch_10" "mohanty_news_1"
## [35] "mohanty_news_2" "mohanty_news_3"
## [37] "mohanty_news_4" "mohanty_news_5"
## [39] "mohanty_news_6" "mohanty_news_6_text"
## [41] "dinsmore_smusage" "dinsmore_comms"
## [43] "dinsmore_comms_7_text" "yifan_livecomm_1"
## [45] "yifan_livecomm_2" "yifan_livecomm_3"
## [47] "yifan_livecomm_4" "shawn_worktype"
## [49] "shawn_worktype_5_text" "bot_check_4"
## [51] "bot_check_5" "bot_check_6"
## [53] "bot_check_7" "bot_check_8"
## [55] "gutierrez_suarez_ai" "krishna_cata_1"
## [57] "krishna_cata_2" "krishna_cata_3"
## [59] "krishna_cata_4" "krishna_cata_5"
## [61] "krishna_cata_6" "krishna_cata_7"
## [63] "krishna_cata_8" "krishna_cata_9"
## [65] "krishna_cata_10" "krishna_cata_11"
## [67] "krishna_cata_10_text" "ai_ad_1"
## [69] "ai_ad_2" "ai_ad_3"
## [71] "krishna_dc" "gutierrez_clear"
## [73] "suarez_ad2_1" "suarez_ad2_2"
## [75] "dinsmore_mh" "balint_stress"
## [77] "gomez_dinsmore_ad_1" "gomez_dinsmore_ad_2"
## [79] "gomez_dinsmore_ad_3" "gomez_dinsmore_ad_4"
## [81] "dinsmore_friendship" "mohanty_emotneeds"
## [83] "sc_approval" "brake_influence"
## [85] "brake_confidence" "brake_ideology"
## [87] "brake_number" "mohanty_disagree"
## [89] "mohanty_famideo" "mohanty_famideo_13_text"
## [91] "mohanty_commideo" "mohanty_commideo_6_text"
## [93] "amalea_source_1" "amalea_source_2"
## [95] "amalea_source_3" "amalea_source_4"
## [97] "amalea_source_5" "amalea_source_6"
## [99] "amalea_source_7" "amalea_source_6_text"
## [101] "amalea_fam" "amalea_stigma"
## [103] "amalea_ad" "god_importance_xu"
## [105] "god_help_xu" "god_answer_xu"
## [107] "gomez_recruit" "gomez_success"
## [109] "balint_struggle" "balint_cost"
## [111] "balint_inc_1" "balint_inc_2"
## [113] "balint_inc_3" "balint_inc_4"
## [115] "balint_inc_5" "balint_inc_6"
## [117] "balint_inc_5_text" "balint_impact"
## [119] "mitchell_threat" "mitchell_ad_1"
## [121] "mitchell_ad_2" "q1zo"
## [123] "q2zo" "q5zo"
## [125] "omar_ad_1" "omar_ad_2"
## [127] "kopecky_sat" "yeumin_rec"
## [129] "yuemin_ad_1" "yuemin_ad_2"
## [131] "yuemin_ad_3" "yuemin_ad_4"
## [133] "kopecky_goals_1" "kopecky_goals_2"
## [135] "kopecky_goals_3" "kopecky_goals_4"
## [137] "kopecky_goals_5" "kopecky_goals_6"
## [139] "kopecky_goals_7" "kopecky_goals_7_text"
## [141] "kopecky_focus" "kopecky_focus_6_text"
## [143] "kopecky_type_1" "kopecky_type_2"
## [145] "kopecky_type_3" "shawn_overall"
## [147] "shawn_ad" "stapleton_so_1"
## [149] "stapleton_so_2" "stapleton_so_3"
## [151] "stapleton_so_4" "q107"
## [153] "yifan_why_12" "yifan_why_13"
## [155] "yifan_why_14" "yifan_why_15"
## [157] "yifan_why_16" "yifan_why_18"
## [159] "yifan_why_19" "yifan_why_20"
## [161] "yifan_why_21" "yifan_why_22"
## [163] "yifan_why_22_text" "yifan_length"
## [165] "mitchell_cc" "pol_ideo"
## [167] "pol_ideo_8_text" "god_pray_xu"
## [169] "rid" "age"
## [171] "gender" "hhi"
## [173] "ethnicity" "hispanic"
## [175] "education" "region"
## [177] "zip" "political_party"
Since our variable names are generally consistent already, you will not see a huge difference from this command. The one thing to note is that it made the random columns that started with an uppercase letter lowercase, which is best coding practice.
We start by looking at the first question in the survey
q1
which is the informed consent question. If a respondent
did not consent, they immediately left the survey. This group does not
need to be in our final data so let’s remove them first.
We start with the freq(omni$q1)
function along with
val_labels(omni$q1)
to see how many of each response we
have in our data.
freq(omni$q1, plot=F)
## Thank you for agreeing to participate in this survey conducted by students at the University of Massachusetts. Your participation in this study is completely voluntary, and you can withdraw at any time with no penalty. The survey should take around 10 minutes to complete. You will be asked questions about a wide variety of topics.
##
## By selecting Yes below, you consent to taking part in this research study. By selecting No below, you do not consent to take part in this study.
## Frequency Percent
## 1 292 98.316
## 2 5 1.684
## Total 297 100.000
val_labels(omni$q1)
## Yes No
## 1 2
Five respondents did not consent to take the survey so are included
in the raw data. We delete them using the filter
function
from tidyverse
.
We also save a new dataset to keep the original intact and
untransformed. We’ll call the new data simply data
.
<-omni %>%
datafilter(q1==1)
Now we have a new dataframe in R that has 292 observations, 5 less than the original 297. The bot detection question asked respondents to select options 3 and 4. Bots are not good at following instructions like this so tend to fail this question routinely. We need to remove the responses that failed this check.
From the codebook, we see that the bot detector questions all start
with bot_check_
followed by a number between 4 and 8. The
odd numbering is a quirk of Qualtrics that I’ve never been able to
figure out why it happens. But the codebook tells us that option 3 is
bot_check_6
while option 4 is bot_check_7
. To
continue in the survey, respondents had to select both of those options
and nothing else. Let’s fun a frequency on each option to see how many
were selected independently.
freq(data$bot_check_6, plot=F)
## For this survey question, we are interested in how many people read the question closely so please select options 3 and 4 below. We thank you for being engaged with the survey. 3
## Frequency Percent
## 0 12 4.11
## 1 280 95.89
## Total 292 100.00
freq(data$bot_check_7, plot=F)
## For this survey question, we are interested in how many people read the question closely so please select options 3 and 4 below. We thank you for being engaged with the survey. 4
## Frequency Percent
## 0 21 7.192
## 1 271 92.808
## Total 292 100.000
A vast majority of responses selected the correct option, 96% selected option 3 and 93% selected option 4, but we need to remove the responses that did not answer both.
We do this by first creating a new variable that simply adds the two bot_check variables together and then removes any responses that does not equal 2. This works because adding the two bot_check questions together will only equal 2 if the respondent correctly selected both options.
To do this, we use the mutate
function from
tidyverse
to sum the correct responses then we use the
filter
command to remove any cases that do not equal 2. We
save the data into the same data
since this is our
transformed dataset. Remember, we keep the original untransformed just
in case.
<- data %>%
data mutate(correct = bot_check_6 + bot_check_7) %>%
filter(correct == 2)
freq(data$finished, plot=F)
## Finished
## Frequency Percent
## 1 266 100
## Total 266 100
Now, we see that we have 266 observations so we dropped another 26
respondents who failed the bot detector question. There are oftentimes
responses in raw survey data that only partially finished the survey.
These cases should be removed and should not be included in any analysis
or reporting. Since our data was exported before the incompletes are
included, we do not have any additional cases to filter out. To confirm
this, we run the freq
command on the finished
variable, which is a Qualtrics provided variable that = 1 if respondent
finished the survey and 0 if they did not. If you have any 0’s here, you
should filter them out using the filter approach from the above code
chunk. Since we only have finished responses, we will not take any
additional steps with our class data.
This means that our final survey data has 266 completed responses for us to analyze. Not every variable will have 266 responses though. Sometimes a few respondents will simply not answer the question while other questions are only asked of a subset of respondents who answered a previous question in a specific way. It is important for you to understand why you have missing data. Is it because the respondent skipped answering it or because it was designed for that respondent not to answer?
Raw survey data is rarely ready for analysis without transforming the data in some way(s). In this section, we will review a variety of common data cleaning and recoding techniques used in survey research including:
Any time a survey respondent does not see a question (i.e. they did
not qualify to see it for some branching question reason) or skip a
question when asked (i.e. they saw the question and simply refused to
answer it), the data is imported into R as NA
. Previously,
we talked about how you can decide to include or not when reporting the
results to the public. For analysis purposes however, it is important
that we deal with them in order to get certain analyses to work.
Luckily, there are numerous ways to identify NAs in your results.
Above, we used the skim
function to look at the entire
dataset. While that is useful, it can be cumbersome with a large survey
like we have here.
Using these variables (stapleton_so_1/4
&
main_branch_6
) as an example, you can think about your
variables and replacing these names with the appropriate names for your
variables.
First, let’s review the summary information for these specific
variables. There are number of ways to do this in R. Here we use the
get_summary_stats
function from the rstatix
package. This function gives you a broad statistical overview of your
variables.
There are two approaches using this function in the following code
chunk. In the first approach, you manually input each variable names of
interest. In the second approach, you can take advantage of consistent
variable naming conventions and have it summarize all variables that
start with a specific structure, here we use stapleton_so_
since all these variables start with that except for the branching
question. We can simply manually add that variable to the code. This
approach is efficient but only works if the naming conventions are
consistent for your variables. So ensure they are before employing this
approach.
We also use the skimr
package for a slightly different
look at the data. This package is very helpful at understanding missing
data in each variable. To use it on specific variables, we first use the
select
function to specify the exact variable names we want
and then use skim()
to run the function. Both give broadly
the same information although I like skim
for its clearer
readout of missing data in a variable.
#1
%>% #Manually enter the variable names
data get_summary_stats(
main_branch_6, stapleton_so_1,stapleton_so_2,# columns to calculate for
stapleton_so_3, stapleton_so_4, type = "common")
## Warning: attributes are not identical across measure variables; they will be
## dropped
## # A tibble: 5 × 10
## variable n min max median iqr mean sd se ci
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 main_branch_6 266 0 1 0 1 0.357 0.48 0.029 0.058
## 2 stapleton_so_1 95 1 6 2 2 2.55 1.63 0.167 0.332
## 3 stapleton_so_2 95 1 6 2 2 2.56 1.54 0.158 0.314
## 4 stapleton_so_3 95 1 6 3 2 3.17 1.66 0.17 0.338
## 5 stapleton_so_4 95 1 6 3 2 2.99 1.60 0.164 0.326
#2
%>% #Review many at once depending on how it starts
data select(starts_with("stapleton_so_"), main_branch_6) %>%
get_summary_stats(type = "common")
## Warning: attributes are not identical across measure variables; they will be
## dropped
## # A tibble: 5 × 10
## variable n min max median iqr mean sd se ci
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 stapleton_so_1 95 1 6 2 2 2.55 1.63 0.167 0.332
## 2 stapleton_so_2 95 1 6 2 2 2.56 1.54 0.158 0.314
## 3 stapleton_so_3 95 1 6 3 2 3.17 1.66 0.17 0.338
## 4 stapleton_so_4 95 1 6 3 2 2.99 1.60 0.164 0.326
## 5 main_branch_6 266 0 1 0 1 0.357 0.48 0.029 0.058
#Skimr Package Approach
#data %>%
# select(main_branch_6, stapleton_so_1, stapleton_so_2, stapleton_so_3, stapleton_so_4) %>% skim()
%>%
data select(starts_with("stapleton_so_"), main_branch_6) %>%
skim()
Name | Piped data |
Number of rows | 266 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
stapleton_so_1 | 171 | 0.36 | 2.55 | 1.63 | 1 | 1 | 2 | 3 | 6 | ▇▂▁▁▂ |
stapleton_so_2 | 171 | 0.36 | 2.56 | 1.54 | 1 | 1 | 2 | 3 | 6 | ▇▂▁▁▂ |
stapleton_so_3 | 171 | 0.36 | 3.17 | 1.66 | 1 | 2 | 3 | 4 | 6 | ▇▅▃▁▃ |
stapleton_so_4 | 171 | 0.36 | 2.99 | 1.60 | 1 | 2 | 3 | 4 | 6 | ▇▃▂▁▃ |
main_branch_6 | 0 | 1.00 | 0.36 | 0.48 | 0 | 0 | 0 | 1 | 1 | ▇▁▁▁▅ |
With this code, we see the min, max, median, mean and variation
metrics of our specified variables. The two functions return broadly the
same information but with some slight differences. The
get_summary_stats
code tells us the number of responses for
each variable as well as min and max. From here, we see the branching
question, main_branch_6
, has 266 responses, which is the
full sample.
However, we see only 95 responses for the other 4 questions meaning
we have a majority of the data for that question is missing. We can look
at the skim
results to see that in fact we have 171 system
missing responses with only a 35.7% completion rate. If I were expecting
the entire sample to have answered this question, the 35.7% completion
rate would be a horrifying result as it would mean something went wrong
either in the branching logic or how it displayed to the respondents.
Because I created the survey, I know that these questions were only
asked of a subset of the full sample, those who said they liked MLB, so
that volume of missing data does not concern me. To verify, we look at
the mean percentage of respondents who answered yes they like baseball,
which is 35.7%. The exact same percentage so no issues with these
data.
Now, I know there these variables have system missing responses so I will need to deal with that once we start analyzing the data, but no additional action is necessary right now.
There is one concerning issue in the previous results, which requires more investigation. The max value for the first four questions is 6. However, there are only 4 main responses to these questions (strongly approve, somewhat approve, somewhat disapprove, strongly disapprove) so the fact that there a value of 6 indicates there is another response option present in the data.
To investigate more closely, we will use the freq
and
val_labels
commands to understand what the max option of 6
actually is.
val_labels(data$stapleton_so_1)
## Strongly support Somewhat support Somewhat oppose Strongly oppose
## 1 2 3 4
## Not sure/No opinion
## 6
freq(data$stapleton_so_1, plot=F)
## Recently, Major League Baseball implemented various rule changes about how the game is played. This includes...
##
## The "universal DH" rule, which allowed National League teams to employ a Designated Hitter in each game to match the rules in the American League,
## A pitch clock which forces pitchers to throw a pitch within a specified amount of time,
## Banning the use of the "shift" strategy which saw infielders being moved all around the infield, and
## Making the bases larger to make it easier to steal bases.
##
##
##
## For each of the following new MLB rules, how much do you support or oppose it? - Universal Designated Hitter (DH)
## Frequency Percent Valid Percent
## 1 28 10.526 29.474
## 2 33 12.406 34.737
## 3 14 5.263 14.737
## 4 7 2.632 7.368
## 6 13 4.887 13.684
## NA's 171 64.286
## Total 266 100.000 100.000
freq(data$stapleton_so_2, plot=F)
## Recently, Major League Baseball implemented various rule changes about how the game is played. This includes...
##
## The "universal DH" rule, which allowed National League teams to employ a Designated Hitter in each game to match the rules in the American League,
## A pitch clock which forces pitchers to throw a pitch within a specified amount of time,
## Banning the use of the "shift" strategy which saw infielders being moved all around the infield, and
## Making the bases larger to make it easier to steal bases.
##
##
##
## For each of the following new MLB rules, how much do you support or oppose it? - Pitch Clock
## Frequency Percent Valid Percent
## 1 25 9.398 26.316
## 2 34 12.782 35.789
## 3 16 6.015 16.842
## 4 9 3.383 9.474
## 6 11 4.135 11.579
## NA's 171 64.286
## Total 266 100.000 100.000
freq(data$stapleton_so_3, plot=F)
## Recently, Major League Baseball implemented various rule changes about how the game is played. This includes...
##
## The "universal DH" rule, which allowed National League teams to employ a Designated Hitter in each game to match the rules in the American League,
## A pitch clock which forces pitchers to throw a pitch within a specified amount of time,
## Banning the use of the "shift" strategy which saw infielders being moved all around the infield, and
## Making the bases larger to make it easier to steal bases.
##
##
##
## For each of the following new MLB rules, how much do you support or oppose it? - Infielder Shift Ban
## Frequency Percent Valid Percent
## 1 15 5.639 15.79
## 2 24 9.023 25.26
## 3 22 8.271 23.16
## 4 16 6.015 16.84
## 6 18 6.767 18.95
## NA's 171 64.286
## Total 266 100.000 100.00
freq(data$stapleton_so_4, plot=F)
## Recently, Major League Baseball implemented various rule changes about how the game is played. This includes...
##
## The "universal DH" rule, which allowed National League teams to employ a Designated Hitter in each game to match the rules in the American League,
## A pitch clock which forces pitchers to throw a pitch within a specified amount of time,
## Banning the use of the "shift" strategy which saw infielders being moved all around the infield, and
## Making the bases larger to make it easier to steal bases.
##
##
##
## For each of the following new MLB rules, how much do you support or oppose it? - Larger bases
## Frequency Percent Valid Percent
## 1 16 6.015 16.84
## 2 28 10.526 29.47
## 3 22 8.271 23.16
## 4 14 5.263 14.74
## 6 15 5.639 15.79
## NA's 171 64.286
## Total 266 100.000 100.00
With the val_labels
command, we see that strongly
support = 1 and strongly oppose = 4 while not sure/no opinion = 6. This
means that right now if we were to use different analytic techniques on
this variable, we would be telling R that the don’t knows are legitimate
substantive responses that should be analyzed. This would lead to error
and bias in your results so should be explicitly removed from the
variable prior to data exploration and analysis. This does not mean that
for general reporting to the public that should be removed. This is a
separate part of being a survey researcher.
Now that we know that the option 6 is don’t know/no opinion, we first look at its frequency for each question. We have a range from 11 to 19%. Now, we need to remove these non-substantive responses from the data. We never remove them from the original variable, so anytime you make transformations to a variable create a new one.
We use the mutate
function to recode the value 6 into NA
using the 4 original stapleton_so_
variables then create
new variables with names to signify what the variable is
(e.g. stapleton_so_dh
). This keeps the original variable’s
integrity intact while giving us a new variable that we can use in
analysis and data exploration.
<- data %>%
data mutate(
stapleton_so_dh = replace(stapleton_so_1, stapleton_so_1 == 6, NA) , #Create new variable called 'stapleton_so_dh' which equals the original variable 'stapleton_so_1' but replace all 6 values (don't know/no opinion) as NA
stapleton_so_pc = replace(stapleton_so_2, stapleton_so_2 == 6, NA),
stapleton_so_shift = replace(stapleton_so_3, stapleton_so_3 == 6, NA),
stapleton_so_bases = replace(stapleton_so_4, stapleton_so_4 == 6, NA)
)
Always review your newly created variables to ensure the
recoding/transformation worked as expected. Here, we use the
get_summary_stats
command to summarize all the
stapleton_so_
variables easily. This is the reason why I
kept the naming convention the same.
%>% #Review many at once depending on how it starts
data select(starts_with("stapleton_so_")) %>%
get_summary_stats(type = "common")
## Warning: attributes are not identical across measure variables; they will be
## dropped
## # A tibble: 8 × 10
## variable n min max median iqr mean sd se ci
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 stapleton_so_1 95 1 6 2 2 2.55 1.63 0.167 0.332
## 2 stapleton_so_2 95 1 6 2 2 2.56 1.54 0.158 0.314
## 3 stapleton_so_3 95 1 6 3 2 3.17 1.66 0.17 0.338
## 4 stapleton_so_4 95 1 6 3 2 2.99 1.60 0.164 0.326
## 5 stapleton_so_dh 82 1 4 2 1.75 2 0.93 0.103 0.204
## 6 stapleton_so_pc 84 1 4 2 2 2.11 0.957 0.104 0.208
## 7 stapleton_so_shift 77 1 4 2 1 2.51 1.03 0.118 0.235
## 8 stapleton_so_bases 80 1 4 2 1 2.42 1.00 0.112 0.223
Here we see all 8 variables that start stapleton_so_
including the four new variables we created. Notice how the
n
counts are slightly lower in the four new variables
compared to their original. This means we successfully removed the Don’t
Know options. We can also tell this from the max for the newly created
variables which is 4.
We can also use the freq
function to quickly look at the
individual values for the newly created variables.
freq(data$stapleton_so_dh, plot=F)
## Recently, Major League Baseball implemented various rule changes about how the game is played. This includes...
##
## The "universal DH" rule, which allowed National League teams to employ a Designated Hitter in each game to match the rules in the American League,
## A pitch clock which forces pitchers to throw a pitch within a specified amount of time,
## Banning the use of the "shift" strategy which saw infielders being moved all around the infield, and
## Making the bases larger to make it easier to steal bases.
##
##
##
## For each of the following new MLB rules, how much do you support or oppose it? - Universal Designated Hitter (DH)
## Frequency Percent Valid Percent
## 1 28 10.526 34.146
## 2 33 12.406 40.244
## 3 14 5.263 17.073
## 4 7 2.632 8.537
## NA's 184 69.173
## Total 266 100.000 100.000
val_labels(data$stapleton_so_dh)
## Strongly support Somewhat support Somewhat oppose Strongly oppose
## 1 2 3 4
## Not sure/No opinion
## 6
Once again, we see that the value 6 is no longer in this data showing
us we successfully removed them from these data. We also use the
val_labels
function to reveal that the new variable keeps
the originally labels so we do not need to worry about that here.
Not all survey data will import with labels already applied to the response options. For us, the only variables like this are the Lucid provided demographic results. Meaning, they just give us the number for the demographic questions while providing a codebook that tells us what the numbers mean. For reporting purposes, we always would need to provide the actual labels rather than the number for these options. This is also frequently done when analyzing the data so the researcher does not have to constantly go back and forth to the codebook.
The demographic variables we have available are: - hhi (household income measured with 24 categories) - age (measured in individual years from 18 to 99) - gender (actually biological sex but that is how Lucid names it) - ethnicity (actually race broken into 10 groups) - hispanic (if respondent is Hispanic background - with 10 unique options) - education (with 8 options) - region (region of the United States respondent lives with 4 options)
Start by running the freq
& val_labels
command on the hhi
variable.
freq(data$hhi, plot=F)
## hhi
## Frequency Percent
## 1 45 16.9173
## 10 6 2.2556
## 11 12 4.5113
## 12 9 3.3835
## 13 8 3.0075
## 14 10 3.7594
## 15 2 0.7519
## 16 6 2.2556
## 17 2 0.7519
## 18 1 0.3759
## 19 10 3.7594
## 2 11 4.1353
## 20 7 2.6316
## 21 9 3.3835
## 22 5 1.8797
## 23 2 0.7519
## 24 3 1.1278
## 3 24 9.0226
## 4 20 7.5188
## 5 15 5.6391
## 6 13 4.8872
## 7 10 3.7594
## 8 14 5.2632
## 9 22 8.2707
## Total 266 100.0000
val_labels(data$hhi)
## NULL
We see that hhi
has 24 options. Notice how the
val_labels
command returned a null
. This means
that the individual values are not labeled in our dataset so we need to
refer to the provided codebook to see what each one is. However, by
running the freq
command, we see the first option is 1
followed by 10, 11, and so on. Since the results are not in numeric
order, this suggests that R is not considering this variable as a number
but rather as a character
.
We run the summary
function on the demographic variables
to understand how they imported. Results here are annoying because the
class = character
which means right now R thinks the
numbers in each of these variables is a string rather than representing
an actual number. First thing we must do here is force these variables
into being numbers.
%>%
data select(hhi, age, gender, ethnicity, hispanic, education, region) %>%
summary()
## hhi age gender ethnicity
## Length:266 Length:266 Length:266 Length:266
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## hispanic education region
## Length:266 Length:266 Length:266
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
We use the following tidyverse
code with
as.numeric
to change all the demographic variables to be
numbers. This change allows us a lot more flexibility in how we use the
data and puts them into appropriate order rather than going from 1, 10,
11, 12, 2, 20, 21, etc. Here we do not need to create new variables
since we are simply making them numbers rather than transforming them in
some tangible way.
<- data %>%
data mutate(across(c(hhi, age, gender, ethnicity, hispanic, education, region), as.numeric))
Run same code as above to double-check we know have numbers instead of characters for our demographic variables.
%>%
data select(hhi, age, gender, ethnicity, hispanic, education, region) %>%
summary()
## hhi age gender ethnicity hispanic
## Min. : 1.00 Min. :18.00 Min. :1.00 Min. : 1.00 Min. : 1.00
## 1st Qu.: 3.00 1st Qu.:35.00 1st Qu.:1.00 1st Qu.: 1.00 1st Qu.: 1.00
## Median : 7.00 Median :51.00 Median :2.00 Median : 1.00 Median : 1.00
## Mean : 8.35 Mean :50.98 Mean :1.62 Mean : 1.94 Mean : 1.82
## 3rd Qu.:12.00 3rd Qu.:66.75 3rd Qu.:2.00 3rd Qu.: 2.00 3rd Qu.: 1.00
## Max. :24.00 Max. :99.00 Max. :2.00 Max. :16.00 Max. :16.00
## education region
## Min. :-3105.00 Min. :1.000
## 1st Qu.: 2.00 1st Qu.:2.000
## Median : 4.00 Median :3.000
## Mean : -19.05 Mean :2.594
## 3rd Qu.: 6.00 3rd Qu.:3.000
## Max. : 8.00 Max. :4.000
Now, we see numbers in our summary of the demographic variables. This is what we want to see because it means the transformation from character to number was successful. It is important to look at these numbers as well to look for any odd patterns. There is one that immediately stands out to me which is the min value for the education variable, -3105, and the mean value, -19.05. If you ever see negative values like this in survey data, it likely means that there are DK/PNA/NO options coded as negative values. This means you should investigate and if necessary remove these from your data prior to analysis like we did above.
Let’s use the education variable as an example of how to label
questions that are previously unlabeled. We start by running the
freq
command and compare the results to our provided
codebook. We see from the codebook that -3105 = none of the above. That
is not a legitimate answer so we will initially remove that using the
approach from above making sure to change the value to recode to NA from
6
to -3105
. If you keep it 6, you would be
removing all the respondents with a Bachelor’s degree from your data on
accident. Notice we also give it a new name, educ
, which
ensures if we do make a mistake we have the original variable to work
from.
freq(data$education, plot=F)
## data$education
## Frequency Percent
## -3105 2 0.7519
## 1 16 6.0150
## 2 67 25.1880
## 3 14 5.2632
## 4 45 16.9173
## 5 30 11.2782
## 6 48 18.0451
## 7 20 7.5188
## 8 24 9.0226
## Total 266 100.0000
<- data %>%
data mutate(
educ = replace(education, education == -3105, NA))
freq(data$educ, plot=F)
## data$educ
## Frequency Percent Valid Percent
## 1 16 6.0150 6.061
## 2 67 25.1880 25.379
## 3 14 5.2632 5.303
## 4 45 16.9173 17.045
## 5 30 11.2782 11.364
## 6 48 18.0451 18.182
## 7 20 7.5188 7.576
## 8 24 9.0226 9.091
## NA's 2 0.7519
## Total 266 100.0000 100.000
Now that is removed, we look at the codebook for the rest of the
options. We see that 1 = HS or less, 2= HS graduate, 3=Post HS training,
etc. What we need to do now is apply those labels to the
educ
variable.
We can do this in one of two ways depending on how many variables we are applying the labels too. If we were applying them to many variables, we would first create a vector of values with our desired label names in order from 1:x. Here, since education is a unique variable, we can also simply do it in the same line of code.
Here we use the mutate
function along with
labelled
to assign label names to each individual value. We
keep the variable name the same here since we still have the original
variable at our disposal in case something goes wrong.
# Assign value labels to the Response variable
<- data %>%
data mutate(educ = labelled(
educ,labels = c(
"No HS Degree" = 1,
"HS Degree" = 2,
"Post-HS VoTec" = 3,
"Some College, No Degree" = 4,
"Associate" = 5,
"BA/BA" = 6,
"Masters" = 7,
"Doctorate" = 8
)
))
freq(data$educ, plot=F)
## data$educ
## Frequency Percent Valid Percent
## 1 16 6.0150 6.061
## 2 67 25.1880 25.379
## 3 14 5.2632 5.303
## 4 45 16.9173 17.045
## 5 30 11.2782 11.364
## 6 48 18.0451 18.182
## 7 20 7.5188 7.576
## 8 24 9.0226 9.091
## NA's 2 0.7519
## Total 266 100.0000 100.000
val_labels(data$educ)
## No HS Degree HS Degree Post-HS VoTec
## 1 2 3
## Some College, No Degree Associate BA/BA
## 4 5 6
## Masters Doctorate
## 7 8
Now we have added labels to the individual response options for the education variable. This will be useful for reporting and for data exploration.
One thing you might notice is that there are 8 options for educational level and some of those options have many fewer cases than others. When this happens, especially with demographic variables, it can be appropriate to collapse the groups into logical sub-groupings. For education variable, it is common to combine these types of options into either college degree or not (so >= 6 (BA/BS)) or to combine into 4 or 5 groups (HS or less, Some college/Associates, BA/BS, Advanced Degree). We will review how to do both of these approaches.
When you create a binary variable from a variable with many options,
provided the order is appropriate, you can simply use the
if_else
command to create a new variable that = 1 when some
condition is met and 0 when it is not. Here, since respondents with a
college degree or more begins at 6 and every value below that represents
less than a BS, we simply use
college = if_else(educ>5, 1, 0)
to create a new variable
called college
that = 1 when the respondent has a BA/BS or
higher and 0 if not. We also apply the labels to new variable as
illustrated previously.
Any time you create a new variable, check it against the original to
ensure the recode worked as desired. Here we
group_by(educ)
, which is the original variable, and
count(college)
the new variable to see if the recode
worked.
#Collapsing into Bivariate Variable
<- data %>%
data mutate(college = if_else(educ>5, 1, 0))
freq(data$college, plot=F)
## data$college
## Frequency Percent Valid Percent
## 0 172 64.6617 65.15
## 1 92 34.5865 34.85
## NA's 2 0.7519
## Total 266 100.0000 100.00
<- data %>%
data mutate(college = labelled(
college,labels = c(
"No College Degree" = 0,
"College Degree" = 1
)
))
val_labels(data$college)
## No College Degree College Degree
## 0 1
%>% #Data we are using
data group_by(educ) %>% #Original Variable
count(college) # New Variable
## # A tibble: 9 × 3
## # Groups: educ [9]
## educ college n
## <dbl+lbl> <dbl+lbl> <int>
## 1 1 [No HS Degree] 0 [No College Degree] 16
## 2 2 [HS Degree] 0 [No College Degree] 67
## 3 3 [Post-HS VoTec] 0 [No College Degree] 14
## 4 4 [Some College, No Degree] 0 [No College Degree] 45
## 5 5 [Associate] 0 [No College Degree] 30
## 6 6 [BA/BA] 1 [College Degree] 48
## 7 7 [Masters] 1 [College Degree] 20
## 8 8 [Doctorate] 1 [College Degree] 24
## 9 NA NA 2
To check if the recode worked, we look at the final table. Here we see three columns - educ, college, n - which tells us the information we need. We see that for educ between 1:5 the college variable = 0. This means that the educational attainment levels under 6, i.e. no BA/BS or more, is coded as 0 in the college variable and the 1 next to 6, 7, 8 shows they recoded correctly as well. We could then precede to use this variable in analysis.
Sometimes, you want to keep more than just 2 response options. This code is slightly more complex than above but still easily doable. Here, we collapse the 8 education options into 3 options - HS or less, Some College, College Degree. We then do the same thing as above to label and double-check that our recode worked as desired.
<- data %>%
data mutate(educ3 = case_when(
<= 1 ~ 1,
educ > 2 & educ < 6 ~ 2,
educ >= 6 ~ 3
educ
))
freq(data$educ3, plot=F)
## data$educ3
## Frequency Percent Valid Percent
## 1 16 6.015 8.122
## 2 89 33.459 45.178
## 3 92 34.586 46.701
## NA's 69 25.940
## Total 266 100.000 100.000
<- data %>%
data mutate(educ3 = labelled(
educ3,labels = c("HS or Less" = 1,
"Some College" = 2,
"College Degree or More"= 3)
))
val_labels(data$educ3)
## HS or Less Some College College Degree or More
## 1 2 3
%>% #Data we are using
data group_by(educ) %>% #Original Variable
count(educ3)
## # A tibble: 9 × 3
## # Groups: educ [9]
## educ educ3 n
## <dbl+lbl> <dbl+lbl> <int>
## 1 1 [No HS Degree] 1 [HS or Less] 16
## 2 2 [HS Degree] NA 67
## 3 3 [Post-HS VoTec] 2 [Some College] 14
## 4 4 [Some College, No Degree] 2 [Some College] 45
## 5 5 [Associate] 2 [Some College] 30
## 6 6 [BA/BA] 3 [College Degree or More] 48
## 7 7 [Masters] 3 [College Degree or More] 20
## 8 8 [Doctorate] 3 [College Degree or More] 24
## 9 NA NA 2
When we review our final table to check if the recode worked, we see
an issue. Notice how there is a NA in the educ3 column when educ = 2.
This indicates we did not include the educ = 2 option in our recode so
we must fix this before moving forward with this variable. Since option
2 is HS degree, we know that educ3 should = 4 for this option. We review
our code and see the issue - educ <= 1 ~ 1,
- the <=1
should be <=2. We update that are rerun the rest of the code to
ensure it is fixed. It is. Yay.
<- data %>%
data mutate(educ3 = case_when(
<= 2 ~ 1,
educ > 2 & educ < 6 ~ 2,
educ >= 6 ~ 3
educ
))
freq(data$educ3, plot=F)
## data$educ3
## Frequency Percent Valid Percent
## 1 83 31.2030 31.44
## 2 89 33.4586 33.71
## 3 92 34.5865 34.85
## NA's 2 0.7519
## Total 266 100.0000 100.00
<- data %>%
data mutate(educ3 = labelled(
educ3,labels = c("HS or Less" = 1,
"Some College" = 2,
"College Degree or More"= 3)
))
val_labels(data$educ3)
## HS or Less Some College College Degree or More
## 1 2 3
%>% #Data we are using
data group_by(educ) %>% #Original Variable
count(educ3)
## # A tibble: 9 × 3
## # Groups: educ [9]
## educ educ3 n
## <dbl+lbl> <dbl+lbl> <int>
## 1 1 [No HS Degree] 1 [HS or Less] 16
## 2 2 [HS Degree] 1 [HS or Less] 67
## 3 3 [Post-HS VoTec] 2 [Some College] 14
## 4 4 [Some College, No Degree] 2 [Some College] 45
## 5 5 [Associate] 2 [Some College] 30
## 6 6 [BA/BA] 3 [College Degree or More] 48
## 7 7 [Masters] 3 [College Degree or More] 20
## 8 8 [Doctorate] 3 [College Degree or More] 24
## 9 NA NA 2
It can be appropriate to collapse other variables as well including bipolar ordinal scale questions like approve/disapprove, satisfied/dissatisfied, agree/disagree, etc. Sometimes you are more interested in overall approval ratings rather than splitting by intensity of feeling. In these cases, you can collapse the full scale combining the approve/satisfied/agree/etc positive options into one group, the negative options into another group, and if the scale has a midpoint keeping that in the middle of the positive and negative options.
We use the stapleton_so_dh
variable to illustrate this.
It has 4 scale points ranging from Strongly Support to Strongly
Disapprove. If I wanted to report the percentage the support the rules
then I would need to collapse the strongly and somewhat support options
into a new option that = 1 for support and = 0 if the respondent
answered somewhat or strongly disapprove.
First run the val_labels
command to ensure you know how
your original variable is coded. We see that the support options are 1
& 2 while the oppose options are 3 & 4. We use this knowledge
directly in our code.
We give our variable a new name following the established naming
conventions, stapleton_so_dh_2
, this time adding
_a
to the end to signify to us that it is a binary
variable. Then provide value labels as previously and check that the
recode worked as desired.
val_labels(data$stapleton_so_dh)
## Strongly support Somewhat support Somewhat oppose Strongly oppose
## 1 2 3 4
## Not sure/No opinion
## 6
<- data %>%
data mutate(stapleton_so_dh_2 = case_when(
<= 2 ~ 1,
stapleton_so_dh >= 3 ~ 0
stapleton_so_dh
))
freq(data$stapleton_so_dh_2, plot=F)
## data$stapleton_so_dh_2
## Frequency Percent Valid Percent
## 0 21 7.895 25.61
## 1 61 22.932 74.39
## NA's 184 69.173
## Total 266 100.000 100.00
<- data %>%
data mutate(stapleton_so_dh_2 = labelled(
stapleton_so_dh_2,labels = c("Oppose Rule" = 0,
"Support Rule" = 1)
))
val_labels(data$stapleton_so_dh_2)
## Oppose Rule Support Rule
## 0 1
%>% #Data we are using
data group_by(stapleton_so_dh) %>% #Original Variable
count(stapleton_so_dh_2)
## # A tibble: 5 × 3
## # Groups: stapleton_so_dh [5]
## stapleton_so_dh stapleton_so_dh_2 n
## <dbl+lbl> <dbl+lbl> <int>
## 1 1 [Strongly support] 1 [Support Rule] 28
## 2 2 [Somewhat support] 1 [Support Rule] 33
## 3 3 [Somewhat oppose] 0 [Oppose Rule] 14
## 4 4 [Strongly oppose] 0 [Oppose Rule] 7
## 5 NA NA 184
<- data %>%
data mutate(
stapleton_so_pc_2 = case_when(
<= 2 ~ 1,
stapleton_so_pc >= 3 ~ 0),
stapleton_so_pc stapleton_so_shift_2 = case_when(
<= 2 ~ 1,
stapleton_so_shift >= 3 ~ 0),
stapleton_so_shift stapleton_so_bases_2 = case_when(
<= 2 ~ 1,
stapleton_so_bases >= 3 ~ 0) ) stapleton_so_bases
Reviewing the new variable, we see that it worked as desired with the values from the original variable of 1 & 2 = 1 in the new variable and values 3 & 4 = 0. We can now move forward with this variable in any analysis or reporting required.
The final common data transformation technique we will cover in this tutorial is to look at how to create one variable from two existing ones. This is not as frequently used as the previous transformations but does happen enough to cover it here. Most commonly, this approach is used with branching questions where you ask something like approve or disapprove in Question 1 then follow that up with do you strongly of somewhat approve/disapprove in Q2.
Here, we will create a new variable that is a combination of college
degree or not (the new college
variable created above) and
biological sex (gender
in the Lucid codebook). This will
give us a new variable that makes reporting the impact of a college
degree on some outcome variable interacted with biological sex. We will
call this new variable college_sex
to illustrate what it
includes. We need to look at the Lucid codebook to learn that gender = 1
is males and gender = 2 is females. We created the college
variable above and now that college education = 1 and no college degree
= 0.
<- data %>%
data mutate(college_sex = case_when(
== 1 & college==0 ~ 1,
gender == 1 & college==1 ~ 2,
gender == 2 & college==0 ~ 3,
gender == 2 & college==1 ~ 4
gender
))
<- data %>%
data mutate(college_sex = labelled(
college_sex,labels = c("Male, No College Degree" = 1,
"Male, No College Degree" = 2,
"Female, No College Degree" = 3,
"Female, College Degree" = 4)
))
freq(data$college_sex, plot=F)
## data$college_sex
## Frequency Percent Valid Percent
## 1 66 24.8120 25.00
## 2 35 13.1579 13.26
## 3 106 39.8496 40.15
## 4 57 21.4286 21.59
## NA's 2 0.7519
## Total 266 100.0000 100.00
val_labels(data$college_sex)
## Male, No College Degree Male, No College Degree Female, No College Degree
## 1 2 3
## Female, College Degree
## 4
%>% #Data we are using
data group_by(gender, college) %>% #Original Variable
count(college_sex)
## # A tibble: 5 × 4
## # Groups: gender, college [5]
## gender college college_sex n
## <dbl> <dbl+lbl> <dbl+lbl> <int>
## 1 1 0 [No College Degree] 1 [Male, No College Degree] 66
## 2 1 1 [College Degree] 2 [Male, No College Degree] 35
## 3 2 0 [No College Degree] 3 [Female, No College Degree] 106
## 4 2 1 [College Degree] 4 [Female, College Degree] 57
## 5 2 NA NA 2
Now, we have combined two variables into one that we can then use in reporting or analysis.
write_dta(data, "data_clean.dta")