Cracking Survey Data w/ R

The purpose of this is to help with basic survey analysis.

Get started

You can get your written survey by going into Qualtrics > Edit Survey > Advanced Options > Export Survey (as Word if you like). This will give you a document of the survey questions with skip logic flows.

You can get you survey data by going into Qualtrics > View Results > Download.

Set your working directory and load in the data (here’s mine)

# this is my wd; be sure to change it to your system.
setwd("C:/Users/Nathan/Desktop/Working Directory/After/Emperitas/Emperitas-SG-1/LLL/Analysis")

# name your data something easy to remember / type out
Dat <- read.csv("LLL Data With CLV.csv", header = TRUE, stringsAsFactors = FALSE)

Get an idea of what your data is like

dim(Dat)   #see the dimensions of your data set

## [1] 1625  557

You now know that you have 1625 rows and 557 columns.

class(Dat) #see what kind of data you have

## [1] "data.frame"

To see the overall data structure try this:

str(Dat)

Some Data Cleaning/Manipulation

# changing the third column's name to"unique id"
colnames(Dat)[3]<-"unique id"

#if necessary, put "NA" in for blanks
Dat$Q5[Dat$Q5==""] <- NA

Sometimes you have characters in your data you want to remove. These can be removed using regular expressions. For more on how to do this in R, see here. When doing this, we recommend creating a new variable to take your transformations. If you like the changes, then have your original data variable “get” your test variable.

Cleaning out bad characters using regular expressions (REGEX)

#say you wanted to remove punctuation marks from Q10_TEXT
tom <- gsub("\\(","",Dat$Q10_TEXT) #replace (parenthesis with ""
tom <- gsub("[]$*+.?[^{|(\\#%&~_/<=>'!,:;???\")}@-]","",tom) #remove all punctuation
tom <- gsub("[ \t\n\r\f\v]","",tom) #get rid of all blank spaces
tom <- gsub("[a-zA-Z]","",tom) #get rid of all upper and lower case letters

# if you like the changes, give tom to your original variable
Dat$Q10_TEXT <- tom

Binary questions

The basic table function works great here. It gives you a count of the number of records for each type of answer. Pretty straightforward.

table(Dat$Q7)

## 
##    0    1 
## 1004  579

This command just shows the sums of 0’s and 1’s. To also see how many NA’s, summarize this variable as a factor. Think of ‘factors’ in R as a categorical variable, where all possible results are analyzed discretely. Since there are several “NA” answers, they’ll be treated as one of the categories of answers.

summary(as.factor(Dat$Q7))

##    0    1 NA's 
## 1004  579   42

Crosstabs

Use the table() function for a quick and dirty 2 way crosstab. For example, crosstab Q4 with segments. Table gives you the summed output.

table(Dat$Q4, Dat$Segment)

##    
##     Curious Bystander Goal Setters Relaxing Learner Selective Learner
##   1                34           51               53                32
##   2               112          162              200               147
##   3                64           81               88                87
##   4                15           26               35                31
##   5                 9           17               14                25
##   6                12           19               10                22
##   7                42            0                0                 0
##    
##     Topic Oriented
##   1             36
##   2            115
##   3             48
##   4             25
##   5              8
##   6              5
##   7              0

prop.table() will give a percentage frequency instead of sums for the table. You can get percentages by rows (margin =1) or by columns (margin =2).

## you can also leave out the "margin =" 
prop.table(table(Dat$Q4, Dat$Segment), margin = 2)

##    
##     Curious Bystander Goal Setters Relaxing Learner Selective Learner
##   1        0.11805556   0.14325843       0.13250000        0.09302326
##   2        0.38888889   0.45505618       0.50000000        0.42732558
##   3        0.22222222   0.22752809       0.22000000        0.25290698
##   4        0.05208333   0.07303371       0.08750000        0.09011628
##   5        0.03125000   0.04775281       0.03500000        0.07267442
##   6        0.04166667   0.05337079       0.02500000        0.06395349
##   7        0.14583333   0.00000000       0.00000000        0.00000000
##    
##     Topic Oriented
##   1     0.15189873
##   2     0.48523207
##   3     0.20253165
##   4     0.10548523
##   5     0.03375527
##   6     0.02109705
##   7     0.00000000

Or for just a specific segment variable

Qs <- prop.table(table(Dat$Q4, Dat$Segment=="Curious Bystander"),2) 
Qs

##    
##          FALSE       TRUE
##   1 0.12864622 0.11805556
##   2 0.46671653 0.38888889
##   3 0.22737472 0.22222222
##   4 0.08750935 0.05208333
##   5 0.04786836 0.03125000
##   6 0.04188482 0.04166667
##   7 0.00000000 0.14583333

We are interested in the stats for a specific segment so we need to remove the “FALSE”. To do this, convert to data frame and subset to “TRUE”.

Qs <- data.frame(Qs) # this will stack the true and false into one variable called 'Var2'
Qs <- subset(Qs, Var2==TRUE)
Qs

##    Var1 Var2       Freq
## 8     1 TRUE 0.11805556
## 9     2 TRUE 0.38888889
## 10    3 TRUE 0.22222222
## 11    4 TRUE 0.05208333
## 12    5 TRUE 0.03125000
## 13    6 TRUE 0.04166667
## 14    7 TRUE 0.14583333

Now that Var2 is only TRUE really don’t need to see it. We can drop that column by calling…

Qs[,2]<-NULL #deletes the second column of 'Qs'
Qs

##    Var1       Freq
## 8     1 0.11805556
## 9     2 0.38888889
## 10    3 0.22222222
## 11    4 0.05208333
## 12    5 0.03125000
## 13    6 0.04166667
## 14    7 0.14583333

The “8 - 14” to the left of Var1 are the row names assigned by R, not necessary the question numbers in the survey. In this case, since we didn’t recode the variables, knowing what each option represents in the survey is key for interpretation.

Another really useful function for multiple variable analysis is the aggregate() function. Try ‘?aggregate’ in your R console for more information.

Multiple Select Crosstabs

This is for questions where the respondents can select “all that apply” / more than one. Tyler developed this function to handle multiple choice comparison questions in crosstabs (You can copy and paste it into your R script editor, run it, and it will be ready for use).

multipleSelect.Xtab <- function(Questions, comparison, rnm = NULL, byCase = T, both = F){
  
  if(both == T){
    
    List <- list()
    for(i in 1:length(comparison)){
      Row <- sapply(Questions, table, comparison[,i])
      List[[i]] <- Row
    }
    
    for(i in 2:length(List)){
      List[[1]] <- cbind(List[[1]], List[[i]])
    }
    colnames(List[[1]]) <- colnames(comparison)
  }else{    
    List <- lapply(Questions,table,comparison)
    for(i in 2:length(List)){
      List[[1]]<-rbind(List[[1]],List[[i]])
    }
  }
  
  Table <- List[[1]]
  
  if(is.null(rnm)==F){
    rownames(Table) <- rnm
  }else rownames(Table) <- names(Questions)
  
  if(byCase==T){
    Table <- sweep(x = Table, MARGIN = 2, STATS = summary(as.factor(comparison)),FUN = "/")*100
  }
  
  return(Table)
}

Once you run this function, it will be saved in your environment. Here’s an example. Use this function to compare Q14 (a multiple select question) to segments.

# find the column indicies for questions with "Q14 in them"
grep("Q14", names(Dat))

##  [1] 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56

# create these as a vector of "IDs"
IDs <- grep("Q14", names(Dat))
# the last one has text in it, and we don't want that in our analysis, so let's remove it
IDs <- IDs[-16]
# Use the IDs in the function, and compare to segments
# use byCase = TRUE so you get the correct percentages for viz
multipleSelect.Xtab(Dat[,IDs], Dat$Segment, byCase = TRUE)

##        Curious Bystander Goal Setters Relaxing Learner Selective Learner
## Q14_1          11.805556   14.6067416            11.25        14.2441860
## Q14_2          13.194444   16.5730337            13.25        14.5348837
## Q14_3           7.986111   13.7640449            13.75        11.6279070
## Q14_4          10.763889    9.8314607            12.50        11.0465116
## Q14_5          14.583333   14.0449438            16.25        11.9186047
## Q14_6          13.888889   14.6067416            16.75        14.8255814
## Q14_7          14.583333   16.2921348            16.25        13.6627907
## Q14_8           6.597222    9.8314607            10.75         8.7209302
## Q14_9          11.458333   15.4494382            17.75        14.2441860
## Q14_10          5.902778    8.4269663             6.50         5.8139535
## Q14_11          2.777778    4.7752809             6.00         4.3604651
## Q14_12          7.986111    8.9887640            11.25         8.4302326
## Q14_13          7.638889   14.6067416            12.50        12.2093023
## Q14_14          1.736111    0.2808989             0.25         2.3255814
## Q14_15          1.041667    1.9662921             1.50         0.5813953
##        Topic Oriented
## Q14_1      14.7679325
## Q14_2      15.1898734
## Q14_3      15.1898734
## Q14_4      11.8143460
## Q14_5      17.7215190
## Q14_6      17.2995781
## Q14_7      19.8312236
## Q14_8       7.1729958
## Q14_9      15.1898734
## Q14_10      5.0632911
## Q14_11      7.5949367
## Q14_12      9.7046414
## Q14_13     11.8143460
## Q14_14      0.8438819
## Q14_15      2.5316456

Other analysis & Statistics more here

Chi Square tests for crosstabs

kyle <- table(Dat$Q4,Dat$Segment)
chisq.test(kyle)

## 
##  Pearson's Chi-squared test
## 
## data:  kyle
## X-squared = 232.21, df = 24, p-value < 2.2e-16

T tests and Anovas (Continuous data
This is just one kind of ANOVA that you can do. Defer to the TukeyHSD() for specific values for interpretation.

# One Way Anova (Completely Randomized Design)
# Have your dependent variable on the LHS of the '~'
brett <- aov(Dat$Q4 ~ Dat$Segment, data=Dat)
summary(brett)  # general summary - display Type I ANOVA table

##               Df Sum Sq Mean Sq F value   Pr(>F)    
## Dat$Segment    4  117.1  29.275   15.82 1.05e-12 ***
## Residuals   1620 2998.3   1.851                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TukeyHSD(brett) # differences and p-values by Segment

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Dat$Q4 ~ Dat$Segment, data = Dat)
## 
## $`Dat$Segment`
##                                            diff         lwr         upr
## Goal Setters-Curious Bystander      -0.61083801 -0.90527902 -0.31639701
## Relaxing Learner-Curious Bystander  -0.73041667 -1.01752420 -0.44330914
## Selective Learner-Curious Bystander -0.38396318 -0.68069196 -0.08723440
## Topic Oriented-Curious Bystander    -0.75065928 -1.07648536 -0.42483321
## Relaxing Learner-Goal Setters       -0.11957865 -0.39027507  0.15111776
## Selective Learner-Goal Setters       0.22687484 -0.05400554  0.50775522
## Topic Oriented-Goal Setters         -0.13982127 -0.45128306  0.17164052
## Selective Learner-Relaxing Learner   0.34645349  0.07327037  0.61963660
## Topic Oriented-Relaxing Learner     -0.02024262 -0.32478108  0.28429585
## Topic Oriented-Selective Learner    -0.36669610 -0.68032154 -0.05307067
##                                         p adj
## Goal Setters-Curious Bystander      0.0000002
## Relaxing Learner-Curious Bystander  0.0000000
## Selective Learner-Curious Bystander 0.0038531
## Topic Oriented-Curious Bystander    0.0000000
## Relaxing Learner-Goal Setters       0.7477391
## Selective Learner-Goal Setters      0.1779758
## Topic Oriented-Goal Setters         0.7362372
## Selective Learner-Relaxing Learner  0.0049629
## Topic Oriented-Relaxing Learner     0.9997575
## Topic Oriented-Selective Learner    0.0124828

T-test are great for comparing two groups by a continuous variable. Check here for more info.

# independent 2-group t-test.
# Have your dependent variable on the LHS of the '~'
t.test(Dat$CLV ~ Dat$Q7)

## 
##  Welch Two Sample t-test
## 
## data:  Dat$CLV by Dat$Q7
## t = -11.714, df = 723.73, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -150.7282 -107.4558
## sample estimates:
## mean in group 0 mean in group 1 
##         156.086         285.178

Finding those Darn N-sizes

To find the number of responses to a question you can use…

#this gives a sum for a column
NROW(na.omit(Dat$Q8))  # ommiting the "NA" values

## [1] 579

For a typical crosstab like…

table(Dat$Q38, Dat$Q7)

##    
##       0   1
##   0 747 455
##   1 257 124

taking the sum() will sum up all the answers, giving you the N size.

sum(table(Dat$Q38, Dat$Q7)) # Gender by "do you typically enroll..."

## [1] 1583

Just don’t do this:

sum(table(Dat$Q39, Dat$Segment=="Curious Bystander")) # Income by Curious Bystander

## [1] 1625

Why is it so large? Remember that when you table() with one part “==” to a specific value, you automatically will get back a TRUE and FALSE stats that, if summed up, will equal all the data in the dataset, not just your filter. One way to get around this is…

#
sum(table(Dat$Q39[Dat$Segment=="Curious Bystander"]))

## [1] 288

Using table() can help you find N-sizes, just know what you are summing up when you call the sum().

For Multiple-Select N-Sizes, you want to know how many people in each group were asked the question, not how many responses were given. Therefore, your N size will be the aggregate number of your grouping variable (in this example, it’s segments).

# find the indices of the multiple select question
grep("Q14", names(Dat))

##  [1] 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56

# create these as a vector of "IDs"
IDs <- grep("Q14", names(Dat))
# the last one has text in it, and we don't want that in our analysis, so let's remove it
IDs <- IDs[-16]
# Use the IDs in the function, and compare to segments
# use byCase = FALSE if you want raw counts, leave the default for correct %s
multipleSelect.Xtab(Dat[,IDs], Dat$Segment)

##        Curious Bystander Goal Setters Relaxing Learner Selective Learner
## Q14_1          11.805556   14.6067416            11.25        14.2441860
## Q14_2          13.194444   16.5730337            13.25        14.5348837
## Q14_3           7.986111   13.7640449            13.75        11.6279070
## Q14_4          10.763889    9.8314607            12.50        11.0465116
## Q14_5          14.583333   14.0449438            16.25        11.9186047
## Q14_6          13.888889   14.6067416            16.75        14.8255814
## Q14_7          14.583333   16.2921348            16.25        13.6627907
## Q14_8           6.597222    9.8314607            10.75         8.7209302
## Q14_9          11.458333   15.4494382            17.75        14.2441860
## Q14_10          5.902778    8.4269663             6.50         5.8139535
## Q14_11          2.777778    4.7752809             6.00         4.3604651
## Q14_12          7.986111    8.9887640            11.25         8.4302326
## Q14_13          7.638889   14.6067416            12.50        12.2093023
## Q14_14          1.736111    0.2808989             0.25         2.3255814
## Q14_15          1.041667    1.9662921             1.50         0.5813953
##        Topic Oriented
## Q14_1      14.7679325
## Q14_2      15.1898734
## Q14_3      15.1898734
## Q14_4      11.8143460
## Q14_5      17.7215190
## Q14_6      17.2995781
## Q14_7      19.8312236
## Q14_8       7.1729958
## Q14_9      15.1898734
## Q14_10      5.0632911
## Q14_11      7.5949367
## Q14_12      9.7046414
## Q14_13     11.8143460
## Q14_14      0.8438819
## Q14_15      2.5316456

# N sizes for segments
summary(as.factor(Dat$Segment))

## Curious Bystander      Goal Setters  Relaxing Learner Selective Learner 
##               288               356               400               344 
##    Topic Oriented 
##               237

Open-ended Text Variables

There are a LOT of options for how to deal with text responses. This will provide a few examples of how we’ve been handling text variables for survey analysis.

Prep the text for analysis:

#load in some packages (install if necessary)
library(tm)
library(SnowballC)
library(RColorBrewer)
library(dplyr)

#Clean text
tricia <- Dat$Q5 #open-response variable to a new variable for text transformations
tricia <- gsub("[^[:graph:]]", " ", tricia) #get rid of non graphical characters
tricia <- gsub("rt", "", tricia)# Replace blank space ("rt")
tricia <- gsub("[[:punct:]]", "", tricia)# Remove punctuation
tricia <- gsub("[ |\t]{2,}", "", tricia)# Remove tabs
tricia <- gsub("^ ", "", tricia)# Remove blank spaces at the beginning
tricia <- gsub(" $", "", tricia)# Remove blank spaces at the end
tricia <- tolower(tricia)#convert all text to lower case

myCorpus <- Corpus(VectorSource(tricia))
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, removeWords, stopwords("english")) #removes common english stopwords
myCorpus <- tm_map(myCorpus, removeWords, c("muffin"))  #You can specify words to remove
myCorpus <- tm_map(myCorpus, PlainTextDocument)

#build a term-document matrix
myTDM = TermDocumentMatrix(myCorpus, control = list(minWordLength = 1))
m = as.matrix(myTDM)
v = sort(rowSums(m), decreasing = TRUE)
d = data.frame(word = names(v),freq=v)

Phew! That’s a lot of stuff. We’ve queued up this text data for some analysis and visualizations. Here’s the basic stuff:

Create basic word cloud

library(wordcloud)
wordcloud(myCorpus, scale=c(5,0.5), max.words=100, random.order=FALSE, rot.per=0.25, 
          use.r.layout=FALSE, colors=brewer.pal(8, "Dark2"))

Frequent word analysis

We can find the words that appear at least 50 times by calling the findFreqTerms() function on myTDM

HiFreqWrds <- findFreqTerms(myTDM,100)
HiFreqWrds

##  [1] "can"         "chance"      "class"       "classes"     "continue"   
##  [6] "education"   "explore"     "fun"         "great"       "interest"   
## [11] "interesting" "interests"   "knowledge"   "learn"       "learning"   
## [16] "life"        "lifelong"    "love"        "means"       "new"        
## [21] "oppounity"   "people"      "skills"      "something"   "take"       
## [26] "things"      "time"        "way"

Now you also see how associated a word is to another word or a list of words. Choose the association coefficient.

# Compute associations for all high freq terms 
findAssocs(myTDM, HiFreqWrds, 0.4)

## $can
## numeric(0)
## 
## $chance
## numeric(0)
## 
## $class
## numeric(0)
## 
## $classes
## numeric(0)
## 
## $continue
## numeric(0)
## 
## $education
## numeric(0)
## 
## $explore
## numeric(0)
## 
## $fun
## numeric(0)
## 
## $great
## numeric(0)
## 
## $interest
## numeric(0)
## 
## $interesting
## numeric(0)
## 
## $interests
## numeric(0)
## 
## $knowledge
## numeric(0)
## 
## $learn
## numeric(0)
## 
## $learning
## numeric(0)
## 
## $life
## numeric(0)
## 
## $lifelong
## numeric(0)
## 
## $love
## numeric(0)
## 
## $means
## numeric(0)
## 
## $new
## numeric(0)
## 
## $oppounity
## numeric(0)
## 
## $people
## meet 
## 0.53 
## 
## $skills
## numeric(0)
## 
## $something
## numeric(0)
## 
## $take
## numeric(0)
## 
## $things
## numeric(0)
## 
## $time
## numeric(0)
## 
## $way
## numeric(0)

Looks like only one term had an association of at least 40%

# or, just compute word strength associations
findAssocs(myTDM, "learn", 0.5)

## $learn
## numeric(0)

Looks like the word “learn” has no associations at the 50% level. See what happens if you lower it.

Polarity / Sentiment Analysis

This is a quick and dirty version of sentiment analysis. There are lots more ways of doing this (see the QDAP package vignette). Here we take a cleaned character vector used earlier (i.e. tricia) and compare its sentiment against a grouping variable. Here I use Dat$Q38, which is gender.

library(qdap)
library(ggplot2)
poldat <- with(Dat, polarity(tricia, Dat$Q38))
plot(poldat)

If you get stuck or have questions, you are not alone! We are used to being stuck all the time with problems and we’ve learned that Googling and Stack Overflow have almost always had the answer. It just takes some diligence and hard work. We hope this helps!