The purpose of this is to help with basic survey analysis.
You can get your written survey by going into Qualtrics > Edit Survey > Advanced Options > Export Survey (as Word if you like). This will give you a document of the survey questions with skip logic flows.
You can get you survey data by going into Qualtrics > View Results > Download.
Set your working directory and load in the data (here’s mine)
# this is my wd; be sure to change it to your system.
setwd("C:/Users/Nathan/Desktop/Working Directory/After/Emperitas/Emperitas-SG-1/LLL/Analysis")
# name your data something easy to remember / type out
Dat <- read.csv("LLL Data With CLV.csv", header = TRUE, stringsAsFactors = FALSE)
dim(Dat) #see the dimensions of your data set
## [1] 1625 557
You now know that you have 1625 rows and 557 columns.
class(Dat) #see what kind of data you have
## [1] "data.frame"
To see the overall data structure try this:
str(Dat)
# changing the third column's name to"unique id"
colnames(Dat)[3]<-"unique id"
#if necessary, put "NA" in for blanks
Dat$Q5[Dat$Q5==""] <- NA
Sometimes you have characters in your data you want to remove. These can be removed using regular expressions. For more on how to do this in R, see here. When doing this, we recommend creating a new variable to take your transformations. If you like the changes, then have your original data variable “get” your test variable.
Cleaning out bad characters using regular expressions (REGEX)
#say you wanted to remove punctuation marks from Q10_TEXT
tom <- gsub("\\(","",Dat$Q10_TEXT) #replace (parenthesis with ""
tom <- gsub("[]$*+.?[^{|(\\#%&~_/<=>'!,:;???\")}@-]","",tom) #remove all punctuation
tom <- gsub("[ \t\n\r\f\v]","",tom) #get rid of all blank spaces
tom <- gsub("[a-zA-Z]","",tom) #get rid of all upper and lower case letters
# if you like the changes, give tom to your original variable
Dat$Q10_TEXT <- tom
The basic table function works great here. It gives you a count of the number of records for each type of answer. Pretty straightforward.
table(Dat$Q7)
##
## 0 1
## 1004 579
This command just shows the sums of 0’s and 1’s. To also see how many NA’s, summarize this variable as a factor. Think of ‘factors’ in R as a categorical variable, where all possible results are analyzed discretely. Since there are several “NA” answers, they’ll be treated as one of the categories of answers.
summary(as.factor(Dat$Q7))
## 0 1 NA's
## 1004 579 42
Use the table() function for a quick and dirty 2 way crosstab. For example, crosstab Q4 with segments. Table gives you the summed output.
table(Dat$Q4, Dat$Segment)
##
## Curious Bystander Goal Setters Relaxing Learner Selective Learner
## 1 34 51 53 32
## 2 112 162 200 147
## 3 64 81 88 87
## 4 15 26 35 31
## 5 9 17 14 25
## 6 12 19 10 22
## 7 42 0 0 0
##
## Topic Oriented
## 1 36
## 2 115
## 3 48
## 4 25
## 5 8
## 6 5
## 7 0
prop.table() will give a percentage frequency instead of sums for the table. You can get percentages by rows (margin =1) or by columns (margin =2).
## you can also leave out the "margin ="
prop.table(table(Dat$Q4, Dat$Segment), margin = 2)
##
## Curious Bystander Goal Setters Relaxing Learner Selective Learner
## 1 0.11805556 0.14325843 0.13250000 0.09302326
## 2 0.38888889 0.45505618 0.50000000 0.42732558
## 3 0.22222222 0.22752809 0.22000000 0.25290698
## 4 0.05208333 0.07303371 0.08750000 0.09011628
## 5 0.03125000 0.04775281 0.03500000 0.07267442
## 6 0.04166667 0.05337079 0.02500000 0.06395349
## 7 0.14583333 0.00000000 0.00000000 0.00000000
##
## Topic Oriented
## 1 0.15189873
## 2 0.48523207
## 3 0.20253165
## 4 0.10548523
## 5 0.03375527
## 6 0.02109705
## 7 0.00000000
Or for just a specific segment variable
Qs <- prop.table(table(Dat$Q4, Dat$Segment=="Curious Bystander"),2)
Qs
##
## FALSE TRUE
## 1 0.12864622 0.11805556
## 2 0.46671653 0.38888889
## 3 0.22737472 0.22222222
## 4 0.08750935 0.05208333
## 5 0.04786836 0.03125000
## 6 0.04188482 0.04166667
## 7 0.00000000 0.14583333
We are interested in the stats for a specific segment so we need to remove the “FALSE”. To do this, convert to data frame and subset to “TRUE”.
Qs <- data.frame(Qs) # this will stack the true and false into one variable called 'Var2'
Qs <- subset(Qs, Var2==TRUE)
Qs
## Var1 Var2 Freq
## 8 1 TRUE 0.11805556
## 9 2 TRUE 0.38888889
## 10 3 TRUE 0.22222222
## 11 4 TRUE 0.05208333
## 12 5 TRUE 0.03125000
## 13 6 TRUE 0.04166667
## 14 7 TRUE 0.14583333
Now that Var2 is only TRUE really don’t need to see it. We can drop that column by calling…
Qs[,2]<-NULL #deletes the second column of 'Qs'
Qs
## Var1 Freq
## 8 1 0.11805556
## 9 2 0.38888889
## 10 3 0.22222222
## 11 4 0.05208333
## 12 5 0.03125000
## 13 6 0.04166667
## 14 7 0.14583333
The “8 - 14” to the left of Var1 are the row names assigned by R, not necessary the question numbers in the survey. In this case, since we didn’t recode the variables, knowing what each option represents in the survey is key for interpretation.
Another really useful function for multiple variable analysis is the aggregate() function. Try ‘?aggregate’ in your R console for more information.
This is for questions where the respondents can select “all that apply” / more than one. Tyler developed this function to handle multiple choice comparison questions in crosstabs (You can copy and paste it into your R script editor, run it, and it will be ready for use).
multipleSelect.Xtab <- function(Questions, comparison, rnm = NULL, byCase = T, both = F){
if(both == T){
List <- list()
for(i in 1:length(comparison)){
Row <- sapply(Questions, table, comparison[,i])
List[[i]] <- Row
}
for(i in 2:length(List)){
List[[1]] <- cbind(List[[1]], List[[i]])
}
colnames(List[[1]]) <- colnames(comparison)
}else{
List <- lapply(Questions,table,comparison)
for(i in 2:length(List)){
List[[1]]<-rbind(List[[1]],List[[i]])
}
}
Table <- List[[1]]
if(is.null(rnm)==F){
rownames(Table) <- rnm
}else rownames(Table) <- names(Questions)
if(byCase==T){
Table <- sweep(x = Table, MARGIN = 2, STATS = summary(as.factor(comparison)),FUN = "/")*100
}
return(Table)
}
Once you run this function, it will be saved in your environment. Here’s an example. Use this function to compare Q14 (a multiple select question) to segments.
# find the column indicies for questions with "Q14 in them"
grep("Q14", names(Dat))
## [1] 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
# create these as a vector of "IDs"
IDs <- grep("Q14", names(Dat))
# the last one has text in it, and we don't want that in our analysis, so let's remove it
IDs <- IDs[-16]
# Use the IDs in the function, and compare to segments
# use byCase = TRUE so you get the correct percentages for viz
multipleSelect.Xtab(Dat[,IDs], Dat$Segment, byCase = TRUE)
## Curious Bystander Goal Setters Relaxing Learner Selective Learner
## Q14_1 11.805556 14.6067416 11.25 14.2441860
## Q14_2 13.194444 16.5730337 13.25 14.5348837
## Q14_3 7.986111 13.7640449 13.75 11.6279070
## Q14_4 10.763889 9.8314607 12.50 11.0465116
## Q14_5 14.583333 14.0449438 16.25 11.9186047
## Q14_6 13.888889 14.6067416 16.75 14.8255814
## Q14_7 14.583333 16.2921348 16.25 13.6627907
## Q14_8 6.597222 9.8314607 10.75 8.7209302
## Q14_9 11.458333 15.4494382 17.75 14.2441860
## Q14_10 5.902778 8.4269663 6.50 5.8139535
## Q14_11 2.777778 4.7752809 6.00 4.3604651
## Q14_12 7.986111 8.9887640 11.25 8.4302326
## Q14_13 7.638889 14.6067416 12.50 12.2093023
## Q14_14 1.736111 0.2808989 0.25 2.3255814
## Q14_15 1.041667 1.9662921 1.50 0.5813953
## Topic Oriented
## Q14_1 14.7679325
## Q14_2 15.1898734
## Q14_3 15.1898734
## Q14_4 11.8143460
## Q14_5 17.7215190
## Q14_6 17.2995781
## Q14_7 19.8312236
## Q14_8 7.1729958
## Q14_9 15.1898734
## Q14_10 5.0632911
## Q14_11 7.5949367
## Q14_12 9.7046414
## Q14_13 11.8143460
## Q14_14 0.8438819
## Q14_15 2.5316456
Chi Square tests for crosstabs
kyle <- table(Dat$Q4,Dat$Segment)
chisq.test(kyle)
##
## Pearson's Chi-squared test
##
## data: kyle
## X-squared = 232.21, df = 24, p-value < 2.2e-16
T tests and Anovas (Continuous data
This is just one kind of ANOVA that you can do. Defer to the TukeyHSD() for specific values for interpretation.
# One Way Anova (Completely Randomized Design)
# Have your dependent variable on the LHS of the '~'
brett <- aov(Dat$Q4 ~ Dat$Segment, data=Dat)
summary(brett) # general summary - display Type I ANOVA table
## Df Sum Sq Mean Sq F value Pr(>F)
## Dat$Segment 4 117.1 29.275 15.82 1.05e-12 ***
## Residuals 1620 2998.3 1.851
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(brett) # differences and p-values by Segment
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Dat$Q4 ~ Dat$Segment, data = Dat)
##
## $`Dat$Segment`
## diff lwr upr
## Goal Setters-Curious Bystander -0.61083801 -0.90527902 -0.31639701
## Relaxing Learner-Curious Bystander -0.73041667 -1.01752420 -0.44330914
## Selective Learner-Curious Bystander -0.38396318 -0.68069196 -0.08723440
## Topic Oriented-Curious Bystander -0.75065928 -1.07648536 -0.42483321
## Relaxing Learner-Goal Setters -0.11957865 -0.39027507 0.15111776
## Selective Learner-Goal Setters 0.22687484 -0.05400554 0.50775522
## Topic Oriented-Goal Setters -0.13982127 -0.45128306 0.17164052
## Selective Learner-Relaxing Learner 0.34645349 0.07327037 0.61963660
## Topic Oriented-Relaxing Learner -0.02024262 -0.32478108 0.28429585
## Topic Oriented-Selective Learner -0.36669610 -0.68032154 -0.05307067
## p adj
## Goal Setters-Curious Bystander 0.0000002
## Relaxing Learner-Curious Bystander 0.0000000
## Selective Learner-Curious Bystander 0.0038531
## Topic Oriented-Curious Bystander 0.0000000
## Relaxing Learner-Goal Setters 0.7477391
## Selective Learner-Goal Setters 0.1779758
## Topic Oriented-Goal Setters 0.7362372
## Selective Learner-Relaxing Learner 0.0049629
## Topic Oriented-Relaxing Learner 0.9997575
## Topic Oriented-Selective Learner 0.0124828
T-test are great for comparing two groups by a continuous variable. Check here for more info.
# independent 2-group t-test.
# Have your dependent variable on the LHS of the '~'
t.test(Dat$CLV ~ Dat$Q7)
##
## Welch Two Sample t-test
##
## data: Dat$CLV by Dat$Q7
## t = -11.714, df = 723.73, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -150.7282 -107.4558
## sample estimates:
## mean in group 0 mean in group 1
## 156.086 285.178
To find the number of responses to a question you can use…
#this gives a sum for a column
NROW(na.omit(Dat$Q8)) # ommiting the "NA" values
## [1] 579
For a typical crosstab like…
table(Dat$Q38, Dat$Q7)
##
## 0 1
## 0 747 455
## 1 257 124
taking the sum() will sum up all the answers, giving you the N size.
sum(table(Dat$Q38, Dat$Q7)) # Gender by "do you typically enroll..."
## [1] 1583
Just don’t do this:
sum(table(Dat$Q39, Dat$Segment=="Curious Bystander")) # Income by Curious Bystander
## [1] 1625
Why is it so large? Remember that when you table() with one part “==” to a specific value, you automatically will get back a TRUE and FALSE stats that, if summed up, will equal all the data in the dataset, not just your filter. One way to get around this is…
#
sum(table(Dat$Q39[Dat$Segment=="Curious Bystander"]))
## [1] 288
Using table() can help you find N-sizes, just know what you are summing up when you call the sum().
For Multiple-Select N-Sizes, you want to know how many people in each group were asked the question, not how many responses were given. Therefore, your N size will be the aggregate number of your grouping variable (in this example, it’s segments).
# find the indices of the multiple select question
grep("Q14", names(Dat))
## [1] 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
# create these as a vector of "IDs"
IDs <- grep("Q14", names(Dat))
# the last one has text in it, and we don't want that in our analysis, so let's remove it
IDs <- IDs[-16]
# Use the IDs in the function, and compare to segments
# use byCase = FALSE if you want raw counts, leave the default for correct %s
multipleSelect.Xtab(Dat[,IDs], Dat$Segment)
## Curious Bystander Goal Setters Relaxing Learner Selective Learner
## Q14_1 11.805556 14.6067416 11.25 14.2441860
## Q14_2 13.194444 16.5730337 13.25 14.5348837
## Q14_3 7.986111 13.7640449 13.75 11.6279070
## Q14_4 10.763889 9.8314607 12.50 11.0465116
## Q14_5 14.583333 14.0449438 16.25 11.9186047
## Q14_6 13.888889 14.6067416 16.75 14.8255814
## Q14_7 14.583333 16.2921348 16.25 13.6627907
## Q14_8 6.597222 9.8314607 10.75 8.7209302
## Q14_9 11.458333 15.4494382 17.75 14.2441860
## Q14_10 5.902778 8.4269663 6.50 5.8139535
## Q14_11 2.777778 4.7752809 6.00 4.3604651
## Q14_12 7.986111 8.9887640 11.25 8.4302326
## Q14_13 7.638889 14.6067416 12.50 12.2093023
## Q14_14 1.736111 0.2808989 0.25 2.3255814
## Q14_15 1.041667 1.9662921 1.50 0.5813953
## Topic Oriented
## Q14_1 14.7679325
## Q14_2 15.1898734
## Q14_3 15.1898734
## Q14_4 11.8143460
## Q14_5 17.7215190
## Q14_6 17.2995781
## Q14_7 19.8312236
## Q14_8 7.1729958
## Q14_9 15.1898734
## Q14_10 5.0632911
## Q14_11 7.5949367
## Q14_12 9.7046414
## Q14_13 11.8143460
## Q14_14 0.8438819
## Q14_15 2.5316456
# N sizes for segments
summary(as.factor(Dat$Segment))
## Curious Bystander Goal Setters Relaxing Learner Selective Learner
## 288 356 400 344
## Topic Oriented
## 237
There are a LOT of options for how to deal with text responses. This will provide a few examples of how we’ve been handling text variables for survey analysis.
Prep the text for analysis:
#load in some packages (install if necessary)
library(tm)
library(SnowballC)
library(RColorBrewer)
library(dplyr)
#Clean text
tricia <- Dat$Q5 #open-response variable to a new variable for text transformations
tricia <- gsub("[^[:graph:]]", " ", tricia) #get rid of non graphical characters
tricia <- gsub("rt", "", tricia)# Replace blank space ("rt")
tricia <- gsub("[[:punct:]]", "", tricia)# Remove punctuation
tricia <- gsub("[ |\t]{2,}", "", tricia)# Remove tabs
tricia <- gsub("^ ", "", tricia)# Remove blank spaces at the beginning
tricia <- gsub(" $", "", tricia)# Remove blank spaces at the end
tricia <- tolower(tricia)#convert all text to lower case
myCorpus <- Corpus(VectorSource(tricia))
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, removeWords, stopwords("english")) #removes common english stopwords
myCorpus <- tm_map(myCorpus, removeWords, c("muffin")) #You can specify words to remove
myCorpus <- tm_map(myCorpus, PlainTextDocument)
#build a term-document matrix
myTDM = TermDocumentMatrix(myCorpus, control = list(minWordLength = 1))
m = as.matrix(myTDM)
v = sort(rowSums(m), decreasing = TRUE)
d = data.frame(word = names(v),freq=v)
Phew! That’s a lot of stuff. We’ve queued up this text data for some analysis and visualizations. Here’s the basic stuff:
library(wordcloud)
wordcloud(myCorpus, scale=c(5,0.5), max.words=100, random.order=FALSE, rot.per=0.25,
use.r.layout=FALSE, colors=brewer.pal(8, "Dark2"))
We can find the words that appear at least 50 times by calling the findFreqTerms() function on myTDM
HiFreqWrds <- findFreqTerms(myTDM,100)
HiFreqWrds
## [1] "can" "chance" "class" "classes" "continue"
## [6] "education" "explore" "fun" "great" "interest"
## [11] "interesting" "interests" "knowledge" "learn" "learning"
## [16] "life" "lifelong" "love" "means" "new"
## [21] "oppounity" "people" "skills" "something" "take"
## [26] "things" "time" "way"
Now you also see how associated a word is to another word or a list of words. Choose the association coefficient.
# Compute associations for all high freq terms
findAssocs(myTDM, HiFreqWrds, 0.4)
## $can
## numeric(0)
##
## $chance
## numeric(0)
##
## $class
## numeric(0)
##
## $classes
## numeric(0)
##
## $continue
## numeric(0)
##
## $education
## numeric(0)
##
## $explore
## numeric(0)
##
## $fun
## numeric(0)
##
## $great
## numeric(0)
##
## $interest
## numeric(0)
##
## $interesting
## numeric(0)
##
## $interests
## numeric(0)
##
## $knowledge
## numeric(0)
##
## $learn
## numeric(0)
##
## $learning
## numeric(0)
##
## $life
## numeric(0)
##
## $lifelong
## numeric(0)
##
## $love
## numeric(0)
##
## $means
## numeric(0)
##
## $new
## numeric(0)
##
## $oppounity
## numeric(0)
##
## $people
## meet
## 0.53
##
## $skills
## numeric(0)
##
## $something
## numeric(0)
##
## $take
## numeric(0)
##
## $things
## numeric(0)
##
## $time
## numeric(0)
##
## $way
## numeric(0)
Looks like only one term had an association of at least 40%
# or, just compute word strength associations
findAssocs(myTDM, "learn", 0.5)
## $learn
## numeric(0)
Looks like the word “learn” has no associations at the 50% level. See what happens if you lower it.
This is a quick and dirty version of sentiment analysis. There are lots more ways of doing this (see the QDAP package vignette). Here we take a cleaned character vector used earlier (i.e. tricia) and compare its sentiment against a grouping variable. Here I use Dat$Q38, which is gender.
library(qdap)
library(ggplot2)
poldat <- with(Dat, polarity(tricia, Dat$Q38))
plot(poldat)
If you get stuck or have questions, you are not alone! We are used to being stuck all the time with problems and we’ve learned that Googling and Stack Overflow have almost always had the answer. It just takes some diligence and hard work. We hope this helps!