Processing Survey Data

Statistical Modeling: A Fresh Approach

Daniel Kaplan

Congratulations! Your survey spreadsheet has been filling in. Now it's time to analyze your data.

You already have the modeling skills needed to draw conclusions from your survey data. But before you can build models, you have to put the survey information into a suitable form. This is not difficult, but it can be a bit tedious. The good news is that you can set up the computer to do this for you.

To get started, let's work with a bit of data that we will share, coming from a Google Form. You'll read in your own data in the same way, but for demonstration, use this example data.

First, look at the questions that were asked and the responses. These are meant to be similar to typical questions and responses in the student forms.

Read in the “Raw” Data from Google

You'll need the “publish as CSV” link from Google. Give this to fetchGoogle() and you can read in your data.

myCSVLink = ""
d = fetchGoogle(myCSVLink)
### Put in NA for empty or blank answers
for (k in 1:length(d)) {
    temp = d[[k]]
    temp[temp %in% c("", " ", "  ", "   ")] = NA
    d[k] = temp

Fix the Names

The names of variables generated by Google Form are too verbose.

origNames = names(d)
## [1] "Timestamp"                                                                                      
## [2] ""                                               
## [3] ""                                               
## [4] ""                                
## [5] ""                             
## [6] ""                
## [7] ""
## [8] ""

As you can see, there are 8 different variables. You're going to rename each of them. Use the following statements to do so:

names(d)[2] = "SkillLevel"
names(d)[3] = "Division"
names(d)[4] = "Web"

Make sure to change the number in each line and that you choose an appropriate mnemonic short name.


Use this style to complete the change of names, e.g.

names(d)[5] = "Spreadsheet"
# Put the rest of your commands here.

When you are done, check your work with this statement:

paste(origNames, names(d), sep = "-->>")
## [1] "Timestamp-->>Timestamp"                                                                                                                                                                            
## [2] ">>SkillLevel"                                                                                                                                    
## [3] ">>Division"                                                                                                                                      
## [4] ">>Web"                                                                                                                            
## [5] ">>Spreadsheet"                                                                                                                 
## [6] ">>"                                
## [7] ">>"
## [8] ">>"

Make sure that you're satisfied with the names. If not, go back an alter your name statements.

Quantitative Variables

You may have had some variables that are numeric, where the only options available to the user were numbers. Check these to make sure that they have been read in properly. If not, talk to your instructor to find out an appropriate way of correcting the situation. (We'll keep the examples of things that go wrong to create templates for future students.)

In our example data, only the SkillLevel variable is quantitative. Here are two simple checks:

with(d, class(SkillLevel))
## [1] "integer"
mean(SkillLevel, data = d)
## [1] 2.326

When things go wrong, it's often because some numerical-similar content was used in the variable. For instance, suppose that you offer a choice of these options for a variable called HowMany: none, 0, 1, 2, 3, 4+

When such a column is read in to R, the “none” and “4+” are treated as character strings. Translating these into numbers requires somewhat specialized operations that aren't difficult but are certainly not among the learning objectives for this course. To illustrate, the following will do the job on the above:

temp = as.character(d$HowMany)
temp = sub("\\+", "", temp)
temp = sub("none", "0", temp)
d$HowMany = as.numeric(temp)

Categorical Variables

Often, the levels produced by Google Forms are too verbose for convenience. After all, they were designed for another purpose: to be informative to a human completing your survey. It's helpful to change the names to be more convenient for display. To do this, construct a vector that tells what should be the new level for each existing level. You need to be careful to get the spelling exactly right. Also, make sure to list every possible level from your form, even if there are some that nobody selected in your survey.

require(plyr) # just need to do once, like require(mosaic)
## [1] ""                        "Fine Arts"              
## [3] "Science/Math"            "Social Science Division"
# check that the spelling matches in the next line
newLevels = c("Fine Arts"="Art","Science/Math"="Sci",
              "Humanities"="Hum","Social Science Division"="SS")

Now you will assign these new levels to your variable:

d$Division = revalue(d$Division, newLevels)
## The following `from` values were not present in `x`: Humanities
d$Division = factor(d$Division, levels = newLevels)

This involves two commands for each variable. The first changes the names of the levels. The second does something a little more obscure. It makes sure that the full set of possible levels is available for graphics, models, etc.

You may also want to set the reference level explicitly. You can do this with a statement of this sort:

relevel(d$Division, ref = "Hum")
##  [1] Sci  Art  SS   <NA> SS   SS   <NA> Sci  SS   Sci  Sci  Sci  Sci  Sci 
## [15] Sci  SS   Sci  SS   Sci  SS   Sci  SS   Sci  Sci  SS   Sci  SS   Sci 
## [29] Sci  Sci  SS   SS   SS   Sci  Sci  Sci  Sci  Sci  SS   SS   Sci  SS  
## [43] Sci  Sci  SS   Sci  <NA> Sci 
## Levels: Hum Art Sci SS

Make sure to get the spelling right!


Rename the levels of the “What keeps you from studying more about computing?” question to “Boring”,“Klutz”,“Useless”,“Busy”,“WillDo”. The original names are so complicated that you will do well to cut and paste the output of levels() to seed a new command, then putting in the additional part of the command like ="Interest".

Ordinal Variables

Many of the survey questions are on a Likert Scale. You will want to simplify the names and also to tell R that there is a natural order. For example, the Web variable in our survey has a natural ordering.

Here's the renaming step:

likertLevels = c("I don't need"="None" ,
                 "Helps to know about it"="Little", 
                 "Somewhat important in my field"="Some",
                 "Very important in my field"="Very")
d$Web = revalue(d$Web, likertLevels)

When you construct the translation (here called likertLevels), make sure to order it in the natural way, from one end to the other. Here, the order goes “None”, “Little”,“Some”,“Very”. You often will be able to use the same translation across multiple variables.

Now, tell R that the variable is ordered:

d$Web = factor(d$Web, ordered = TRUE, levels = likertLevels)

When you look at the variable, you will see that the level ordering is as specified:

## [1] Some   None   Some   <NA>   Little Very  
## Levels: None < Little < Some < Very