Introduction

In this exercise you will learn how to import data from an online survey (google forms) into R and to clean the data. For this exercise we will use the Global Physical Activity Questionnaire (GPAQ). The goal of the GPAQ is to estimate the amount of physical activity for a person in a normal week.

You can download the questionnaire here in German or English. The document is also available on Moodle.

Note:

We don’t expect you to come up with the solutions in this exercise on your own. It is quite normal in such situations to ask Google. But of course you can also just look at the solutions directly and try to understand them.
There are many different and correct ways to solve the problems below. I show the solution which is the easiest one for me. However, every method which leads to the same result is correct.
Of course we would not use google forms for a trial collecting sensitive data. However, the procedure described in this exercise is quite similar for other online survey tools.

You will need the following packages for this exercise:

library(rio)
library(tidyverse)
library(kableExtra)

Step 1

Tasks

Please read all tasks for step 1 before starting:

Have a look at the original GPAQ (German).
I have created a online survey based on the GPAQ using Google forms. Please fill in this questionnaire.
What are the main differences between the original GPAQ on paper and the online survey I created? What are the advantages and the disadventages of these differences?

Solutions

What are the main differences between the original GPAQ and the online survey I created? Was it a good idea to introduce these differences?

There are no specific examples regarding the activities in the online version. If this online survey would be used in research, examples should be provided to ensure good quality of the data.
The time for physical activity is recorded in minutes in the online version instead of hours and minutes (as in the paper version). For some people, it might be difficult to think in minutes (risk for errors). I did it this way because I was too lazy to configure two separate fields…
In a online survey, it is possible to define field-validation. For example, it is not possible to fill in 1.5 minutes. This may remind someone that minutes are asked and not hours. In general, field validation reduces unrealistic values and thus, time for data cleaning.
In the online survey, the respondent is forced to give an answer. This reduces missing values, but can lead to wrong responses.
On a paper, it is always possible to leave a comment. However, there is no such option in the online survey. I would always provide the possibility to leave comments (as there is no possibility to ask back).

Step 2

Tasks

Import the data from the online survey into R using the import() function from the rio package and the link to the Google spreadsheet: https://docs.google.com/spreadsheets/d/1IlzXH5urKrDUjM2JDo-2XcoJnl44f2TR7kk5GYiTnw0/edit?resourcekey#gid=1387235461
Look at the dataset. Is the dataset in a long or a wide format?
Obviously the names of the variables are not appropriate. In such a case, there is probably no r-function which is smart enough to replace the current names with appropriate names. Therefore, the names have to be changed manually. Which names would you suggest?
Create a code book.
Change the names of the variables. You can use the names() function to do this (use ?names or Google for help). Also create factors for catergorical variables.
There is no identification variable. Create an ID variable (just a row number from 1 to nrow).

Solutions

The import of the data works like this:

pa <- import("https://docs.google.com/spreadsheets/d/1IlzXH5urKrDUjM2JDo-2XcoJnl44f2TR7kk5GYiTnw0/edit?resourcekey#gid=1387235461")

The dataset is in a wide format.

There is not a single solution for the names of the variables. However, they should be short and consistent. Consistent means that the type of information should always be at the same place. Here is my suggestion:

names_new <- c("time", "IPA_work", "IPA_work_days", "IPA_work_minutes", "MPA_work", "MPA_work_days", "MPA_work_minutes",
           "MPA_transport", "MPA_transport_days", "MPA_transport_minutes", "IPA_hobby", "IPA_hobby_days", "IPA_hobby_minutes",
           "MPA_hobby", "MPA_hobby_days", "MPA_hobby_minutes", "Sitting_hours")   

names_old <- names(pa)

As you can see, I store all new names in a vector. I also stored all old names in a vector. In this case (which is a quite simple example), this makes it easy to create a code book:

codebook <- data.frame(names_new, names_old)

library(tidyverse)
library(kableExtra)
codebook %>% 
  mutate(unit = stringr::str_match(names_new, '(.*)_(.*)')[, -1][,2]) %>%
  mutate(unit = ifelse(unit == "work" | unit == "hobby" | unit == "transport", "Ja/Nein", unit)) %>% 
  kable() %>% 
  kable_styling(full_width = TRUE)

names_new	names_old	unit
time	Zeitstempel	NA
IPA_work	Beinhaltet Ihre Arbeit instesive körperliche Aktivität, bei der Atmung und Puls mit einer Dauer von mindestens zehn Minuten stark zunehmen?	Ja/Nein
IPA_work_days	An wie vielen Tagen in einer gewöhnlichen Woche betätigen Sie sich bei der Arbeit körperlich intensiv?	days
IPA_work_minutes	Wie viel Zeit verbringen Sie an einem gewöhnlichen Tag beider Arbeit mit intensiver körperlicher Aktivität? Nennen Sie die Zeit in Minuten.	minutes
MPA_work	Beinhaltet Ihre Arbeit moderate körperliche Aktivität, bei der Atmung und Puls leicht zunehmen, wie flottes Gehen oder Tragen leichter Lasten mit einer Dauer von mindestens zehn Minuten?	Ja/Nein
MPA_work_days	An wie vielen Tagen in einer gewöhnlichen Woche fuhren Sie bei der Arbeit moderate körperliche Aktivität aus?	days
MPA_work_minutes	Wie viel Zeit verbringen Sie an einem gewöhnlichen Tag beider Arbeit mit moderater körperlicher Aktivität? Nennen Sie die Zeit in Minuten	minutes
MPA_transport	Gehen Sie zu Fuss oder fahren Sie mit dem Fahrrad, um voneinem Ort zum anderen zu kommen, mit einer Dauer vonmindestens zehn Minuten?	Ja/Nein
MPA_transport_days	An wie vielen Tagen in einer gewöhnlichen Woche gehen Sie zu Fuss oder fahren Sie mit dem Fahrrad, um von einem Ortzum anderen zu kommen, mit einer Dauer von mindestens zehn Minuten?	days
MPA_transport_minutes	Wie viel Zeit investieren Sie an einem gewöhnlichen Tag, um zu Fuss oder mit dem Fahrrad von einem Ort zum anderen zukommen? Nennen Sie die Zeit in Minuten	minutes
IPA_hobby	Betreiben Sie in der Freizeit intensive körperliche Aktivitätoder Sport, bei dem Atmung und Puls stark zunehmen, wie laufen oder Fussball mit einer Dauer von mindestens zehn Minuten?	Ja/Nein
IPA_hobby_days	An wie vielen Tagen in einer gewöhnlichen Woche betreiben Sie intensive körperliche Aktivität oder Sport in der Freizeit?	days
IPA_hobby_minutes	Wie viel Zeit investieren Sie an einem gewöhnlichen Tag in der Freizeit in intensive körperliche Aktivität oder Sport? Nennen Sie die Zeit in Minuten	minutes
MPA_hobby	Betreiben Sie in der Freizeit moderate körperliche Aktivität oder Sport, bei dem Atmung und Puls leicht zunehmen, wie flottes Gehen, Fahrrad fahren, Schwimmen oder Volleyball mit einer Dauer von mindestens zehn Minuten?	Ja/Nein
MPA_hobby_days	An wie vielen Tagen in einer gewöhnlichen Woche betreiben Sie moderate körperliche Aktivität oder Sport in der Freizeit?	days
MPA_hobby_minutes	Wie viel Zeit investieren Sie an einem gewöhnlichen Tag invder Freizeit in moderate körperliche Aktivität oder Sport? Nennen Sie die Zeit in Minuten	minutes
Sitting_hours	Wie viel Zeit verbringen Sie an einem gewöhnlichen Tag mit Sitzen oder Ruhen? Nennen Sie die Zeit in Stunden	hours

Finally, we define the new names and we define factors for the categorical variables:

names(pa) <- names_new
pa <- pa %>% 
  mutate(IPA_work = factor(IPA_work),
         MPA_work = factor(MPA_work),
         MPA_transport = factor(MPA_transport),
         IPA_hobby = factor(IPA_hobby),
         MPA_hobby = factor(MPA_hobby),
         ID = 1:nrow(pa)) %>% 
  select(ID, time:Sitting_hours)

str(pa)

## 'data.frame':    48 obs. of  18 variables:
##  $ ID                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ time                 : chr  "31.05.2021 10:14:12" "02.06.2021 14:13:41" "02.06.2021 14:14:24" "02.06.2021 14:14:40" ...
##  $ IPA_work             : Factor w/ 2 levels "Ja","Nein": 2 2 2 2 2 1 1 2 2 1 ...
##  $ IPA_work_days        : int  NA NA NA NA NA 3 5 NA NA 4 ...
##  $ IPA_work_minutes     : int  NA NA NA NA NA 60 30 NA NA 15 ...
##  $ MPA_work             : Factor w/ 2 levels "Ja","Nein": 2 2 1 1 1 1 1 1 2 1 ...
##  $ MPA_work_days        : int  NA NA 4 5 5 5 5 5 NA 5 ...
##  $ MPA_work_minutes     : int  NA NA 200 240 60 240 240 60 NA 60 ...
##  $ MPA_transport        : Factor w/ 2 levels "Ja","Nein": 1 1 2 2 1 2 1 2 2 1 ...
##  $ MPA_transport_days   : int  7 7 NA NA 6 NA 5 NA NA 7 ...
##  $ MPA_transport_minutes: int  25 30 NA NA 45 NA 150 NA NA 35 ...
##  $ IPA_hobby            : Factor w/ 2 levels "Ja","Nein": 1 1 1 1 1 2 1 1 2 1 ...
##  $ IPA_hobby_days       : int  4 4 2 2 2 NA 2 1 NA 4 ...
##  $ IPA_hobby_minutes    : int  120 120 20 30 30 NA 40 30 NA 50 ...
##  $ MPA_hobby            : Factor w/ 2 levels "Ja","Nein": 1 1 1 1 1 1 1 1 1 1 ...
##  $ MPA_hobby_days       : int  7 7 2 5 5 3 1 1 3 1 ...
##  $ MPA_hobby_minutes    : int  120 120 30 40 30 120 30 30 60 40 ...
##  $ Sitting_hours        : int  10 15 5 19 4 10 5 4 4 3 ...

Step 3

Now it is time to clean the data. The WHO provides a guide how data should be cleaned. You can access the whole guide here.

For this exercise, we focus on this table (page 9):

Important Note: It is always a good idea to look at the data before starting with data cleaning. With a few plots, you can get a quick overview and you easily see, for example, extreme or implausible values. However, we skip this step here. You will see at the end of this exercise, how to create plots.

Tasks

The first row of the table above is not relevant for us as we only have answers in minutes, but the other criteria should be checked. Check, if there are observations which meet the criteria on the left side of the table (If…). If yes, take action according to the recommendation on the right side of the table (Then…).

Solultion

Is there at least one “sub-domain” (vigorous work, moderate work, transport, vigorous recreation, or moderate recreation activity) in which the value of hours+minutes >16 hours? As we have only minutes, we look for values > 16*60 = 960 minutes. There are probably > 100 different ways to find this out. Here is one possible way:

pa %>% 
  filter_at(vars(contains("minutes")), any_vars(. > 960)) %>%  # We filter all variables which contain "minutes" in the name and of all these variables the lines in which is a value > 960
  select(ID, contains("minutes"))

##   ID IPA_work_minutes MPA_work_minutes MPA_transport_minutes IPA_hobby_minutes
## 1 45               NA               NA                    50                90
##   MPA_hobby_minutes
## 1              1000

We see that observation 45 has a value > 960. We see, that this person reported 1000 minutes for MPA_hobby_minutes. We store the ID of the observation(s) with values > 960 in a new data frame (max_val).

max_val <- pa %>% 
  filter_at(vars(contains("minutes")), any_vars(. > 960)) %>% 
  select(ID) # We store the ID of the observation(s) with values > 960 in a new data frame

Did a respondent report implausible values (eg., >7 days in any days column)? We can check this in the same way as above:

pa %>% 
  filter_at(vars(contains("days")), any_vars(. > 7)) %>% # We filter all variables which contain "days" in the name and of all these variables the lines in which is a value > 7
  select(ID, contains("days"))

##   ID IPA_work_days MPA_work_days MPA_transport_days IPA_hobby_days
## 1 27            NA             4                  8              3
## 2 46             1             5                  8             NA
##   MPA_hobby_days
## 1              3
## 2             NA

We see that there are observations with a value > 7. We store the ID(s) of the observation(s) with values > 7 in a new data frame.

implausible_val <- pa %>% 
  filter_at(vars(contains("days")), any_vars(. > 7)) %>% 
  select(ID) # We store the ID of the observation(s) with values > 7 in a new data frame

Did a respondent have inconsistent answers (eg., 0 days, but values >0 in the corresponding time variables)? We can check this in the same way as above:

pa %>% 
  filter_at(vars(contains("days")), any_vars(. == 0)) %>% # We filter all variables which contain "days" in the name and of all these variables the lines in which is a value = 0
  filter_at(vars(contains("minutes")), any_vars(. > 0)) %>% 
  select(ID, contains("days") | contains("minutes"))

##   ID IPA_work_days MPA_work_days MPA_transport_days IPA_hobby_days
## 1 47            NA            NA                  0             NA
##   MPA_hobby_days IPA_work_minutes MPA_work_minutes MPA_transport_minutes
## 1              0               NA               NA                    40
##   IPA_hobby_minutes MPA_hobby_minutes
## 1                NA                45

We see that observation with ID 47 reported two times 0 days but > 0 minutes. We store the ID of the observation(s) with values = 0 in a new data frame.

inconsistent_val <- pa %>% 
  filter_at(vars(contains("days")), any_vars(. == 0)) %>% 
  filter_at(vars(contains("minutes")), any_vars(. > 0)) %>% 
  select(ID) # We store the ID of the observation(s) with values = 0 in a new data frame

We now have to remove the observations with the ID numbers stored in max_val, implausible_val and inconsistent_val. First, we combine all ID’s in a single vector:

to_remove <- c(max_val$ID, implausible_val$ID, inconsistent_val$ID)
to_remove

## [1] 45 27 46 47

It is important to check if an ID occurs two or more times. It could be that a participant is excluded for several reasons; nevertheless, we exclude this person only once, of course. We can use the unique() function to make sure that no ID is represented multiple times:

to_remove <- unique(to_remove)
to_remove

## [1] 45 27 46 47

Now we can remove the observations with the IDs above. As you can see below, they disappeared. Keep in mind that from now on, we have to use the data frame pa2 and not pa anymore.

pa2 <- subset(pa, !(ID %in% to_remove))
tail(pa2[, 1:5])

##    ID                time IPA_work IPA_work_days IPA_work_minutes
## 40 40 02.06.2021 14:17:31     Nein            NA               NA
## 41 41 02.06.2021 14:17:41       Ja             2               60
## 42 42 02.06.2021 14:17:49       Ja             2               15
## 43 43 02.06.2021 14:18:26     Nein            NA               NA
## 44 44 02.06.2021 14:23:24     Nein            NA               NA
## 48 48 04.01.2023 09:09:40     Nein            NA               NA

Finally we have to check if there are missing values for days or minutes (see last criteria in the table above). If this is the case, we need to replace the NAs with the value 0. We do this for all three sub-domains using the ifelse() function:

Note: NAs are actually not possible in this online survey as the respondent is forced to give a response once the first question has been answered with “Ja”. Nevertheless, it is shown how this could be done.

pa2 <- pa2 %>% 
  mutate(IPA_work_minutes = ifelse(IPA_work == "Ja" & is.na(IPA_work_minutes), 0, IPA_work_minutes), 
         MPA_work_minutes = ifelse(MPA_work == "Ja" & is.na(MPA_work_minutes), 0, MPA_work_minutes),
         MPA_transport_minutes = ifelse(MPA_transport == "Ja" & is.na(MPA_transport_minutes), 0, MPA_transport_minutes),
         IPA_hobby_minutes = ifelse(IPA_hobby == "Ja" & is.na(IPA_hobby_minutes), 0, IPA_hobby_minutes),
         MPA_hobby_minutes = ifelse(MPA_hobby == "Ja" & is.na(MPA_hobby_minutes), 0, IPA_hobby_minutes))

Step 4

In this step, we will calculate the actual amount of physical activity per person per week (this is what we actually want to find out with the GPAQ). To do so, we again follow the analysis guidelines (page 14). See page 3 for information on MET if you don’t what this is.

Tasks

Calculate the MET-minutes for each sub-domain.
Calculate the overall MET-minutes per week.
Create a dichotomous variable indicating if the person meets the WHO recommendation.

Solutions

First we create new variables containing the MET-minutes for each sub-domain according to the weight given in the table above. Then we sum all these values up to get the total MET-minutes per week. Finally, we create a dichotomous variable indicating if the person meets the WHO recommendation.

pa2 <- pa2 %>% 
  mutate(MET_IPA_work = IPA_work_minutes * IPA_work_days * 8,
         MET_MPA_work = MPA_hobby_minutes * MPA_work_days * 4,
         MET_MPA_transport = MPA_transport_minutes * MPA_transport_days * 4,
         MET_IPA_hobby = IPA_hobby_minutes * IPA_hobby_days * 8,
         MET_MPA_hobby = MPA_hobby_minutes * MPA_hobby_days * 4)

pa2 <- pa2 %>% 
  mutate(MET_total = rowSums(.[19:23], na.rm = TRUE),
         MET_bin = ifelse(MET_total >= 600, "Yes", "No"))

pa2 %>% 
  select(contains("MET")) %>% 
  slice(1:6) %>% 
  kable() %>% 
  kable_styling()

MET_IPA_work	MET_MPA_work	MET_MPA_transport	MET_IPA_hobby	MET_MPA_hobby	MET_total	MET_bin
NA	NA	700	3840	3360	7900	Yes
NA	NA	840	3840	3360	8040	Yes
NA	320	NA	320	160	800	Yes
NA	600	NA	480	600	1680	Yes
NA	600	1080	480	600	2760	Yes
1440	NA	NA	NA	NA	1440	Yes

Step 5

Now we finally can analyse the data! For analysis, data is often required in a long format (e.g. for regression or for plots). In this last exercise, you will learn to use the pivot_longer() function to transform datasets from wide to long.

Tasks

Create a data frame in the long format with two variables. One variables with all values (“Ja”, “Nein”) of IPA_work, MPA_work, MPA_transport, IPA_hobby, MPA_hobby and one variable with the name of the corresponding sub-domains.
Create a graph with barplots which shows the distribution of “Ja” & “Nein” per sub-domain.
Create another data frame in the long format with two variables. One variables with all values of MET_IPA_work, MET_MPA_work, MET_MPA_transport, MET_IPA_hobby, MET_MPA_hobby and MET_total and one variable with the name of the corresponding sub-domains.
Create a graph with boxplots which shows the distribution of MET-minutes per sub-domain.

Solutions

In the first graph we see the frequency for each sub-domain.

pa2_domains_long <- pa2 %>%
  select(IPA_work,  MPA_work, MPA_transport, IPA_hobby, MPA_hobby) %>% 
  pivot_longer(everything(), names_to = "Domain",  values_to = "Response")

ggplot(pa2_domains_long, aes(x = Response, fill = Domain)) + geom_bar(position = "dodge")

In the second graph, we see the distribution of MET-minutes per sub-domain and in total:

pa2_MET_long <- pa2 %>% 
  select(contains("MET")) %>% 
  pivot_longer(MET_IPA_work:MET_total, names_to = "Domain", values_to = "MET_minutes")

ggplot(pa2_MET_long, aes(x = Domain, y = MET_minutes)) + geom_boxplot() + 
  theme(axis.text.x = element_text(angle = 40,  hjust=1)) + 
  scale_fill_discrete(name = "Meet WHO recommendation")

Get and clean data from an online survey

Nathanael Lutz & Roger Hilfiker

Introduction

Step 1

Tasks

Solutions

Step 2

Tasks

Solutions

Step 3

Tasks

Solultion

Step 4

Tasks

Solutions

Step 5

Tasks

Solutions