In this exercise you will learn how to import data from an online survey (google forms) into R and to clean the data. For this exercise we will use the Global Physical Activity Questionnaire (GPAQ). The goal of the GPAQ is to estimate the amount of physical activity for a person in a normal week.
You can download the questionnaire here in German or English. The document is also available on Moodle.
Note:
We don’t expect you to come up with the solutions in this
exercise on your own. It is quite normal in such situations to ask
Google. But of course you can also just look at the solutions directly
and try to understand them.
There are many different and correct ways to solve the problems below. I show the solution which is the easiest one for me. However, every method which leads to the same result is correct.
Of course we would not use google forms for a trial collecting sensitive data. However, the procedure described in this exercise is quite similar for other online survey tools.
You will need the following packages for this exercise:
library(rio)
library(tidyverse)
library(kableExtra)
Please read all tasks for step 1 before starting:
What are the main differences between the original GPAQ and the online survey I created? Was it a good idea to introduce these differences?
import()
function from the rio
package and the
link to the Google spreadsheet: https://docs.google.com/spreadsheets/d/1IlzXH5urKrDUjM2JDo-2XcoJnl44f2TR7kk5GYiTnw0/edit?resourcekey#gid=1387235461?names
or Google for help). Also create
factors for catergorical variables.The import of the data works like this:
<- import("https://docs.google.com/spreadsheets/d/1IlzXH5urKrDUjM2JDo-2XcoJnl44f2TR7kk5GYiTnw0/edit?resourcekey#gid=1387235461") pa
The dataset is in a wide format.
There is not a single solution for the names of the variables. However, they should be short and consistent. Consistent means that the type of information should always be at the same place. Here is my suggestion:
<- c("time", "IPA_work", "IPA_work_days", "IPA_work_minutes", "MPA_work", "MPA_work_days", "MPA_work_minutes",
names_new "MPA_transport", "MPA_transport_days", "MPA_transport_minutes", "IPA_hobby", "IPA_hobby_days", "IPA_hobby_minutes",
"MPA_hobby", "MPA_hobby_days", "MPA_hobby_minutes", "Sitting_hours")
<- names(pa) names_old
As you can see, I store all new names in a vector. I also stored all old names in a vector. In this case (which is a quite simple example), this makes it easy to create a code book:
<- data.frame(names_new, names_old)
codebook
library(tidyverse)
library(kableExtra)
%>%
codebook mutate(unit = stringr::str_match(names_new, '(.*)_(.*)')[, -1][,2]) %>%
mutate(unit = ifelse(unit == "work" | unit == "hobby" | unit == "transport", "Ja/Nein", unit)) %>%
kable() %>%
kable_styling(full_width = TRUE)
names_new | names_old | unit |
---|---|---|
time | Zeitstempel | NA |
IPA_work | Beinhaltet Ihre Arbeit instesive körperliche Aktivität, bei der Atmung und Puls mit einer Dauer von mindestens zehn Minuten stark zunehmen? | Ja/Nein |
IPA_work_days | An wie vielen Tagen in einer gewöhnlichen Woche betätigen Sie sich bei der Arbeit körperlich intensiv? | days |
IPA_work_minutes | Wie viel Zeit verbringen Sie an einem gewöhnlichen Tag beider Arbeit mit intensiver körperlicher Aktivität? Nennen Sie die Zeit in Minuten. | minutes |
MPA_work | Beinhaltet Ihre Arbeit moderate körperliche Aktivität, bei der Atmung und Puls leicht zunehmen, wie flottes Gehen oder Tragen leichter Lasten mit einer Dauer von mindestens zehn Minuten? | Ja/Nein |
MPA_work_days | An wie vielen Tagen in einer gewöhnlichen Woche fuhren Sie bei der Arbeit moderate körperliche Aktivität aus? | days |
MPA_work_minutes | Wie viel Zeit verbringen Sie an einem gewöhnlichen Tag beider Arbeit mit moderater körperlicher Aktivität? Nennen Sie die Zeit in Minuten | minutes |
MPA_transport | Gehen Sie zu Fuss oder fahren Sie mit dem Fahrrad, um voneinem Ort zum anderen zu kommen, mit einer Dauer vonmindestens zehn Minuten? | Ja/Nein |
MPA_transport_days | An wie vielen Tagen in einer gewöhnlichen Woche gehen Sie zu Fuss oder fahren Sie mit dem Fahrrad, um von einem Ortzum anderen zu kommen, mit einer Dauer von mindestens zehn Minuten? | days |
MPA_transport_minutes | Wie viel Zeit investieren Sie an einem gewöhnlichen Tag, um zu Fuss oder mit dem Fahrrad von einem Ort zum anderen zukommen? Nennen Sie die Zeit in Minuten | minutes |
IPA_hobby | Betreiben Sie in der Freizeit intensive körperliche Aktivitätoder Sport, bei dem Atmung und Puls stark zunehmen, wie laufen oder Fussball mit einer Dauer von mindestens zehn Minuten? | Ja/Nein |
IPA_hobby_days | An wie vielen Tagen in einer gewöhnlichen Woche betreiben Sie intensive körperliche Aktivität oder Sport in der Freizeit? | days |
IPA_hobby_minutes | Wie viel Zeit investieren Sie an einem gewöhnlichen Tag in der Freizeit in intensive körperliche Aktivität oder Sport? Nennen Sie die Zeit in Minuten | minutes |
MPA_hobby | Betreiben Sie in der Freizeit moderate körperliche Aktivität oder Sport, bei dem Atmung und Puls leicht zunehmen, wie flottes Gehen, Fahrrad fahren, Schwimmen oder Volleyball mit einer Dauer von mindestens zehn Minuten? | Ja/Nein |
MPA_hobby_days | An wie vielen Tagen in einer gewöhnlichen Woche betreiben Sie moderate körperliche Aktivität oder Sport in der Freizeit? | days |
MPA_hobby_minutes | Wie viel Zeit investieren Sie an einem gewöhnlichen Tag invder Freizeit in moderate körperliche Aktivität oder Sport? Nennen Sie die Zeit in Minuten | minutes |
Sitting_hours | Wie viel Zeit verbringen Sie an einem gewöhnlichen Tag mit Sitzen oder Ruhen? Nennen Sie die Zeit in Stunden | hours |
Finally, we define the new names and we define factors for the categorical variables:
names(pa) <- names_new
<- pa %>%
pa mutate(IPA_work = factor(IPA_work),
MPA_work = factor(MPA_work),
MPA_transport = factor(MPA_transport),
IPA_hobby = factor(IPA_hobby),
MPA_hobby = factor(MPA_hobby),
ID = 1:nrow(pa)) %>%
select(ID, time:Sitting_hours)
str(pa)
## 'data.frame': 48 obs. of 18 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ time : chr "31.05.2021 10:14:12" "02.06.2021 14:13:41" "02.06.2021 14:14:24" "02.06.2021 14:14:40" ...
## $ IPA_work : Factor w/ 2 levels "Ja","Nein": 2 2 2 2 2 1 1 2 2 1 ...
## $ IPA_work_days : int NA NA NA NA NA 3 5 NA NA 4 ...
## $ IPA_work_minutes : int NA NA NA NA NA 60 30 NA NA 15 ...
## $ MPA_work : Factor w/ 2 levels "Ja","Nein": 2 2 1 1 1 1 1 1 2 1 ...
## $ MPA_work_days : int NA NA 4 5 5 5 5 5 NA 5 ...
## $ MPA_work_minutes : int NA NA 200 240 60 240 240 60 NA 60 ...
## $ MPA_transport : Factor w/ 2 levels "Ja","Nein": 1 1 2 2 1 2 1 2 2 1 ...
## $ MPA_transport_days : int 7 7 NA NA 6 NA 5 NA NA 7 ...
## $ MPA_transport_minutes: int 25 30 NA NA 45 NA 150 NA NA 35 ...
## $ IPA_hobby : Factor w/ 2 levels "Ja","Nein": 1 1 1 1 1 2 1 1 2 1 ...
## $ IPA_hobby_days : int 4 4 2 2 2 NA 2 1 NA 4 ...
## $ IPA_hobby_minutes : int 120 120 20 30 30 NA 40 30 NA 50 ...
## $ MPA_hobby : Factor w/ 2 levels "Ja","Nein": 1 1 1 1 1 1 1 1 1 1 ...
## $ MPA_hobby_days : int 7 7 2 5 5 3 1 1 3 1 ...
## $ MPA_hobby_minutes : int 120 120 30 40 30 120 30 30 60 40 ...
## $ Sitting_hours : int 10 15 5 19 4 10 5 4 4 3 ...
Now it is time to clean the data. The WHO provides a guide how data should be cleaned. You can access the whole guide here.
For this exercise, we focus on this table (page 9):
Important Note: It is always a good idea to look at the data before starting with data cleaning. With a few plots, you can get a quick overview and you easily see, for example, extreme or implausible values. However, we skip this step here. You will see at the end of this exercise, how to create plots.
The first row of the table above is not relevant for us as we only have answers in minutes, but the other criteria should be checked. Check, if there are observations which meet the criteria on the left side of the table (If…). If yes, take action according to the recommendation on the right side of the table (Then…).
Is there at least one “sub-domain” (vigorous work, moderate work, transport, vigorous recreation, or moderate recreation activity) in which the value of hours+minutes >16 hours? As we have only minutes, we look for values > 16*60 = 960 minutes. There are probably > 100 different ways to find this out. Here is one possible way:
%>%
pa filter_at(vars(contains("minutes")), any_vars(. > 960)) %>% # We filter all variables which contain "minutes" in the name and of all these variables the lines in which is a value > 960
select(ID, contains("minutes"))
## ID IPA_work_minutes MPA_work_minutes MPA_transport_minutes IPA_hobby_minutes
## 1 45 NA NA 50 90
## MPA_hobby_minutes
## 1 1000
We see that observation 45 has a value > 960. We see, that this
person reported 1000 minutes for MPA_hobby_minutes
. We
store the ID of the observation(s) with values > 960 in a new data
frame (max_val
).
<- pa %>%
max_val filter_at(vars(contains("minutes")), any_vars(. > 960)) %>%
select(ID) # We store the ID of the observation(s) with values > 960 in a new data frame
Did a respondent report implausible values (eg., >7 days in any days column)? We can check this in the same way as above:
%>%
pa filter_at(vars(contains("days")), any_vars(. > 7)) %>% # We filter all variables which contain "days" in the name and of all these variables the lines in which is a value > 7
select(ID, contains("days"))
## ID IPA_work_days MPA_work_days MPA_transport_days IPA_hobby_days
## 1 27 NA 4 8 3
## 2 46 1 5 8 NA
## MPA_hobby_days
## 1 3
## 2 NA
We see that there are observations with a value > 7. We store the ID(s) of the observation(s) with values > 7 in a new data frame.
<- pa %>%
implausible_val filter_at(vars(contains("days")), any_vars(. > 7)) %>%
select(ID) # We store the ID of the observation(s) with values > 7 in a new data frame
Did a respondent have inconsistent answers (eg., 0 days, but values >0 in the corresponding time variables)? We can check this in the same way as above:
%>%
pa filter_at(vars(contains("days")), any_vars(. == 0)) %>% # We filter all variables which contain "days" in the name and of all these variables the lines in which is a value = 0
filter_at(vars(contains("minutes")), any_vars(. > 0)) %>%
select(ID, contains("days") | contains("minutes"))
## ID IPA_work_days MPA_work_days MPA_transport_days IPA_hobby_days
## 1 47 NA NA 0 NA
## MPA_hobby_days IPA_work_minutes MPA_work_minutes MPA_transport_minutes
## 1 0 NA NA 40
## IPA_hobby_minutes MPA_hobby_minutes
## 1 NA 45
We see that observation with ID 47 reported two times 0 days but > 0 minutes. We store the ID of the observation(s) with values = 0 in a new data frame.
<- pa %>%
inconsistent_val filter_at(vars(contains("days")), any_vars(. == 0)) %>%
filter_at(vars(contains("minutes")), any_vars(. > 0)) %>%
select(ID) # We store the ID of the observation(s) with values = 0 in a new data frame
We now have to remove the observations with the ID numbers stored in
max_val
, implausible_val
and
inconsistent_val.
First, we combine all ID’s in a single
vector:
<- c(max_val$ID, implausible_val$ID, inconsistent_val$ID)
to_remove to_remove
## [1] 45 27 46 47
It is important to check if an ID occurs two or more times. It could
be that a participant is excluded for several reasons; nevertheless, we
exclude this person only once, of course. We can use the
unique()
function to make sure that no ID is represented
multiple times:
<- unique(to_remove)
to_remove to_remove
## [1] 45 27 46 47
Now we can remove the observations with the IDs above. As you can see
below, they disappeared. Keep in mind that from now on, we have to use
the data frame pa2
and not pa
anymore.
<- subset(pa, !(ID %in% to_remove))
pa2 tail(pa2[, 1:5])
## ID time IPA_work IPA_work_days IPA_work_minutes
## 40 40 02.06.2021 14:17:31 Nein NA NA
## 41 41 02.06.2021 14:17:41 Ja 2 60
## 42 42 02.06.2021 14:17:49 Ja 2 15
## 43 43 02.06.2021 14:18:26 Nein NA NA
## 44 44 02.06.2021 14:23:24 Nein NA NA
## 48 48 04.01.2023 09:09:40 Nein NA NA
Finally we have to check if there are missing values for days or
minutes (see last criteria in the table above). If this is the case, we
need to replace the NAs with the value 0. We do this for all three
sub-domains using the ifelse()
function:
Note: NAs are actually not possible in this online survey as the respondent is forced to give a response once the first question has been answered with “Ja”. Nevertheless, it is shown how this could be done.
<- pa2 %>%
pa2 mutate(IPA_work_minutes = ifelse(IPA_work == "Ja" & is.na(IPA_work_minutes), 0, IPA_work_minutes),
MPA_work_minutes = ifelse(MPA_work == "Ja" & is.na(MPA_work_minutes), 0, MPA_work_minutes),
MPA_transport_minutes = ifelse(MPA_transport == "Ja" & is.na(MPA_transport_minutes), 0, MPA_transport_minutes),
IPA_hobby_minutes = ifelse(IPA_hobby == "Ja" & is.na(IPA_hobby_minutes), 0, IPA_hobby_minutes),
MPA_hobby_minutes = ifelse(MPA_hobby == "Ja" & is.na(MPA_hobby_minutes), 0, IPA_hobby_minutes))
In this step, we will calculate the actual amount of physical activity per person per week (this is what we actually want to find out with the GPAQ). To do so, we again follow the analysis guidelines (page 14). See page 3 for information on MET if you don’t what this is.
First we create new variables containing the MET-minutes for each sub-domain according to the weight given in the table above. Then we sum all these values up to get the total MET-minutes per week. Finally, we create a dichotomous variable indicating if the person meets the WHO recommendation.
<- pa2 %>%
pa2 mutate(MET_IPA_work = IPA_work_minutes * IPA_work_days * 8,
MET_MPA_work = MPA_hobby_minutes * MPA_work_days * 4,
MET_MPA_transport = MPA_transport_minutes * MPA_transport_days * 4,
MET_IPA_hobby = IPA_hobby_minutes * IPA_hobby_days * 8,
MET_MPA_hobby = MPA_hobby_minutes * MPA_hobby_days * 4)
<- pa2 %>%
pa2 mutate(MET_total = rowSums(.[19:23], na.rm = TRUE),
MET_bin = ifelse(MET_total >= 600, "Yes", "No"))
%>%
pa2 select(contains("MET")) %>%
slice(1:6) %>%
kable() %>%
kable_styling()
MET_IPA_work | MET_MPA_work | MET_MPA_transport | MET_IPA_hobby | MET_MPA_hobby | MET_total | MET_bin |
---|---|---|---|---|---|---|
NA | NA | 700 | 3840 | 3360 | 7900 | Yes |
NA | NA | 840 | 3840 | 3360 | 8040 | Yes |
NA | 320 | NA | 320 | 160 | 800 | Yes |
NA | 600 | NA | 480 | 600 | 1680 | Yes |
NA | 600 | 1080 | 480 | 600 | 2760 | Yes |
1440 | NA | NA | NA | NA | 1440 | Yes |
Now we finally can analyse the data! For analysis, data is often
required in a long format (e.g. for regression or for plots). In this
last exercise, you will learn to use the pivot_longer()
function to transform datasets from wide to long.
IPA_work
,
MPA_work
, MPA_transport
,
IPA_hobby
, MPA_hobby
and one variable with the
name of the corresponding sub-domains.MET_IPA_work
,
MET_MPA_work
, MET_MPA_transport
,
MET_IPA_hobby
, MET_MPA_hobby
and
MET_total
and one variable with the name of the
corresponding sub-domains.In the first graph we see the frequency for each sub-domain.
<- pa2 %>%
pa2_domains_long select(IPA_work, MPA_work, MPA_transport, IPA_hobby, MPA_hobby) %>%
pivot_longer(everything(), names_to = "Domain", values_to = "Response")
ggplot(pa2_domains_long, aes(x = Response, fill = Domain)) + geom_bar(position = "dodge")
In the second graph, we see the distribution of MET-minutes per sub-domain and in total:
<- pa2 %>%
pa2_MET_long select(contains("MET")) %>%
pivot_longer(MET_IPA_work:MET_total, names_to = "Domain", values_to = "MET_minutes")
ggplot(pa2_MET_long, aes(x = Domain, y = MET_minutes)) + geom_boxplot() +
theme(axis.text.x = element_text(angle = 40, hjust=1)) +
scale_fill_discrete(name = "Meet WHO recommendation")