Load packages and data
To get started, load in the needed packages: {tidyverse}, {here}, and {esquisse}.
Now, let’s read the dataset into your RStudio
environment. The data frame you import should have 142 rows and
9 columns. Remember to use the here()
function to use
project-relative paths.
## Rows: 142 Columns: 9
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): sex, status, bmi, sedentary_ap_s_day, light_ap_s_day, mvpa_s_day, o...
## dbl (2): personal_id, household_id
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Next, view your dataset and compare it to the variable definitions (open the “00_info_about_the_dataset” file in the “data folder).
Clean data with {dplyr}
Step 1: Examine and clean dataset (Instructor demo)
Before jumping into wrangling or plotting, let’s think about the types of variables in our dataset. Take note of which variables should be numeric and which should be factors.
Now let’s check that R has classified these variables correctly. You
can check the data classes assigned to each variable with
summary()
or glimpse()
.
## personal_id household_id sex status
## Min. : 1.00 Min. : 1.0 Length:142 Length:142
## 1st Qu.: 36.25 1st Qu.: 9.0 Class :character Class :character
## Median : 71.50 Median :19.0 Mode :character Mode :character
## Mean : 71.50 Mean :19.8
## 3rd Qu.:106.75 3rd Qu.:30.0
## Max. :142.00 Max. :40.0
## bmi sedentary_ap_s_day light_ap_s_day mvpa_s_day
## Length:142 Length:142 Length:142 Length:142
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## oms_recommendation
## Length:142
## Class :character
## Mode :character
##
##
##
Notice that some of your variables are of class “character”, but they
should be numeric (e.g. bmi
). This is because those
variables have some words in them in addition to number. Can you spot
those words when you view the dataset?
Other variables of class “character” should be factors. Lastly, some variables that are “numeric” should be integers.
We need to change these variables to correct type, because this will be essential for further manipulations and for plotting!
We can use mutate()
to convert your variables into the
right type.
pa_clean <- pa_raw %>%
mutate(personal_id = as.integer(personal_id),
household_id = as.integer(household_id),
sex = as.factor(sex),
status = as.factor(status),
bmi = as.numeric(bmi),
sedentary_ap_s_day = as.numeric(sedentary_ap_s_day),
light_ap_s_day = as.numeric(light_ap_s_day),
mvpa_s_day = as.numeric(mvpa_s_day),
oms_recommendation = as.factor(oms_recommendation)
)
## Warning: There were 4 warnings in `mutate()`.
## The first warning was:
## i In argument: `bmi = as.numeric(bmi)`.
## Caused by warning:
## ! NAs introduced by coercion
## i Run `dplyr::last_dplyr_warnings()` to see the 3 remaining warnings.
Note: You may get a warning message about your numeric columns. This is telling you that some NAs were introduced to the data. This is not a problem in this case, so you can disregard the message.
Now that you have converted them, use summary()
or
glimpse()
again to check that the data classes are correct.
Notice that in your new numeric columns, the text that used to be there
as been replaced with NA
(this is what the warning message
was about).
Step 2: Convert the physical activity variables
Currently, the variables of physical activity are in seconds per day.
There are 3 types of physical activity variables: sedentary behavior
(sedentary_ap_s_day
), light physical activity
(light_ap_s_day
), and moderate to vigorous physical
activity (mvpa_s_day
). These variables are measured in
seconds per day.
However, the WHO recommendations for physical activity are in minutes per week, so we want to align with these measures. To do this, complete the following manipulations:
Convert these numerical variables from seconds/day to minutes/week. (Hint: use
mutate()
to create new variables that are in minutes per week. As a kind reminder, 60 seconds = 1 minute and 7 days = 1 week.)Remove the previous seconds per day variables. (Hint: you can use
ends_with()
within theselect()
function.)
# Convert seconds/day to minutes/week
pa_clean1 <- pa_clean %>% mutate(
sedentary_min_week=sedentary_ap_s_day*7/60,
light_min_week=light_ap_s_day*7/60,
mvpa_min_week=mvpa_s_day*7/60) %>%
select(-ends_with("_s_day"))# Remove previous seconds/day variables
Lastly, rename oms_recommendation
to
who_recommendation
(OMS is French for WHO).
Step 3: Combine physical activity variables
Create a new column that adds light physical activity with moderate to vigorous physical activity. This should give us the total amount of activity in minutes per week.
Step 4: Replace values in selected columns
The sex
variable uses the letter “M” for male and “F”
for female. Replace values in that column to say “Male” and “Female”
instead of “M” and “F”.
Now, transform the status
variable, to change “Aldulte”
to “Adult” and “Enfant” to “Child”.
Visualize data with {esquisse}
Plot 1: Histogram
Histograms are used to visualize the distribution of a single numeric (continuous) variable. Chose a variable from the dataset that you can plot a histogram with.
Using esquisse, create your histogram.
ggplot(pa_clean5) +
aes(x = bmi) +
geom_histogram(bins = 30L, fill = "#112446") +
labs(title = "Histogram ", subtitle = "BMI") +
theme_minimal() +
theme(
plot.title = element_text(face = "bold",
hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)
)
## Warning: Removed 3 rows containing non-finite values (`stat_bin()`).
Note: Sometimes code generated by esquisse can have unwanted
filtering. This is due to a bug in the package. Double check your code
and remove any extra filter()
operations.
Plot 2: Boxplot
Boxplots are used to visualize the distribution of a numeric variable, split by the values of a discrete variable.
Use esquisse to create a boxplot to show the distribution of a numeric variable (the same one used for your histogram) and plot it against a categorical variable from the dataset.
Set fill color to match the values of your categorical variable.
ggplot(pa_clean5) +
aes(x = status, y = bmi, fill = status) +
geom_boxplot() +
scale_fill_hue(direction = 1) +
labs(
title = "Boxplot",
subtitle = "the distribution of the \"bmi\" variable (numeric) against the \"status\" variable (categorical)"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold",
hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)
)
## Warning: Removed 3 rows containing non-finite values (`stat_boxplot()`).
Wrap up
That’s it for this assignment! We will choose 2-3 people to present your work during the workshop. If you would like to share your results with the class, please let an instructor know.
Finished early? Try some new manipulations below.