Workshop 5: Creating and transforming variables

JOHN OYAN NAIVEST

2023-08-10

Load packages and data

To get started, load in the needed packages: {tidyverse}, {here}, and {esquisse}.

pacman::p_load(tidyverse, here, esquisse)

Now, let’s read the dataset into your RStudio environment. The data frame you import should have 142 rows and 9 columns. Remember to use the here() function to use project-relative paths.

pa_raw <- read_csv(here("data/obesity.csv"))
## Rows: 142 Columns: 9
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): sex, status, bmi, sedentary_ap_s_day, light_ap_s_day, mvpa_s_day, o...
## dbl (2): personal_id, household_id
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Next, view your dataset and compare it to the variable definitions (open the “00_info_about_the_dataset” file in the “data folder).

Clean data with {dplyr}

Step 1: Examine and clean dataset (Instructor demo)

Before jumping into wrangling or plotting, let’s think about the types of variables in our dataset. Take note of which variables should be numeric and which should be factors.

Now let’s check that R has classified these variables correctly. You can check the data classes assigned to each variable with summary() or glimpse().

summary(pa_raw)
##   personal_id      household_id      sex               status         
##  Min.   :  1.00   Min.   : 1.0   Length:142         Length:142        
##  1st Qu.: 36.25   1st Qu.: 9.0   Class :character   Class :character  
##  Median : 71.50   Median :19.0   Mode  :character   Mode  :character  
##  Mean   : 71.50   Mean   :19.8                                        
##  3rd Qu.:106.75   3rd Qu.:30.0                                        
##  Max.   :142.00   Max.   :40.0                                        
##      bmi            sedentary_ap_s_day light_ap_s_day      mvpa_s_day       
##  Length:142         Length:142         Length:142         Length:142        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  oms_recommendation
##  Length:142        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Notice that some of your variables are of class “character”, but they should be numeric (e.g. bmi). This is because those variables have some words in them in addition to number. Can you spot those words when you view the dataset?

Other variables of class “character” should be factors. Lastly, some variables that are “numeric” should be integers.

We need to change these variables to correct type, because this will be essential for further manipulations and for plotting!

We can use mutate() to convert your variables into the right type.

pa_clean <- pa_raw %>%
  mutate(personal_id = as.integer(personal_id),
         household_id = as.integer(household_id),
         sex = as.factor(sex),
         status = as.factor(status),
         bmi = as.numeric(bmi),
         sedentary_ap_s_day = as.numeric(sedentary_ap_s_day),
         light_ap_s_day = as.numeric(light_ap_s_day),
         mvpa_s_day = as.numeric(mvpa_s_day),
         oms_recommendation = as.factor(oms_recommendation)
  )
## Warning: There were 4 warnings in `mutate()`.
## The first warning was:
## i In argument: `bmi = as.numeric(bmi)`.
## Caused by warning:
## ! NAs introduced by coercion
## i Run `dplyr::last_dplyr_warnings()` to see the 3 remaining warnings.

Note: You may get a warning message about your numeric columns. This is telling you that some NAs were introduced to the data. This is not a problem in this case, so you can disregard the message.

Now that you have converted them, use summary() or glimpse() again to check that the data classes are correct. Notice that in your new numeric columns, the text that used to be there as been replaced with NA (this is what the warning message was about).

Step 2: Convert the physical activity variables

Currently, the variables of physical activity are in seconds per day. There are 3 types of physical activity variables: sedentary behavior (sedentary_ap_s_day), light physical activity (light_ap_s_day), and moderate to vigorous physical activity (mvpa_s_day). These variables are measured in seconds per day.

However, the WHO recommendations for physical activity are in minutes per week, so we want to align with these measures. To do this, complete the following manipulations:

  1. Convert these numerical variables from seconds/day to minutes/week. (Hint: use mutate() to create new variables that are in minutes per week. As a kind reminder, 60 seconds = 1 minute and 7 days = 1 week.)

  2. Remove the previous seconds per day variables. (Hint: you can use ends_with() within the select() function.)

# Convert seconds/day to minutes/week
pa_clean1 <- pa_clean %>% mutate(
sedentary_min_week=sedentary_ap_s_day*7/60,
light_min_week=light_ap_s_day*7/60,
mvpa_min_week=mvpa_s_day*7/60) %>% 
  select(-ends_with("_s_day"))# Remove previous seconds/day variables

Lastly, rename oms_recommendation to who_recommendation (OMS is French for WHO).

pa_clean2 <- pa_clean1 %>% rename(who_recommendation=oms_recommendation)

Step 3: Combine physical activity variables

Create a new column that adds light physical activity with moderate to vigorous physical activity. This should give us the total amount of activity in minutes per week.

pa_clean3 <-pa_clean2 %>% mutate(total_activity_min_week=light_min_week+mvpa_min_week)

Step 4: Replace values in selected columns

The sex variable uses the letter “M” for male and “F” for female. Replace values in that column to say “Male” and “Female” instead of “M” and “F”.

pa_clean4 <- pa_clean3 %>% mutate(sex=ifelse(sex=="M","Male","Female"))

Now, transform the status variable, to change “Aldulte” to “Adult” and “Enfant” to “Child”.

pa_clean5 <- pa_clean4 %>% mutate(status=case_when(status=="Adulte"~"Adult", status=="Enfant"~"Child", TRUE~status))

Visualize data with {esquisse}

Plot 1: Histogram

Histograms are used to visualize the distribution of a single numeric (continuous) variable. Chose a variable from the dataset that you can plot a histogram with.

Using esquisse, create your histogram.

ggplot(pa_clean5) +
  aes(x = bmi) +
  geom_histogram(bins = 30L, fill = "#112446") +
  labs(title = "Histogram ", subtitle = "BMI") +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold",
    hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5)
  )
## Warning: Removed 3 rows containing non-finite values (`stat_bin()`).

Note: Sometimes code generated by esquisse can have unwanted filtering. This is due to a bug in the package. Double check your code and remove any extra filter() operations.

Plot 2: Boxplot

Boxplots are used to visualize the distribution of a numeric variable, split by the values of a discrete variable.

  1. Use esquisse to create a boxplot to show the distribution of a numeric variable (the same one used for your histogram) and plot it against a categorical variable from the dataset.

  2. Set fill color to match the values of your categorical variable.

ggplot(pa_clean5) +
  aes(x = status, y = bmi, fill = status) +
  geom_boxplot() +
  scale_fill_hue(direction = 1) +
  labs(
    title = "Boxplot",
    subtitle = "the distribution of the \"bmi\" variable (numeric) against the \"status\" variable (categorical)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold",
    hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5)
  )
## Warning: Removed 3 rows containing non-finite values (`stat_boxplot()`).

Wrap up

That’s it for this assignment! We will choose 2-3 people to present your work during the workshop. If you would like to share your results with the class, please let an instructor know.

Finished early? Try some new manipulations below.

Bonus challenge (optional)