Exploring the BRFSS dataset

Introduction

The below is a limited exploratory data analysis on the brfss data set for the purposes of an assignment for the Introduction to Probability and Data course offered by Duke University on Coursera link. The assignment requires choosing three questions by the program participants based on their interest, the underlying data, and, the learning objectives of the module. Since the data-set is large (over 330 variables) the EDA will focus on variables assumed relevant to the questions being posed by the author.

Setup

The following version of R is being used (3.4.4) and the markdown file is being created through R Studio’s desktop client. For a complete detail of the code used to generate this document please visit the git-hub link. For the purposes of readability, the EDA will focus on the summary statistics and graphics generated from the same and will skip some of the code output not required for grading the assignment.

Load packages

library(tidyverse)
library(gridExtra)
library(purrr)

Load data

load("brfss2013.RData")
br_data <- brfss2013; rm("brfss2013")

Part 1: Data

Per CDC’s brief on the BRFSS2013 data-set link, the Behavioral Risk Factor Surveillance System (BRFSS) project, and the resultant data-set, is a collaboration between the Centers for Disease Control (CDC) and US States and participating Territories. The purpose of collecting data regarding health-related risk behaviors, chronic health conditions, and use of preventive services is to better create health promotion tools for US residents. The following aspects of the data-set are relevant to this project:

It is an observational study conducted via telephones and cellular phones and hence correlations between variables can be ascertained. Causal inference cannot be made unless advanced techniques link (beyond the scope of this project) are deployed.
- The study methodology recognizes that some individuals have both land-lines and cellular phones and only selects people for the study that have either a land-line or use their cellular phone as their primary phone more than 90 % of the time. Their are randomization algorithms in place to ensure randomized sampling.
The sampling method and subsequent ‘raking’ methodology, per the study designers, render the data representative of population, and hence, the observational study is generalize-able.

There are a large proportion of Na’s (about 41.87 %). In fact, there are 0 complete cases. While this may present a problem, depending upon the variables being analysed, the types of questions being asked may not pertain to each individual and hence missing data is to be expected. The missing data may require a strategy to mitigate their effect, however, any strategy will be deployed on a case-by-case basis below.

The data-set contains 491775, 330 columns (observations) and rows (variables) respectively. A detailed description of the variables can be found in the BRFSS 2013 code-book here.

Part 2: Research questions

The research questions are motivated by a combination of currently topical subjects along with the author’s personal curiosities given his background in healthcare, along with his interest in nutrition.

What is the relationship, if any, between poor mental or physical health and health coverage? Do Adults over 65 (covered by medicare) influence the relationship?

To answer this question the following variables will be used:

$POORHLTH is a integer variable (discrete) representing the number of days that participants reported (in the last 30 days) where they felt their health (physical and mental) prevented them from doing their usual activities such as, self-care, work, or recreation.
$HLTHPLN1 is a factor variable with the following levels Yes, No i.e. does the respondent have any type of health care coverage?
$X_age65yr is a factor variable with the following levels Age 18 to 64, Age 65 or older

What is the relatioship, if any, between mental health and alcohol consumption? Are there any gender differences?

To answer this question the following variables will be used:

$menthlth is integer variable (discrete) representing the number of days that survey participants reported (within a 30 days period prior to the interview) where they felt that they were not able to perform their usual activities due to issues related to their mental health or their emotional state.
- This variable will be used to create a custom categorical (Yes, No) variable $MHEpisode indicating the presence or absence of self-reported days.
$AVEDRNK2 is integer variable (discrete) representing the number of alcoholic beverages on average consumed by the respondent whenever they consumed an alcoholic beverage in the 30 days prior to the interview.
- This variable will also be used to create a custom categorical (Yes, No) variable $Drank_Alc, indicating whether the respondent did or did not consume alcohol in the 30 days prior to the interview.
$ALCDAY5 is integer variable (discrete) representing the number of days per week or per month the survey respondent consumed an alcoholic beverage in the 30 days prior to the interview.
- alcday5 and avedrnk2 will be used to compute the total number of drinks (Total_Drinks) consumed by the respondent in the 30 days prior to the interview

What is the relationship between BMI and Balanced Food consumption, if any? Balanced Food consumption will be measured by aggregating consumption of Fruits, Vegetables and Beans.

To answer this question the following variables will be used:

$X_BMI5CAT is a factor variable with the following levels Underweight, Normal weight, Overweight, Obese. The cut-offs to ascertain membership in the weight classes per CDC code-book are:
- Underweight < BMI 18.5
- BMI 18.5 < Normal Weight > BMI 25
- BMI 25 < Overweight > BMI 30
- BMI 30 < Obese
$X_bmi5 is a integer variable (continuous) with values representing the calculated BMI of participants for their responses to the question of height and weight.
$beanday_ , $grenday_, $orngday_, $frutda1_, are numeric variables indicating the number of times per day intake of beans, dark green vegetables, orange colored vegetables, and fruits, respectively.
- These variables will be aggregated (without weights) to create the variable $Balance_F which is simply the sum of the above food intake.

Part 3: Exploratory data analysis

Q.1 What is the relationship, if any, between poor mental or physical health and health coverage? Do Adults over 65 (covered by medicare) influence the relationship?

The following is a summary of the three chosen variables, along with a density plot of the number of self-reported poor health days. Since $poorhlth has to be a number between 0 - 30, values over 30 and Na values have been discarded.

## Table 1.1.1

##            X_age65yr        PoorHealth     Coverage    
##  Age 18 to 64   :172245   Min.   : 0.000   Yes:215291  
##  Age 65 or older: 73711   1st Qu.: 0.000   No : 30665  
##                           Median : 0.000               
##                           Mean   : 5.276               
##                           3rd Qu.: 5.000               
##                           Max.   :30.000

Some Observations:

Data peaks at zero, with subsequent minor peaks on the round numbers, 10, 15, 20, will a final peak on day 30. This may be an artifact of participants ‘rounding up’ the number of days they felt in poor health.
- There are minor (daily) peaks between day 0 - 5 in the Health Covered group that do not appear in the Not Covered group. Further investigation in the difference could provide interesting explanations. However, for the purpose of this report, it will not be explored.
- The median of the $PoorHealth distribution is 0. In other words most participants didn’t report being unwell in the 30 days prior to the interview.
- On average Americans experienced ~ 5 days of ill health due to either physical or mental reasons in the preceding 30 days before being interviewed.

For those that did report not feeling well in the last 30 days; mean and median are below.

## Table 1.1.2

## # A tibble: 2 x 3
##   Coverage  mean median
##   <fct>    <dbl>  <int>
## 1 Yes       12.2      7
## 2 No        12.3      7

The data show those that reported feeling unwell, average about 12 days of total illness in the 30 days prior to their interview call. The most common total duration of illness was 7 days. There doesn’t seem to be any difference in mean or median, based on the presence or absence of any healthcare coverage.

The proportion table the data is below.

## Table 1.1.3

##           Coverage
## PoorHealth        Yes         No
##     Unwell 0.37414416 0.05719316
##     Well   0.50117907 0.06748361

Interpretation of the proportion table:
- 37.41% of all individuals have healthcare coverage and reported an episode of ill-health
- 50.11% of all individuals have coverage and did not report any episode.
- 5.71 % of all individuals do not have coverage and reported an episode of ill health
- 6.74 % of all individuals did not report an episode of ill health and do not have coverage.
- Of the total, there are 56.87% and 43.13% people are labelled Unwell and Well respectively.
- Of the total, there are 87.53% and 12.47% people who do and do not have healthcare coverage respectively.
- 86.74% of those people labelled unwell have some form of healthcare coverage. See Table 1.1.4.
- 88.13% of those who are labelled well have some form of healthcare coverage. See Table 1.1.4.

## Table 1.1.4

## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  testTable1
## X-squared = 107.05, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.01658528 -0.01126320
## sample estimates:
##    prop 1    prop 2 
## 0.8674050 0.8813293

The difference in proportions could be due to:

Randomness
A reflection in the variability due to Health Coverage
Some other factors not considered in the above analysis such as
- Age, Presence of chronic disease, or other chronic illness, to name a few.

Individuals in the US over the age of 65 are automatically covered by Medicare and hence separating the data by those above 65 years of age may be useful in understanding the relationship between episodes of ill health in the previous 30 day period and some form of healthcare coverage.

Are Health Coverage and Poor Health indepedent? Does Medicare (Age > 65 Yrs.) influence the relationship?

Interpretation of Mosaic Plots:

Boxes are shaded according to the disproportionate influence of any residual. “Cells representing negative residuals are drawn in shaded of red and with broken borders; positive ones are drawn in blue with solid borders.” R Documentation for mosaicplot()
The size of the boxes correspond to the proportionality of data
The medicare coverage influence on the data is clear if one compares Plot B to Plot C (The boxes representing ‘No’)
When considered in isolation, co-variation between Coverage and ill health episodes seems to less in adults over the age of 65 and more in adults below 65 years.

Is the difference in proportions significantly greater?

## Table 1.1.5

##                 Independence_pValue Greater_pValue
## All adults                   0.0000           1.00
## Adults Over 65               0.0112           0.99
## Adults 18 to 64              0.0000           1.00

Interpretation of tests for independence and equality of proportions:

Tables 1.1.5 suggests that the variables Coverage are Poor Health are not independent of each other (regardless of the influence of medicare) because the respective p-values are below 0.05.
However, the proportions between Coverage and Poor Health (including when separating the Medicare population) are not statistically greater between populations at the 95% confidence interval.

Conlusion: The variables Coverage and Poor Health (an episode of ill health in the previous 30 days) are not entirely independent of each other. However, the difference in proportions between populations are not statistically greater than the other. Hence, there are other influencing factors not identified in the above analysis.

Q.2 What is the relatioship, if any, between mental health and alcohol consumption?

Are there any gender differences?

A brief computation of the selected variables follows, along with a summary of the data.

#create smaller data with chosen variables
q2b <- br_data %>% select(sex, 
          Mental_Health = menthlth, 
          Ave_Drinks = avedrnk2, 
          Drink_Days = alcday5) %>% # create categorical vairables $MHEpisode, $Drank_Alc
   mutate(MHEpisode = ifelse(Mental_Health > 0, "Yes_M", "No_M"),
          Drank_Alc = ifelse(Drink_Days > 0, "Consumed Alcohol", "Did Not Consume Alc"))

# Function to convert $alcday5 into Total Drinks
convert.DrinksAlc <- function(x) {
   as.character(x)
   if (grepl("^1", x)) { 
      y = (as.numeric(x) - 100)*4
   } else if (grepl("^2", x)) { 
      y = (as.numeric(x) - 200)
   } else {y = x}
   y
}
# apply convert.DrinksAlc to $Drink_Days
q2b$Drink_Days <- map_dbl(q2b$Drink_Days, convert.DrinksAlc)
# filter days to remove negative values computed by miscoded answers.
q2b <- q2b %>% filter(Drink_Days >= 0)

# create $Total_Drinks and select variables for plots
q2_data <- q2b %>% mutate(Total_Drinks = Drink_Days * Ave_Drinks) %>%
   select(sex, Mental_Health, Total_Drinks, MHEpisode, Drank_Alc)

rm(q2b)

## Table 2.1.1 - Summary of the selected variables

##      sex         Mental_Health    Total_Drinks      MHEpisode        
##  Male  :192694   Min.   : 0.00   Min.   :   1.00   Length:472127     
##  Female:279433   1st Qu.: 0.00   1st Qu.:   4.00   Class :character  
##                  Median : 0.00   Median :  10.00   Mode  :character  
##                  Mean   : 3.38   Mean   :  22.09                     
##                  3rd Qu.: 2.00   3rd Qu.:  28.00                     
##                  Max.   :30.00   Max.   :2280.00                     
##                  NA's   :7911    NA's   :240948                      
##   Drank_Alc        
##  Length:472127     
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

Points to note in the summary of data:
- There are more female respondents than male respondents in the data-set.
- After reviewing the documentation associated with the $alcday5 variable used to compute the $Total_Drinks variable, NA values (indicating a person did not drink) were retained rather than being discarded. The proportion of NA matches the coding in the original data-set.
- $MHEpisode and $Drank_Alc are categorical (Yes / No) variables created to better visualize the data.

Chosen graphical representation of the data:

Interpretation of plots.

Plot 1 - Dot plot representing individual alcohol consumption positioned on a scale (x axis) representing the number of days respondents felt they were non-productive or unable to perform self care due to mental health concerns or emotional states; by gender.
- Note: a random selection of 20,000 points were used from the original data set to ease the computation load.
- There is a very large range of alcohol consumption which was very surprising and brings to question the accuracy of the survey methodology and / or the accuracy of self-reporting. Hence, a log scale was used to better visualize the plots.
- Overall there seems to be more women at the lower end of the scale and overall more blue dots i.e. females in the plot. - This would indicate that there are more women who reported being non-productive due to mental health reasons; however they seem to be drinking less compared to males.
- There is a cluster of males which drink proportionally more than the females but don’t report any non-productivity due to mental health concerns (clustered around the origin of the x-axis).
Plot 2 - Box-plot of the Total Alcohol consumption vs the mental health categorical $MHEpisode by gender.
- Overall Data shows that Males drink more alcohol than Females.
- Those who were non-productive due to mental health reasons drank more than those who were not.
- Note the outliers of Females who consumed Alcohol and self-reported Mental health concerns.
- The NA category indicates those individuals who chose not to respond to the question. - It seems that the Males among that group follow a similar consumption pattern to those who reported Mental health issues. Are these Males who are hiding their mental health concerns? - It seems that Females in this group have similar consumption distribution as those who did not report any mental health concerns.
Plot 3 - Density plot of Alcohol Consumption by Gender
- The data suggests that regardless of mental health concerns, mean drink more on average than women for those individuals who drink more than 10 alcoholic beverages per month. More women drink less than 10 alcoholic beverages per month than men.
Plot 4 - Number of individuals who report mental health concerns displayed in proportion of alcohol consumption.
- Note the log scale.
- There are a large proportion of people who didn’t report any mental health concerns (0 at the x-axis) and it appears that the proportions are equally distributed between those that did and did not consume alcohol.
- Of the individuals who did report mental health concerns, and had 10 or less non-productive days, there are more individuals who consumed alcohol than those that did not. - However, beyond 10 days of self-reported non-productivity, there are less alcohol consumers than not.
- It could be that the above indicates alcohol consumption patterns in acute vs chronic mental health concerns and would be an area of continued research (beyond the scope of this paper).

Select questions to quantify and confirm after EDA are:

Is there a significant difference in the proportion of individuals who consumed alcohol and had mental health concerns?
Are women disproportionately consuming more alcohol if there have mental health concerns?

## Table 2.1.2 - Proportion of Individuals who reported mental health concerns and reported consuming alcohol.

##                      MHEpisode
## Drank_Alc                  No_M     Yes_M
##   Consumed Alcohol    0.3436590 0.1577585
##   Did Not Consume Alc 0.3465736 0.1520090

## prop.test()

## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  Alc_ME_table
## X-squared = 51.521, df = 1, p-value = 7.083e-13
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.012402840 -0.007082608
## sample estimates:
##    prop 1    prop 2 
## 0.6853750 0.6951177

Interpretation of Table 2.1.2:

There is a minor difference in proportion between those individuals who consumed alcohol and reported mental health concerns. The 95% CI of this difference is - 0.7% - 1.24%.

## Table 2.1.3 - Test for independence of proportions: Individuals who reported mental health concerns & reported consuming alcohol by gender.

## Number of cases in table: 464216 
## Number of factors: 3 
## Test for independence of all factors:
##  Chisq = 13548, df = 4, p-value = 0

Interpretation of Table 2.1.3:

The three selected variables ($sex, $MHEpisode, $Drank_Alc) are not independent.

## Visualizing difference in proportions

Interpretation of the Mosaic plot:

Note - Red colored boxes are proportions that are under-represented in the group i.e. less than expected if random. Conversely, blue squares are proportions over-represented in the group i.e. more than expected if random.
- Among Males - - Of those that do not report mental health concerns, those that consume alcohol are more than expected and those that do not consume alcohol are less than expected. - Of those that reported mental health concerns, the proportion of males that consume alcohol are less than expected (red) vs females (blue) who are more than expected.
- Among Females - - Among those that did not report mental health concerns, those that consumed alcohol are less than expected (red) and those that did not consume alcohol are more than expected (blue). - Among those that reported mental health concerns, both those that consumed alcohol, and those that didn’t, are more than expected.

Conclusions:

2.1 There is a difference in alcohol consumption between Males and Females. Table 2.2.1

gender_Alc_table <- q2_data %>% select(Drank_Alc, sex) %>% table(useNA = "no")
prop.test(gender_Alc_table, correct = FALSE)

## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  gender_Alc_table
## X-squared = 9420.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.1360765 0.1416289
## sample estimates:
##    prop 1    prop 2 
## 0.4777581 0.3389054

2.2 There is a difference in mental health concerns between Males and Females. Table 2.2.2

gender_MH_table <- q2_data %>% select(MHEpisode, sex) %>% table(useNA = "no")
prop.test(gender_MH_table, correct = FALSE, alternative = "greater")

## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  gender_MH_table
## X-squared = 4131.9, df = 1, p-value < 2.2e-16
## alternative hypothesis: greater
## 95 percent confidence interval:
##  0.09779562 1.00000000
## sample estimates:
##    prop 1    prop 2 
## 0.4398768 0.3395712

2.3 Among those that consume alcohol, males report less mental health concerns than women. Table 2.2.3

Alc_gend_MH <- q2_data %>% 
   filter(Drank_Alc == "Consumed Alcohol") %>% 
   select(sex, MHEpisode) %>% table(useNA = "no")
prop.test(Alc_gend_MH, correct = FALSE, alternative = "greater")

## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  Alc_gend_MH
## X-squared = 3027.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: greater
## 95 percent confidence interval:
##  0.1028892 1.0000000
## sample estimates:
##    prop 1    prop 2 
## 0.7407524 0.6347275

Q.3 What is the relationship between BMI and balanced food consumption, if any?

Data preparation for the analysis will consist of:

Creating $Balance_F - arithmetic sum of selected foods consumed.

q3_balance_data <- br_data %>% select(Beans = beanday_, 
                                      Greens = grenday_, 
                                      OrangeV = orngday_,
                                      Fruits = frutda1_,
                                      BMI = X_bmi5,
                                      BMI_Levels = X_bmi5cat) %>% na.omit() %>%
   mutate(Balance_F = Beans + Greens + OrangeV + Fruits)
rm(br_data)

Creating a smaller data-set of randomly chosen rows for reducing the computation burden while creating plots.

set.seed(5678)
size = nrow(q3_balance_data)
mini_q3_data <- q3_balance_data[sample(1:size, 10000),]

Table 3.1.1 summarizes the data: The range of the food consumption and BMI variables seems large. Note, there are two decimal places implied in the BMI variable.

## Table 3.1.1

##      Beans             Greens           OrangeV            Fruits      
##  Min.   :   0.00   Min.   :   0.00   Min.   :   0.00   Min.   :   0.0  
##  1st Qu.:   7.00   1st Qu.:  14.00   1st Qu.:   7.00   1st Qu.:  33.0  
##  Median :  14.00   Median :  43.00   Median :  17.00   Median : 100.0  
##  Mean   :  27.86   Mean   :  55.08   Mean   :  29.52   Mean   : 101.5  
##  3rd Qu.:  33.00   3rd Qu.:  83.00   3rd Qu.:  43.00   3rd Qu.: 100.0  
##  Max.   :9900.00   Max.   :9900.00   Max.   :9900.00   Max.   :9900.0  
##       BMI               BMI_Levels       Balance_F    
##  Min.   :   1   Underweight  :  7283   Min.   :    0  
##  1st Qu.:2367   Normal weight:140260   1st Qu.:  111  
##  Median :2675   Overweight   :152163   Median :  179  
##  Mean   :2787   Obese        :124009   Mean   :  214  
##  3rd Qu.:3086                          3rd Qu.:  276  
##  Max.   :9769                          Max.   :20800

## Visualizing Balanced_F and BMI

Interpretation of plots 3.1 and 3.2:

Plot 3.1 visualizes the BMI distribution. The data is slightly skewed as can be seen the difference in the peak of the data vs the mean BMI of ~ 28 (Overweight, but not Obese).

Plot 3.2 visualizes the $Balanced_F variable across the BMI categories. There median of each group lies underneath the mean of the entire cohort indicating, not only, skew-ness in the data, but also, that there may be variation among each category. In fact, a log10 scale has been used on the y axis to better visualize this plot.

Plot 3.3 (below) expands on plot 3.2 by showing the spread of the data. Points to note:

The center of mass appears in the BMI zone of 23 - 30 i.e. normal weight to overweight.
There are fewer points in the underweight category and there is a larger range of obese individuals.
There is more variability in both ends of the BMI spectrum (to be expected)
Overall there seems to be a negative correlation between BMI and Balance_F i.e. higher the Balance_F lower the BMI.

However, is the difference in Balance_F statistically significant across categories?

## Table 3.2.1

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Balance_F by BMI_Levels
## Kruskal-Wallis chi-squared = 4175.2, df = 3, p-value < 2.2e-16

## Table 3.2.2

## 
##  Pairwise comparisons using Wilcoxon rank sum test 
## 
## data:  test_data$Balance_F and test_data$BMI_Levels 
## 
##               Underweight Normal weight Overweight
## Normal weight < 2e-16     -             -         
## Overweight    0.36        < 2e-16       -         
## Obese         1.3e-13     < 2e-16       < 2e-16   
## 
## P value adjustment method: bonferroni

The low p-value in Table 3.2.1 indicates that the BMI categories and the Balance_F variable are not independent.

Table 3.2.2 describes the statistical independence between categories for the variable Balance_F. In other words, Balance_F is statistically independent between the Underweight and Overweight categories i.e. the variation between those two groups is similar. However, the variation in Balance_F among other categories is significantly different.

This is an interesting aspect of the data and could indicate that those that are underweight and those that are overweight have similar food consumption with each other, but different from those that obese and of normal weight.

Conclusion:

Additional exploration is required to ascertain the cause of these differences. However, it must be noted that the variation in Balance_F only explains very little of the variation in BMI (see Table 3.3.1)

## Table 3.3.1

## 
##  Pearson's product-moment correlation
## 
## data:  q3_balance_data$BMI and q3_balance_data$Balance_F
## t = -50.102, df = 423710, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07973452 -0.07374797
## sample estimates:
##         cor 
## -0.07674194

A note on the cause of variability in Balance_F:

The creation of Balance_F was an artifact based on the assumption that different types of foods and their balanced intake could have a relationship with BMI. However, where does the variation in this data exist? PCA is a good method to explore the underlying variation in the data. Table 3.3.1 and Table 3.3.2 shows one way to look at the underlying components of Balance_F, further exploration of which could lead to a better understanding of individual food components and their relationship to BMI.

## Table 3.4.1

## Importance of components:
##                           PC1    PC2    PC3    PC4
## Standard deviation     1.3029 0.9475 0.8658 0.8094
## Proportion of Variance 0.4244 0.2244 0.1874 0.1638
## Cumulative Proportion  0.4244 0.6488 0.8362 1.0000

## Table 3.4.2

##               PC1         PC2         PC3         PC4
## Beans   0.3706564  0.90083349 -0.22512013  0.02082708
## Greens  0.5647861 -0.22399148 -0.03979517 -0.79325960
## OrangeV 0.5251537 -0.02954757  0.77817163  0.34320472
## Fruits  0.5175365 -0.37074760 -0.58496683  0.50250966

Principle components (PC) 2 to 4 explain ~ 60% variation in the data. PC1 explains ~40% of the variation and Table 3.3.2 provides the respective weights to the underlying food groups that result in the maximal variation. Further study into these techniques (beyond the scope of this overview) could provide interesting ways to understand food proportions.

Git-hub link to original source code link

Exploring the BRFSS dataset

Vivek Narayan

October, 2018

Introduction

Setup

Load packages

Load data

Part 1: Data

Part 2: Research questions

Part 3: Exploratory data analysis