PH 747 - Week 5

Basic Notes

You will need to download wk4_data.xlsx and wk4_data_2.xlsx from Laulima for these exercises.

I will reference command shortcuts by the Mac shortcuts such as Cmd+Enter to Run a line.

Add pipe %>%: Cmd+Shift+M
Add assignment <-: Opt+-

To convert to Windows shortcuts:
Cmd = Ctrl
Opt = Alt

Note: Datasets are adapted and modified from those found in the DatasauRus R package.

Review Exercise 1

Let’s start by reviewing some of the basics from the first exercise.

Initialize the data

Loading required libraries, input the dataset into a tibble, etc.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(table1)

## 
## Attaching package: 'table1'
## 
## The following objects are masked from 'package:base':
## 
##     units, units<-

library(writexl)

file_path <- "/Users/jonhuang/Desktop/"
wk4_dat <- readxl::read_xlsx(paste0(file_path, "wk4_data.xlsx"))
wk4_dat

Cleaning up the dataset

I will filter down to Set 0, rename x, label y, and create a new variable hi_y

wk4_sub0 <-  
  wk4_dat %>% 
  filter(set == 0) %>% 
  rename(Biomarker = x) %>% 
  mutate(y = structure(y, label = "Self-Rated Health")) %>% 
  mutate(hi_y = case_when(y > 30 ~ 1,
                          y <= 30 ~ 0,
                          T ~ NA_real_))
wk4_sub0

Univariate statistics

I will summarize Set 0 by taking its mean, SD, min, max, 25%ile, and 75%ile

wk4_sub0 %>% 
  summarize(across(c(Biomarker, y, hi_y), 
                   ~c(mean(.), sd(.), min(.), max(.), 
                      quantile(., probs = 0.25), quantile(., probs = 0.75)))) %>% 
  add_column(STAT = c("mean", "SD", "min", "max", "p25", "p75"))

Make a nice Table 1

pvalue <- function(x, ...) {
    # Construct vectors of data y, and groups (strata) g
    y <- unlist(x)
    g <- factor(rep(1:length(x), times=sapply(x, length)))
    if (is.numeric(y)) {
        # For numeric variables, perform a standard 2-sample t-test
        p <- t.test(y ~ g)$p.value
    } else {
        # For categorical variables, perform a chi-squared test of independence
        p <- chisq.test(table(y, g))$p.value
    }
    # Format the p-value, using an HTML entity for the less-than sign.
    # The initial empty string places the output on the line below the variable label.
    c("", sub("<", "&lt;", format.pval(p, digits=3, eps=0.001)))
}

tbl1 <- 
  wk4_sub0 %>% 
  mutate(hi_y = structure(case_when(hi_y == 1 ~ "Y > 30", 
                          hi_y == 0 ~ "Y <= 30"), 
                          label = "High Y Value (>30)")) %>% 
  table1(~ Biomarker + y  + hi_y | hi_y, data = .,
         topclass="Rtable1-zebra",
         overall = F, extra.col=list(`P-value`=pvalue),
         render.continuous=c(.="Mean (SD)", 
                             .="Median [MIN, MAX]",
                             "IQR [25%ile, 75%ile]" ="IQR [Q1, Q3]",
                           "N Missing" = "NMISS"),
         footnote = "Data arbitrarily split by Y value. Thus, differences in self-rated health can't be interpreted.")
tbl1

	Y <= 30 (N=45)	Y > 30 (N=97)	P-value
Data arbitrarily split by Y value. Thus, differences in self-rated health can't be interpreted.
Biomarker
Mean (SD)	57.1 (16.3)	53.0 (16.9)	0.172
Median [MIN, MAX]	62.4 [21.8, 82.1]	49.0 [15.6, 91.6]
IQR [25%ile, 75%ile]	28.7 [43.0, 71.7]	28.7 [39.1, 67.8]
N Missing	0	0
Self-Rated Health
Mean (SD)	15.7 (8.92)	62.7 (18.0)	<0.001
Median [MIN, MAX]	16.1 [0.0151, 29.7]	66.0 [30.2, 97.5]
IQR [25%ile, 75%ile]	14.9 [8.19, 23.1]	30.2 [45.7, 76.0]
N Missing	0	0
High Y Value (>30)
Y <= 30	45 (100%)	0 (0%)	<0.001
Y > 30	0 (0%)	97 (100%)

Save this Table to a .docx

library(flextable)

## 
## Attaching package: 'flextable'

## The following object is masked from 'package:purrr':
## 
##     compose

t1flex(tbl1) %>% save_as_docx(path = paste0(file_path, "Set_0_Table.docx"))

Run a linear regression and produce 95% Confidence Intervals

library(broom)

wk4_sub0 %>% 
  do(tidy(lm(formula = y ~ Biomarker,.))) %>% 
  mutate(LB = estimate - 1.965*std.error, UB = estimate + 1.965*std.error)

Interpretive statement:
>
> We observe that each 1-unit higher Biomarker was associated with 0.1 point lower Self-Rated Health.
> Though this result was generally consistent with either small decrease or increase in Self-Rated Health:
> (estimate = -0.10, [95% CI: -0.37, 0.16], p = 0.45)
>

Make some nice plots

wk4_dat %>% 
  ggplot() + geom_point(aes(x, y, color = factor(set))) + 
  facet_wrap(~set) + 
  theme(legend.position = "none")

## Set 0 is boring and I like a even grid, so I will filter it out first:
wk4_dat %>% 
  filter(set != 0) %>% 
  ggplot() + geom_point(aes(x, y, color = factor(set))) + 
  facet_wrap(~set) + 
  labs(title = "Scatter Plots of All 12 Sets",
       subtitle = "From the DatasauRus R Package") +
  theme_bw() +
  theme(legend.position = "none")

Exercise 2

This is a simulated dataset to look at the relationship between age and self-reported health. We are given a special dataset that include age, several important covariates, the true value of self-reported health (sr_health) and several alternate measures of this outcome we will use to illustrate different biases.

Of course this would never happened in a real dataset, so let’s first pretend we don’t know anything about the dataset and just explore it.

Let’s start just as we did with the first exercise.

Initialize the data

Load the required libraries, input the dataset into a tibble, etc.

library(tidyverse)
library(table1)
library(writexl)

file_path <- "/Users/jonhuang/Desktop/"
wk5_dat <- readxl::read_xlsx(paste0(file_path, "wk4_data_2.xlsx"))
wk5_dat

Describe variables

After looking at the overall structure of the data, we can produce some univariate statistics and plots to understand each variable.

If you tried to run the code below, you will find it does not work in its current form. What does the error message suggest the problem is?

Hint: The default summarize functions require complete data. We need to specify how the function treats missing data. Which variables have missingness?

var_names <- names(wk5_dat)
  
wk5_dat %>% 
  summarize(across(c(var_names), 
                   ~c(mean(.), sd(.), min(.), max(.), median(.),
                      quantile(., probs = 0.25), quantile(., probs = 0.75)))) %>% 
  add_column(STAT = c("mean", "SD", "min", "max", "median", "p25", "p75"))

Let’s first analyze for complete variables only, ignoring the missing values. Then we will come back to better understand the missingness.

wk5_dat %>% 
  summarize(across(c(names(wk5_dat)), 
                   ~c(mean(., na.rm = T), sd(., na.rm = T), min(., na.rm = T), max(., na.rm = T), median(., na.rm = T),
                      quantile(., probs = 0.25, na.rm = T), quantile(., probs = 0.75, na.rm = T),
                      mean(is.na(.))))) %>% 
  add_column(STAT = c("mean", "SD", "min", "max", "median", "p25", "p75", "pct_missing"))

We see that only sr_health_3 and sr_health_4 have missing data. We’ll come back to this.

In the meantime, let’s start keeping notes for the variables and their values. Even if there is another more detailed codebook, it is important to note for yourself important information. They can be in this notebook or in a separate spreadsheet or other file time.

age = Age in Years, range 17-91
sr_health = self-reported health, range 0-100 (based on separate documentation, you know that higher scores are worse)
ethnicity = self-identified ethnicity, binary variable in this case (0, 1)
phone = survey conducted by phone (0 = No, 1 = Yes), 52% by phone
sick = did the person report being sick at the time of the survey (0 = No, 1 = Yes), 16% were sick
young = seems to be a binary variable, but what does it mean? No documentation, let’s find out…

You are not given any information about young. It seems to be a constructed variable. How do we learn more about it?

Hint: We could split by Young = 0 vs. 1 and summarize another variable…which one and what statistics?

wk5_dat %>% group_by(young) %>% summarize(???)

Now let’s look at the variables with missing data.

Describe missingness

wk5_dat %>% summarize(across(everything(), ~sum(is.na(.))))

Because only the last two variables have any missing, let’s just focus in on any patterns for these two variables:

wk5_dat %>% count(across(c(9:10), ~is.na(.)))

wk5_dat %>% group_by(is.na(sr_health_3)) %>% summarize(across(c(age, sr_health, ethnicity, phone, sick, young), ~mean(.)))

wk5_dat %>% group_by(is.na(sr_health_4)) %>% summarize(across(c(age, sr_health, ethnicity, phone, sick, young), ~mean(.)))

Use the Table1 package to make the last table look nicer (summarize first six variables, stratified by whether sr_health_4) is missing.

The code below is incomplete as it does not summarize all six variables. Fix this and run it. Then produce a similar table stratified on the basis of sr_health_3 missingness.

tbl1_new <- 
  wk5_dat %>% 
  mutate(miss = structure(case_when(is.na(sr_health_4) ~ "Missing", 
                          !is.na(sr_health_4) ~ "Not Missing"), 
                          label = "sr_health_4 Missing?")) %>% 
  table1(~ age + sr_health + ethnicity | miss, data = .,
         topclass="Rtable1-zebra",
         overall = F, extra.col=list(`P-value`=pvalue),
         render.continuous=c(.="Mean (SD)", 
                             .="Median [MIN, MAX]",
                             "IQR [25%ile, 75%ile]" ="IQR [Q1, Q3]",
                           "N Missing" = "NMISS"))
tbl1_new

What can you say about any differences in those who have missingness for sr_health_3? for sr_health_4?

Plotting variables: univariate plots to look at distributions of variables

This is a plot for age distributions. Modify this to look at each of the first six variables (before sr_health_1).

wk5_dat %>% ggplot() + geom_histogram(aes(age)) + theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

By combining univariate plots, we can look at relative distributions and potentially learn something about related variables and missing data.

The following code will plot histograms of all of the sr_health together in one column to compare them if VAR_NAME is replaced with the correct variable name. Can you figure out what it is?

Hint: Try running just the pivot_longer() part of the command to see what the resulting tibble looks like.

wk5_dat %>% 
  pivot_longer(cols = everything()) %>% 
  filter(grepl("sr_health", VAR_NAME)) %>% 
  ggplot() + geom_histogram(aes(x = value, fill = VAR_NAME)) + 
  facet_wrap(~VAR_NAME, ncol = 1) + theme_bw() + theme(legend.position = "none")

This is another way to look at it using the geom_density() command. But it may not work well for too many overlapping shapes. Also, be sure to look at multiple adjust = values. What does adjust = do?

wk5_dat %>% 
  pivot_longer(cols = everything()) %>% 
  filter(grepl("sr_health", VAR_NAME)) %>% 
  ggplot() + geom_density(aes(x = value, fill = VAR_NAME), alpha = 0.6, adjust = 0.7) + 
  theme_bw() + theme(legend.position = "bottom")

Generate a working dataset

As we mentioned at the beginning, sr_health_1 through sr_health_4 are actually artificial versions of the true sr_health outcome we have created to illustrate different potential biases. Of course, in a real dataset, you would just have a measure of sr_health without necessarily knowing if it is the true value. However, in this case we can compare what would happen if

For practice, let’s generate the dataset that you would see if sr_health_1 are the outcome values you have actually observed for sr_health.

Modify/fix the follow code to:
- Create a new dataset called scenario_1 that pulls age, covariates, and sr_health_1 from the original dataset (drop sr_health and sr_health_2 through sr_health_4)
- Rename sr_health_1 to sr_health because these are the values that are actually observed in Scenario 1
- Mutate the values or labels for ethnicity to have 0 = Non-NH and 1 = NH
- Generate a Table 1 showing the variable values stratified by the ethnicity variable

DATSET_NAME <- wk5_dat %>% 
  select(-sr_health, -c(sr_health_2:sr_health_4)) %>% 
  rename(sr_health = sr_health_1) %>% 
  mutate(ethnicity = case_when(ethnicity == 0 ~ "Group 0",
                               ethnicity == 1 ~ "Group 1",
                               T ~ NA_character_))
  
DATSET_NAME %>% 
  table1(~ age + sr_health_1 + phone + sick + young + ethnicity, data = .,
         topclass="Rtable1-zebra",
         #overall = F, 
         #extra.col=list(`P-value`=pvalue),
         render.continuous=c(.="Mean (SD)", 
                             .="Median [MIN, MAX]",
                             "IQR [25%ile, 75%ile]" ="IQR [Q1, Q3]",
                           "N Missing" = "NMISS"))

Now let’s run a regression to look at the mean difference in Self-Reported Health per year older Age:

library(broom)

scenario_1 %>% 
  do(tidy(lm(formula = sr_health ~ age,.))) %>% 
  mutate(LB = estimate - 1.965*std.error, UB = estimate + 1.965*std.error)

This is the format (and analytic results) of the dataset you would see in Scenario 1 (see more details below). This is important if you want to come back take a look at datasets representing data from the other Scenarios.

For extra practice: Repeat the Table 1 and Regression exercise for sr_health_2 through sr_health_4

However, from here on out, we will be comparing findings from each scenario with the truth (sr_health) to illustrate potential biases. Therefore, we will keep the entire working dataset to make tabulating, plotting, and regressions easier to do.

Scenario 1 (`sr_health_1`): Ethnicity influences reporting of self-reported health (Information Bias)

Let’s first assume that there is a single correct “truth” for self-reported health (i.e. it may be different for each person, but for each person there is a true value).

Then let’s imagine that an individuals ethnic background can influence how they respond to a question about self-reported health in that they are equally likely to over- or under- report their self-rated health. e.g. maybe the question is interpreted differently or there are cultural reasons to respond in different ways.

Would this differential (by ethnicity) measurement error be a source of information bias? i.e. Would it lead to a biased estimate of association between Age and Self-Reported Health?

Why or why not?

Assume ethnicity is not associated with either age or any other predictor of self-reported health. Does that change your answer?

Make the plot below to look at the influence of differential reporting under this scenario:

wk5_dat %>% 
  ggplot() + 
  geom_point(aes(x = age, y = sr_health), color = "green") +
  geom_smooth(aes(x = age, y = sr_health), color = "green", method = lm, se = T) +
  geom_point(aes(x = age, y = sr_health_1), shape = "star", color = "blue") +
  geom_smooth(aes(x = age, y = sr_health_1), color = "blue", linetype = "dashed", method = lm, se = T) + 
  labs(title = "Scenario 1: Ethnicity influences self-reporting of health",
       subtitle = "Green = 'truth', Blue = 'differential reporting by ethnicity'") +
  theme_bw()

Let’s compare the results from regression models:

True association without measurement error:

wk5_dat %>% reframe(tidy(lm(formula = sr_health ~ age,.)))

Observed association with measurement error:

wk5_dat %>% reframe(tidy(lm(formula = sr_health_1 ~ age,.)))

These are very, very similar, but not identical because in small datasets, even random error can make a difference. If this was observed in a much larger dataset the estimates would be identical. In this case it really is random error, because the added noise does not systematically influence the relationship between age and self-reported health.

Take away: Just because a variable is mis-measured does not necessarily mean there is Information Bias.

Scenario 2 (`sr_health_2`): Contact by phone is associated with age and influences self-reported health (Information Bias)

Here is alternative situation where measurement error has occurred.

In this case, some survey interviews are conducted by phone. In phone interviews, because of distractions or just due to the pressures in responding to another person on the spot, there is a greater chance of misreporting self-rated health (both positively or negatively).

However, in this case, the chance of responding to survey by phone was higher in older respondents.

Is this a source of potential information bias?

To look at how this might influence the outcome, update the command below by putting the outcome variable for Scenario 2 sr_health_2 in the proper place in the ggplot() command below:

wk5_dat %>% 
  ggplot() + 
  geom_point(aes(x = age, y = sr_health), color = "green") +
  geom_smooth(aes(x = age, y = sr_health), color = "green", method = lm, se = T) +
  geom_point(aes(x = age, y = OBSERVED_OUTCOME), shape = "star", color = "blue") +
  geom_smooth(aes(x = age, y = OBSERVED_OUTCOME), color = "blue", linetype = "dashed", method = lm, se = T) + 
  labs(title = "Scenario 2: Phone survey influences self-reporting of health",
       subtitle = "Green = 'truth', Blue = 'differential reporting by phone respondents'") +
  theme_bw()

Here, even though measurement error was just random noise, the noise affected people differently based on their “exposure” (i.e. age). We also can have some intuition as to how this affects the association: Since there is more noise at the older ages (larger values of x), this is going to pull the slope upward making the association more positive (than the truth).

Scenario 3 (`sr_health_3`): Sick people are excluded from surveys (Selection Bias)

In addition to measurement error, other biases may occur when individuals are omitted from the study relative to the true target population. A common thought is that this can be “diagnosed” by seeing if demographics for the sample are representative of the target population. However, this is a vague and generally unhelpful criterion. Consider the demographics of the target population:

wk5_dat %>% 
  table1(~ age + phone + young + ethnicity, data = .,
         topclass="Rtable1-zebra",
         render.continuous=c(.="Mean (SD)"))

	Overall (N=182)
age
Mean (SD)	54.0 (14.5)
phone
Mean (SD)	0.516 (0.501)
young
Yes	31 (17.0%)
No	151 (83.0%)
ethnicity
Mean (SD)	0.291 (0.456)

And then the sample in Scenario 3:

wk5_dat %>% 
  filter(!is.na(sr_health_3)) %>% 
  table1(~ age + phone + young + ethnicity, data = .,
         topclass="Rtable1-zebra",
         render.continuous=c(.="Mean (SD)"))

	Overall (N=152)
age
Mean (SD)	52.5 (14.1)
phone
Mean (SD)	0.493 (0.502)
young
Yes	28 (18.4%)
No	124 (81.6%)
ethnicity
Mean (SD)	0.303 (0.461)

They are pretty similar! However, we know there is one important difference, which is that no one who was actively sick the day the surveys were conducted was asked to / able to participate.

Do you think selecting only people who were well to complete the survey will bias the association between age and self-reported health?

Why or why not?

Complete the plot command with the correct exposure and outcome variable names for scenario 3 to see:

wk5_dat %>% 
  ggplot() + 
  geom_point(aes(x = EXPOSURE, y = TRUE_OUTCOME), color = "green") +
  geom_smooth(aes(x = EXPOSURE, y = TRUE_OUTCOME), color = "green", method = lm, se = T) +
  geom_point(aes(x = EXPOSURE, y = OBSERVED_OUT), shape = "star", color = "blue") +
  geom_smooth(aes(x = EXPOSURE, y = OBSERVED_OUT), color = "blue", linetype = "dashed", method = lm, se = T) + 
  labs(title = "Scenario 3: Sick people excluded from survey",
       subtitle = "Green = 'truth', Blue = 'only people feeling well were surveyed'") +
  theme_bw()

The observed association was much weaker than the true association. Why was this? While data on who was not surveyed is usually not available to us, in this example we can.

We can see from the plot and from the table below which stratifies the demographics on the basis of whether they were sampled (TRUE, N = 152) or not (FALSE, N = 30) that, not only were there no sick people in the sample, this also means that the selected population is actually younger - something that was not clear when we just looked at the overall means.

wk5_dat %>% 
  #filter(!is.na(sr_health_3)) %>% 
  table1(~ age + phone + young + ethnicity + sick | !is.na(sr_health_3), data = .,
         topclass="Rtable1-zebra",
         overall = F, 
         extra.col=list(`P-value`=pvalue),
         render.continuous=c(.="Mean (SD)"))

## Warning in table1.formula(~age + phone + young + ethnicity + sick |
## !is.na(sr_health_3), : Terms to the right of '|' in formula 'x' define table
## columns and are expected to be factors with meaningful labels.

## Warning in chisq.test(table(y, g)): Chi-squared approximation may be incorrect

	FALSE (N=30)	TRUE (N=152)	P-value
age
Mean (SD)	61.5 (14.4)	52.5 (14.1)	0.00327
phone
Mean (SD)	0.633 (0.490)	0.493 (0.502)	0.162
young
Yes	3 (10.0%)	28 (18.4%)	0.392
No	27 (90.0%)	124 (81.6%)
ethnicity
Mean (SD)	0.233 (0.430)	0.303 (0.461)	0.43
sick
Yes	30 (100%)	0 (0%)	<0.001
No	0 (0%)	152 (100%)

If we think some more about this, it make sense that since there is a relationship between being sick and higher self-reported health scores and also between being old and higher scores. Removing sick people with higher scores would also differentially remove older people from the sample. As a result, the association is not as strong as it could have been.

This is what is specifically needed for selection bias to exist: > > Selection bias occurs when a factor determining inclusion is related to at least two of the following: exposure, confounder, outcome. >

Unfortunately, because of the very strong relationship between sickness (selection) and health score in this scenario, there is very little we can do to correct this bias, unless of course we could somehow go back and observe the scores in the entire target population (i.e. the points / line in green on the plot).

There is one potential way we try to fix this analysis to produce at least a slightly less biased association. Can you think of what it is?

Hint: There is a subgroup within the sample in which the selection bias is less strong / has less of an effect. If we can restrict to this subgroup, we could minimize the bias. If you look at the previous table which has a column for selected (TRUE) and not-selected (FALSE), there is a characteristics that is barely present in the non-selected population…

wk5_dat %>% 
  mutate(restricted_y = case_when(young == 1 ~ sr_health_3, 
                                  T ~ NA_real_)) %>% 
  ggplot() + 
  geom_point(aes(x = age, y = sr_health), color = "green") +
  geom_smooth(aes(x = age, y = sr_health), color = "green", method = lm, se = T) +
  geom_point(aes(x = age, y = restricted_y), shape = "star", color = "blue") +
  geom_smooth(aes(x = age, y = restricted_y), color = "blue", linetype = "dashed", method = lm, se = T) + 
  labs(title = "Scenario 3: Sick people excluded from survey - restricted to young",
       subtitle = "Green = 'truth', Blue = 'restricted to well people who are also young'") +
  theme_bw()

Let’s remind ourselves of the true association:

wk5_dat %>% reframe(tidy(lm(formula = sr_health ~ age,.)))

The observed association amongst well (non-sick) only:

wk5_dat %>% reframe(tidy(lm(formula = sr_health_3 ~ age,.)))

The observed association amongst well (non-sick) only, restricted to the young where selection biases may be less strong:

wk5_dat %>%  
  mutate(restricted_y = case_when(young == 1 ~ sr_health_3, 
                                  T ~ NA_real_)) %>% 
  reframe(tidy(lm(formula = restricted_y ~ age,.)))

Note that while the estimated association with age is closer to the truth, there is a cost in terms of power because the sample size is far smaller. In general with statistics, there is almost always a bias-variance trade off between different analytic strategies, all else being equal (same dataset, estimation target, etc.)
That being said, restriction is often times a sensible strategy if other way to correct for biases, such as adjustment are not available to you.

Scenario 4 (`sr_health_4`): Males are excluded (Selection Bias)

Our final scenario considers a seemingly clearer imbalance in the selected sample. What do you think would happen if all males were were excluded from the survey?

Would this lead to selection bias, i.e. selection that leads to a bias in the estimated association between age and self-reported health score?

What other information would you need to know to determine this?

For the sake of this exercise, let’s assume that biological sex is not related to self-reported health (This is not actually true). Would such a selected sample give a biased association between age and health score?

Let’s modify the dataset to look at this scenario:

set.seed(1)
wk5_dat %>% 
  add_column(MALE = rbinom(182,1, 0.5)) %>% 
  mutate(sr_health_4 = case_when(MALE == 1 ~ NA_real_, 
                                 T ~ sr_health)) %>% 
  ggplot() + 
  geom_point(aes(x = age, y = sr_health), color = "green") +
  geom_smooth(aes(x = age, y = sr_health), color = "green", method = lm, se = F) +
  geom_point(aes(x = age, y = sr_health_4), shape = "star", color = "blue") +
  geom_smooth(aes(x = age, y = sr_health_4), color = "blue", linetype = "dashed", method = lm, se = F) + 
  labs(title = "Scenario 4: Males excluded from survey",
       subtitle = "Green = 'truth', Blue = 'Males excluded'") +
  theme_bw()

In this case, because male sex was not related to anything else there was no bias in the estimate even though the sample was very non-representative of the source population!

PH 747 - Week 5

Visualizing Data - Part 2

Jon Huang

September 18, 2023

Basic Notes

Review Exercise 1

Initialize the data

Cleaning up the dataset

Univariate statistics

Make a nice Table 1

Save this Table to a .docx

Run a linear regression and produce 95% Confidence Intervals

Make some nice plots

Exercise 2

Initialize the data

Describe variables

Describe missingness

Plotting variables: univariate plots to look at distributions of variables

Generate a working dataset

Scenario 1 (`sr_health_1`): Ethnicity influences reporting of self-reported health (Information Bias)

Scenario 2 (`sr_health_2`): Contact by phone is associated with age and influences self-reported health (Information Bias)

Scenario 3 (`sr_health_3`): Sick people are excluded from surveys (Selection Bias)

Scenario 4 (`sr_health_4`): Males are excluded (Selection Bias)

PH 747 - Week 5

Visualizing Data - Part 2

Jon Huang

September 18, 2023

Basic Notes

Review Exercise 1

Initialize the data

Cleaning up the dataset

Univariate statistics

Make a nice Table 1

Save this Table to a .docx

Run a linear regression and produce 95% Confidence Intervals

Make some nice plots

Exercise 2

Initialize the data

Describe variables

Describe missingness

Plotting variables: univariate plots to look at distributions of variables

Generate a working dataset

Scenario 1 (sr_health_1): Ethnicity influences reporting of self-reported health (Information Bias)

Scenario 2 (sr_health_2): Contact by phone is associated with age and influences self-reported health (Information Bias)

Scenario 3 (sr_health_3): Sick people are excluded from surveys (Selection Bias)

Scenario 4 (sr_health_4): Males are excluded (Selection Bias)

Scenario 1 (`sr_health_1`): Ethnicity influences reporting of self-reported health (Information Bias)

Scenario 2 (`sr_health_2`): Contact by phone is associated with age and influences self-reported health (Information Bias)

Scenario 3 (`sr_health_3`): Sick people are excluded from surveys (Selection Bias)

Scenario 4 (`sr_health_4`): Males are excluded (Selection Bias)