Coursera Analysis Assignment (Week 4)

The objective of this script is to analyse a subset of data from the National Health Interview Survey (NHIS).

Prepare Data

1. Determine which fields in the file that you need for analysis. Determine which values of each field should be treated as missing values.

In the first step, I load packages needed for the analysis. All of the used packages are available on CRAN.

# load packages
library(sampling)
library(dplyr)
library(tidyr)
library(forcats)
library(purrr)
library(survey)
library(ggplot2)
library(readr)
library(mice)
library(tibble)
library(effects)
library(glue)

The subset of data is provided by Coursera as a CSV file. We load this into memory using the readr package.

# read-in data
raw <- read_csv("~/Downloads/NHISsamdat.csv",
  col_types = cols(
    .default = col_character(),
    AGE_P = col_integer(),
    STRATUM = col_integer(),
    BG = col_integer(),
    wt = col_double(),
    pi1 = col_double(),
    pi2 = col_double()
  )
)

The dataset consists of nrow(raw) = 972 observations. Using the codebook provided, the variables of interest for this analysis include:

Dependent variable:

CANEV: Have you EVER been told by a doctor or other health professional that you had…Cancer or a malignancy of any kind?

Covariates:

REGION: Region of residence
SMKEV: Having ever smoked 100 or more cigarettes in lifetime
SEX: Sex
HYPEV: Having hypertension
RACERPI2: Race
CHLEV: Having high cholesterol
AGE_P: Age
CHDEV: Having heart disease
R_MARITL: Marital status

Design variables:

STRATUM: Sample design stratum
BG: Block group
pi1: Selection probability of a block group
pi2: Selection probability of a person within a block group
wt: Survey weight

To facilitate later analysis, I save the names of these variables and our design variables.

# define study variables
study_vars <- c(
  "REGION", "SMKEV", "SEX", "HYPEV", "RACERPI2",
  "CHLEV", "AGE_P", "CHDEV", "R_MARITL", "CANEV"
)

# define design variables
design_vars <- c("STRATUM", "BG", "pi1", "pi2", "wt", "ID")

The next step is to clean the data. To make these more usable, as a first step I recode the levels in the data to factors. The levels are defined using the provided codebook.

# recode variables
df <- raw %>%
  mutate(
    REGION = fct_recode(REGION,
      Northeast = "1",
      Midwest = "2",
      South = "3",
      West = "4"
    ),
    SEX = fct_recode(SEX, 
      Male = "1",
      Female = "2"
    ),
    RACERPI2 = fct_recode(RACERPI2, 
      `White only` = "1",
      `Black/African American only` = "2",
      `AIAN only` = "3",
      `Asian only` = "4",
      `Race group not releasable` = "5",
      `Multiple race` = "6"
    ),
    R_MARITL = fct_recode(R_MARITL, 
      `Under 14 years` = "0",
      `Married - spouse in household` = "1",
      `Married - spouse not in household` = "2",
      `Married - spouse in household unknown` = "3",
      Widowed = "4",
      Divorced = "5",
      Separated = "6",
      `Never married` = "7",
      `Living with partner` = "8",
      `Unknown marital status` = "9"
    )
  ) %>%
  mutate_at(vars(HYPEV, SMKEV, CHLEV, CHDEV, CANEV),
            funs(fct_recode(., 
      No = "2",
      Yes = "1",
      Refused = "7",
      `Not Ascertained` = "8",
      `Don't know` = "9"
    ))) %>%
  mutate_at(vars(HYPEV, SMKEV, CHLEV, CHDEV, CANEV),
            funs(fct_relevel(., "No")
    ))

## Warning: Unknown levels in `f`: 0, 3

## Warning: Unknown levels in `f`: 7, 8

## Warning: Unknown levels in `f`: 7, 9

## Warning: Unknown levels in `f`: 7, 8

## Warning: Unknown levels in `f`: 7, 8

## Warning: Unknown levels in `f`: 8, 9

The warning messages tell us that there are certain levels in the codebook that never appear in our data, and that these are not incorporated into our factor levels. Depending on what our purposes are and what statistical functions we are using, this can either be a desirable or undesirable property. For the present analysis, I do not concern myself with this too much and restrict the factor levels only to those actually observed in the data.

To start the analysis, I check whether there are any NA values in our data:

df %>%
  select(one_of(study_vars)) %>%
  map_lgl(~any(is.na(.)))

##   REGION    SMKEV      SEX    HYPEV RACERPI2    CHLEV    AGE_P    CHDEV 
##    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE 
## R_MARITL    CANEV 
##    FALSE    FALSE

There do not appear to be any NA values in the data.

The next step is to get an initial sense of the data. Since almost all variables of interest are categorical, the simplest approach is create contingency tables for each of the variables:

# tabulate margins of study data
df %>%
  select(one_of(study_vars)) %>%
  map(table)

## $REGION
## 
## Northeast   Midwest     South      West 
##       154       222       326       270 
## 
## $SMKEV
## 
##              No             Yes Not Ascertained 
##             602             369               1 
## 
## $SEX
## 
##   Male Female 
##    415    557 
## 
## $HYPEV
## 
##         No        Yes Don't know 
##        629        342          1 
## 
## $RACERPI2
## 
##                  White only Black/African American only 
##                         733                         165 
##                   AIAN only                  Asian only 
##                           9                          48 
##   Race group not releasable               Multiple race 
##                           1                          16 
## 
## $CHLEV
## 
##         No        Yes Don't know 
##        702        267          3 
## 
## $AGE_P
## 
## 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 
##  8 12  8 15 16 10 13 15 15 14 16  9 16 13 16 20 29 12 14 15 12 13 18 21 20 
## 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 
## 26 15 17 15 16 24 13 23 16 22 11 17 11 16 13 14 15 14 19  8 23 17 16 15 17 
## 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 
## 10 21 10  8 13  7 14 11 13  8  4  6  4  9  7  5  2 37 
## 
## $CHDEV
## 
##         No        Yes Don't know 
##        919         52          1 
## 
## $R_MARITL
## 
##     Married - spouse in household Married - spouse not in household 
##                               429                                20 
##                           Widowed                          Divorced 
##                                88                               137 
##                         Separated                     Never married 
##                                34                               205 
##               Living with partner            Unknown marital status 
##                                57                                 2 
## 
## $CANEV
## 
##      No     Yes Refused 
##     868     103       1

From these sample counts, we can see that certain level categories are very infrequent. There are several ‘missing’-type levels, such as “Don’t know”/“Unknown”, “Refused”, “Not Ascertained”, “Not releasable”, etc. These can be problematic for the analysis for a number of reasons:

They represent missing data, where for certain analyses, it may not be fully appropriate to consider it to be a level onto its own. For example, in cases of a refusal or “don’t know”, we believe that there is a ‘true’ value (e.g. Yes or No) but that this is not captured in the data due to measurement error.
They are infrequent, and can render certain analyses and estimates unstable.
They can restrict available analytic methods. In particular, they make what might otherwise be a binary variable (e.g. Yes/No) into a multivariate one. In the present case, this limits our response variable if we want to rely on methods that require a binary response (such as logistic regression)

To address this, I will first recode these ‘problem’ levels as NA values in R, and then impute values for them. Since there are few cases, the negative impact of “inventing data” should be negligible, and worth the trade-offs.

# substitute 'NA' for quasi-missing levels
sub <- df %>%
  mutate_at(vars(CANEV), funs(na_if(., "Refused"))) %>%
  mutate_at(vars(CHDEV, CHLEV, HYPEV), funs(na_if(., "Don't know"))) %>%
  mutate_at(vars(SMKEV), funs(na_if(., "Not Ascertained"))) %>%
  mutate_at(vars(R_MARITL), funs(na_if(., "Unknown marital status"))) %>%
  mutate_at(vars(RACERPI2), funs(na_if(., "Race group not releasable"))) %>%
  select(one_of(study_vars, design_vars))

# Impute missing values using the MICE package. This method is stochastic, so I
# set a random seed to ensure reproduciblity. The imputed dataset is saved into
# a new data.frame names df_imp
set.seed(65)
imp <- mice(droplevels(sub), m = 1, printFlag = FALSE)
df_imp <- complete(imp)

Because the amount of imputation is minimal (10 cases or less per variable), I will not concern myself with diagnosing the quality of the imputations or using the full power of running multiple imputations. However, this is something that is generally advised if larger amounts of data are missing.

Sampling Design

2. Were block groups selected with equal probability? How do you know? In order to estimate the standard errors of estimates, will you assume that the first‐stage units were selected with or without replacement? How do you make this choice?

Not much information is provided in the accompanying documentation about the sampling mechanism. We can glean from the codebook that we are dealing with a two-stage sample, as there are pi1 and pi2 variables provided for each stage of sampling. The codebook specifies that the pi1 variable represents the probability of selecting a given block group, BG.

By looking at the distinct combinations of pi1 for each BG, we can see that BGs do not all have the same probability:

# print distinct Stage-1 probabilities by BG
(bg_counts <- distinct(df, STRATUM, BG, pi1) )

## # A tibble: 30 x 3
##    STRATUM    BG   pi1
##      <int> <int> <dbl>
##  1       1     2 0.259
##  2       1     6 0.291
##  3       2    11 0.314
##  4       2    14 0.282
##  5       3    17 0.303
##  6       3    21 0.285
##  7       4    24 0.278
##  8       4    27 0.264
##  9       5    31 0.245
## 10       5    34 0.306
## # ... with 20 more rows

For example, within stratum 1, BG 2 was selected with probability 0.259, whereas BG 6 was selected with probability 0.291.

Our next line of inquiry is to determine whether first‐stage units were selected with or without replacement. If any BGs appear in the data twice (within a stratum), we can be sure that the design was with replacement. This is not the case. This leaves two possibilities: either the design was without replacement, or the design was with-replacement but happened to produce a sample without any duplicates by chance. We cannot be 100% sure which of these two actually happened, but can attempt an answer based on probability. To do so, I begin by printing a list of our selected BGs:

# print IDs of selected BGs in each stratum
df %>%
  group_by(STRATUM) %>%
  summarise(sampled_psus = glue::collapse(unique(BG), sep = ", ")) %>%
  mutate(id_floor = (STRATUM - 1) * 7 + 1,
         id_ceiling = STRATUM * 7)

## # A tibble: 15 x 4
##    STRATUM sampled_psus id_floor id_ceiling
##      <int> <chr>           <dbl>      <dbl>
##  1       1 2, 6               1.         7.
##  2       2 11, 14             8.        14.
##  3       3 17, 21            15.        21.
##  4       4 24, 27            22.        28.
##  5       5 31, 34            29.        35.
##  6       6 36, 39            36.        42.
##  7       7 45, 49            43.        49.
##  8       8 50, 53            50.        56.
##  9       9 59, 62            57.        63.
## 10      10 67, 70            64.        70.
## 11      11 72, 76            71.        77.
## 12      12 80, 84            78.        84.
## 13      13 85, 88            85.        91.
## 14      14 92, 95            92.        98.
## 15      15 102, 105          99.       105.

From this table, we observe a few critical things: First, only two BGs are selected per stratum. Each stratum seems to conform to a pattern where serials are within a range of no more than 7 integers apart, and the sequence of IDs suggests that the frame had exactly 7 BGs per stratum (that is, \(N_h = 7\)). This hypothesis is strongly supported by the fact that the sum of the stage-1 weights per stratum are all close to 7:

bg_counts %>%
  group_by(STRATUM) %>%
  summarise(stratum_weight = sum(1 / pi1))

## # A tibble: 15 x 2
##    STRATUM stratum_weight
##      <int>          <dbl>
##  1       1           7.29
##  2       2           6.72
##  3       3           6.81
##  4       4           7.39
##  5       5           7.35
##  6       6           7.05
##  7       7           7.13
##  8       8           6.93
##  9       9           6.94
## 10      10           7.39
## 11      11           6.74
## 12      12           6.91
## 13      13           7.02
## 14      14           6.91
## 15      15           6.91

If the design was indeed with replacement, how likely would it be to observe a sample with no duplicate BGs? We do not have the inclusion probabilities for the full frame, so this cannot be calculated exactly. However, since pi1 probabilities are not too different from one another, we can approximate an answer by looking at how likely such a result would be if a simple random sample with replacement (srswr) had been used.

Under a stratified srswr design, within a given stratum with \(N_h = 7\), the probability of drawing the same element twice is \(\frac{1}{7} \times \frac{1}{7} \times 7 = \frac{1}{7}\). By the principle of inclusion-exclusion, the probability of drawing two different numbers is \(\frac{6}{7}\). Since each of samples drawn in the 15 strata are independent from one another, the probability of observing no duplicates is simply \((\frac{6}{7})^{15} = 0.099\). If a srswr design was used, we would expect this type of result 1% of the time. This is a very unlikely outcome, so I can be confident in my belief that the design was actually without replacement.

So if a srs design was not applied, what design was used?

My best answer is that a without-replacement probability-proportionate to size design was used (πps). A few arguments can be raised in favour of this view.

First, PPS designs are very common in real surveys. This is especially the case when area sampling, as seems to be the case with the current data that have regions and block group. This is because PPS designs are typically very efficient.

Second, the stage-1 probabilities are in direct proportion to the BG population totals. The BG population totals are not given directly, but can be calculated by summing the inverse of pi2 values. If we plot the stage-1 probabilities against the population totals, we obtain the following plot:

# plot distinct stage-1 probabilities against BG population total
df_imp %>%
  group_by(STRATUM, BG, pi1, pi2) %>%
  summarise(n = n(),
            bg_pop = sum(1 / pi2)) %>%
  ggplot(aes(bg_pop, pi1)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Relationship between BG inclusion probability and BG sample size")

The near-perfect linear relationship gives me strong cause to believe that a PPS design was used in stage 1. Furthermore, as all pi2 are roughly 0.10, and that no duplicate person IDs exist in the data, I conculde that a srswor was used in stage two with a roughly constant sampling proportion of 10%.

Finally, since only two PSUs were sampled per stratum, it is possible that the design used a Brewer variant of the PPS family of designs. This type of design selects exactly two primary sampling units per stratum. This is something which I will return to later when we define our sampling design using the survey package.

As a final step, since the stratum-wise and cluster-wise sampling fractions (or population counts) are not given in the data, I add them myself. This will be useful in later stages if we want to use ad hoc finite population corrections.

df_imp <- df_imp %>%
  group_by(STRATUM) %>%
  mutate(Nh = 7,
         f1 = 2 / 7) %>%
  group_by(BG) %>%
  mutate(Ni = sum(1 / pi2),
         f2 = n() / Ni) %>%
  ungroup

Population Estimates

3. What is the estimate of the total number of persons in the population?

Using the π-estimator, the estimate of the population size is simply the sum of they survey weights:

\[ \hat{N} = \sum_{k \in s}\frac{1}{\pi_{k}} = \sum_{k \in s}w_{k} \]

This is very straightforward to apply in R:

# population total
sum(df_imp$wt)

## [1] 33672

# alternatively, we get the same result using pi1 and pi2
sum(1 / (df_imp$pi1 * df_imp$pi2))

## [1] 33672

We estimate that there are 33,672 individuals in the population. Due to the clustering, the estimate is a random quantity, and comes with a degree of uncertainty.

4. What proportion of persons have ever been told that they had some type of cancer? What is the estimated standard error of the proportion?

The variable CANEV captures whether or not a responded was ever told that they had some type of cancer. Using the π-estimator, the proportion of person that have been told they have had some type of cancer (\(y\)) is given by:

\[ \tilde{y}_s = \frac{\hat{t}_\pi}{\hat{N}} = \frac{\sum_{k \in s}y_{k}w_{k}}{\sum_{k \in s}w_{k}} = \frac{\sum_{k \in s}\check{y}_{k}}{\sum_{k \in s}w_{k}} \] We can calculate this ‘by hand’ using the survey weights:

df_imp %>%
  group_by(CANEV) %>%
  summarise(t_hat = sum(wt)) %>%
  mutate(wt_mean = t_hat / sum(t_hat))

## # A tibble: 2 x 3
##   CANEV  t_hat wt_mean
##   <fct>  <dbl>   <dbl>
## 1 No    30057.   0.893
## 2 Yes    3615.   0.107

In other words, eleven percent of the population has been told by a physician that they have had some type of cancer.

We can arrive at the same results using the survey package. The benefit of relying on survey is that we can also calculate standard errors, confidence intervals, and design effects. A few different approaches are taken here. A simple ‘without-replacement’ design is our starting point. We also define Brewer PPS designs that implements ad-hoc finite population corrections. Finally, I also define a stratified twostage design and a Jackknife replication design. Each of these designs will return the same point estimates, but different standard errors.

# two-stage, with-replacement design
dsg1 <- svydesign(ids = ~BG + ID, 
                  probs = ~pi1 + pi2, 
                  data = df_imp,
                  strata = ~STRATUM+NULL)

# Brewer PPS design with finite population correction
dsg2 <- svydesign(ids = ~BG+ID, 
                  pps = "brewer",
                  data = df_imp,
                  fpc = ~pi1+f2,
                  strata = ~STRATUM+NULL)

# stratified 2-stage cluster design, with fpc
dsg3 <- svydesign(ids = ~BG+ID, 
                  probs = ~pi1 + pi2, 
                  data = df_imp,
                  fpc = ~Nh + Ni,
                  strata = ~STRATUM+NULL)

# Jackknife replication design
dsg4 <- as.svrepdesign(dsg3, "JKn")

# estimate the proporation of patients that have been told 
# they have had cancer
dsg <- list(`Without Replacement Design` = dsg1,
            `PPS Design (with fpc)` = dsg2,
            `Stratified 2-Stage Design (with fpc)` = dsg3,
            `Jackknife Replication Design` = dsg4)
map_df(dsg, ~as.data.frame(svymean(~CANEV, .x, deff = TRUE))[2, ], .id = "design")

##                                 design     mean         SE     deff
## 1           Without Replacement Design 0.107357 0.01586920 2.627499
## 2                PPS Design (with fpc) 0.107357 0.01430078 2.133791
## 3 Stratified 2-Stage Design (with fpc) 0.107357 0.01430859 2.136122
## 4         Jackknife Replication Design 0.107357 0.01341192 1.876785

As expected, all design objects return an estimate of 10.7%. The simplest ‘with replacement’ design suggests a standard error of 0.015869. This can be used to produce a 95% confidence interval of [0.07625391, 0.1384600]. The other designs give slightly smaller standard errors. For dsg2 and dsg3, this is because of the finite population corrections.

The weighted estimate can be compared to the ‘naive’ sample mean, which would be used if the data were generated through a simple random sample:

# via the survey package
naive_dsg <- svydesign(ids = ~1, data = df_imp)
svymean(~CANEV, naive_dsg)

##             mean     SE
## CANEVNo  0.89403 0.0099
## CANEVYes 0.10597 0.0099

As we can see, the result is almost identical to the weighted estimate. This may not be the case for other variables in the data. It should also be noted that although the point estimate is nearly the same, the design-based estimate provides a design effect of ranging between 1.9 and 2.6 (depending on which design object we use). This suggests that the sampling design is 1.9 - 2.6 times less efficient than a simple random sample. Accordingly, the error rates using methods that account for the design variables are much wider.

Analytic Associations

5. Determine whether having been told by a doctor or other health professional that a person had cancer or a malignancy of any kind is related to any of the following covariates: REGION, SMKEV, SEX, HYPEV, RACERPI2, CHLEV, CHDEV, R_MARITL, AGE_P

Since all variables in question are categorical (with the exception of AGE_P), I will rely on contingency table analysis. In particular, I will use Chi-square tests with Rao & Scott adjustments, which account for the sampling design. For the AGE_P variable, I will rely on regression. As is common, I set my alpha rate at 5%.

# run multiple chisquare tests
res <- list()
res$REGION   <- svychisq(~CANEV+REGION, dsg1)
res$SMKEV    <- svychisq(~CANEV+SMKEV, dsg1)
res$SEX      <- svychisq(~CANEV+SEX, dsg1)
res$HYPEV    <- svychisq(~CANEV+HYPEV, dsg1)
res$RACERPI2 <- svychisq(~CANEV+RACERPI2, dsg1)
res$CHLEV    <- svychisq(~CANEV+CHLEV, dsg1)
res$CHDEV    <- svychisq(~CANEV+CHDEV, dsg1)
res$R_MARITL <- svychisq(~CANEV+R_MARITL, dsg1)

# run regression if independent variable is continuous
fit_age <- svyglm(CANEV~AGE_P, dsg1, family = quasibinomial())

# Assemble all p-values from contingency tables
pvals <- map_dbl(res, "p.value")

# Add p-value from regression
pvals <- c(pvals, summary(fit_age)$coefficients[, "Pr(>|t|)"]["AGE_P"])

# print p-values (unadjusted)
(round(pvals, 2))

##   REGION    SMKEV      SEX    HYPEV RACERPI2    CHLEV    CHDEV R_MARITL 
##     0.22     0.14     0.18     0.01     0.07     0.02     0.00     0.00 
##    AGE_P 
##     0.00

These p-values overstate the true probability of observing such results under the null hypothesis due to repeated testing. To control the false discovery rate, we can adjust these values with the p.adjust() function in R:

# adjusted p-values
round(p.adjust(pvals), 2)

##   REGION    SMKEV      SEX    HYPEV RACERPI2    CHLEV    CHDEV R_MARITL 
##     0.43     0.43     0.43     0.04     0.29     0.12     0.00     0.00 
##    AGE_P 
##     0.00

From this, we reject the null hypothesis and conclude that HYPEV, CHDEV, R_MARITL, and AGE_P are marginally associated with CANEV. Note that after adjusting for repeated testing, the CHLEV variable is no longer significant.

Some of these results are surprising. My own intuition would have told me that SMKEV would have been associated with incidence of cancer, whereas a factor such as marital status would not be. It is possible that a variable such as marital status is a confounder (e.g. widowed individuals are likely to be older and have been told they have cancer). This will be something to explore in the next section.

6. Are there combinations of covariates (interactions) that seem related to having cancer?

It would be preferable not only to know about marginal associations, but also about effect sizes and joint associations. For this, we can rely on regression techniques. Specifically, we can use a design-based logistic regression. For present purposes, I will limit my analytically scope on the covariates that were previously found to be marginally associated. Since CHLEV was on the edge, I include it here. I continue to rely on the survey package so as to conduct regression rooted in a design-based paradigm.

# run logistic regression. Most reference levels are set to 'No' to make the
# coefficient interpretation intuitive. The R_MARITL variable's reference is set
# to 'Married - spouse in household', which is the most common level.
fit1 <- svyglm(CANEV ~ AGE_P + HYPEV + CHLEV + CHDEV + R_MARITL,
              design = dsg1, 
              family = quasibinomial())

# print a summary of the model fit
summary(fit1)

## 
## Call:
## svyglm(formula = CANEV ~ AGE_P + HYPEV + CHLEV + CHDEV + R_MARITL, 
##     design = dsg1, family = quasibinomial())
## 
## Survey design:
## svydesign(ids = ~BG + ID, probs = ~pi1 + pi2, data = df_imp, 
##     strata = ~STRATUM + NULL)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.522207   0.372156 -14.838 2.51e-05 ***
## AGE_P        0.062838   0.006362   9.877 0.000181 ***
## HYPEV2      -0.405963   0.176359  -2.302 0.069606 .  
## CHLEV2       0.087345   0.283818   0.308 0.770680    
## CHDEV2       0.385608   0.270969   1.423 0.214000    
## R_MARITL2   -0.140923   0.695085  -0.203 0.847329    
## R_MARITL3   -0.394624   0.270762  -1.457 0.204781    
## R_MARITL4    0.320670   0.300347   1.068 0.334487    
## R_MARITL5   -0.550742   0.562934  -0.978 0.372837    
## R_MARITL6   -1.293644   0.629792  -2.054 0.095147 .  
## R_MARITL7    0.465381   0.542844   0.857 0.430443    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 0.9724908)
## 
## Number of Fisher Scoring iterations: 7

The model reveals a few important findings. For starters, the fit finds that that the only variable that is statistically significant is AGE_P (since our alpha of 0.05). This suggests that although the other covariates are marginally associated, that this association largely disappears once age is accounted for. It is possible that these other covariates may mediate the relationship between CANEV and AGE_P.

To explore this possibility, I prune down fit1 and compare the new and old fits.

# remove all covariates except AGE_P
fit2 <- update(fit1, . ~ AGE_P)

# remove all covariates except AGE_P
fit3 <- update(fit1, . ~ AGE_P * HYPEV)

# compare models
anova(fit2, fit1)

## Working (Rao-Scott+F) LRT for HYPEV CHLEV CHDEV R_MARITL
##  in svyglm(formula = CANEV ~ AGE_P + HYPEV + CHLEV + CHDEV + R_MARITL, 
##     design = dsg1, family = quasibinomial())
## Working 2logLR =  19.07649 p= 0.20628 
## (scale factors:  2.9 2.1 1.2 1 0.7 0.43 0.38 0.13 0.046 );  denominator df= 5

An analysis of variance reveals that the addition of covariates (over and above only keeping AGE_P) is not statistically significant (\(p = 0.21\)), and therefore not warranted. Our best model is fit2, which only retains age as a risk factor.

We can see if there are any statistically significant interactions. A heuristic method to go about this is to take the two most significant covariates and attempt to interact them. Since we only have one covariate left (age), I will try to interact it with the variable that had the lowest p-value (HYPEV).

# add an interaction
fit3 <- update(fit1, . ~ AGE_P * HYPEV)

# compare models
anova(fit2, fit3)

## Working (Rao-Scott+F) LRT for HYPEV HYPEV:AGE_P
##  in svyglm(formula = CANEV ~ AGE_P + HYPEV + AGE_P:HYPEV, design = dsg1, 
##     family = quasibinomial())
## Working 2logLR =  2.758938 p= 0.27625 
## (scale factors:  1.4 0.64 );  denominator df= 12

The analysis of variance suggests that the interaction between AGE_P and HYPEV is not justified. There is insufficient evidence to suggest that there are combinations of covariates that are related to having cancer. I therefore revert back to our fit2 as our ‘final’ model. Analyzing the summary of this model, we get:

# print summary of final model
summary(fit2)

## 
## Call:
## svyglm(formula = CANEV ~ AGE_P, design = dsg1, family = quasibinomial())
## 
## Survey design:
## svydesign(ids = ~BG + ID, probs = ~pi1 + pi2, data = df_imp, 
##     strata = ~STRATUM + NULL)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.501487   0.356463  -15.43 3.49e-10 ***
## AGE_P        0.059155   0.005131   11.53 1.56e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 0.9201444)
## 
## Number of Fisher Scoring iterations: 6

The summary of the fit gives a coefficient of 0.059155 on AGE_P. For a logistic model, the interpretation of this is that for every additional 10 years in age, we would expect the odds of ever being told that have cancer to increase exp(0.059155 * 10) = 1.8 times. This validity of this pattern is only supported by the range of the original data: that is to say for ages between 18 and 85.

We can use the effects package to better visualize this relationship on the response scale (i.e. the probability scale):

# plot marginal effects
plot(allEffects(fit2))

From the plot, we see that conditional on the model, the estimated probability of ever having been told that you have cancer ranges from 1% for 20-year olds, all the way to 32% for octogenarians.

7. Assuming that you used some sort of model to explore point 5, run the analyses two ways—accounting for the complex design and ignoring it. Are there any differences in your conclusions about which variables are predictive of having cancer?

We can run the exact same analysis using unweighted methods:

# run multiple unweighted chisquare tests
unwt <- list()
unwt$REGION   <- with(df_imp, chisq.test(CANEV, REGION))
unwt$SMKEV    <- with(df_imp, chisq.test(CANEV, SMKEV))
unwt$SEX      <- with(df_imp, chisq.test(CANEV, SEX))
unwt$HYPEV    <- with(df_imp, chisq.test(CANEV, HYPEV))
unwt$RACERPI2 <- with(df_imp, chisq.test(CANEV, RACERPI2))

## Warning in chisq.test(CANEV, RACERPI2): Chi-squared approximation may be
## incorrect

unwt$CHLEV    <- with(df_imp, chisq.test(CANEV, CHLEV))
unwt$CHDEV    <- with(df_imp, chisq.test(CANEV, CHDEV))
unwt$R_MARITL <- with(df_imp, chisq.test(CANEV, R_MARITL))

## Warning in chisq.test(CANEV, R_MARITL): Chi-squared approximation may be
## incorrect

# run unweighted regression if independent variable is continuous
fit_age_unwt <- glm(CANEV~AGE_P, df_imp, family = binomial())

# Assemble all p-values from contingency tables
pvals_unwt <- map_dbl(unwt, "p.value")

# Add p-value from regression
pvals_unwt <- c(pvals_unwt, summary(fit_age_unwt)$coefficients[, "Pr(>|z|)"]["AGE_P"])

# print p-values (adjusted to control FDR)
round(p.adjust(pvals_unwt), 2)

##   REGION    SMKEV      SEX    HYPEV RACERPI2    CHLEV    CHDEV R_MARITL 
##     0.09     0.09     0.17     0.08     0.09     0.00     0.00     0.00 
##    AGE_P 
##     0.00

The unweighted results have much lower p-values than those that account for the sampling design. Had we analysed these data as if they had been produced via a srs mechanism, we would have concluded (likely erroneously) that R_MARITL and CHLEV were marginally associated with CANEV at a statistically significant level. Furthermore, had we set our alpha rate at 0.10, the difference between the two analyses would have run even deeper, resulting in finding marginal associations with all variables except SEX.

We can similarly run an unweighted logistic regression against the same set of covariates as in our weighted fit1 analysis:

# unweighted logistic regression
fit_unwt <- glm(fit1$formula, data = df_imp, family = binomial())
summary(fit_unwt)

## 
## Call:
## glm(formula = fit1$formula, family = binomial(), data = df_imp)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3300  -0.5062  -0.3188  -0.1506   3.0841  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.526912   0.559506  -9.878  < 2e-16 ***
## AGE_P        0.062752   0.009061   6.926 4.34e-12 ***
## HYPEV2      -0.409638   0.250525  -1.635   0.1020    
## CHLEV2       0.090104   0.245868   0.366   0.7140    
## CHDEV2       0.389749   0.363116   1.073   0.2831    
## R_MARITL2   -0.105517   0.786804  -0.134   0.8933    
## R_MARITL3   -0.394614   0.334410  -1.180   0.2380    
## R_MARITL4    0.316057   0.294037   1.075   0.2824    
## R_MARITL5   -0.514253   0.766345  -0.671   0.5022    
## R_MARITL6   -1.290969   0.618312  -2.088   0.0368 *  
## R_MARITL7    0.460314   0.524234   0.878   0.3799    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 657.07  on 971  degrees of freedom
## Residual deviance: 552.19  on 961  degrees of freedom
## AIC: 574.19
## 
## Number of Fisher Scoring iterations: 7

The unweighted regression have produced similar but slightly difference conclusions. We still would have determined that the covariate AGE_P is significant. However, we would also have seen that one of the levels in the R_MARITL variable has a p-value below 0.05. An analysis of variance reveals that removing the R_MARITL variable would indeed reduce the predictive accuracy of the model.

# does removing the R_MARITL variable worsen the fit?
anova(fit_unwt, update(fit_unwt, .~.-R_MARITL))

## Analysis of Deviance Table
## 
## Model 1: CANEV ~ AGE_P + HYPEV + CHLEV + CHDEV + R_MARITL
## Model 2: CANEV ~ AGE_P + HYPEV + CHLEV + CHDEV
##   Resid. Df Resid. Dev Df Deviance
## 1       961     552.19            
## 2       967     563.93 -6  -11.738

From this, we see that that there are small differences in the conclusions drawn between the weighted and unweighted analyses. However, the analyses are largely the same.

Coursera Analysis Assignment (Week 4)

2018-04-13

Prepare Data

Sampling Design

Population Estimates

Analytic Associations