Installing Packages

You will need some packages to help you do certain things on RStudio. It is similar to how you would download certain apps from the Apple Store to your iPhone in order to do certain things. You use iMessage to send texts and Safari to browse the internet. Well, on R, you can use the “haven” package for reading and writing data, “dplyr” package for data manipulation, and “survey” package to work with complex survey designs that have survey weights.

You will likely require other packages later down the road. It all really depends on what you want to do for your project. You will only need to download the packages once. To install those packages to R, run the code below but you need to remove the “#” character at the beginning of the code.

#install.packages("haven")
#install.packages("dplyr")
#install.packages("survey")

Uploading Packages

Once you have your packages installed, you will need to pull them to this RStudio so you can use them. Here is how to do it:

library(haven)
library(dplyr)
library(survey)

Uploading BRFSS Data

This code below tells us where your file will be (the pathway) so that it knows where to pull your file from. If you move your file somewhere else later then you will need to fix this code too.My pathway will be different from yours. Keep in mind of the slash “/” directions. If your pathway is “" then you might need to change it to”/“.

file_path <- "C:/Users/Admin/Downloads/LLCP2022XPT/LLCP2022.XPT"

This code below reads your BRFSS file. You may notice that it may take some time, but that is because the file is large.

data <- read_xpt(file_path)

Subsetting the Dataset:

Okay. So when you tried to upload the data using the code above, it probably took some time to upload. Let’s subset your data so that it runs faster in the future.

Here is the code how to do it:

NHOPI_NVHI_data <- data %>% 
                 filter(`_RACE1` == 5 & (`_STATE` == 15 | `_STATE` == 32))

Doing this made the sample size drop to 679. While your new data set can run faster, you may be limited with what you do. By the way, if I was your advisor, I would ask why you are only looking at Nevada and Hawaii and eliminating all the other states. Make sure you have good justifications.

For this code below, I am giving the variable _STATE some name assigned to its values. 15 = Hawaii and 32 = Nevada. Giving it names makes it easier to read later on. We will also run a summary of this variable which should provide us the count of those from Nevada and Hawaii.

NHOPI_NVHI_data$`_STATE` <- factor(NHOPI_NVHI_data$`_STATE`,
                                 levels = c(15, 32),
                                 labels = c("Hawaii", "Nevada"))

summary(NHOPI_NVHI_data$`_STATE`)
## Hawaii Nevada 
##    657     22

Selecting Cancer Variables

Obviously you need variables. Look at the codebook. I will help you find a few variables, but you can add more if you want.

  1. Variable Name: CHCSCNC1
NHOPI_NVHI_data$CHCSCNC1w <- factor(NHOPI_NVHI_data$CHCSCNC1,
                                   levels = c(1, 2, 7, 9),
                                   labels = c("Yes", "No", "Don’t know / Not sure", "Refused"))
  1. Variable Name: CHCOCNC1
NHOPI_NVHI_data$CHCOCNC1w <- factor(NHOPI_NVHI_data$CHCOCNC1,
                                   levels = c(1, 2, 7, 9),
                                   labels = c("Yes", "No", "Don’t know / Not sure", "Refused"))
  1. Variable Name: HADMAM
NHOPI_NVHI_data$HADMAMw <- factor(NHOPI_NVHI_data$HADMAM,
                                 levels = c(1, 2, 7, 9),
                                 labels = c("Yes", "No", "Don’t know/Not sure", "Refused"))
  1. Variable Name: HOWLONG
NHOPI_NVHI_data$HOWLONGw <- factor(NHOPI_NVHI_data$HOWLONG,
                                   levels = c(1, 2, 3, 4, 5, 7),
                                   labels = c("Within the past year (anytime less than 12 months ago)",
                                              "Within the past 2 years (1 year but less than 2 years ago)",
                                              "Within the past 3 years (2 years but less than 3 years ago)",
                                              "Within the past 5 years (3 years but less than 5 years ago)",
                                              "5 or more years ago",
                                              "Don’t know/Not sure"),
                                   ordered = TRUE)

# Ensure 5 and 7 are treated as NA
NHOPI_NVHI_data$HOWLONGw[NHOPI_NVHI_data$HOWLONGw == "5 or more years ago"] <- NA
NHOPI_NVHI_data$HOWLONGw[NHOPI_NVHI_data$HOWLONGw == "Don’t know/Not sure"] <- NA
  1. Variable Name: CERVSCRN
NHOPI_NVHI_data$CERVSCRNw <- factor(NHOPI_NVHI_data$CERVSCRN,
                                 levels = c(1, 2, 7, 9),
                                 labels = c("Yes", "No", "Don’t know/Not sure", "Refused"))
  1. Variable Name: CRVCLCNC
NHOPI_NVHI_data$CRVCLCNCw <- factor(NHOPI_NVHI_data$CRVCLCNC,
                                 levels = c(1, 2, 7, 9),
                                 labels = c("Yes", "No", "Don’t know/Not sure", "Refused"))
  1. Variable Name: CRVCLPAP
NHOPI_NVHI_data$CRVCLPAPw <- factor(NHOPI_NVHI_data$CRVCLPAP,
                                 levels = c(1, 2, 7, 9),
                                 labels = c("Yes", "No", "Don’t know/Not sure", "Refused"))
  1. Variable Name: CRVCLHPV
NHOPI_NVHI_data$CRVCLHPVw <- factor(NHOPI_NVHI_data$CRVCLHPV,
                                 levels = c(1, 2, 7, 9),
                                 labels = c("Yes", "No", "Don’t know/Not sure", "Refused"))

Creating Survey Design Object

This setup is essential for performing weighted survey analysis correctly, taking into account the complex survey design, which includes stratification and weighting.

options(survey.lonely.psu = "adjust")

brfssdsgn <- svydesign( 
id=~1, 
strata = ~`_STSTR`, 
weights = ~`_LLCPWT`, 
data = NHOPI_NVHI_data) 

Let’s Play Around a Little Bit

Remember above how we got 657 NHOPIs in Hawaii and 22 NHOPIs in Nevada? Well, that is not an accurate number to the actual number of NHOPIs in these 2 states. If you use the survey design objectthat we created above, we would get a much more accurate number of NHOPIs in Nevada and Hawaii

# Get the summary of the variable _STATE using the survey design
state_summary <- svytable(~`_STATE`, design = brfssdsgn)

# Print the summary
print(state_summary)
## _STATE
##    Hawaii    Nevada 
## 102611.65  32641.53

Before using our survey design object, Hawaii only had 657 NHOPIs. After applying the survey design object, Hawaii has 102611 NHOPIs.This is the power of survey design. Ideally, we would get information from every single person in America, but that is too time-consuming and requires too much money. We use survey weights to “guess-timate” the real number of people within the population.

Let’s disaggregate those who have ever been told to have skin cancer that is not melanoma by the 2 states. We will se the Variable Name: CHCSCNC1 and _STATE.

brfssdsgn <- update(brfssdsgn, CHCSCNC1w = ifelse(CHCSCNC1w %in% c("Don’t know / Not sure", "Refused"), NA,
                                                   ifelse(CHCSCNC1w == "Yes", "a.Skin Cancer",
                                                          ifelse(CHCSCNC1w == "No", "b.No Skin Cancer", CHCSCNC1w))))


state_cancer_table <- svytable(~`_STATE` + CHCSCNC1w, design = brfssdsgn)

print(state_cancer_table)
##         CHCSCNC1w
## _STATE   a.Skin Cancer b.No Skin Cancer
##   Hawaii      830.3348      101304.6165
##   Nevada      645.5875       31995.9413

Prevalence of Skin Cancer (non-melanoma): Skin Cancer (non-melanoma) is 0.812978% and 1.9778101% for Hawaii and Nevada respectively.

# Calculate the row percentages
row_totals <- apply(state_cancer_table, 1, sum)
row_percentages <- sweep(state_cancer_table, 1, row_totals, FUN = "/") * 100

# Print the row percentages
print(row_percentages)
##         CHCSCNC1w
## _STATE   a.Skin Cancer b.No Skin Cancer
##   Hawaii     0.8129781       99.1870219
##   Nevada     1.9778101       98.0221899

Initially looking at this table, we see that there is a difference in skin cancer (non-melanoma) prevalence between these two.In fact, the prevalence of skin cancer (non-melanoma) among NHOPIs in Nevada is approximately 2.43 times that of Hawaii. However, it’s essential to determine whether this observed difference is statistically significant.

Looking at this contingency table of these 2 variables (without the survey design object), we see very low numbers. There was only 1 person with skin cancer in Nevada. We would have to use a non-parametric test.

# Create a cross-tabulation using xtabs
xtab <- xtabs(~ `_STATE` + CHCSCNC1w, data = brfssdsgn$variables)

# Print the cross-tabulation
print(xtab)
##         CHCSCNC1w
## _STATE   a.Skin Cancer b.No Skin Cancer
##   Hawaii            12              643
##   Nevada             1               21

Survey-Weighted, Rao-Scott adjusted chi-square test: We are using this statistical test to determine if the skin cancer among NHOPIs in Hawaii are significantly different from NHOPIs in Nevada.

# Perform the survey-weighted chi-square test
chisq_result <- svychisq(~`_STATE` + CHCSCNC1w, design = brfssdsgn)

# Print the chi-square test results
print(chisq_result)
## 
##  Pearson's X^2: Rao & Scott adjustment
## 
## data:  svychisq(~`_STATE` + CHCSCNC1w, design = brfssdsgn)
## F = 0.66233, ndf = 1, ddf = 658, p-value = 0.416

Not statistically significant. We fail to reject the null hypothesis. There is no statistically significant evidence to suggest an association between state (Hawaii vs. Nevada) and skin cancer (non-melanoma) status (Yes/No) at the 0.05 alpha level.

Conclusion

The analysis found no statistically significant association between state (Hawaii vs. Nevada) and non-melanoma skin cancer status (Yes/No) (p > 0.05). This means we don’t have sufficient evidence to conclude that the prevalence of non-melanoma skin cancer differs between Hawaii and Nevada based on this data.

Implications

Welp. No statistical significance. What do we do?

  1. The results suggest that instead of states (Nevada vs Hawaii), we should look into other factors.

  2. Education programs to prevent skin cancer (non-melanoma) can be designed uniformly across both states. Educational program doesn’t need to be tailored differently between the 2 states. The same educational materials and preventive measures can be applied.

  3. Future research is needed. Perhaps with a larger sample size of NHOPIs, we would be able to better detect any statistically significant differences between NHOPIs the 2 states. Furthermore, other factors may influnece our results such as cost of living, access to healthcare services, etc.

Keep in Mind of the Limitations

  1. Selection Bias:
    • BRFSS relies heavily on telephone calls. The sample may not be as generalizable since some people who don’t have a phone and those who don’t answer unknown numbers. Those who are homeless and are non-English speakers may be underrepresented. Time of day in which these surveys are adminsitered may affect it’s generalizability. Many people may be busy during the day working or going to school.There may be a non-response bias in which those who respond are different from those who don’t which also affects the generalizability of this study.
  2. Information Bias:
    • Relies on self-reported data, therefore this study is subject to recall bias, social desirability bias, or misunderstanding of questions
    • Healthy volunteer bias. Participants may be more health-conscious than the general population. I think women are also more likely to do these surveys than men.
    • Of the study is long, participants may feel fatigue and so their responses may be less accurate responses towards the end.
  3. Study Design:
    • No Temporality. Cannot establish causation