PAF 573
Elaine MacPherson
READ THIS BEFORE BEGINNING EXAMINATION
The examination is open book – you may use the textbooks and syllabus materials, any slides or handouts from class, and any notes you have taken in class or in preparation for class. But you need to answer each question on your own.
The examination consists of two questions, one of which you need to analyze a related dataset to answer. It is not enough to just provide output from a statistical program when answering the questions. If it is not clear how you arrived at your answer, you will not gain most of the points eligible for the question even if you provided the correct answer. You must show your work, and when appropriate, provide the commands you used to produce your output.
For the purposes of this exam, assume you are working for the Arizona Department of Health Services. The director has become increasingly concerned about opioid use in Arizona. She obtained data from the CDC on opioid use across the US, as well as demographic information (see Table 1 for definitions). In collaboration with a colleague, your task is to analyze this dataset and interpret the results.
SPECIAL INSTRUCTIONS
Opioid use is known to occur more frequently in certain demographic groups. One of your colleagues used the provided dataset to test this, and provides you with the following output.
#In this dataset, the p-values are all very small. Because the p-value is the probability that we will get the hypothesis value, the extremely small probability indicates good model fit of the data. R2 is .38, which is not very high, but it does indicate some variance in the depending variable based on the independent variables. Finally, the coefficients appear to be very strong for some independent variables (such as region and gender) and less significant for others (such as demographics). The presence of some significant coefficients means the model likely fits.
#According to this output, the demographic groups most likely to use opioids are those with at least a high school diploma (108.66) and women (.76). The strongest relationships appear to be for regions, where those in the South are most likely to use opioids (49.69). According to this output, the demographic groups less likely to use opiods are those with negative coefficients: African Americans (-.44), Native Americans (-.94), Hispanics (-1.09), and those who are between 25 and 44 years old (-2.14).
#The transition from considering African Americans, Native Americans, and then everyone else who is NOT one of those two identities (“other”) means that there is a sleu of demographics lumped in that could be causing bias. Controlling for that bias by including the ethnicity of “Hispanic” (which also is NOT the same thing as race, which means that really, both Native Americans AND African Americans could also identify as Hispanic) means that bias could be there since the variable is being omitted (Omitted Variable bias). The addition of this variable results in some changes from one variable to the next:
#The coefficient on “other_percent” transitions from slightly positive (.24) to negative (-1.09). That is a significant shift in direction, in addition to a sharp increase in magnitude (four times as much as the previous effect). In addition, the p value starts at .04 with the variable of “other_percent” and declines to far less (.00) showing a far more statistically significant value once ethnicity is accounted for.
#One additional component to add to this analysis would be for socioeconomic status. There could be multicollinearity (a relationship between two variables, one of which may be omitted) between race, ethnicity, and socioeconomic status. However, socioeconomic status is likely to have a relationship on health outcomes and opioid use. Controlling for this omitted variable bias could support a better analysis.
For the remainder for the exam, you will be using the provided dataset.
#I ran a linear regression using the lm(dependent variable ~ independent variable, data) command and got an output (below). The coefficient for economic conditions as a regressor indicates that the impact is negative, meaning that by a magnitude of 2.05 worsening economic conditions has an impact on the likelihood of opioid use being lower. This statistical significance is strong, given that the p-value is so small (0.00).
#MODEL FIT: #F(1,7630) = 873.00, p = 0.00 #R² = 0.10 #Adj. R² = 0.10
#—————————— ——– —— ——– —— #(Intercept) 188.65 3.15 59.85 0.00 #employment_to_population -2.05 0.07 -29.55 0.00 ————————————————————–
#At first I ran this regression with the addition of the regressor of “midwest” only, but then I realized that in order to actually see the comparable effects, I had to add for all regions. Before adding other regions, it appears that the midwest has a negative coefficient of -1.61 and it slightly lowered the magnitude on the economic conditions by .04. Also, the p-value when you regress midwest alone is pretty high (0.09) so it’s worth adding more independent variables to the analysis.
#MODEL FIT: #F(2,7629) = 438.05, p = 0.00 #R² = 0.10 #Adj. R² = 0.10
#Standard errors: OLS #————————————————————– # Est. S.E. t val. p #—————————— ——– —— ——– —— #(Intercept) 187.14 3.27 57.15 0.00 #employment_to_population -2.01 0.07 -27.06 0.00 #midwest -1.61 0.94 -1.70 0.09 ————————————————————–
#Running with all regions as regressors revealed a DRASTIC shift for those in the midwest, which I don’t understand. Where previously there was a -1.61 effect, there is now a positive 19.66 effect (see output below). There is no data for the northeast (which I guess could make sense considering almost all of those dummy variables are 0) but for the south and west, there is also a highly positive coefficient (34.71 and 13.21 respectively). These regional additions also decreased the coefficient for economic conditions from -2.01 to -1.46. Opioid use is thus high in the midwest, but not as high as it is in the south.
#MODEL INFO: #Observations: 7632 #Dependent Variable: prescription_rate #Type: OLS linear regression
#MODEL FIT: #F(4,7627) = 443.94, p = 0.00 #R² = 0.19 #Adj. R² = 0.19
#Standard errors: OLS #————————————————————– # Est. S.E. t val. p
#—————————— ——– —— ——– —— #(Intercept) 139.42 3.64 38.26 0.00
#employment_to_population -1.46 0.07 -19.91 0.00 #midwest 19.66 1.38
14.29 0.00 #south 34.71 1.37 25.31 0.00 #west 13.21 1.50 8.82 0.00
#northeast
————————————————————–
Question Two (35 points)
The US government recently implemented a program with the aim of reducing opioid use. To pilot the program, a Randomized Control Trial (RCT) was conducted [Due to the obvious ethical concerns associated with this program, the individual responsible for creating this program was later terminated]. Certain CBSAs were assigned treatment (treat = 1) and all CBSAs assigned treatment received treatment. The treatment began in 2010 (post = 1). You have been assigned the task of evaluating the effectiveness of this program.
#You would want to use a difference-in-difference estimate model to understand causal effect, because of the nature for how the treatment group is introduced. Because there is a discrete point in time between the counterfactual situation (had the treatment never been introduced, what would have happened?) and the introduction of the treatment (2010), you can use the concept of the counterfactural to estimate the difference. The counterfactual assumption holds that had the treatment group NOT received treat, there would have been no change. We would regress for both the treated group and untreated group to see the differences in differences and determine causal effect.
#The output of this program indicates that the difference in difference is -7.75, or, with the treatment there was a 7.75 decrease in the incidence of opioid use. In the counterfactural situation (indicated as “post” in the output below) there would have been an 8.05 increase in opioid use without the treatment. The treatment output has a high p-value (.57) indicating a low statistical significance.
#I would not recommend that this program be implemented in North Korea due to the threats to both internal and external validity. Internal validity refers to the validity of the research instruments to indicate real cause and effect and to remove the possibility of other confounding variables. For us to be sure that the regression does not have bias, we have to know that other variables have been omitted. For example: of the time period of 2010 (when the treatment started) also had some other contextual factors that would have altered nationwide prescription of opioids - such as pharmaceutical laws that may have impacted the ease with which opioids were produced and distributed - may create an effect that we’re not isolating. In addition, there are other things that could challenge the internal validity, like the passage of the Affordable Care Act in 2009, that may have caused an increase in prescription access when it went into effect the following year. External validity refers to the ability of the results of a study to be generalized to a broader context. North Korea’s incredibly isolated economy and lack of transparency means there are a host of circumstances that make these results unable to be generalized there - is the economy capitalistic, like ours, or tightly controlled? Are drug laws similar, and is production of pharmaceutical drugs comparable? Dempgraphically, are those who use opioids similar to those who are in the United States? Many circumstances make the application of this study difficult to the North Korean context.