Submission Instructions

Due 11:59pm 8th September 2025. This assignment is worth 30% of the overall mark.

Please submit your work in one document, with your name, the course title and the assignment number in the file name (e.g. RM2 Assignment1 JimmyBarnes.docx) and in the document itself (either on the title page, or in the header/footer). Please submit either a Word or PDF file. You should include relevant statistical code in an appendix, showing how you answered the question. Please use a different font (Courier 10pt is recommended) and format it neatly, to make your answers easy to read and understand. You should submit your assignment online according to the instructions in the BCA Turnitin Guide (available on Canvas).

This data is from a study to identify predictors of doctors and other visits.

The data file assn1data.csv contains the following variables:

Variable Description
docvisits Number of doctor visits during study
chronic Number of chronic conditions (0=0, 1=1, 2=2, 3=3, 4=4 or more)
age Age in years
income Income in $1K
black Black (0=no, 1=yes)
gender Gender (0=male, 1=female)
educ Education (1=high school or less, 2=completed high school, 3=post high school)

Tables and graphs must be suitable for publication only if specified.

Question 1. [24 Marks]

This section is to develop a model for the number of doctors visits using the predictors: chronic, age, black, gender, income and educ.

i.

Produce graphical displays showing univariate analysis of both the outcome and continuous predictors. Comment.

🌸 Plots showing the distribution of continuous variables (docvisits, age, income) 🌸

Comments

Number of doctors visits (Outcome variable)

  • Heavily right-skewed distribution
  • Several outliers with over 20 visits
  • Mean (6.12 visits) substantially higher than median (4 visits) indicating right-skew

Age (Predictor variable)

  • Approximately normal distribution with some high age outliers
  • Mean (74.16) close to median, indicating relatively symmetric distribution

Income (Predictor variable)

  • Heavily right-skewed distribution with most individuals having low income ($0-$20k)
  • Median income appears very low (~$10k) with several outliers extending up to $200k+

🌸 Summary statistics for continuous variables 🌸

Doctor Visits Age (years) Income ($1k)
Min. 0.00 66.00 0.00
1st Qu. 1.00 69.00 8.58
Median 4.00 73.00 15.69
Mean 6.12 74.16 24.22
3rd Qu. 8.00 78.00 29.50
Max. 61.00 97.00 229.66

Comments

  • Number of doctors visits (outcome variable) has a heavily right-skewed distribution and is count-based.
  • Income (predictor variable) also has a heavily right skewed distribution.
  • Age (predictor variable) displays an approximately normal distribution and should be fine to include in our model untransformed.
  • Consider log-transforming income (predictor variable) to reduce skew.
  • Heavily right skewed docvisits (outcome variable) count distribution making ordinary least squares linear regression inappropriate due to violation of normality of residuals.
  • Consider Poisson or negative binomial regression instead of linear regression.

ii.

Check the relationship between continuous.predictors and comment.

🌸 Scatterplot showing relationship between continuous predictors 🌸

Comments

The relationship between the two continuous predictors (age and income) shows a very weak negative correlation (R² = 0.001; r = -0.037), indicating that these predictors are likely independent of each other.

The scatterplot shows that most individuals cluster in the lower income ranges regardless of age, with high variability and no clear linear pattern. This very weak correlation indicates no multicollinearity concerns between age and income, meaning both predictors can be safely included together in the model as they provide independent information.

iii.

Produce graphical displays showing the relationship between the outcome and predictors. Comment.

🌸 Scatterplots showing relationship between docvisits and continuous predictors 🌸

🌸 Boxplots showing relationship between docvisits and categorical predictors 🌸

Comments

Continuous Predictors

Age vs Doctor visits

  • Shows a very weak negative correlation (r = -0.025, R² = 0.001).

Income vs Doctor visits

  • Shows a very weak negative correlation (r = 0.011, R² = 0).
  • Higher income individuals tend to have slightly fewer doctor visits, but the relationship is very weak and most of the variation is unexplained.

Categorical Predictors

Chronic conditions

  • Clear increasing linear trend.
  • Individuals with more chronic conditions having higher doctor visits.

Race

  • Black individuals appear to have substantially lower mean doctor visits compared to non-Black individuals.

Gender

  • Women appear to have slightly higher mean doctor visits compared to men.

Education

  • Some variation across education levels, with those having completed high school showing slightly higher mean visits.
  • However, no clearly linear relationship.

Summary

Continuous variables (age and income) show very weak linear relationships with number of doctor visits.

Number of chronic conditions appears to be the strongest predictor of doctor visits overall. Whereas, the other categorical variables (black, gender, education) appear to have a notable (albeit modest) effect.

iv.

Produce a table of demographics for both outcome and all predictors, suitable for publication. Hence, comment on any difficulties that may occur in analysis.

Table 1. Baseline Characteristics of Study Participants (N = 700)

Characteristic N (%) or mean (SD)
Doctor visits 6.12 (8.01)
Age (years) 74.16 (6.4)
Annual income ($1,000) 24.22 (27.41)
Gender
Female 433 (61.9)
Male 267 (38.1)
Race/Ethnicity
Non-Black 634 (90.6)
Black 66 (9.4)
Education level
Completed high school 193 (27.6)
High school or less 382 (54.6)
Post high school 125 (17.9)
Number of chronic conditions
0 166 (23.7)
1 238 (34)
2 149 (21.3)
3 84 (12)
≥4 63 (9)
Note: Data presented as mean (standard deviation) for continuous variables and n (%) for categorical variables.

Comments

  • Doctor visits and income are both right-skewed (may require log transformation)
  • Small samples for some subgroups may cause convergence / wide confidence intervals
    • Only 66 black participants (9.4%)
    • Only 63 people with 4+ chronic conditions (9%)
  • Standard deviations are large relative to means for doctor visits suggesting overdispersion
  • Overdispersion violates a key assumption for Poisson regression, negative binomial regression likely better

v.

Choose an appropriate model using all predictors without any transformations. Explain your choice.

Model fit statistics comparing Poisson and Negative Binomial models
Model AIC BIC Log-Likelihood Deviance Residual DF
Poisson 6676.14 6708.00 -3331.07 4674.11 693
Negative Binomial 3948.25 3984.66 -1966.13 793.10 693
Overdispersion statistics for both models
Model Dispersion Parameter Residual Deviance Pearson χ² Dispersion Ratio
Poisson Fixed at 1 4674.11 6573.07 6.745
Negative Binomial 1.063 793.10 956.56 1.380
Likelihood ratio test comparing Poisson vs Negative Binomial
Test χ² Statistic DF P-value Decision
Likelihood Ratio Test 2729.887 1 < 0.001 Negative Binomial preferred

Comments

Based on the above metrics of model fit, negative binomial regression is the clearly more appropriate model for predicting number of doctors visits using all predictors without transformations.

  • The negative binomial model had a substantially lower AIC / BIC (3948.25 / 3984.66 vs 6676.14 / 6708)
  • The negative binomial model showed substantially less overdispersion with a dispersion ratio of 1.38, compared to the Poisson model’s dispertion ratio of 6.745 (variance nearly 7x the mean).
  • The residual deviance also suggested that the negative binomial model had a better fit (deviance = 793.1 with 693 DF) compared to the Poisson model (deviance = 4674.11 with 693 DF).
  • The likelihood ratio test also statistically significantly favored the negative binomial model (χ² = 2729.887, p < 0.001).

vi.

Using the best model, first determine if any transformations are required. Then using backwards regression with p=0.20 determine a reduced model. Before attempting this, watch the video “RM2 model building video” by Dr Michael Waller on page https://canvas.sydney.edu.au/courses/62815/pages/module-2-lectures

🌸 Residual plots for untransformed nb model 🌸

🌸 Residual plots for log_income nb model 🌸

🌸 Comparison table: Untransformed nb model vs Log income nb model 🌸

Model fit statistics comparing untransformed income nb model and log income nb model
Model AIC BIC Log-Likelihood Deviance Residual DF
Untransformed Income 3948.25 3984.66 -1966.13 793.10 693
Log Income 3948.35 3984.76 -1966.18 793.11 693
Likelihood ratio test comparing untransformed income nb model vs log income nb model
Test χ² Statistic DF P-value Decision
Likelihood Ratio Test -0.1 1 1 Models not significantly different

Comments

Based on the above metrics, transforming income to log_income did not meaningful improve the model fit, so we’ll just use the original nb model.

  • Both models have similar residual patterns
  • Likelihood ratio test shows no significant difference between models (χ² = -0.11, p > 0.05)
  • AIC values are nearly identical (3948.25 vs 3948.35)

🌸 Stepwise regression results 🌸

Backwards elimination process (α = 0.20)
Step Variable Removed P-Value
income 1 income 0.7078
age 2 age 0.2541
Model comparison: Full vs Reduced model
Model N_Variables AIC
Full Model 6 3948.25
Reduced Model 4 3945.59
Final reduced model coefficients
Variable Estimate P_value
(Intercept) (Intercept) 0.816 0.000
chronic chronic 0.299 0.000
black black -0.686 0.000
gender gender 0.225 0.011
educ educ 0.235 0.000

vii.

Investigate the diagnostics for this model and comment.

🌸 Residual plots for final reduced variable model 🌸

🌸 Diagnostic statistics 🌸

Model diagnostic statistics
Diagnostic Value
Residual Deviance 793.300
Degrees of Freedom 695.000
Dispersion Ratio 1.395
AIC 3945.590
Number of Influential Observations 33.000
Theta (NB parameter) 0.939

Comments

The residual deviance of 793.30 with 695 degrees of freedom yields a dispersion ratio of 1.395, which is close to 1 and indicates that the negative binomial model appropriately accounts for overdispersion in the data. The theta parameter of 0.939 confirms moderate overdispersion, validating the choice of negative binomial over Poisson regression.

The AIC of 3945.6 represents a marginal improvement over the full model (3948.25), suggesting that variable removal slightly enhanced model parsimony without substantial loss of explanatory power.

viii.

Present a table of results suitable for publication with confidence intervals and appropriate p-values. Discuss the results.

Table 2. Negative Binomial Regression Results for Doctor Visits
Variable β (95% CI) SE IRR (95% CI) P-value
Intercept 0.816 (0.54, 1.095) 0.136 2.262 (1.717, 2.989) < 0.001
Chronic conditions 0.299 (0.23, 0.369) 0.035 1.349 (1.259, 1.447) < 0.001
Black (vs Non-Black) -0.686 (-0.986, -0.374) 0.155 0.503 (0.373, 0.688) < 0.001
Female (vs Male) 0.225 (0.051, 0.397) 0.088 1.252 (1.053, 1.487) 0.011
Education level 0.235 (0.126, 0.345) 0.056 1.265 (1.134, 1.412) < 0.001
Model Information: Model: docvisits ~ chronic + black + gender + educ AIC = 3945.6 Dispersion parameter (θ) = 0.94 N = 700

Comments

The chronic conditions variable had the strongest association with doctor visits, with an incidence rate ratio (IRR) of 1.349 (95% CI: 1.259-1.447, p < 0.001). This indicates that holding all other variables constant, each additional chronic condition increases the expected number of doctor visits by approximately 34.9% (95% CI 25.9%-44.7%).

Gender had an IRR of 1.252 (95% CI: 1.053-1.487, p = 0.011) for women compared to men. This indicates that, holding all other variables constant,women have approximately 25.2% (95% CI: 5.3% to 48.7%) more doctor visits than men

Education level had an IRR of 1.265 (95% CI 1.134-1.412, p < 0.001), indicating that each increase in education-level category is associated with approximately 26.5% (95% CI 13.4%-41.2% increase) more doctor visits, holding all other variables constant. The confidence interval and very small p-value indicate a robust positive association between education and healthcare utilization.

Race had an IRR of 0.503 (95% CI: 0.373-0.688, p < 0.001) for Black individuals compared to non-Black individuals. This indicates that Black individuals have approximately 49.7% fewer doctor visits (95% CI: 26.2%-62.7% reduction) compared to non-Black individuals, holding all other variables constant.

The backward elimination procedure removed age (p = 0.254) and income (p = 0.708) from the model, as their p-values exceeded the retention criterion of 0.20. This suggests that after accounting for chronic conditions, gender, race, and education, neither age nor income provide additional predictive value for doctor visit frequency.

Question 2. [6 Marks]

A clinician believes that there may be a relationship between number of chronic diseases and age, with age as the predictor.

i.

Produce a graph showing the effect of age. It may be useful to categorise ranges of age. Comment.

🌸 Scatterplot showing relationship between age and chronic conditions 🌸

🌸 Boxplot showing chronic conditions by age categories 🌸

🌸 Summary statistics by age categories 🌸

Summary of chronic conditions by age categories
Age Category N Mean SD Median
66-70 246 1.39 1.23 1
71-75 193 1.58 1.29 1
76-80 144 1.45 1.11 1
81-85 75 1.69 1.23 2
86+ 42 1.38 1.27 1

Comments

The scatterplot reveals a weak positive correlation (r = 0.046, R² = 0.002) between age and number of chronic conditions. The boxplot by age categories shows a general upward trend, with older age groups tending to have higher median numbers of chronic conditions. The 81-85 and 86+ age groups show notably higher mean chronic conditions compared to younger age groups. However, there is considerable variability within each age category, and the linear relationship appears modest in strength.

ii.

Fit the model assuming a linear effect of age, and check that the model is appropriate.

Model comparison for chronic conditions vs age
Model AIC BIC Log-Likelihood Deviance
Poisson 2167.08 2176.19 -1081.54 840.59
Negative Binomial 2169.04 2182.69 -1081.52 832.19
Dispersion statistics
Model Dispersion Ratio
Poisson 1.204
Negative Binomial 1.003
Likelihood ratio test comparing models
Test χ² Statistic DF P-value Decision
Likelihood Ratio Test 0.043 1 0.835 Models not significantly different

🌸 Diagnostic plots for the selected model 🌸

Comments

  • The negative binomial model is more appropriate than the Poisson model due to overdispersion (dispersion ratio of 1.2 for Poisson vs 1 for negative binomial).
  • The likelihood ratio test indicated no statistically significant difference between the models fits (p = 0.835)
  • Negative binomial model is still preferred due to the slightly better AIC and the theoretical expectation that count data often exhibits overdispersion, even if minimal.

iii.

Briefly summarise the results in a table including a p-value and confidence interval. Discuss the results and diagnostics.

Negative Binomial Regression Results: Chronic Conditions vs Age
Variable β (95% CI) SE IRR (95% CI) P-value
(Intercept) Intercept -0.0348 (-0.7336, 0.6721) 0.3584 0.9658 (0.4802, 1.9583) 0.9226
age Age (years) 0.0058 (-0.0037, 0.0151) 0.0048 1.0058 (0.9963, 1.0152) 0.2270
Model Information: Model: chronic ~ age AIC = 2169 Dispersion parameter (θ) = 123.46 N = 700
Model diagnostic statistics
Diagnostic Value
Residual Deviance 832.190
Degrees of Freedom 698.000
Dispersion Ratio 1.003
AIC 2169.040
Theta (NB parameter) 123.465

Comments

  • The above metrics show that age is not statistically significantly associated with number of chronic conditions.
  • The incidence rate ratio (IRR) for age is 1.0058 (95% CI: 0.9963-1.0152, p = 0.227), indicating that the relationship between age and chronic conditions is not statistically significant at a often used p-value of > 0.05).
  • The very weak correlation (R² = 0.002) suggests that age explains virtually none of the variation in chronic conditions in this sample.
  • Despite the clinicians expectation that older individuals would have more chronic conditions, the data suggests that within this cohort (age ranging from 66-97), age is not a significant predictor of chronic disease burden.