# Convert character variables to factors
for(i in names(Dataset)){
  if(is.character(Dataset[[i]])){
    Dataset[[i]] <- factor(Dataset[[i]])
  }
}

# Make activity ordered for nicer displays
if("Physical_Activity" %in% names(Dataset)){
  Dataset$Physical_Activity <- factor(Dataset$Physical_Activity,
                                      levels = c("High", "Moderate", "Low"))
}

# pull variables from names in the preamble
anova_y  <- Dataset[[anova_y_name]]
anova_x1 <- Dataset[[anova_x1_name]]
anova_x2 <- Dataset[[anova_x2_name]]

reg_y <- Dataset[[reg_y_name]]

# formulas built dynamically
anova_formula <- as.formula(
  paste(anova_y_name, "~", paste0(anova_x1_name, " * ", anova_x2_name))
)

reg_formula <- as.formula(
  paste(reg_y_name, "~", paste(reg_x_names, collapse = " + "))
)

This report analyzes the dataset colorectal_Midterm.csv using fully dynamic variable references. The main response variable is Tumor_Size_mm, and all analysis text, figures, and model output are generated from the variable names defined in the preamble. That means the report can be adapted to a different but similar set of variables by changing only the preamble section.

1 Question 1: ANOVA for main effects and interaction

anova_fit <- aov(anova_formula, data = Dataset)
anova_tab <- anova(anova_fit)

ss_total <- sum(anova_tab$"Sum Sq", na.rm = TRUE)
eta_sq_x1  <- anova_tab$"Sum Sq"[1] / ss_total
eta_sq_x2  <- anova_tab$"Sum Sq"[2] / ss_total
eta_sq_int <- anova_tab$"Sum Sq"[3] / ss_total

group_means <- aggregate(Dataset[[anova_y_name]],
                         by = list(Dataset[[anova_x1_name]], Dataset[[anova_x2_name]]),
                         FUN = mean)
names(group_means) <- c(anova_x1_name, anova_x2_name, "Mean")
group_means

##   Smoking_History Gender     Mean
## 1              No      F 48.28058
## 2             Yes      F 56.07047
## 3              No      M 49.11826
## 4             Yes      M 55.57984

I used a two-way ANOVA to test whether Tumor_Size_mm differs by Smoking_History, Gender, and their interaction. This is a good choice because the outcome is quantitative and both predictors are categorical. The ANOVA found a strong main effect of Smoking_History, F(1, 9996) = 789.28, p < .001, with an estimated eta-squared of 0.073. The main effect of Gender was not statistically significant, F(1, 9996) = 1.41, p = 0.236.

There was also a statistically significant interaction between Smoking_History and Gender, F(1, 9996) = 6.83, p = 0.009. In practical terms, that means the difference in Tumor_Size_mm between smoking groups was not exactly the same for males and females. Looking at the cell means, non-smokers had smaller average tumors than smokers in both gender groups, but the smoking gap was slightly larger for females than for males. So the clearest pattern in this analysis is that smoking history is associated with larger tumors, while gender by itself is much less important.

par(mfrow = c(1, 2), mar = c(5, 4.5, 3, 1))

boxplot(anova_y ~ interaction(anova_x1, anova_x2),
        col = c("grey90", "grey70", "grey90", "grey70"),
        las = 2,
        ylab = anova_y_name,
        xlab = paste(anova_x1_name, "and", anova_x2_name),
        main = "Distribution by group")

interaction.plot(x.factor = Dataset[[anova_x1_name]],
                 trace.factor = Dataset[[anova_x2_name]],
                 response = Dataset[[anova_y_name]],
                 type = "b",
                 pch = c(1, 19),
                 lwd = 2,
                 ylab = paste("Mean", anova_y_name),
                 xlab = anova_x1_name,
                 trace.label = anova_x2_name,
                 main = "Interaction plot")

Figure 1.1: Two-panel ANOVA figure for Tumor_Size_mm by Smoking_History and Gender

Figure 1.1 supports the ANOVA results. The boxplot shows a clear upward shift in tumor size for people with a smoking history, and the interaction plot shows that the smoking effect is present for both genders but is not perfectly parallel across the two lines, which is consistent with the statistically significant interaction.

2 Question 2: Predictive regression model

reg_fit <- lm(reg_formula, data = Dataset)
reg_sum <- summary(reg_fit)

coef_tab <- reg_sum$coefficients
r2 <- reg_sum$r.squared
adj_r2 <- reg_sum$adj.r.squared
rmse <- sqrt(mean(reg_fit$residuals^2))

coef_tab

##                             Estimate  Std. Error     t value
## (Intercept)               29.6703900 0.708461851  41.8800108
## Age                        0.3204151 0.009085289  35.2674631
## GenderM                    0.0885424 0.221354752   0.4000022
## Smoking_HistoryYes         6.4365835 0.220036635  29.2523265
## Alcohol_ConsumptionYes     6.2555971 0.215857447  28.9802235
## Urban_or_RuralUrban       -6.2863870 0.235016632 -26.7486900
## Physical_ActivityModerate -2.9841156 0.256670833 -11.6262356
## Physical_ActivityLow      -1.4169005 0.277346649  -5.1087709
##                                                                                                                                                                                                                                                                                            Pr(>|t|)
## (Intercept)               0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## Age                       0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000006710932
## GenderM                   0.68916341927734980910003059761947952210903167724609375000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## Smoking_HistoryYes        0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000147884341032107091957723675184117468008308380744071661609368214855300267158885833587
## Alcohol_ConsumptionYes    0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000221465587236585565082399664377278544176123094819070731105872132497950084868909628039068
## Urban_or_RuralUrban       0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002706639892531628503214432097795806894327180626038456185934369790110168362851621598124806564083315257306326918692
## Physical_ActivityModerate 0.00000000000000000000000000000048013045783821893480565570557974459732750872836218880993834890167744244247228114098186013691815787751693278551101684570312500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## Physical_ActivityLow      0.00000033024955649898458132107149920475563931177021004259586334228515625000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

To build a predictive model, I used multiple linear regression with Tumor_Size_mm as the response and the demographic or behavioral predictors Age, Gender, Smoking_History, Alcohol_Consumption, Urban_or_Rural, Physical_Activity. This model is appropriate because the outcome is continuous and the goal is to estimate how several variables jointly predict tumor size. Overall, the model fit the data well, F(7, 9992) = 554.96, p < .001, with R² = 0.28 and adjusted R² = 0.279. The residual standard prediction error was about 10.75 mm.

Several predictors were especially important. Holding the other variables constant, each additional year of age was associated with an average increase of 0.32 mm in Tumor_Size_mm. Compared with non-smokers, participants with a smoking history had tumors that were about 6.437 mm larger on average, p < .001. Compared with non-drinkers, alcohol consumption was associated with an average increase of 6.256 mm, p < .001. Living in an urban area was associated with a decrease of 6.286 mm relative to living in a rural area, p < .001. Gender was not a meaningful predictor in this model, p = 0.689.

par(mfrow = c(1, 2), mar = c(5, 4.5, 3, 1))

plot(fitted(reg_fit), reg_y,
     pch = 16, cex = 0.5,
     xlab = "Fitted values",
     ylab = paste("Observed", reg_y_name),
     main = "Observed vs fitted")
abline(0, 1, lwd = 2)

plot(fitted(reg_fit), resid(reg_fit),
     pch = 16, cex = 0.5,
     xlab = "Fitted values",
     ylab = "Residuals",
     main = "Residuals vs fitted")
abline(h = 0, lwd = 2)

Figure 2.1: Two-panel regression diagnostics for predicting Tumor_Size_mm

Figure 2.1 shows the model’s goodness-of-fit. In the observed-versus-fitted plot, the points follow the 45-degree line reasonably well, which matches the moderate-to-strong R². In the residual plot, the residuals are centered around zero without a strong systematic curve, so the linear model provides a sensible summary of these predictors and the response.

3 Question 3: What should Joe change?

joe_model_formula <- as.formula(
  paste(reg_y_name, "~ Age + Gender + Smoking_History + Alcohol_Consumption + Urban_or_Rural + Physical_Activity")
)
joe_fit <- lm(joe_model_formula, data = Dataset)
summary(joe_fit)

## 
## Call:
## lm(formula = joe_model_formula, data = Dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.3974  -9.4092   0.0048   9.3987  19.4783 
## 
## Coefficients:
##                            Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)               29.670390   0.708462  41.880 < 0.0000000000000002 ***
## Age                        0.320415   0.009085  35.267 < 0.0000000000000002 ***
## GenderM                    0.088542   0.221355   0.400                0.689    
## Smoking_HistoryYes         6.436583   0.220037  29.252 < 0.0000000000000002 ***
## Alcohol_ConsumptionYes     6.255597   0.215857  28.980 < 0.0000000000000002 ***
## Urban_or_RuralUrban       -6.286387   0.235017 -26.749 < 0.0000000000000002 ***
## Physical_ActivityModerate -2.984116   0.256671 -11.626 < 0.0000000000000002 ***
## Physical_ActivityLow      -1.416900   0.277347  -5.109           0.00000033 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.75 on 9992 degrees of freedom
## Multiple R-squared:  0.2799, Adjusted R-squared:  0.2794 
## F-statistic:   555 on 7 and 9992 DF,  p-value: < 0.00000000000000022

joe_base <- data.frame(
  Age = joe_age,
  Gender = joe_gender,
  Smoking_History = joe_smoking,
  Alcohol_Consumption = joe_alcohol,
  Urban_or_Rural = joe_urban,
  Physical_Activity = joe_activity
)

joe_quit_smoking <- joe_base
joe_quit_smoking$Smoking_History <- "No"

joe_stop_alcohol <- joe_base
joe_stop_alcohol$Alcohol_Consumption <- "No"

joe_move_rural <- joe_base
joe_move_rural$Urban_or_Rural <- "Rural"

joe_raise_activity <- joe_base
joe_raise_activity$Physical_Activity <- "High"

joe_preds <- c(
  Baseline = predict(joe_fit, newdata = joe_base),
  Quit_Smoking = predict(joe_fit, newdata = joe_quit_smoking),
  Stop_Alcohol = predict(joe_fit, newdata = joe_stop_alcohol),
  Move_to_Rural = predict(joe_fit, newdata = joe_move_rural),
  High_Activity = predict(joe_fit, newdata = joe_raise_activity)
)

joe_preds

##      Baseline.1  Quit_Smoking.1  Stop_Alcohol.1 Move_to_Rural.1 High_Activity.1 
##        47.56443        41.12785        41.30883        53.85082        48.98133

best_change <- names(which.min(joe_preds[-1]))
best_change

## [1] "Quit_Smoking.1"

For Joe’s decision, I fit a focused regression model using only the variables named in the prompt: age, gender, smoking history, alcohol consumption, urban or rural residence, and physical activity. This lets me compare Joe’s predicted tumor size under four one-change-at-a-time scenarios. Based on that model, Joe’s predicted Tumor_Size_mm is NA mm under his current profile. If he quit smoking, the predicted value drops to NA mm. If he stopped drinking alcohol, it drops to NA mm. In this dataset, changing to rural residence or moving from low to high physical activity does not improve the prediction, so those are not the best choices here.

The best single change is for Joe to quit smoking. Among the harmful behaviors in this model, smoking has the largest estimated increase in tumor size: its regression coefficient is 6.437 mm, which is slightly larger than the alcohol coefficient of 6.256 mm. So if Joe can only change one thing, smoking is the strongest choice for lowering his predicted bad outcome in this report.

barplot(joe_preds,
        las = 2,
        ylab = paste("Predicted", reg_y_name),
        main = "Joe's predicted outcome under different changes")

Figure 3.1: Predicted outcome for Joe under one-change-at-a-time scenarios

Overall, the report tells a consistent story. Smoking is important in the ANOVA, important again in the multivariable regression, and it also produces the largest beneficial one-step change in Joe’s counterfactual comparison. That consistency makes the final recommendation easy to justify statistically.

STAT 385 Midterm Report

Yidan Wang

2026-03-29

1 Question 1: ANOVA for main effects and interaction

2 Question 2: Predictive regression model

3 Question 3: What should Joe change?