Create a new variable, agegen, to indicate the four age-gender groups. And inspect the data using side-by-side box plots: Describe the differences across the four groups in median costs, spread and shape.
The median cost values of male aged less than 60, 4860, is similar to that of female aged less than 60, 4488. In the population aged over 60, the median costs are also similar (5904 versus 5866, respectively). We can see the group of female aged less than 60 has the largest variability across the four groups, followed by the group of female aged over 60, the group of male aged over 60, and the group of male aged less than 60. The female population has more higher outliers regardless of the age. The distributions of all groups are skewed to the right, however, it is more positively skewed in females.
# import dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ce621 = read_csv("ce621.csv")
## Rows: 200 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): provnum, sex, race, volhosp, volsurg, dthstrk, smoker
## dbl (5): patid, ccscore, totchg, age, year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# create new variable
ce621 = ce621 %>%
mutate(agegen=case_when(sex=="Male" & age<=60 ~ "m <=60",
sex=="Female" & age<=60 ~ "f <=60",
sex=="Male" & age > 60 ~ "m >60",
sex=="Female" & age > 60 ~ "f >60"))
# Inspect the data using side-by-side box plots:
boxplot(totchg~agegen, data=ce621, ylab="Total charges in dollars")
ce621 %>%
group_by(agegen) %>%
summarize(obs=n(), mean=mean(totchg), median=median(totchg), sd=sd(totchg), min=min(totchg), max=max(totchg))
## # A tibble: 4 × 7
## agegen obs mean median sd min max
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 f <=60 17 8321 4860 9662. 2629 40083
## 2 f >60 83 8054. 5904 5970. 200 33197
## 3 m <=60 10 4829. 4488. 1326. 2984 6731
## 4 m >60 90 6668. 5866 3381. 2528 19427
Now use the data to perform a linear regression of total charges on the age-gender groups to partition the total variability as displayed in the analysis of variance (ANOVA) table for regression. Interpret each of the regression coefficients. Using regression, how do you test the overall hypothesis of no group differences? What is the difference between the results of the lm and glm commands?
Our regression model is the following equation, and let X1 =1 if the group is f > 60, X2 =1 if the group is m <= 60, and X3 =1 if the group is m > 60.
E[totchg] = β0 + β1X1 + β2X2 + β3X3
Based on the output of the computing, we can interpret: (1) β0 = 8321.0, the expected (average) total charges for female aged less than 60. (2) β0 + β1 = -267.1, the expected (average) total charges for female aged over 60. (3) β0 + β2 = -3492.3, the expected (average) total charges for male aged less than 60. (4) β0 + β3 = -1653.0, the expected (average) total charges for male aged over 60.
We can also calculate the each value of slope: (1) β1 = - 267.1 - 8321.0 = - 8588.1, the difference in the expected (average) total charges for female aged over 60. (2) β2 = - 3492.3 - 8321.0 = - 11813.3, the difference in the expected (average) total charges for male aged less than 60. (3) β3 = - 1653.0 - 8321.0 = - 9974.0, the difference in the expected (average) total charges for male aged over 60.
We can set up Ho: β1 = β2 = β3 = 0, which is the same as Ho that there are no differences across the mean total charges of four groups. The value of the F-statistic is 1.94 on the df = 3, 196 and the p-value = 0.124 would suggest fail to rejecting Ho, which means the mean of total charges are the same across the four groups.
The lm command uses the t statistics in constructing CIs for the regression coefficients, however, the glm commands uses the z statistics in constructing CIs for the regression coefficients.
The median value of the carotid endarterectomy (CE) expenditure tend to increase as the age increases in both genders. However, the mean and the variance of CE cost in female are larger than that of male in both age stratum. The largely positive skewness and higher means in females show that the higher CE expenditure are needed for female population regardless of age.
# perform a linear regression of total charges on the age-gender groups
# to partition the total variability as displayed in the analysis of variance (ANOVA) table for regression.
## using the lm command
model1 = lm(totchg ~ as.factor(agegen), data=ce621)
anova(model1)
## Analysis of Variance Table
##
## Response: totchg
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(agegen) 3 1.619e+08 53968180 1.9412 0.1242
## Residuals 196 5.449e+09 27800817
summary(model1)
##
## Call:
## lm(formula = totchg ~ as.factor(agegen), data = ce621)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7854 -3061 -1438 1232 31762
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8321.0 1278.8 6.507 6.26e-10 ***
## as.factor(agegen)f >60 -267.1 1403.7 -0.190 0.8493
## as.factor(agegen)m <=60 -3492.3 2101.3 -1.662 0.0981 .
## as.factor(agegen)m >60 -1653.0 1394.4 -1.186 0.2372
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5273 on 196 degrees of freedom
## Multiple R-squared: 0.02886, Adjusted R-squared: 0.01399
## F-statistic: 1.941 on 3 and 196 DF, p-value: 0.1242
##using the glm command
model2 = glm(totchg ~ as.factor(agegen), data=ce621, family=gaussian(link="identity"))
anova(model2)
## Analysis of Deviance Table
##
## Model: gaussian, link: identity
##
## Response: totchg
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 199 5610864745
## as.factor(agegen) 3 161904541 196 5448960204
summary(model2)
##
## Call:
## glm(formula = totchg ~ as.factor(agegen), family = gaussian(link = "identity"),
## data = ce621)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8321.0 1278.8 6.507 6.26e-10 ***
## as.factor(agegen)f >60 -267.1 1403.7 -0.190 0.8493
## as.factor(agegen)m <=60 -3492.3 2101.3 -1.662 0.0981 .
## as.factor(agegen)m >60 -1653.0 1394.4 -1.186 0.2372
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 27800817)
##
## Null deviance: 5610864745 on 199 degrees of freedom
## Residual deviance: 5448960204 on 196 degrees of freedom
## AIC: 4001.7
##
## Number of Fisher Scoring iterations: 2
Inspect your box plots from Step A and notice whether the observations are approximately normal and have equal variance. To remove the differences in average values among the 4 samples, we plot residuals rather than raw data by group or more often against predicted value. A residual is the difference between the observed CE cost and the predicted value from the regression. In a linear regression of continuous response on a group variable, the predicted value is just the group sample mean and the residual is just the deviation of each observation from its group mean. Obtain the residuals from the regression model above. Make a boxplot of the residuals by group. Plot the residuals against group.
Base on the box plot from Step A, the observations are positively skewed and have different variance. The boxplots of the residuals by group are also positively skewed. The plot the residuals against the predicted values with jitter option shows that the the residuals are not Gaussian, not even approximately, and do not have equal variances.
Given the following two assumptions of the linear regression, 1. Observations within a group are approximately normally distributed. 2. The within-group variance is the same across all groups.
I need to do transformation of the value of outcome variable.
#Inspect your box plots from Step A
model1 = lm(totchg ~ as.factor(agegen), data=ce621)
boxplot(model1$residuals ~ ce621$agegen, ylab="Residuals")
# Plot the residuals against the predicted values.
qplot(x=model1$fitted.values, y=model1$residuals, xlab="Predicted values", ylab="Residuals")
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# [+] This graph can be improved by using the jitter option.
qplot(x=jitter(model1$fitted.values), y=model1$residuals,xlab="Fitted values", ylab="Residuals")
One way to address this problem is by analyzing a transformation of the CE expenditure data, rather than the data on its original scale. This works if you want to ask questions about whether there are differences between groups rather than estimating the size of the differences. To accomplish this, generate a new variable which is the logarithm(log10)of CE expenditures. Make the graphical display using box plots, as was done above.
When I see the box plot with a new variable with log transformation,the distribution of the transformed CE data in Step B is more normally distributed compared to the previous distribution of the untransformed CE data in Step A. The within-group variances in Step B are more nearly equal across groups than the variances in Step A.
However, we need more statistical evidence with CIs or p-value in order to make the equal variance assumption to make the assumption of equal (constant) variance.
# to adress this problem, -> analyzing a transformation of the CE expenditure data, rather than the data on its original scale.
# transformation
ce621 = ce621 %>%
mutate(logtotchg=log10(totchg))
model3 = lm(logtotchg ~ as.factor(agegen), data=ce621)
summary(model3)
##
## Call:
## lm(formula = logtotchg ~ as.factor(agegen), data = ce621)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.51002 -0.15155 -0.03341 0.14645 0.84548
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.75748 0.06201 60.599 <2e-16 ***
## as.factor(agegen)f >60 0.05358 0.06806 0.787 0.432
## as.factor(agegen)m <=60 -0.08885 0.10189 -0.872 0.384
## as.factor(agegen)m >60 0.02158 0.06761 0.319 0.750
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2557 on 196 degrees of freedom
## Multiple R-squared: 0.01596, Adjusted R-squared: 0.0009018
## F-statistic: 1.06 on 3 and 196 DF, p-value: 0.3673
# boxplot with residuals
boxplot(model3$residuals ~ ce621$agegen, ylab="Residuals")
## Part 3.
Another way to proceed when the focus is the difference in the means themselves, not the means of a transformed value, is to use regression to estimate the means but to use bootstrapping to get more appropriate standard errors that do not depend on the normal and equal variance assumptions. Compare the bootstrap standard errors and confidence intervals with the ones from the original regression analysis. These are more valid when the assumptions are so strongly violated.
The value of standard error in bootstrap, which are falled between 2327.765 to 2457.763, is smaller than that from the original regression analysis, 5272.648 (calculated by the root of MSE). The confidence intervals are also narrow in the bootstrap. We can see the bootstrap method can provide more valid and precise results under the violation of the normality assumption.
The mean of CE cost in females (8321 for those aged less than 60, 8054 for those aged over 60) are larger than that of males (4829 for those aged less than 60, 6668 for those aged over 60). We can see the groups of female have larger variability and more upper outliers than the groups of males regardless of age. The distributions of all groups are skewed to the right, however, it is more positively skewed in females.
Based on the regression analysis, no statistically significant difference in the means of CE cost is observed across the four groups, as evidenced by p value of F test, 0.124. However, according to the bootstrap, there is significant difference in the mean of CE cost of female aged less than 60 compared with the others. In this case, where the normality assumption does not hold, the bootstrap can provide more precise estimation. As a result, females aged less than 60 has statistically significant higher mean CE cost compared with the other gender and age strata.
# setting for bootstrapping
library(boot)
# function to obtain regression coefficients
bs = function(formula, data, indices) {
d=data[indices, ] # allows boot to select sample
fit = lm(formula, data=d)
return(coef(fit))
}
# bootstrapping with 250 replications
results = boot(data=ce621, statistic=bs, R=250, formula=totchg~agegen)
results
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = ce621, statistic = bs, R = 250, formula = totchg ~
## agegen)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 8321.0000 236.7810 2196.284
## t2* -267.1205 -226.8572 2311.620
## t3* -3492.3000 -230.4359 2238.675
## t4* -1653.0333 -231.5245 2214.814
# get 95% confidence intervals from the bootstrap
boot.ci(results, type="norm", index=1) # intercept (f <=60)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 250 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = results, type = "norm", index = 1)
##
## Intervals :
## Level Normal
## 95% ( 3780, 12389 )
## Calculations and Intervals on Original Scale
boot.ci(results, type="norm", index=2) # f >60
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 250 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = results, type = "norm", index = 2)
##
## Intervals :
## Level Normal
## 95% (-4571.0, 4490.4 )
## Calculations and Intervals on Original Scale
boot.ci(results, type="norm", index=3) # m <=60
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 250 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = results, type = "norm", index = 3)
##
## Intervals :
## Level Normal
## 95% (-7650, 1126 )
## Calculations and Intervals on Original Scale
boot.ci(results, type="norm", index=4) # m >60
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 250 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = results, type = "norm", index = 4)
##
## Intervals :
## Level Normal
## 95% (-5762, 2919 )
## Calculations and Intervals on Original Scale
# get 95% confidence intervals from the regression model
confint(model1)
## 2.5 % 97.5 %
## (Intercept) 5799.016 10842.9840
## as.factor(agegen)f >60 -3035.358 2501.1166
## as.factor(agegen)m <=60 -7636.343 651.7426
## as.factor(agegen)m >60 -4402.908 1096.8418