Part.1

1. Inspect differences across the four groups using side-by-side box plots

Create a new variable, agegen, to indicate the four age-gender groups. And inspect the data using side-by-side box plots: Describe the differences across the four groups in median costs, spread and shape.

The median cost values of male aged less than 60, 4860, is similar to that of female aged less than 60, 4488. In the population aged over 60, the median costs are also similar (5904 versus 5866, respectively). We can see the group of female aged less than 60 has the largest variability across the four groups, followed by the group of female aged over 60, the group of male aged over 60, and the group of male aged less than 60. The female population has more higher outliers regardless of the age. The distributions of all groups are skewed to the right, however, it is more positively skewed in females.

# import dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ce621 = read_csv("ce621.csv")
## Rows: 200 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): provnum, sex, race, volhosp, volsurg, dthstrk, smoker
## dbl (5): patid, ccscore, totchg, age, year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# create new variable
ce621 = ce621 %>% 
  mutate(agegen=case_when(sex=="Male" & age<=60 ~ "m <=60",
                          sex=="Female" & age<=60 ~ "f <=60",
                          sex=="Male" & age > 60 ~ "m >60",
                          sex=="Female" & age > 60 ~ "f >60"))

# Inspect the data using side-by-side box plots:
boxplot(totchg~agegen, data=ce621, ylab="Total charges in dollars")

ce621 %>%
  group_by(agegen) %>%
  summarize(obs=n(), mean=mean(totchg), median=median(totchg), sd=sd(totchg), min=min(totchg), max=max(totchg))
## # A tibble: 4 × 7
##   agegen   obs  mean median    sd   min   max
##   <chr>  <int> <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 f <=60    17 8321   4860  9662.  2629 40083
## 2 f >60     83 8054.  5904  5970.   200 33197
## 3 m <=60    10 4829.  4488. 1326.  2984  6731
## 4 m >60     90 6668.  5866  3381.  2528 19427

2. Perform a linear regression of total charges on the age-gender groups

Now use the data to perform a linear regression of total charges on the age-gender groups to partition the total variability as displayed in the analysis of variance (ANOVA) table for regression. Interpret each of the regression coefficients. Using regression, how do you test the overall hypothesis of no group differences? What is the difference between the results of the lm and glm commands?

Our regression model is the following equation, and let X1 =1 if the group is f > 60, X2 =1 if the group is m <= 60, and X3 =1 if the group is m > 60.

E[totchg] = β0 + β1X1 + β2X2 + β3X3

Based on the output of the computing, we can interpret: (1) β0 = 8321.0, the expected (average) total charges for female aged less than 60. (2) β0 + β1 = -267.1, the expected (average) total charges for female aged over 60. (3) β0 + β2 = -3492.3, the expected (average) total charges for male aged less than 60. (4) β0 + β3 = -1653.0, the expected (average) total charges for male aged over 60.

We can also calculate the each value of slope: (1) β1 = - 267.1 - 8321.0 = - 8588.1, the difference in the expected (average) total charges for female aged over 60. (2) β2 = - 3492.3 - 8321.0 = - 11813.3, the difference in the expected (average) total charges for male aged less than 60. (3) β3 = - 1653.0 - 8321.0 = - 9974.0, the difference in the expected (average) total charges for male aged over 60.

We can set up Ho: β1 = β2 = β3 = 0, which is the same as Ho that there are no differences across the mean total charges of four groups. The value of the F-statistic is 1.94 on the df = 3, 196 and the p-value = 0.124 would suggest fail to rejecting Ho, which means the mean of total charges are the same across the four groups.

The lm command uses the t statistics in constructing CIs for the regression coefficients, however, the glm commands uses the z statistics in constructing CIs for the regression coefficients.

3. Summarize initial findings with respect to CE costs as a function of age and gender

The median value of the carotid endarterectomy (CE) expenditure tend to increase as the age increases in both genders. However, the mean and the variance of CE cost in female are larger than that of male in both age stratum. The largely positive skewness and higher means in females show that the higher CE expenditure are needed for female population regardless of age.

# perform a linear regression of total charges on the age-gender groups 
# to partition the total variability as displayed in the analysis of variance (ANOVA) table for regression.
## using the lm command
model1 = lm(totchg ~ as.factor(agegen), data=ce621)
anova(model1)
## Analysis of Variance Table
## 
## Response: totchg
##                    Df    Sum Sq  Mean Sq F value Pr(>F)
## as.factor(agegen)   3 1.619e+08 53968180  1.9412 0.1242
## Residuals         196 5.449e+09 27800817
summary(model1)
## 
## Call:
## lm(formula = totchg ~ as.factor(agegen), data = ce621)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7854  -3061  -1438   1232  31762 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               8321.0     1278.8   6.507 6.26e-10 ***
## as.factor(agegen)f >60    -267.1     1403.7  -0.190   0.8493    
## as.factor(agegen)m <=60  -3492.3     2101.3  -1.662   0.0981 .  
## as.factor(agegen)m >60   -1653.0     1394.4  -1.186   0.2372    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5273 on 196 degrees of freedom
## Multiple R-squared:  0.02886,    Adjusted R-squared:  0.01399 
## F-statistic: 1.941 on 3 and 196 DF,  p-value: 0.1242
##using the glm command
model2 = glm(totchg ~ as.factor(agegen), data=ce621, family=gaussian(link="identity"))
anova(model2)
## Analysis of Deviance Table
## 
## Model: gaussian, link: identity
## 
## Response: totchg
## 
## Terms added sequentially (first to last)
## 
## 
##                   Df  Deviance Resid. Df Resid. Dev
## NULL                                 199 5610864745
## as.factor(agegen)  3 161904541       196 5448960204
summary(model2)
## 
## Call:
## glm(formula = totchg ~ as.factor(agegen), family = gaussian(link = "identity"), 
##     data = ce621)
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               8321.0     1278.8   6.507 6.26e-10 ***
## as.factor(agegen)f >60    -267.1     1403.7  -0.190   0.8493    
## as.factor(agegen)m <=60  -3492.3     2101.3  -1.662   0.0981 .  
## as.factor(agegen)m >60   -1653.0     1394.4  -1.186   0.2372    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 27800817)
## 
##     Null deviance: 5610864745  on 199  degrees of freedom
## Residual deviance: 5448960204  on 196  degrees of freedom
## AIC: 4001.7
## 
## Number of Fisher Scoring iterations: 2

Part.2

4. Obtain the residuals from the regression model above and make a boxplot of the residuals by group.

Inspect your box plots from Step A and notice whether the observations are approximately normal and have equal variance. To remove the differences in average values among the 4 samples, we plot residuals rather than raw data by group or more often against predicted value. A residual is the difference between the observed CE cost and the predicted value from the regression. In a linear regression of continuous response on a group variable, the predicted value is just the group sample mean and the residual is just the deviation of each observation from its group mean. Obtain the residuals from the regression model above. Make a boxplot of the residuals by group. Plot the residuals against group.

Base on the box plot from Step A, the observations are positively skewed and have different variance. The boxplots of the residuals by group are also positively skewed. The plot the residuals against the predicted values with jitter option shows that the the residuals are not Gaussian, not even approximately, and do not have equal variances.

Given the following two assumptions of the linear regression, 1. Observations within a group are approximately normally distributed. 2. The within-group variance is the same across all groups.

I need to do transformation of the value of outcome variable.

#Inspect your box plots from Step A 
model1 = lm(totchg ~ as.factor(agegen), data=ce621)
boxplot(model1$residuals ~ ce621$agegen, ylab="Residuals")

# Plot the residuals against the predicted values.
qplot(x=model1$fitted.values, y=model1$residuals, xlab="Predicted values", ylab="Residuals")
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# [+] This graph can be improved by using the jitter option.
qplot(x=jitter(model1$fitted.values), y=model1$residuals,xlab="Fitted values", ylab="Residuals")

5. Transform the CE expenditure data to address this problem.

One way to address this problem is by analyzing a transformation of the CE expenditure data, rather than the data on its original scale. This works if you want to ask questions about whether there are differences between groups rather than estimating the size of the differences. To accomplish this, generate a new variable which is the logarithm(log10)of CE expenditures. Make the graphical display using box plots, as was done above.

When I see the box plot with a new variable with log transformation,the distribution of the transformed CE data in Step B is more normally distributed compared to the previous distribution of the untransformed CE data in Step A. The within-group variances in Step B are more nearly equal across groups than the variances in Step A.

However, we need more statistical evidence with CIs or p-value in order to make the equal variance assumption to make the assumption of equal (constant) variance.

# to adress this problem, -> analyzing a transformation of the CE expenditure data, rather than the data on its original scale.
# transformation 
ce621 = ce621 %>%
  mutate(logtotchg=log10(totchg))
model3 = lm(logtotchg ~ as.factor(agegen), data=ce621)
summary(model3)
## 
## Call:
## lm(formula = logtotchg ~ as.factor(agegen), data = ce621)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.51002 -0.15155 -0.03341  0.14645  0.84548 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              3.75748    0.06201  60.599   <2e-16 ***
## as.factor(agegen)f >60   0.05358    0.06806   0.787    0.432    
## as.factor(agegen)m <=60 -0.08885    0.10189  -0.872    0.384    
## as.factor(agegen)m >60   0.02158    0.06761   0.319    0.750    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2557 on 196 degrees of freedom
## Multiple R-squared:  0.01596,    Adjusted R-squared:  0.0009018 
## F-statistic:  1.06 on 3 and 196 DF,  p-value: 0.3673
# boxplot with residuals
boxplot(model3$residuals ~ ce621$agegen, ylab="Residuals")

## Part 3.

6. Use bootstrapping to get more appropriate standard errors

Another way to proceed when the focus is the difference in the means themselves, not the means of a transformed value, is to use regression to estimate the means but to use bootstrapping to get more appropriate standard errors that do not depend on the normal and equal variance assumptions. Compare the bootstrap standard errors and confidence intervals with the ones from the original regression analysis. These are more valid when the assumptions are so strongly violated.

The value of standard error in bootstrap, which are falled between 2327.765 to 2457.763, is smaller than that from the original regression analysis, 5272.648 (calculated by the root of MSE). The confidence intervals are also narrow in the bootstrap. We can see the bootstrap method can provide more valid and precise results under the violation of the normality assumption.

The mean of CE cost in females (8321 for those aged less than 60, 8054 for those aged over 60) are larger than that of males (4829 for those aged less than 60, 6668 for those aged over 60). We can see the groups of female have larger variability and more upper outliers than the groups of males regardless of age. The distributions of all groups are skewed to the right, however, it is more positively skewed in females.

Based on the regression analysis, no statistically significant difference in the means of CE cost is observed across the four groups, as evidenced by p value of F test, 0.124. However, according to the bootstrap, there is significant difference in the mean of CE cost of female aged less than 60 compared with the others. In this case, where the normality assumption does not hold, the bootstrap can provide more precise estimation. As a result, females aged less than 60 has statistically significant higher mean CE cost compared with the other gender and age strata.

# setting for bootstrapping
library(boot)

# function to obtain regression coefficients
bs = function(formula, data, indices) {
  d=data[indices, ] # allows boot to select sample 
  fit = lm(formula, data=d)
  return(coef(fit))
}

# bootstrapping with 250 replications
results = boot(data=ce621, statistic=bs, R=250, formula=totchg~agegen)
results
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = ce621, statistic = bs, R = 250, formula = totchg ~ 
##     agegen)
## 
## 
## Bootstrap Statistics :
##       original    bias    std. error
## t1*  8321.0000  236.7810    2196.284
## t2*  -267.1205 -226.8572    2311.620
## t3* -3492.3000 -230.4359    2238.675
## t4* -1653.0333 -231.5245    2214.814
# get 95% confidence intervals from the bootstrap
boot.ci(results, type="norm", index=1) # intercept (f <=60)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 250 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results, type = "norm", index = 1)
## 
## Intervals : 
## Level      Normal        
## 95%   ( 3780, 12389 )  
## Calculations and Intervals on Original Scale
boot.ci(results, type="norm", index=2) # f >60
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 250 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results, type = "norm", index = 2)
## 
## Intervals : 
## Level      Normal        
## 95%   (-4571.0,  4490.4 )  
## Calculations and Intervals on Original Scale
boot.ci(results, type="norm", index=3) # m <=60
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 250 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results, type = "norm", index = 3)
## 
## Intervals : 
## Level      Normal        
## 95%   (-7650,  1126 )  
## Calculations and Intervals on Original Scale
boot.ci(results, type="norm", index=4) # m >60
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 250 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results, type = "norm", index = 4)
## 
## Intervals : 
## Level      Normal        
## 95%   (-5762,  2919 )  
## Calculations and Intervals on Original Scale
# get 95% confidence intervals from the regression model
confint(model1)
##                             2.5 %     97.5 %
## (Intercept)              5799.016 10842.9840
## as.factor(agegen)f >60  -3035.358  2501.1166
## as.factor(agegen)m <=60 -7636.343   651.7426
## as.factor(agegen)m >60  -4402.908  1096.8418