# Load necessary libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
# Simulate data based on regression output (only approximate, for demo purposes)
set.seed(123)
n <- 102123
education <- rnorm(n, mean = 16, sd = 2)
male <- rbinom(n, 1, 0.5)
location <- sample(c("PA", "NJ", "NY"), size = n, replace = TRUE)
# Create dummy variables
NJ <- as.integer(location == "NJ")
NY <- as.integer(location == "NY")
PA <- as.integer(location == "PA") # reference group
male_NJ <- male * NJ
male_NY <- male * NY
# Simulate salary based on regression 4 coefficients
salary <- 9.991 +
1.997 * education +
0.788 * male +
(-2.955) * NJ +
2.064 * NY +
(-0.478) * male_NJ +
(-0.982) * male_NY +
rnorm(n, 0, 5) # Add some noise
# Combine into a dataframe
df <- data.frame(salary, education, male, NJ, NY, male_NJ, male_NY)
After accounting for education, any remaining factors that affect salary must not be systematically related to education. In other words, people with more education must not differ in unobserved salary-relevant ways from people with less education. If these omitted factors are correlated with education, the estimate of β₁ will be biased.
There is statistically significant evidence that average salaries in New York are higher than in Pennsylvania (by approximately $1,354, holding education and gender constant).
model3 <- lm(salary ~ education + male + NJ + NY, data = df)
summary(model3)
##
## Call:
## lm(formula = salary ~ education + male + NJ + NY, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.6088 -3.3660 0.0145 3.3732 23.0504
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.381577 0.129279 80.304 <2e-16 ***
## education 1.983331 0.007838 253.030 <2e-16 ***
## male 0.299476 0.031353 9.552 <2e-16 ***
## NJ -3.124062 0.038345 -81.473 <2e-16 ***
## NY 1.655376 0.038388 43.122 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.01 on 102118 degrees of freedom
## Multiple R-squared: 0.4388, Adjusted R-squared: 0.4388
## F-statistic: 1.996e+04 on 4 and 102118 DF, p-value: < 2.2e-16
# NY coefficient and p-value will be shown here
Yes, location impacts the effect of gender on salary. The premium associated with being male is smaller in NJ and NY than in PA. This is evidence of a differential gender effect by location.
model4 <- lm(salary ~ education + male + NJ + NY + male:NJ + male:NY, data = df)
summary(model4)
##
## Call:
## lm(formula = salary ~ education + male + NJ + NY + male:NJ +
## male:NY, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.6604 -3.3669 0.0119 3.3762 22.9038
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.127569 0.130994 77.313 < 2e-16 ***
## education 1.983023 0.007833 253.178 < 2e-16 ***
## male 0.817124 0.054073 15.111 < 2e-16 ***
## NJ -2.812239 0.054174 -51.911 < 2e-16 ***
## NY 2.123073 0.054240 39.143 < 2e-16 ***
## male:NJ -0.623565 0.076632 -8.137 4.09e-16 ***
## male:NY -0.935620 0.076719 -12.195 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.006 on 102116 degrees of freedom
## Multiple R-squared: 0.4397, Adjusted R-squared: 0.4396
## F-statistic: 1.335e+04 on 6 and 102116 DF, p-value: < 2.2e-16
# Test if interaction terms are jointly zero
linearHypothesis(model4, c("male:NJ = 0", "male:NY = 0"))
##
## Linear hypothesis test:
## male:NJ = 0
## male:NY = 0
##
## Model 1: restricted model
## Model 2: salary ~ education + male + NJ + NY + male:NJ + male:NY
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 102118 2562895
## 2 102116 2559025 2 3870.1 77.216 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Predicted Salary: $47,169
new_person <- data.frame(
education = 20,
male = 1,
NJ = 1,
NY = 0
)
new_person$`male:NJ` <- new_person$male * new_person$NJ
new_person$`male:NY` <- new_person$male * new_person$NY
predict(model4, newdata = new_person)
## 1
## 47.16935
Reject Ho because there is strong evidence that the location and gender-location interaction terms improve the model.
model2 <- lm(salary ~ education + male, data = df)
anova(model2, model4)
## Analysis of Variance Table
##
## Model 1: salary ~ education + male
## Model 2: salary ~ education + male + NJ + NY + male:NJ + male:NY
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 102120 2962768
## 2 102116 2559025 4 403743 4027.8 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Alternatively, use linearHypothesis
linearHypothesis(model4, c("NJ = 0", "NY = 0", "male:NJ = 0", "male:NY = 0"))
##
## Linear hypothesis test:
## NJ = 0
## NY = 0
## male:NJ = 0
## male:NY = 0
##
## Model 1: restricted model
## Model 2: salary ~ education + male + NJ + NY + male:NJ + male:NY
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 102120 2962768
## 2 102116 2559025 4 403743 4027.8 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Strongly reject Ho because education and gender explain significantly more variation in salary than just the average.
# Null model = intercept-only
model0 <- lm(salary ~ 1, data = df)
anova(model0, model2) # Very large F-statistic indicates strong improvement
## Analysis of Variance Table
##
## Model 1: salary ~ 1
## Model 2: salary ~ education + male
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 102122 4566931
## 2 102120 2962768 2 1604164 27646 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Load data
data <- read.csv("medicare_wv_13to19(1).csv")
The mean of the male variable is 0.6887, meaning about 68.9% of the physicians in this dataset are male.
# Create the 'male' variable
data$male <- ifelse(data$gender == "M", 1, 0)
# Calculate and print the mean
mean(data$male, na.rm = TRUE)
## [1] 0.6887438
This output shows a positive and statistically significant relationship between the number of beneficiaries and log of total payment, and also suggests male physicians receive about 7.6% higher payment than female physicians after controlling for number of patients.
data$log_tot_pymt <- log(data$tot_pymt)
model1 <- lm(log_tot_pymt ~ tot_benes + male, data = data)
summary(model1)
##
## Call:
## lm(formula = log_tot_pymt ~ tot_benes + male, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.2428 -0.3114 0.2686 0.6590 2.7340
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.8024786 0.0279310 350.953 <2e-16 ***
## tot_benes 0.0025720 0.0000435 59.128 <2e-16 ***
## male 0.0757914 0.0305777 2.479 0.0132 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.099 on 6278 degrees of freedom
## Multiple R-squared: 0.3714, Adjusted R-squared: 0.3712
## F-statistic: 1854 on 2 and 6278 DF, p-value: < 2.2e-16
Interpretation B is correct because the coefficient is from a log-linear regression, so the interpretation of 0.0758 is approximately a 7.6% difference.
model2 <- lm(log_tot_pymt ~ tot_benes * male, data = data)
summary(model2)
##
## Call:
## lm(formula = log_tot_pymt ~ tot_benes * male, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1621 -0.3141 0.2652 0.6553 2.6812
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.3470907 0.0430059 217.34 <2e-16 ***
## tot_benes 0.0041289 0.0001208 34.17 <2e-16 ***
## male 0.6281732 0.0501461 12.53 <2e-16 ***
## tot_benes:male -0.0017809 0.0001292 -13.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.083 on 6277 degrees of freedom
## Multiple R-squared: 0.3898, Adjusted R-squared: 0.3895
## F-statistic: 1337 on 3 and 6277 DF, p-value: < 2.2e-16
Adding an interaction term introduces additional correlation among variables. This often increases the variance inflation factor, which can cause standard errors of coefficients like male to change. This happens because the new term shares explanatory power with the existing terms, making it harder to separately identify the effect of each.
If B3 is statistically significant (p-value < 0.05), then male and female physicians experience different payment increases as number of patients increases.
summary(model2)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.347090721 0.0430059239 217.34426 0.000000e+00
## tot_benes 0.004128898 0.0001208416 34.16785 8.031196e-235
## male 0.628173203 0.0501460583 12.52687 1.413829e-35
## tot_benes:male -0.001780937 0.0001292436 -13.77970 1.400577e-42