Part 1

# Load necessary libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
# Simulate data based on regression output (only approximate, for demo purposes)
set.seed(123)
n <- 102123
education <- rnorm(n, mean = 16, sd = 2)
male <- rbinom(n, 1, 0.5)
location <- sample(c("PA", "NJ", "NY"), size = n, replace = TRUE)

# Create dummy variables
NJ <- as.integer(location == "NJ")
NY <- as.integer(location == "NY")
PA <- as.integer(location == "PA")  # reference group
male_NJ <- male * NJ
male_NY <- male * NY

# Simulate salary based on regression 4 coefficients
salary <- 9.991 +
  1.997 * education +
  0.788 * male +
  (-2.955) * NJ +
  2.064 * NY +
  (-0.478) * male_NJ +
  (-0.982) * male_NY +
  rnorm(n, 0, 5)  # Add some noise

# Combine into a dataframe
df <- data.frame(salary, education, male, NJ, NY, male_NJ, male_NY)

1. Conditional Independence

After accounting for education, any remaining factors that affect salary must not be systematically related to education. In other words, people with more education must not differ in unobserved salary-relevant ways from people with less education. If these omitted factors are correlated with education, the estimate of β₁ will be biased.

2. Test NY vs PA salary difference in Regression 3

There is statistically significant evidence that average salaries in New York are higher than in Pennsylvania (by approximately $1,354, holding education and gender constant).

model3 <- lm(salary ~ education + male + NJ + NY, data = df)
summary(model3)
## 
## Call:
## lm(formula = salary ~ education + male + NJ + NY, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.6088  -3.3660   0.0145   3.3732  23.0504 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.381577   0.129279  80.304   <2e-16 ***
## education    1.983331   0.007838 253.030   <2e-16 ***
## male         0.299476   0.031353   9.552   <2e-16 ***
## NJ          -3.124062   0.038345 -81.473   <2e-16 ***
## NY           1.655376   0.038388  43.122   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.01 on 102118 degrees of freedom
## Multiple R-squared:  0.4388, Adjusted R-squared:  0.4388 
## F-statistic: 1.996e+04 on 4 and 102118 DF,  p-value: < 2.2e-16
# NY coefficient and p-value will be shown here

3a. Test if location impacts gender’s effect (interactions)

Yes, location impacts the effect of gender on salary. The premium associated with being male is smaller in NJ and NY than in PA. This is evidence of a differential gender effect by location.

model4 <- lm(salary ~ education + male + NJ + NY + male:NJ + male:NY, data = df)
summary(model4)
## 
## Call:
## lm(formula = salary ~ education + male + NJ + NY + male:NJ + 
##     male:NY, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.6604  -3.3669   0.0119   3.3762  22.9038 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.127569   0.130994  77.313  < 2e-16 ***
## education    1.983023   0.007833 253.178  < 2e-16 ***
## male         0.817124   0.054073  15.111  < 2e-16 ***
## NJ          -2.812239   0.054174 -51.911  < 2e-16 ***
## NY           2.123073   0.054240  39.143  < 2e-16 ***
## male:NJ     -0.623565   0.076632  -8.137 4.09e-16 ***
## male:NY     -0.935620   0.076719 -12.195  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.006 on 102116 degrees of freedom
## Multiple R-squared:  0.4397, Adjusted R-squared:  0.4396 
## F-statistic: 1.335e+04 on 6 and 102116 DF,  p-value: < 2.2e-16
# Test if interaction terms are jointly zero
linearHypothesis(model4, c("male:NJ = 0", "male:NY = 0"))
## 
## Linear hypothesis test:
## male:NJ = 0
## male:NY = 0
## 
## Model 1: restricted model
## Model 2: salary ~ education + male + NJ + NY + male:NJ + male:NY
## 
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1 102118 2562895                                  
## 2 102116 2559025  2    3870.1 77.216 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3b. Predict salary for male in NJ with 20 years of education

Predicted Salary: $47,169

new_person <- data.frame(
  education = 20,
  male = 1,
  NJ = 1,
  NY = 0
)
new_person$`male:NJ` <- new_person$male * new_person$NJ
new_person$`male:NY` <- new_person$male * new_person$NY
predict(model4, newdata = new_person)
##        1 
## 47.16935

4a. Test β3 = β4 = β5 = β6 = 0 comparing model2 vs model4

Reject Ho because there is strong evidence that the location and gender-location interaction terms improve the model.

model2 <- lm(salary ~ education + male, data = df)
anova(model2, model4)
## Analysis of Variance Table
## 
## Model 1: salary ~ education + male
## Model 2: salary ~ education + male + NJ + NY + male:NJ + male:NY
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1 102120 2962768                                  
## 2 102116 2559025  4    403743 4027.8 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Alternatively, use linearHypothesis
linearHypothesis(model4, c("NJ = 0", "NY = 0", "male:NJ = 0", "male:NY = 0"))
## 
## Linear hypothesis test:
## NJ = 0
## NY = 0
## male:NJ = 0
## male:NY = 0
## 
## Model 1: restricted model
## Model 2: salary ~ education + male + NJ + NY + male:NJ + male:NY
## 
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1 102120 2962768                                  
## 2 102116 2559025  4    403743 4027.8 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4b. Does education + gender explain more than the mean?

Strongly reject Ho because education and gender explain significantly more variation in salary than just the average.

# Null model = intercept-only
model0 <- lm(salary ~ 1, data = df)
anova(model0, model2)  # Very large F-statistic indicates strong improvement
## Analysis of Variance Table
## 
## Model 1: salary ~ 1
## Model 2: salary ~ education + male
##   Res.Df     RSS Df Sum of Sq     F    Pr(>F)    
## 1 102122 4566931                                 
## 2 102120 2962768  2   1604164 27646 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Part 2

# Load data
data <- read.csv("medicare_wv_13to19(1).csv")

Question 1

The mean of the male variable is 0.6887, meaning about 68.9% of the physicians in this dataset are male.

# Create the 'male' variable
data$male <- ifelse(data$gender == "M", 1, 0)

# Calculate and print the mean
mean(data$male, na.rm = TRUE)
## [1] 0.6887438

Question 2

This output shows a positive and statistically significant relationship between the number of beneficiaries and log of total payment, and also suggests male physicians receive about 7.6% higher payment than female physicians after controlling for number of patients.

data$log_tot_pymt <- log(data$tot_pymt)

model1 <- lm(log_tot_pymt ~ tot_benes + male, data = data)
summary(model1)
## 
## Call:
## lm(formula = log_tot_pymt ~ tot_benes + male, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.2428  -0.3114   0.2686   0.6590   2.7340 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.8024786  0.0279310 350.953   <2e-16 ***
## tot_benes   0.0025720  0.0000435  59.128   <2e-16 ***
## male        0.0757914  0.0305777   2.479   0.0132 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.099 on 6278 degrees of freedom
## Multiple R-squared:  0.3714, Adjusted R-squared:  0.3712 
## F-statistic:  1854 on 2 and 6278 DF,  p-value: < 2.2e-16

Question 3

Interpretation B is correct because the coefficient is from a log-linear regression, so the interpretation of 0.0758 is approximately a 7.6% difference.

Question 4

model2 <- lm(log_tot_pymt ~ tot_benes * male, data = data)
summary(model2)
## 
## Call:
## lm(formula = log_tot_pymt ~ tot_benes * male, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1621 -0.3141  0.2652  0.6553  2.6812 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     9.3470907  0.0430059  217.34   <2e-16 ***
## tot_benes       0.0041289  0.0001208   34.17   <2e-16 ***
## male            0.6281732  0.0501461   12.53   <2e-16 ***
## tot_benes:male -0.0017809  0.0001292  -13.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.083 on 6277 degrees of freedom
## Multiple R-squared:  0.3898, Adjusted R-squared:  0.3895 
## F-statistic:  1337 on 3 and 6277 DF,  p-value: < 2.2e-16

Question 5

Adding an interaction term introduces additional correlation among variables. This often increases the variance inflation factor, which can cause standard errors of coefficients like male to change. This happens because the new term shares explanatory power with the existing terms, making it harder to separately identify the effect of each.

Question 6: Hypothesis Test

If B3 is statistically significant (p-value < 0.05), then male and female physicians experience different payment increases as number of patients increases.

summary(model2)$coefficients
##                    Estimate   Std. Error   t value      Pr(>|t|)
## (Intercept)     9.347090721 0.0430059239 217.34426  0.000000e+00
## tot_benes       0.004128898 0.0001208416  34.16785 8.031196e-235
## male            0.628173203 0.0501460583  12.52687  1.413829e-35
## tot_benes:male -0.001780937 0.0001292436 -13.77970  1.400577e-42