setwd("C:/Users/leyla/Documents/DATA 101")
dds.dis <- read.csv("dds.discr.csv")
head(dds.dis)
## id age.cohort age gender expenditures ethnicity
## 1 10210 13-17 17 Female 2113 White not Hispanic
## 2 10409 22-50 37 Male 41924 White not Hispanic
## 3 10486 0-5 3 Male 1454 Hispanic
## 4 10538 18-21 19 Female 6400 Hispanic
## 5 10568 13-17 13 Male 4412 White not Hispanic
## 6 10690 13-17 15 Female 4566 Hispanic
The research question I stated is: Does the amount of money the state spends on developmental disability services differ between ethnic groups, after accounting for age and gender? This project examines patterns in state spending using the Discrimination in Developmental Disability Support dataset, which contains information on 1,000 individuals receiving developmental disability services through the California Department of Developmental Services (DDS). the dataset includes six variables, but this analysis focuses on four that are most relevnt to my research question: expenditures (a continuous measure of annual state spending on each individual), ethnicity (a categorical variable representing the individual’s ethnic group), gender (categorical variable, female or male), and age (a quantitative variable measured in years). All these variables sllow us to examine whether demographic differences are associated with variations in state funded support for disability services within California.
Before the running the regression model, I will first clean and explore the dataset to understand the variables and prepare the data for analysis. I will check for missing values, remove any rows with incomplete data, and selected only the relevant variables. Then, I’m going to summarize the data to understand the distribution of expenditures and how they vary accross different demographi groups. These steps will ensure that the dataset will be clean, orgnize, and ready for multiple regression analysis.
#Check the structure of the dataset
str(dds.dis)
## 'data.frame': 1000 obs. of 6 variables:
## $ id : int 10210 10409 10486 10538 10568 10690 10711 10778 10820 10823 ...
## $ age.cohort : chr "13-17" "22-50" "0-5" "18-21" ...
## $ age : int 17 37 3 19 13 15 13 17 14 13 ...
## $ gender : chr "Female" "Male" "Male" "Female" ...
## $ expenditures: int 2113 41924 1454 6400 4412 4566 3915 3873 5021 2887 ...
## $ ethnicity : chr "White not Hispanic" "White not Hispanic" "Hispanic" "Hispanic" ...
#Summary of all variables
summary(dds.dis)
## id age.cohort age gender
## Min. :10210 Length:1000 Min. : 0.0 Length:1000
## 1st Qu.:31809 Class :character 1st Qu.:12.0 Class :character
## Median :55385 Mode :character Median :18.0 Mode :character
## Mean :54663 Mean :22.8
## 3rd Qu.:76135 3rd Qu.:26.0
## Max. :99898 Max. :95.0
## expenditures ethnicity
## Min. : 222 Length:1000
## 1st Qu.: 2899 Class :character
## Median : 7026 Mode :character
## Mean :18066
## 3rd Qu.:37713
## Max. :75098
The dataset has 1,000 individuals, with ages ranging from 0 to 95 years and expenditures from $222 to $75,098. Most participants are in their teens or early adulthood, and both genders and multiple ethnic groups are represented
#Keep only variables needed
dds.data <- dds.dis |>
select(expenditures, ethnicity, gender, age)
#Remove rows with NA
dds.data <- dds.data |>
filter(!is.na(expenditures) & !is.na(ethnicity) & !is.na(gender) & !is.na(age))
head(dds.data)
## expenditures ethnicity gender age
## 1 2113 White not Hispanic Female 17
## 2 41924 White not Hispanic Male 37
## 3 1454 Hispanic Male 3
## 4 6400 Hispanic Female 19
## 5 4412 White not Hispanic Male 13
## 6 4566 Hispanic Female 15
Summarize data by group
dds_summary <- dds.data |>
group_by(ethnicity) |>
summarize(mean_expenditures = mean(expenditures),
median_expenditures = median(expenditures),
count = n())
dds_summary
## # A tibble: 8 × 4
## ethnicity mean_expenditures median_expenditures count
## <chr> <dbl> <dbl> <int>
## 1 American Indian 36438. 41818. 4
## 2 Asian 18392. 9369 129
## 3 Black 20885. 8687 59
## 4 Hispanic 11066. 3952 376
## 5 Multi Race 4457. 2622 26
## 6 Native Hawaiian 42782. 40727 3
## 7 Other 3316. 3316. 2
## 8 White not Hispanic 24698. 15718 401
Expenditures vary widely across ethnic groups. For example: Hispanic individuals have a mean spending of about $11,066, while White not Hispanic individuals average $24,698. Smaller groups like Native Hawaiian and American Indian show very high mean expenditures, but these are based in few individuals.
lm_model <- lm(expenditures ~ ethnicity + gender + age, data = dds.data)
summary(lm_model)
##
## Call:
## lm(formula = expenditures ~ ethnicity + gender + age, data = dds.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27095 -6605 -3147 3256 34735
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9478.02 5254.10 -1.804 0.0715 .
## ethnicityAsian 8083.29 5270.65 1.534 0.1254
## ethnicityBlack 9224.27 5360.50 1.721 0.0856 .
## ethnicityHispanic 5664.18 5230.54 1.083 0.2791
## ethnicityMulti Race 5193.56 5600.35 0.927 0.3540
## ethnicityNative Hawaiian 21547.55 7886.34 2.732 0.0064 **
## ethnicityOther -894.96 8962.60 -0.100 0.9205
## ethnicityWhite not Hispanic 10143.29 5207.49 1.948 0.0517 .
## genderMale -251.75 653.45 -0.385 0.7001
## age 863.46 18.53 46.607 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10320 on 990 degrees of freedom
## Multiple R-squared: 0.7238, Adjusted R-squared: 0.7213
## F-statistic: 288.3 on 9 and 990 DF, p-value: < 2.2e-16
To address the research question, I used a multiple linear regression model with annual state expenditures as the outcome variable and ethnicity, gender, and age as the predictors. This ,model allow us to examine whether expenditures differ across ethnic groups while controlling for age and gender. Ethnicity and gender are going to be treated as categorical variables, and age will be included as a numerical predictor. Results from the regression will provide coefficient estimates, p-values, and confidence intervals, which will help determine whether demographic factors are significantly associated with differences in the state spending.
# Calculating confidence interval
confint(lm_model)
## 2.5 % 97.5 %
## (Intercept) -19788.47228 832.4413
## ethnicityAsian -2259.64102 18426.2188
## ethnicityBlack -1294.98630 19743.5257
## ethnicityHispanic -4600.04865 15928.4021
## ethnicityMulti Race -5796.36606 16183.4827
## ethnicityNative Hawaiian 6071.68370 37023.4228
## ethnicityOther -18482.82986 16692.9142
## ethnicityWhite not Hispanic -75.69204 20362.2679
## genderMale -1534.04585 1030.5504
## age 827.10398 899.8144
The regession results show that age is a significatn predictor of state expenditure, with each additional year associated with an increase of about $863, holding ethnicity and gender constant. We highlight age because it affects all individuals in the dataset, making it a consistent and reliable predictor across the population. Most ethnicities categories do not show statistically significant differences compared to the reference group, except for Native Hawaiians, who have substantially higher expenditures , although this only applies to few individuals. Gender is not significant, indicating similar expenditures for males and females.
Before trusting our regression results , we need to check several assumptions of this model. These include linearity, independence of observations, homoscedesticity, normality of residuals and low multicollinearity between predictors. Checking this assumptions helps confirm that the model is appropiateand that the resultscan be reliable.
Diagnostic Plots To check linearity, homoscedesticity, normality, and influential points I will generate the standard regression diagnostic plots
par(mfrow = c(2, 2))
plot(lm_model)
par(mfrow = c(1, 1))
The diagnostic plots suggest that the model meets the assumption reasonably well. The residual vs. Fitted plot does not show a marked curved pattern, which means that the linearity is acceptable. The Q - Q plot shows that most of the points follow the diagonal line , indicating that the residuals are more likely normal. The Scale Location plot shows some uneven spread of residuals, however, the pattern is not severe enough to invalidate the model. The residuals vs Leverage shows a few potential influential points, but nothing extreme.
Multicollinearity We also need to make sure our predictors (age, gender, ethnicity) are not highly correlated, which could make coefficients estimates unstable. BUT, in this case, Age is the only numeric predictor, and gender and ethnicity are categorical variables that are independent by design, so multicollinearity is not an issue in this model.
Independence of observations Because the dataset consist of independent individuals without a time or sequence component, we assume independence of observations in reasonably met.
Normality of Residuals
We check if the residuals follow an approximately normal distribution
hist(resid(lm_model), main = "Histogram of Residuals", xlab = "Residuals")
The histogram of residuals shows that most of the residuals are
clustered around zero, which is expected in a regression model. The
distribution is slightly skewed to the right meaning there are a few
larger positive residuals but does not severely affect the normality
assumption.
333 Conclusion
The goal of this analysis was to examine whether state expenditures differ by ethnicity after comparing age and gender. The multiple linear regression model showed that age is the strongest and most consistent predictor, with expenditures increasing about $863 for each additional year of age. Some ethnic groups also showed meaningful differences, for example, individuals identified as Native Hawaiians had significantly higher expenditures compared to the baseline group. However, gender did not appear to be a statistically significant predictor. These result suggest that age related needs are the primary driver of expenditure differences, while demographic categories ;ike gender and ethnicity had less contribution.
The model demostrated a relatively strong fit, with an R-squared of about 0.72, meaning it explains roughly 72% of the variation in expenditures. This indicates that the model captures much of the underlying patter in the data. Though, there are some limitations. The residual diagnostic showed mild deviations from homoscedasticity and normality, and the presence of influential points may or may not affect the stability of certain coefficients. Additionally, because gender and ethnicity are categorical predictors with multiple groups, some comparisons were less precise.
Future analysis could improve the model in several ways. Such as test whether the effect of age is different for diverse ethnic groups by adding age = ethnicity or age = gender. This could reveal patterns that a basic model cannot detect. Collecting more detailed data or using a larger sample could strengthen accuracy of results.