Project 3

setwd("C:/Users/leyla/Documents/DATA 101")
dds.dis <- read.csv("dds.discr.csv")
head(dds.dis)

##      id age.cohort age gender expenditures          ethnicity
## 1 10210      13-17  17 Female         2113 White not Hispanic
## 2 10409      22-50  37   Male        41924 White not Hispanic
## 3 10486        0-5   3   Male         1454           Hispanic
## 4 10538      18-21  19 Female         6400           Hispanic
## 5 10568      13-17  13   Male         4412 White not Hispanic
## 6 10690      13-17  15 Female         4566           Hispanic

Introduction

The research question I stated is: Does the amount of money the state spends on developmental disability services differ between ethnic groups, after accounting for age and gender? This project examines patterns in state spending using the Discrimination in Developmental Disability Support dataset, which contains information on 1,000 individuals receiving developmental disability services through the California Department of Developmental Services (DDS). the dataset includes six variables, but this analysis focuses on four that are most relevnt to my research question: expenditures (a continuous measure of annual state spending on each individual), ethnicity (a categorical variable representing the individual’s ethnic group), gender (categorical variable, female or male), and age (a quantitative variable measured in years). All these variables sllow us to examine whether demographic differences are associated with variations in state funded support for disability services within California.

Data Analysis

Before the running the regression model, I will first clean and explore the dataset to understand the variables and prepare the data for analysis. I will check for missing values, remove any rows with incomplete data, and selected only the relevant variables. Then, I’m going to summarize the data to understand the distribution of expenditures and how they vary accross different demographi groups. These steps will ensure that the dataset will be clean, orgnize, and ready for multiple regression analysis.

#Check the structure of the dataset
str(dds.dis)

## 'data.frame':    1000 obs. of  6 variables:
##  $ id          : int  10210 10409 10486 10538 10568 10690 10711 10778 10820 10823 ...
##  $ age.cohort  : chr  "13-17" "22-50" "0-5" "18-21" ...
##  $ age         : int  17 37 3 19 13 15 13 17 14 13 ...
##  $ gender      : chr  "Female" "Male" "Male" "Female" ...
##  $ expenditures: int  2113 41924 1454 6400 4412 4566 3915 3873 5021 2887 ...
##  $ ethnicity   : chr  "White not Hispanic" "White not Hispanic" "Hispanic" "Hispanic" ...

#Summary of all variables
summary(dds.dis)

##        id         age.cohort             age          gender         
##  Min.   :10210   Length:1000        Min.   : 0.0   Length:1000       
##  1st Qu.:31809   Class :character   1st Qu.:12.0   Class :character  
##  Median :55385   Mode  :character   Median :18.0   Mode  :character  
##  Mean   :54663                      Mean   :22.8                     
##  3rd Qu.:76135                      3rd Qu.:26.0                     
##  Max.   :99898                      Max.   :95.0                     
##   expenditures    ethnicity        
##  Min.   :  222   Length:1000       
##  1st Qu.: 2899   Class :character  
##  Median : 7026   Mode  :character  
##  Mean   :18066                     
##  3rd Qu.:37713                     
##  Max.   :75098

The dataset has 1,000 individuals, with ages ranging from 0 to 95 years and expenditures from $222 to $75,098. Most participants are in their teens or early adulthood, and both genders and multiple ethnic groups are represented

#Keep only variables needed
dds.data <- dds.dis |>
  select(expenditures, ethnicity, gender, age)

#Remove rows with NA
dds.data <- dds.data |>
  filter(!is.na(expenditures) & !is.na(ethnicity) & !is.na(gender) & !is.na(age))

head(dds.data)

##   expenditures          ethnicity gender age
## 1         2113 White not Hispanic Female  17
## 2        41924 White not Hispanic   Male  37
## 3         1454           Hispanic   Male   3
## 4         6400           Hispanic Female  19
## 5         4412 White not Hispanic   Male  13
## 6         4566           Hispanic Female  15

Summarize data by group

dds_summary <- dds.data |>
  group_by(ethnicity) |>
  summarize(mean_expenditures = mean(expenditures),
            median_expenditures = median(expenditures),
            count = n())
dds_summary

## # A tibble: 8 × 4
##   ethnicity          mean_expenditures median_expenditures count
##   <chr>                          <dbl>               <dbl> <int>
## 1 American Indian               36438.              41818.     4
## 2 Asian                         18392.               9369    129
## 3 Black                         20885.               8687     59
## 4 Hispanic                      11066.               3952    376
## 5 Multi Race                     4457.               2622     26
## 6 Native Hawaiian               42782.              40727      3
## 7 Other                          3316.               3316.     2
## 8 White not Hispanic            24698.              15718    401

Expenditures vary widely across ethnic groups. For example: Hispanic individuals have a mean spending of about $11,066, while White not Hispanic individuals average $24,698. Smaller groups like Native Hawaiian and American Indian show very high mean expenditures, but these are based in few individuals.

Regression Analysis

lm_model <- lm(expenditures ~ ethnicity + gender + age, data = dds.data)
summary(lm_model)

## 
## Call:
## lm(formula = expenditures ~ ethnicity + gender + age, data = dds.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27095  -6605  -3147   3256  34735 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -9478.02    5254.10  -1.804   0.0715 .  
## ethnicityAsian               8083.29    5270.65   1.534   0.1254    
## ethnicityBlack               9224.27    5360.50   1.721   0.0856 .  
## ethnicityHispanic            5664.18    5230.54   1.083   0.2791    
## ethnicityMulti Race          5193.56    5600.35   0.927   0.3540    
## ethnicityNative Hawaiian    21547.55    7886.34   2.732   0.0064 ** 
## ethnicityOther               -894.96    8962.60  -0.100   0.9205    
## ethnicityWhite not Hispanic 10143.29    5207.49   1.948   0.0517 .  
## genderMale                   -251.75     653.45  -0.385   0.7001    
## age                           863.46      18.53  46.607   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10320 on 990 degrees of freedom
## Multiple R-squared:  0.7238, Adjusted R-squared:  0.7213 
## F-statistic: 288.3 on 9 and 990 DF,  p-value: < 2.2e-16

To address the research question, I used a multiple linear regression model with annual state expenditures as the outcome variable and ethnicity, gender, and age as the predictors. This ,model allow us to examine whether expenditures differ across ethnic groups while controlling for age and gender. Ethnicity and gender are going to be treated as categorical variables, and age will be included as a numerical predictor. Results from the regression will provide coefficient estimates, p-values, and confidence intervals, which will help determine whether demographic factors are significantly associated with differences in the state spending.

# Calculating confidence interval
confint(lm_model)

##                                    2.5 %     97.5 %
## (Intercept)                 -19788.47228   832.4413
## ethnicityAsian               -2259.64102 18426.2188
## ethnicityBlack               -1294.98630 19743.5257
## ethnicityHispanic            -4600.04865 15928.4021
## ethnicityMulti Race          -5796.36606 16183.4827
## ethnicityNative Hawaiian      6071.68370 37023.4228
## ethnicityOther              -18482.82986 16692.9142
## ethnicityWhite not Hispanic    -75.69204 20362.2679
## genderMale                   -1534.04585  1030.5504
## age                            827.10398   899.8144

The regession results show that age is a significatn predictor of state expenditure, with each additional year associated with an increase of about $863, holding ethnicity and gender constant. We highlight age because it affects all individuals in the dataset, making it a consistent and reliable predictor across the population. Most ethnicities categories do not show statistically significant differences compared to the reference group, except for Native Hawaiians, who have substantially higher expenditures , although this only applies to few individuals. Gender is not significant, indicating similar expenditures for males and females.

Model Assumptions and Diagnostics

Before trusting our regression results , we need to check several assumptions of this model. These include linearity, independence of observations, homoscedesticity, normality of residuals and low multicollinearity between predictors. Checking this assumptions helps confirm that the model is appropiateand that the resultscan be reliable.

Diagnostic Plots To check linearity, homoscedesticity, normality, and influential points I will generate the standard regression diagnostic plots

par(mfrow = c(2, 2))
plot(lm_model)

par(mfrow =  c(1, 1))

The diagnostic plots suggest that the model meets the assumption reasonably well. The residual vs. Fitted plot does not show a marked curved pattern, which means that the linearity is acceptable. The Q - Q plot shows that most of the points follow the diagonal line , indicating that the residuals are more likely normal. The Scale Location plot shows some uneven spread of residuals, however, the pattern is not severe enough to invalidate the model. The residuals vs Leverage shows a few potential influential points, but nothing extreme.

Multicollinearity We also need to make sure our predictors (age, gender, ethnicity) are not highly correlated, which could make coefficients estimates unstable. BUT, in this case, Age is the only numeric predictor, and gender and ethnicity are categorical variables that are independent by design, so multicollinearity is not an issue in this model.

Independence of observations Because the dataset consist of independent individuals without a time or sequence component, we assume independence of observations in reasonably met.

Normality of Residuals

We check if the residuals follow an approximately normal distribution

hist(resid(lm_model), main = "Histogram of Residuals", xlab = "Residuals")

The histogram of residuals shows that most of the residuals are clustered around zero, which is expected in a regression model. The distribution is slightly skewed to the right meaning there are a few larger positive residuals but does not severely affect the normality assumption.

333 Conclusion

The goal of this analysis was to examine whether state expenditures differ by ethnicity after comparing age and gender. The multiple linear regression model showed that age is the strongest and most consistent predictor, with expenditures increasing about $863 for each additional year of age. Some ethnic groups also showed meaningful differences, for example, individuals identified as Native Hawaiians had significantly higher expenditures compared to the baseline group. However, gender did not appear to be a statistically significant predictor. These result suggest that age related needs are the primary driver of expenditure differences, while demographic categories ;ike gender and ethnicity had less contribution.

The model demostrated a relatively strong fit, with an R-squared of about 0.72, meaning it explains roughly 72% of the variation in expenditures. This indicates that the model captures much of the underlying patter in the data. Though, there are some limitations. The residual diagnostic showed mild deviations from homoscedasticity and normality, and the presence of influential points may or may not affect the stability of certain coefficients. Additionally, because gender and ethnicity are categorical predictors with multiple groups, some comparisons were less precise.

Future analysis could improve the model in several ways. Such as test whether the effect of age is different for diverse ethnic groups by adding age = ethnicity or age = gender. This could reveal patterns that a basic model cannot detect. Collecting more detailed data or using a larger sample could strengthen accuracy of results.

Project 3

Leyla C

2025-11-25

Introduction

Data Analysis

Regression Analysis

Model Assumptions and Diagnostics