Import dataset

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(tibble)

setwd("C:/Users/ogutu/Desktop/DATA 101 Project 2")

devdisability <- read.csv("dds.discr.csv")

Is there a difference in average expenditure for older and younger individuals?

Introduction

The dataset discrimination in developmental disability support is a research that was done in the state of California in the United States which had a total of 1000 observations with a total of 6 variables. They were attempting to find out if expenditures were being equally distributed.

The dataset has a total of 1000 observations and 6 variables. The variables include: id, age.cohort, age, gender, expenditures, and ethnicity. The variables I chose to use for my data analysis are age and expenditure.

Data Analysis

For my data analysis, I will be using mutate, summary, and group_by functions for most of my project. In addition, I have also used the if…else function to do my data analysis.

Statistical Analysis

H0: Older individuals with developmental disabilities have the same average expenditure as younger individuals.

H1: Older individuals with developmental disabilities have a higher expenditure compared to younger individuals.

str(devdisability)
## 'data.frame':    1000 obs. of  6 variables:
##  $ id          : int  10210 10409 10486 10538 10568 10690 10711 10778 10820 10823 ...
##  $ age.cohort  : chr  "13-17" "22-50" "0-5" "18-21" ...
##  $ age         : int  17 37 3 19 13 15 13 17 14 13 ...
##  $ gender      : chr  "Female" "Male" "Male" "Female" ...
##  $ expenditures: int  2113 41924 1454 6400 4412 4566 3915 3873 5021 2887 ...
##  $ ethnicity   : chr  "White not Hispanic" "White not Hispanic" "Hispanic" "Hispanic" ...
head(devdisability)
##      id age.cohort age gender expenditures          ethnicity
## 1 10210      13-17  17 Female         2113 White not Hispanic
## 2 10409      22-50  37   Male        41924 White not Hispanic
## 3 10486        0-5   3   Male         1454           Hispanic
## 4 10538      18-21  19 Female         6400           Hispanic
## 5 10568      13-17  13   Male         4412 White not Hispanic
## 6 10690      13-17  15 Female         4566           Hispanic
tail(devdisability)
##         id age.cohort age gender expenditures          ethnicity
## 995  99529        0-5   2   Male         2258         Multi Race
## 996  99622        51+  86 Female        57055 White not Hispanic
## 997  99715      18-21  20   Male         7494           Hispanic
## 998  99718      13-17  17 Female         3673         Multi Race
## 999  99791       6-12  10   Male         3638           Hispanic
## 1000 99898      22-50  23   Male        26702 White not Hispanic
colSums(is.na(devdisability))
##           id   age.cohort          age       gender expenditures    ethnicity 
##            0            0            0            0            0            0
devdisability <- devdisability |>
  mutate(Individual_type = ifelse(age >= 60, "Older individual", "Younger individual"))

average_expenditures <- devdisability |>
  group_by(Individual_type) |>
  summarize(Averages = mean(expenditures))

average_expenditures
## # A tibble: 2 × 2
##   Individual_type    Averages
##   <chr>                 <dbl>
## 1 Older individual     54810.
## 2 Younger individual   14608.
t_test_results <- t.test(expenditures ~ Individual_type, data = devdisability, conf.level = 0.99, alternative = "greater")

t_test_results
## 
##  Welch Two Sample t-test
## 
## data:  expenditures by Individual_type
## t = 48.255, df = 251.91, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Older individual and group Younger individual is greater than 0
## 99 percent confidence interval:
##  38251.01      Inf
## sample estimates:
##   mean in group Older individual mean in group Younger individual 
##                         54809.99                         14608.45

Upon using the t-test to do hypothesis testing, it shows that I can reject the null hypothesis. This is because using a 99% confidence interval, the p-value is less than 2.2e-16 and the difference in the mean is greater than 0. This shows there is enough evidence to show that older individuals have a higher average expenditure in developmental disability support as compared to younger individuals.

summary_stats <- devdisability |>
  mutate(Individual_type = ifelse(age >= 60, "Older individual", "Younger individual")) |>
    group_by(Individual_type) |>
      summarise(
        Mean_Expenditures = mean(expenditures, na.rm = TRUE)
  )

summary_stats
## # A tibble: 2 × 2
##   Individual_type    Mean_Expenditures
##   <chr>                          <dbl>
## 1 Older individual              54810.
## 2 Younger individual            14608.
ggplot(summary_stats, aes(x = Individual_type, y = Mean_Expenditures )) +
  geom_bar(stat = "identity", fill = "darkmagenta") +
  labs(title = "Mean Expenditures by Individual Type", x = "Individual Type", y = "Expenditures") +
  theme_minimal()

The bar graph above shows a huge difference in the amount of money being spent on older individuals with disability support being over 50,000 dollars as compared to younger individuals who have an expenditure of approximately 15,000 dollars.

Conclusion

In summary, it is safe to conclude that there may be some form of discrimination in the state of California with relation to age which can raise awareness to ensure that there is an equal distribution of expenditures across all age groups.

For future research, it would be worth doing a correlation to see if there is a correlation between age and race/ethnicity with regards to expenditures and also narrowing my research by creating smaller age margins to know the age group that is really suffering from expenditure inequality in the state of California.