library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(tibble)
setwd("C:/Users/ogutu/Desktop/DATA 101 Project 2")
devdisability <- read.csv("dds.discr.csv")
Is there a difference in average expenditure for older and younger individuals?
The dataset discrimination in developmental disability support is a research that was done in the state of California in the United States which had a total of 1000 observations with a total of 6 variables. They were attempting to find out if expenditures were being equally distributed.
The dataset has a total of 1000 observations and 6 variables. The variables include: id, age.cohort, age, gender, expenditures, and ethnicity. The variables I chose to use for my data analysis are age and expenditure.
For my data analysis, I will be using mutate, summary, and group_by functions for most of my project. In addition, I have also used the if…else function to do my data analysis.
H0: Older individuals with developmental disabilities have the same average expenditure as younger individuals.
H1: Older individuals with developmental disabilities have a higher expenditure compared to younger individuals.
str(devdisability)
## 'data.frame': 1000 obs. of 6 variables:
## $ id : int 10210 10409 10486 10538 10568 10690 10711 10778 10820 10823 ...
## $ age.cohort : chr "13-17" "22-50" "0-5" "18-21" ...
## $ age : int 17 37 3 19 13 15 13 17 14 13 ...
## $ gender : chr "Female" "Male" "Male" "Female" ...
## $ expenditures: int 2113 41924 1454 6400 4412 4566 3915 3873 5021 2887 ...
## $ ethnicity : chr "White not Hispanic" "White not Hispanic" "Hispanic" "Hispanic" ...
head(devdisability)
## id age.cohort age gender expenditures ethnicity
## 1 10210 13-17 17 Female 2113 White not Hispanic
## 2 10409 22-50 37 Male 41924 White not Hispanic
## 3 10486 0-5 3 Male 1454 Hispanic
## 4 10538 18-21 19 Female 6400 Hispanic
## 5 10568 13-17 13 Male 4412 White not Hispanic
## 6 10690 13-17 15 Female 4566 Hispanic
tail(devdisability)
## id age.cohort age gender expenditures ethnicity
## 995 99529 0-5 2 Male 2258 Multi Race
## 996 99622 51+ 86 Female 57055 White not Hispanic
## 997 99715 18-21 20 Male 7494 Hispanic
## 998 99718 13-17 17 Female 3673 Multi Race
## 999 99791 6-12 10 Male 3638 Hispanic
## 1000 99898 22-50 23 Male 26702 White not Hispanic
colSums(is.na(devdisability))
## id age.cohort age gender expenditures ethnicity
## 0 0 0 0 0 0
devdisability <- devdisability |>
mutate(Individual_type = ifelse(age >= 60, "Older individual", "Younger individual"))
average_expenditures <- devdisability |>
group_by(Individual_type) |>
summarize(Averages = mean(expenditures))
average_expenditures
## # A tibble: 2 × 2
## Individual_type Averages
## <chr> <dbl>
## 1 Older individual 54810.
## 2 Younger individual 14608.
t_test_results <- t.test(expenditures ~ Individual_type, data = devdisability, conf.level = 0.99, alternative = "greater")
t_test_results
##
## Welch Two Sample t-test
##
## data: expenditures by Individual_type
## t = 48.255, df = 251.91, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Older individual and group Younger individual is greater than 0
## 99 percent confidence interval:
## 38251.01 Inf
## sample estimates:
## mean in group Older individual mean in group Younger individual
## 54809.99 14608.45
Upon using the t-test to do hypothesis testing, it shows that I can reject the null hypothesis. This is because using a 99% confidence interval, the p-value is less than 2.2e-16 and the difference in the mean is greater than 0. This shows there is enough evidence to show that older individuals have a higher average expenditure in developmental disability support as compared to younger individuals.
summary_stats <- devdisability |>
mutate(Individual_type = ifelse(age >= 60, "Older individual", "Younger individual")) |>
group_by(Individual_type) |>
summarise(
Mean_Expenditures = mean(expenditures, na.rm = TRUE)
)
summary_stats
## # A tibble: 2 × 2
## Individual_type Mean_Expenditures
## <chr> <dbl>
## 1 Older individual 54810.
## 2 Younger individual 14608.
ggplot(summary_stats, aes(x = Individual_type, y = Mean_Expenditures )) +
geom_bar(stat = "identity", fill = "darkmagenta") +
labs(title = "Mean Expenditures by Individual Type", x = "Individual Type", y = "Expenditures") +
theme_minimal()
The bar graph above shows a huge difference in the amount of money being spent on older individuals with disability support being over 50,000 dollars as compared to younger individuals who have an expenditure of approximately 15,000 dollars.
In summary, it is safe to conclude that there may be some form of discrimination in the state of California with relation to age which can raise awareness to ensure that there is an equal distribution of expenditures across all age groups.
For future research, it would be worth doing a correlation to see if there is a correlation between age and race/ethnicity with regards to expenditures and also narrowing my research by creating smaller age margins to know the age group that is really suffering from expenditure inequality in the state of California.