I created a linear regression model that analyzes multiple variables on the cost of childcare per state. The regression is a potential step in creating an estimate of how much individuals in each state could expect to spend on childcare given known or estimated shifts in each variable. It could also be used to help determine where change would have the most impact on childcare costs.
The independent variables included:
The dependent variable was calculated by averaging the estimated annual cost of childcare in each state in 2021 for three categories: infants, toddlers, and four-year-olds. The data was provided by Child Aware.
The regression was built by first removing all lines of data with omitted information.
#set directory
setwd("C:/Users/hrall/OneDrive/Documents/R")
#read data
df <- read.csv("Childcare.csv",header=TRUE, stringsAsFactors = T)
summary(df)
## State IncomePerCapita IncomeHousehold Infant
## Alabama : 1 Min. :45887 Min. : 50290 Min. : 7280
## Alaska : 1 1st Qu.:55962 1st Qu.: 67765 1st Qu.:10593
## Arizona : 1 Median :60007 Median : 76665 Median :12024
## Arkansas : 1 Mean :62063 Mean : 77125 Mean :13085
## California: 1 3rd Qu.:66201 3rd Qu.: 86880 3rd Qu.:15444
## Colorado : 1 Max. :96659 Max. :105000 Max. :25523
## (Other) :46 NA's :1
## Toddler FourYearOld AverageCost NetWorth
## Min. : 6830 Min. : 6014 Min. : 6758 Min. : 72550
## 1st Qu.: 9302 1st Qu.: 8552 1st Qu.: 9554 1st Qu.:119000
## Median :11324 Median : 9880 Median :10780 Median :162700
## Mean :11868 Mean :10245 Mean :11756 Mean :200092
## 3rd Qu.:14188 3rd Qu.:11464 3rd Qu.:13840 3rd Qu.:255100
## Max. :24396 Max. :15768 Max. :19946 Max. :755100
## NA's :2 NA's :1 NA's :1 NA's :7
## TotalPopulation PopUnderFive Supply
## Min. : 579483 Min. : 27990 Min. : 19170
## 1st Qu.: 1874617 1st Qu.: 106844 1st Qu.: 52900
## Median : 4566844 Median : 272759 Median : 156030
## Mean : 12770444 Mean : 718316 Mean : 223058
## 3rd Qu.: 7969900 3rd Qu.: 446133 3rd Qu.: 244430
## Max. :332031554 Max. :18676229 Max. :1384580
## NA's :17
#remove n/a values (shown as NR)
df_no_NA <- na.omit(df)
Correlation between variables was then reviewed (Insert photo of correlation results). For the AverageCost, the independent variables that showed the highest correlation were IncomePerCapita and IncomeHousehold. The lowest correlation was surprisingly with Supply.
cor(df_no_NA[sapply(df_no_NA, is.numeric)])
## IncomePerCapita IncomeHousehold Infant Toddler
## IncomePerCapita 1.0000000 0.71967026 0.86542257 0.88325370
## IncomeHousehold 0.7196703 1.00000000 0.67782608 0.62051911
## Infant 0.8654226 0.67782608 1.00000000 0.93647901
## Toddler 0.8832537 0.62051911 0.93647901 1.00000000
## FourYearOld 0.8471078 0.55863940 0.92719873 0.95278466
## AverageCost 0.8846581 0.63847027 0.97845761 0.98284815
## NetWorth 0.4191680 0.63812880 0.34275547 0.29500464
## TotalPopulation 0.2001311 -0.02812134 0.13410072 0.11117554
## PopUnderFive 0.1575744 -0.03360532 0.09421573 0.06953070
## Supply 0.1097613 -0.03680261 0.03752511 0.02252874
## FourYearOld AverageCost NetWorth TotalPopulation PopUnderFive
## IncomePerCapita 0.84710779 0.88465808 0.4191680 0.20013108 0.15757436
## IncomeHousehold 0.55863940 0.63847027 0.6381288 -0.02812134 -0.03360532
## Infant 0.92719873 0.97845761 0.3427555 0.13410072 0.09421573
## Toddler 0.95278466 0.98284815 0.2950046 0.11117554 0.06953070
## FourYearOld 1.00000000 0.97627074 0.2217201 0.16935191 0.12762976
## AverageCost 0.97627074 1.00000000 0.2991336 0.13900558 0.09714685
## NetWorth 0.22172010 0.29913360 1.0000000 -0.26395991 -0.26528278
## TotalPopulation 0.16935191 0.13900558 -0.2639599 1.00000000 0.99422609
## PopUnderFive 0.12762976 0.09714685 -0.2652828 0.99422609 1.00000000
## Supply 0.06400977 0.04064380 -0.2369909 0.93732935 0.95934397
## Supply
## IncomePerCapita 0.10976135
## IncomeHousehold -0.03680261
## Infant 0.03752511
## Toddler 0.02252874
## FourYearOld 0.06400977
## AverageCost 0.04064380
## NetWorth -0.23699093
## TotalPopulation 0.93732935
## PopUnderFive 0.95934397
## Supply 1.00000000
I then formulated the model using backward elimination. I started with a linear regression model that used all variables and then removed variables only if the adjusted coefficient of determination (adjusted R2) increased with the removal of the variable.
lmAll <-lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+PopUnderFive+Supply,
data=df_no_NA)
adjr.all<-summary(lmAll)$adj.r.squared
summary(lmAll)
##
## Call:
## lm(formula = AverageCost ~ IncomePerCapita + IncomeHousehold +
## NetWorth + TotalPopulation + PopUnderFive + Supply, data = df_no_NA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2373.76 -927.05 69.87 742.32 2995.27
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.362e+03 1.953e+03 -2.233 0.036 *
## IncomePerCapita 2.563e-01 5.041e-02 5.084 4.3e-05 ***
## IncomeHousehold 2.186e-02 3.862e-02 0.566 0.577
## NetWorth -4.574e-03 4.016e-03 -1.139 0.267
## TotalPopulation 5.509e-05 5.597e-04 0.098 0.922
## PopUnderFive -4.915e-04 1.090e-02 -0.045 0.964
## Supply -1.488e-03 4.362e-03 -0.341 0.736
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1405 on 22 degrees of freedom
## Multiple R-squared: 0.7995, Adjusted R-squared: 0.7448
## F-statistic: 14.62 on 6 and 22 DF, p-value: 1.093e-06
#Round 1 of elimination
t1 <- lm(AverageCost~IncomePerCapita+NetWorth+TotalPopulation+PopUnderFive+Supply, data=df_no_NA)
adjr.t1<-summary(t1)$adj.r.squared
t2 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+TotalPopulation+PopUnderFive+Supply, data=df_no_NA)
adjr.t2<-summary(t2)$adj.r.squared
t3 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+PopUnderFive+Supply, data=df_no_NA)
adjr.t3<-summary(t3)$adj.r.squared
t4 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+Supply, data=df_no_NA)
adjr.t4<-summary(t4)$adj.r.squared
t5 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+PopUnderFive, data=df_no_NA)
adjr.t5<-summary(t5)$adj.r.squared
t6 <- lm(AverageCost~NetWorth+TotalPopulation+PopUnderFive+Supply, data=df_no_NA)
adjr.t6<-summary(t6)$adj.r.squared
adjr.all
## [1] 0.7448395
c(adjr.t1, adjr.t2, adjr.t3, adjr.t4, adjr.t5, adjr.t6)
## [1] 0.7523780 0.7415407 0.7558260 0.7559109 0.7546421 0.1598674
#Round 2 of elimination for t1
t1.1 <- lm(AverageCost~NetWorth+TotalPopulation+PopUnderFive+Supply, data=df_no_NA)
adjr.t1.1<-summary(t1)$adj.r.squared
t1.2 <- lm(AverageCost~IncomePerCapita+TotalPopulation+PopUnderFive+Supply, data=df_no_NA)
adjr.t1.2<-summary(t1)$adj.r.squared
t1.3 <- lm(AverageCost~IncomePerCapita+NetWorth+PopUnderFive+Supply, data=df_no_NA)
adjr.t1.3<-summary(t1)$adj.r.squared
t1.4 <- lm(AverageCost~IncomePerCapita+NetWorth+TotalPopulation+Supply, data=df_no_NA)
adjr.t1.4<-summary(t1)$adj.r.squared
t1.5 <- lm(AverageCost~IncomePerCapita+NetWorth+PopUnderFive+TotalPopulation, data=df_no_NA)
adjr.t1.5<-summary(t1)$adj.r.squared
#Round 2 of elimination for t3
t3.1 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+PopUnderFive+Supply, data=df_no_NA)
adjr.t3.1<-summary(t3)$adj.r.squared
t3.2 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+PopUnderFive+Supply, data=df_no_NA)
adjr.t3.2<-summary(t3)$adj.r.squared
t3.3 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+PopUnderFive+Supply, data=df_no_NA)
adjr.t3.3<-summary(t3)$adj.r.squared
t3.4 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+PopUnderFive+Supply, data=df_no_NA)
adjr.t3.4<-summary(t3)$adj.r.squared
t3.5 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+PopUnderFive+Supply, data=df_no_NA)
adjr.t3.5<-summary(t3)$adj.r.squared
#Round 2 of elimination for t4
t4.1 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+Supply, data=df_no_NA)
adjr.t4.1<-summary(t4)$adj.r.squared
t4.2 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+Supply, data=df_no_NA)
adjr.t4.2<-summary(t4)$adj.r.squared
t4.3 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+Supply, data=df_no_NA)
adjr.t4.3<-summary(t4)$adj.r.squared
t4.4 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+Supply, data=df_no_NA)
adjr.t4.4<-summary(t4)$adj.r.squared
t4.5 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+Supply, data=df_no_NA)
adjr.t4.5<-summary(t4)$adj.r.squared
#Round 2 of elimination for t5
t5.1 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+PopUnderFive, data=df_no_NA)
adjr.t5.1<-summary(t5)$adj.r.squared
t5.2 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+PopUnderFive, data=df_no_NA)
adjr.t5.2<-summary(t5)$adj.r.squared
t5.3 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+PopUnderFive, data=df_no_NA)
adjr.t5.3<-summary(t5)$adj.r.squared
t5.4 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+PopUnderFive, data=df_no_NA)
adjr.t5.4<-summary(t5)$adj.r.squared
t5.5 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+PopUnderFive, data=df_no_NA)
adjr.t5.5<-summary(t5)$adj.r.squared
#Summary or R^2s
adjr.t1
## [1] 0.752378
c(adjr.t1.1, adjr.t1.2, adjr.t1.3, adjr.t1.4, adjr.t1.5)
## [1] 0.752378 0.752378 0.752378 0.752378 0.752378
adjr.t3
## [1] 0.755826
c(adjr.t3.1, adjr.t3.2, adjr.t3.3, adjr.t3.4, adjr.t3.5)
## [1] 0.755826 0.755826 0.755826 0.755826 0.755826
adjr.t4
## [1] 0.7559109
c(adjr.t4.1, adjr.t4.2, adjr.t4.3, adjr.t4.4, adjr.t4.5)
## [1] 0.7559109 0.7559109 0.7559109 0.7559109 0.7559109
adjr.t5
## [1] 0.7546421
c(adjr.t5.1, adjr.t5.2, adjr.t5.3, adjr.t5.4, adjr.t5.5)
## [1] 0.7546421 0.7546421 0.7546421 0.7546421 0.7546421
After the first round of eliminations, the highest adjusted coefficient of determination was for model #4 which included the variables IncomePerCapita, IncomeHousehold, NetWorth, TotalPopulation, Supply