Introduction

I created a linear regression model that analyzes multiple variables on the cost of childcare per state. The regression is a potential step in creating an estimate of how much individuals in each state could expect to spend on childcare given known or estimated shifts in each variable. It could also be used to help determine where change would have the most impact on childcare costs.

Variables

The independent variables included:

  1. Income Per Capita (IncomePerCapita): The average income per capita in each state in 2021 as provided by the United States Census Bureau16
  2. Income Per Household (IncomeHousehold): The average household income in 2021 per state as provided by the St. Louis FED17
  3. Average Net Worth per household (NetWorth): The average household net worth in 2021 as determined by total assets including home equity. Information provided by the United States Census Bureau18
  4. Total Population (TotalPopulation): The estimated population of each state in 2021 as provided by the United States Census Bureau19
  5. Population under five-years-old (PopUnderFive): The estimated population of children under the age of five for each state in 2021 as presented by the Annie E. Casey Foundation from the United States Census Bureau20
  6. Supply of licensed childcare facilities (Supply): Estimated number of licensed childcare facilities for each state in 2021 including in-home daycares and center based facilities as provided by the Bipartisan Policy Center21

The dependent variable was calculated by averaging the estimated annual cost of childcare in each state in 2021 for three categories: infants, toddlers, and four-year-olds. The data was provided by Child Aware.

The regression was built by first removing all lines of data with omitted information.

#set directory
setwd("C:/Users/hrall/OneDrive/Documents/R")

#read data
df <- read.csv("Childcare.csv",header=TRUE, stringsAsFactors = T)
summary(df)
##         State    IncomePerCapita IncomeHousehold      Infant     
##  Alabama   : 1   Min.   :45887   Min.   : 50290   Min.   : 7280  
##  Alaska    : 1   1st Qu.:55962   1st Qu.: 67765   1st Qu.:10593  
##  Arizona   : 1   Median :60007   Median : 76665   Median :12024  
##  Arkansas  : 1   Mean   :62063   Mean   : 77125   Mean   :13085  
##  California: 1   3rd Qu.:66201   3rd Qu.: 86880   3rd Qu.:15444  
##  Colorado  : 1   Max.   :96659   Max.   :105000   Max.   :25523  
##  (Other)   :46                                    NA's   :1      
##     Toddler       FourYearOld     AverageCost       NetWorth     
##  Min.   : 6830   Min.   : 6014   Min.   : 6758   Min.   : 72550  
##  1st Qu.: 9302   1st Qu.: 8552   1st Qu.: 9554   1st Qu.:119000  
##  Median :11324   Median : 9880   Median :10780   Median :162700  
##  Mean   :11868   Mean   :10245   Mean   :11756   Mean   :200092  
##  3rd Qu.:14188   3rd Qu.:11464   3rd Qu.:13840   3rd Qu.:255100  
##  Max.   :24396   Max.   :15768   Max.   :19946   Max.   :755100  
##  NA's   :2       NA's   :1       NA's   :1       NA's   :7       
##  TotalPopulation      PopUnderFive          Supply       
##  Min.   :   579483   Min.   :   27990   Min.   :  19170  
##  1st Qu.:  1874617   1st Qu.:  106844   1st Qu.:  52900  
##  Median :  4566844   Median :  272759   Median : 156030  
##  Mean   : 12770444   Mean   :  718316   Mean   : 223058  
##  3rd Qu.:  7969900   3rd Qu.:  446133   3rd Qu.: 244430  
##  Max.   :332031554   Max.   :18676229   Max.   :1384580  
##                                         NA's   :17
#remove n/a values (shown as NR)
df_no_NA <- na.omit(df)

Correlation

Correlation between variables was then reviewed (Insert photo of correlation results). For the AverageCost, the independent variables that showed the highest correlation were IncomePerCapita and IncomeHousehold. The lowest correlation was surprisingly with Supply.

cor(df_no_NA[sapply(df_no_NA, is.numeric)])
##                 IncomePerCapita IncomeHousehold     Infant    Toddler
## IncomePerCapita       1.0000000      0.71967026 0.86542257 0.88325370
## IncomeHousehold       0.7196703      1.00000000 0.67782608 0.62051911
## Infant                0.8654226      0.67782608 1.00000000 0.93647901
## Toddler               0.8832537      0.62051911 0.93647901 1.00000000
## FourYearOld           0.8471078      0.55863940 0.92719873 0.95278466
## AverageCost           0.8846581      0.63847027 0.97845761 0.98284815
## NetWorth              0.4191680      0.63812880 0.34275547 0.29500464
## TotalPopulation       0.2001311     -0.02812134 0.13410072 0.11117554
## PopUnderFive          0.1575744     -0.03360532 0.09421573 0.06953070
## Supply                0.1097613     -0.03680261 0.03752511 0.02252874
##                 FourYearOld AverageCost   NetWorth TotalPopulation PopUnderFive
## IncomePerCapita  0.84710779  0.88465808  0.4191680      0.20013108   0.15757436
## IncomeHousehold  0.55863940  0.63847027  0.6381288     -0.02812134  -0.03360532
## Infant           0.92719873  0.97845761  0.3427555      0.13410072   0.09421573
## Toddler          0.95278466  0.98284815  0.2950046      0.11117554   0.06953070
## FourYearOld      1.00000000  0.97627074  0.2217201      0.16935191   0.12762976
## AverageCost      0.97627074  1.00000000  0.2991336      0.13900558   0.09714685
## NetWorth         0.22172010  0.29913360  1.0000000     -0.26395991  -0.26528278
## TotalPopulation  0.16935191  0.13900558 -0.2639599      1.00000000   0.99422609
## PopUnderFive     0.12762976  0.09714685 -0.2652828      0.99422609   1.00000000
## Supply           0.06400977  0.04064380 -0.2369909      0.93732935   0.95934397
##                      Supply
## IncomePerCapita  0.10976135
## IncomeHousehold -0.03680261
## Infant           0.03752511
## Toddler          0.02252874
## FourYearOld      0.06400977
## AverageCost      0.04064380
## NetWorth        -0.23699093
## TotalPopulation  0.93732935
## PopUnderFive     0.95934397
## Supply           1.00000000

Linear Model

I then formulated the model using backward elimination. I started with a linear regression model that used all variables and then removed variables only if the adjusted coefficient of determination (adjusted R2) increased with the removal of the variable.

lmAll <-lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+PopUnderFive+Supply, 
           data=df_no_NA)
adjr.all<-summary(lmAll)$adj.r.squared
summary(lmAll)
## 
## Call:
## lm(formula = AverageCost ~ IncomePerCapita + IncomeHousehold + 
##     NetWorth + TotalPopulation + PopUnderFive + Supply, data = df_no_NA)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2373.76  -927.05    69.87   742.32  2995.27 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -4.362e+03  1.953e+03  -2.233    0.036 *  
## IncomePerCapita  2.563e-01  5.041e-02   5.084  4.3e-05 ***
## IncomeHousehold  2.186e-02  3.862e-02   0.566    0.577    
## NetWorth        -4.574e-03  4.016e-03  -1.139    0.267    
## TotalPopulation  5.509e-05  5.597e-04   0.098    0.922    
## PopUnderFive    -4.915e-04  1.090e-02  -0.045    0.964    
## Supply          -1.488e-03  4.362e-03  -0.341    0.736    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1405 on 22 degrees of freedom
## Multiple R-squared:  0.7995, Adjusted R-squared:  0.7448 
## F-statistic: 14.62 on 6 and 22 DF,  p-value: 1.093e-06
#Round 1 of elimination
t1 <- lm(AverageCost~IncomePerCapita+NetWorth+TotalPopulation+PopUnderFive+Supply, data=df_no_NA)
adjr.t1<-summary(t1)$adj.r.squared

t2 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+TotalPopulation+PopUnderFive+Supply, data=df_no_NA)
adjr.t2<-summary(t2)$adj.r.squared

t3 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+PopUnderFive+Supply, data=df_no_NA)
adjr.t3<-summary(t3)$adj.r.squared

t4 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+Supply, data=df_no_NA)
adjr.t4<-summary(t4)$adj.r.squared

t5 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+PopUnderFive, data=df_no_NA)
adjr.t5<-summary(t5)$adj.r.squared

t6 <- lm(AverageCost~NetWorth+TotalPopulation+PopUnderFive+Supply, data=df_no_NA)
adjr.t6<-summary(t6)$adj.r.squared

adjr.all
## [1] 0.7448395
c(adjr.t1, adjr.t2, adjr.t3, adjr.t4, adjr.t5, adjr.t6)
## [1] 0.7523780 0.7415407 0.7558260 0.7559109 0.7546421 0.1598674
#Round 2 of elimination for t1
t1.1 <- lm(AverageCost~NetWorth+TotalPopulation+PopUnderFive+Supply, data=df_no_NA)
adjr.t1.1<-summary(t1)$adj.r.squared

t1.2 <- lm(AverageCost~IncomePerCapita+TotalPopulation+PopUnderFive+Supply, data=df_no_NA)
adjr.t1.2<-summary(t1)$adj.r.squared

t1.3 <- lm(AverageCost~IncomePerCapita+NetWorth+PopUnderFive+Supply, data=df_no_NA)
adjr.t1.3<-summary(t1)$adj.r.squared

t1.4 <- lm(AverageCost~IncomePerCapita+NetWorth+TotalPopulation+Supply, data=df_no_NA)
adjr.t1.4<-summary(t1)$adj.r.squared

t1.5 <- lm(AverageCost~IncomePerCapita+NetWorth+PopUnderFive+TotalPopulation, data=df_no_NA)
adjr.t1.5<-summary(t1)$adj.r.squared


#Round 2 of elimination for t3
t3.1 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+PopUnderFive+Supply, data=df_no_NA)
adjr.t3.1<-summary(t3)$adj.r.squared

t3.2 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+PopUnderFive+Supply, data=df_no_NA)
adjr.t3.2<-summary(t3)$adj.r.squared

t3.3 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+PopUnderFive+Supply, data=df_no_NA)
adjr.t3.3<-summary(t3)$adj.r.squared

t3.4 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+PopUnderFive+Supply, data=df_no_NA)
adjr.t3.4<-summary(t3)$adj.r.squared

t3.5 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+PopUnderFive+Supply, data=df_no_NA)
adjr.t3.5<-summary(t3)$adj.r.squared

#Round 2 of elimination for t4
t4.1 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+Supply, data=df_no_NA)
adjr.t4.1<-summary(t4)$adj.r.squared

t4.2 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+Supply, data=df_no_NA)
adjr.t4.2<-summary(t4)$adj.r.squared

t4.3 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+Supply, data=df_no_NA)
adjr.t4.3<-summary(t4)$adj.r.squared

t4.4 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+Supply, data=df_no_NA)
adjr.t4.4<-summary(t4)$adj.r.squared

t4.5 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+Supply, data=df_no_NA)
adjr.t4.5<-summary(t4)$adj.r.squared



#Round 2 of elimination for t5
t5.1 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+PopUnderFive, data=df_no_NA)
adjr.t5.1<-summary(t5)$adj.r.squared

t5.2 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+PopUnderFive, data=df_no_NA)
adjr.t5.2<-summary(t5)$adj.r.squared

t5.3 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+PopUnderFive, data=df_no_NA)
adjr.t5.3<-summary(t5)$adj.r.squared

t5.4 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+PopUnderFive, data=df_no_NA)
adjr.t5.4<-summary(t5)$adj.r.squared

t5.5 <- lm(AverageCost~IncomePerCapita+IncomeHousehold+NetWorth+TotalPopulation+PopUnderFive, data=df_no_NA)
adjr.t5.5<-summary(t5)$adj.r.squared


#Summary or R^2s
adjr.t1
## [1] 0.752378
c(adjr.t1.1, adjr.t1.2, adjr.t1.3, adjr.t1.4, adjr.t1.5)
## [1] 0.752378 0.752378 0.752378 0.752378 0.752378
adjr.t3
## [1] 0.755826
c(adjr.t3.1, adjr.t3.2, adjr.t3.3, adjr.t3.4, adjr.t3.5)
## [1] 0.755826 0.755826 0.755826 0.755826 0.755826
adjr.t4
## [1] 0.7559109
c(adjr.t4.1, adjr.t4.2, adjr.t4.3, adjr.t4.4, adjr.t4.5)
## [1] 0.7559109 0.7559109 0.7559109 0.7559109 0.7559109
adjr.t5
## [1] 0.7546421
c(adjr.t5.1, adjr.t5.2, adjr.t5.3, adjr.t5.4, adjr.t5.5)
## [1] 0.7546421 0.7546421 0.7546421 0.7546421 0.7546421

After the first round of eliminations, the highest adjusted coefficient of determination was for model #4 which included the variables IncomePerCapita, IncomeHousehold, NetWorth, TotalPopulation, Supply