The code below loads the dataset “CPSWaitingForAdoptionFY2014_2023.csv” as an object called “waitadopt”, and deletes the original file name.
library(readr)
CPSWaitingForAdoptionFY2014_2023 <- read_csv("CPSWaitingForAdoptionFY2014-2023.csv")
## Rows: 12158 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Region, Gender, Race/Ethnicity, Age Group
## dbl (3): Fiscal Year, Chidlren Waiting on Adoption 31 August, Average Months...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
waitadopt <- CPSWaitingForAdoptionFY2014_2023
rm(CPSWaitingForAdoptionFY2014_2023)
Below, the function “pastecs::stat.desc” is used to describe the variable “average months since termination of parental rights at the end of the fiscal year”:
pastecs::stat.desc(waitadopt$`Average Months since Termination of Parental Rights`)
## nbr.val nbr.null nbr.na min max range
## 1.215800e+04 1.000000e+01 0.000000e+00 0.000000e+00 1.810300e+02 1.810300e+02
## sum median mean SE.mean CI.mean.0.95 var
## 2.405740e+05 1.281000e+01 1.978730e+01 1.765712e-01 3.461077e-01 3.790547e+02
## std.dev coef.var
## 1.946933e+01 9.839305e-01
The numbers above represent the following:
nbr.val is the total number of observations: 1.215800e+04, or 12,158.
nbr.null represents the number of NULL values: 1.000000e+01, or 10. However, running a null check returns zero null results:
nullcount <- sum(sapply(waitadopt$`Average Months since Termination of Parental Rights`, is.null))
print(nullcount)
## [1] 0
null_indices <- which(sapply(waitadopt$`Average Months since Termination of Parental Rights`, is.null))
print(null_indices)
## integer(0)
nbr.na shows the number of NA observations in the variable: 0.000000e+00, or 0.
min is the minimum observation value in the variable: 0.000000e+00, or 0. That means, zero is the minimum number of average months since termination of parental rights for children waiting adoption at the end of the fiscal year.
max is the largest observation value in the variable: 1.810300e+02, or 181.03. That translates into 181 months, or roughly fifteen years since termination of parental rights of a child at the end of the fiscal year.
range is 1.810300e+02, which is the same as the max value.
sum is the total value of all observations, or the combined number of average months in the variable.
median is the value positioned exactly in the middle of the lowest and the highest values. In this case, it is 1.281000e+01, or 12.81 months.
mean represents the average of the averages of months: 1.978730e+01, or 19.79, which is about a third larger than the median, which means that the high values, such as the youth waiting five years for adoption, skew the average.
SE.mean per ChatGPT, is the Standard Error of the Mean (SEM), which measures how much the sample mean of the data is expected to vary from the true population mean. It is calculated as the standard deviation (SD) divided by the square root of the sample size. In this case: 1.765712e-01, or .1766, which is relatively accurate, especially given the range of values.
CI.mean per ChatGPT, provides a statistical estimate of the range in which the true mean of the population is likely to fall, based on the sample data. The result, 3.461077e-01, or .346 likely represents that the data set mean is very close to the true population mean.
var is the variance, which per ChatGPT, measures the spread of a set of numbers. A higher variance indicates that the numbers are more spread out from the mean, which seems to be the case: 3.790547e+02, or 379.05. To put it in perspective, the mean is 19.79.
std.dev is the standard deviation: 1.946933e+01, or 19.46. This means that roughly 70 percent of all observations fell between 19.46 months of the mean (19.79). This is clear evidence that this variable does not follow a normal bell curve. This can be observed plainly as a histogram:
hist(waitadopt$`Average Months since Termination of Parental Rights`)
The histogram below shows the dataset along with a “normality” line:
hist(waitadopt$`Average Months since Termination of Parental Rights`,breaks=30,probability = T)
lines(density(waitadopt$`Average Months since Termination of Parental Rights`),col='red',lwd=2)
The code below creates a new dataset called “avgmolog,” with a new variable called “log_avgmo”, which calculates the log of each observation into a new column. Secondly, it creates a new table including only the log values, titled “totavgmolog”. Finally, a pastecs analysis is included below:
avgmolog <- waitadopt %>%
mutate(log_avgmo = log(waitadopt$`Average Months since Termination of Parental Rights`))
totavgmolog <- avgmolog %>%
group_by(avgmolog$log_avgmo) %>%
summarize(count = n())
pastecs::stat.desc(totavgmolog)
## avgmolog$log_avgmo count
## nbr.val 4391.000000 4.391000e+03
## nbr.null 1.000000 0.000000e+00
## nbr.na 0.000000 0.000000e+00
## min -Inf 1.000000e+00
## max 5.198663 2.300000e+01
## range Inf 2.200000e+01
## sum -Inf 1.215800e+04
## median 3.233567 2.000000e+00
## mean -Inf 2.768845e+00
## SE.mean NaN 3.959460e-02
## CI.mean.0.95 NaN 7.762539e-02
## var NaN 6.883913e+00
## std.dev NaN 2.623721e+00
## coef.var NaN 9.475868e-01
The observations changed from 12,158 to 4,391, and there is a value that shows as -Inf, which is an error as a result of calculating log(0). For that reason, the code below changes the values of zero (0) months to 0.01:
avgmolog <- waitadopt %>%
mutate(`Average Months since Termination of Parental Rights` = if_else(`Average Months since Termination of Parental Rights` == 0, 0.01, `Average Months since Termination of Parental Rights`))
The code below runs the log analysis attempted above, but after substituting ‘0’ for ‘0.01’:
avgmolog <- avgmolog %>%
mutate(log_avgmo = log(avgmolog$`Average Months since Termination of Parental Rights`))
totavgmolog <- avgmolog %>%
group_by(avgmolog$log_avgmo) %>%
summarize(count = n())
pastecs::stat.desc(totavgmolog)
## avgmolog$log_avgmo count
## nbr.val 4.391000e+03 4.391000e+03
## nbr.null 1.000000e+00 0.000000e+00
## nbr.na 0.000000e+00 0.000000e+00
## min -4.605170e+00 1.000000e+00
## max 5.198663e+00 2.300000e+01
## range 9.803833e+00 2.200000e+01
## sum 1.334197e+04 1.215800e+04
## median 3.233567e+00 2.000000e+00
## mean 3.038481e+00 2.768845e+00
## SE.mean 1.532816e-02 3.959460e-02
## CI.mean.0.95 3.005093e-02 7.762539e-02
## var 1.031677e+00 6.883913e+00
## std.dev 1.015715e+00 2.623721e+00
## coef.var 3.342837e-01 9.475868e-01
The reason why the observations shrank to 4,391 is because they are grouped in the ‘count’ category which represents the number of children waiting for adoption at the end of the fiscal year. Furthermore, the range changed to 9.8 units, with a minimum value of -4.6, and a maximum value of 5.2. The histogram below shows a better visual representation:
hist(avgmolog$log_avgmo, breaks=30, probability = T)
lines(density(avgmolog$log_avgmo),col='red',lwd=2)
As can be seen above, the curve looks much more like a normal distribution than it did before being transformed to logarithm form.