1 Data generation

The outcome $y_i$ is the observed cognitive score of subject $i$ and $y_i^*$ is the unobserved true score. We generate the true score as \[y_i^* = \beta_0 + f(age_i) + \beta_1 educ_i + \beta_2 sex_i + \epsilon_i,\] where $\epsilon_i$ is the random error and $\epsilon_i \overset{iid}{\sim} N(0,1)$. The observed cognitive score is subject to both the ceilling and flooring effects due to the lower and upper limits of the score. That is, we can only partially observe $y_i^*$ such that \[\begin{equation} y_i = \begin{cases} a & \text{if $y_i^* < a$}\\ y_i^* & \text{if $a \le y_i^* \le b$}\\ b & \text{if $y_i^* > b$} \end{cases} \end{equation}\]

The summary statistics for the simulated covariates and unobserved true cognitive score $y_i^*$ are

Summary statistics	n=1,000
Age
min	40.12
max	89.98
mean (sd)	65.78 ± 14.17
Sex (n)
Female	539 (54%)
Male	461 (46%)
Education years
min	10
max	21
mean (sd)	16.31 ± 2.75
True cognitive score, y*
min	2.3
1st decile	11.08
9th decile	21.31
max	23.95
mean (sd)	17.13 ± 4.06

Here we set $a = 11$ and $b=21$ such that ~10% data are censored from above and below. The summary statistics for the observed outcome is

Summary statistics	n=1,000
Observed cognitive score, y
min	11
max	21
mean (sd)	17.32 ± 3.13

2 Proposed model - censored regression (Tobit)

We use B-splines to approximate the nonlinear function $f(\cdot)$ such that \[ f(\cdot) \approx \sum_{k = 1}^K \alpha_{k} B_k(\cdot) = \bf{B}(\cdot)^T {\bf{\alpha}}, \] where $\bf{B}(\cdot) = [B_1(\cdot),\ldots,B_K(\cdot)]^T$ is a $r$th order B-spline basis. Let $I_i^a = I(y_i=a)$, $I_i^b = I(y_i=b)$ be the indicators for left- and right-censoring. Then the log-likelihood function is \[\begin{eqnarray*} l(\beta_0,\beta_1,\beta_2,\bf{\alpha},\sigma) &=& \sum_{i=1}^n (1-I_i^a-I_i^b)\left[\log \phi\left(\frac{y_i-\beta_0 - \beta_1 educ_i - \beta_2 sex_i-{\bf{B}}(age_i)^T{\bf{\alpha}}}{\sigma}\right) - \log \sigma\right] \nonumber \\ && + I_i^a \log\Phi\left(\frac{a-\beta_0 - \beta_1 educ_i - \beta_2 sex_i-{\bf{B}}(age_i)^T{\bf{\alpha}}}{\sigma}\right) + I_i^b \log\Phi\left(\frac{\beta_0 + \beta_1 educ_i + \beta_2 sex_i+{\bf{B}}(age_i)^T{\bf{\alpha}}-b}{\sigma}\right) \end{eqnarray*}\] where $\phi$ and $\Phi$ are the probability density function and the cumulative distribution function of the standard normal distribution, respectively.

3 Simulation

3.1 Model fitting & testing

We randomly split the data into two halves - training and testing sets. Within each training data, we build three models, including linear regression model, shape constrained additive model (SCAM), and the proposed model (Tobit). Then we evaluate the models by calculating the out-of-sample root-mean-square error (RMSE) using both the observed and true scores. That is, we predict the true scores ($\widehat{y^*}$) on testing data using the fitted models, and get the RMSE using both observed and unobserved scores (both $y$ and $y^*$).

In addition, when evaluating the prediction error using the observed scores ($y$) that are censored, we also try to apply a penalty term such that predicted values being evaluated are always within the interval $[a,b]$. In summary, we have three measures of RMSE.

Measure	Definition
$\text{RMSE}_{y^,\widehat{y^}}$	$\sqrt{\frac{\sum_{i=1}^n (y_i^-\widehat{y_i^})^2}{n}}$
$\text{RMSE}_{y,\widehat{y^*}}$	$\sqrt{\frac{\sum_{i=1}^n (y_i-\widehat{y_i^*})^2}{n}}$
$\text{RMSE}_{y,\psi(\widehat{y^*})}$	$\sqrt{\frac{\sum_{i=1}^n \{y_i-\psi(\widehat{y_i^*})\}^2}{n}}$

where \[\begin{equation} \psi(x) = \begin{cases} a & \text{if $x < a$}\\ x & \text{if $a \le x \le b$.}\\ b & \text{if $x > b$} \end{cases} \end{equation}\]

3.1.0.1 Visualization of RMSE measures

We repeat the above procedure for 100 times.

3.2 Result

## [1] "Iterations 1-10 are done. Time = 2.00 s."
## [1] "Iterations 11-20 are done. Time = 2.00 s."
## [1] "Iterations 21-30 are done. Time = 2.00 s."
## [1] "Iterations 31-40 are done. Time = 2.00 s."
## [1] "Iterations 41-50 are done. Time = 2.00 s."
## [1] "Iterations 51-60 are done. Time = 2.00 s."
## [1] "Iterations 61-70 are done. Time = 2.00 s."
## [1] "Iterations 71-80 are done. Time = 2.00 s."
## [1] "Iterations 81-90 are done. Time = 2.00 s."
## [1] "Iterations 91-100 are done. Time = 2.00 s."

4 Real data - MINTTOTS /& DIGFORSL

We redo the above simulation procedure on MINTTOTS and DIGFORSL for 100 times. Note that we don’t have the true scores ($\widehat{y^*}$) so we can’t evaluate $\text{RMSE}_{y^*,\widehat{y^*}}$.

Measure	Definition
\(\text{RMSE}_{y^,\widehat{y^}}\)	\(\sqrt{\frac{\sum_{i=1}^n (y_i^-\widehat{y_i^})^2}{n}}\)
\(\text{RMSE}_{y,\widehat{y^*}}\)	\(\sqrt{\frac{\sum_{i=1}^n (y_i-\widehat{y_i^*})^2}{n}}\)
\(\text{RMSE}_{y,\psi(\widehat{y^*})}\)	\(\sqrt{\frac{\sum_{i=1}^n \{y_i-\psi(\widehat{y_i^*})\}^2}{n}}\)

Cognitive scores with ceiling/flooring effects

Simulation study

Jingxuan Wang