Data Scientist - Take home assignment
1 Background/question
CNO is making significant changes to the registration requirements nurses must meet in order to register in Ontario. The changes are primarily designed to improve the registration process for internationally educated nurses (IENs) and go into effect on April 1st, 2025. The changes will be immediately applicable to all IENs with an open application and all new IEN applicants going forward. Measuring changes in time to registration, the time between when an application is received by CNO and when an applicant is registered, is critical to understanding the impact of the regulation change.
2 Aim
Measure the impact of the regulation change on time to registration
3 Analytical Approach
3.1 Assumptions/conditions
Assume that data on time to registration is available for all applicants. The average time to registration is currently 150 days for IEN applicants but has already fallen from 180 days over the last year due to other registration process changes.
3.2 Study outcome
Primary: Time from application received to registration
Potential Secondary Outcome: Average time between application to registration over specific time period (season, year, etc)
3.3 Measurement of essential variables
The outcome will be calculated as day difference between date of application received and date of registration of the applicants. It will be aggregated as average over specific time period as potential secondary outcome.
Applicants will be assigned an index, coded as 1 for those applying after the regulation change is implemented and 0 for those applying before the change.
Additionally, relevant variables associated with the regulation change details and other potential covariates/risk factors (e.g. seasonal variations, applicant demographics, registration category, process efficiency changes) will be collected, assessed and quantified as appropriate.
3.4 Analyses plan
Demographic statistics will be computed with mean(sd)/median(IQR) for continuous variables and frequency(percentage). Hypothesis testings, including t-tests or Mann-Whitney tests for continuous variables, and Chi-squared tests or Fisher’s exacts tests for categorical variables will be performed to assess difference before/after regulation change as appropriate. The outcome distribution will be checked through histogram and Q-Q plot.
Linear regression model will be fitted with the outcome as dependent variable, provided that model assumptions are met. It allows for a straightforward interpretation of the relationship between the regulation change and time to registration. Potential confounders and covariates can be adjusted. If normality assumption is violated, the outcome will be log-transformed before fitting the model.
Accelerated failure time (AFT) model will be used to estimate time acceleration factor to assess whether registration process is sped up after the regulation change.
A Interrupted Time Series (ITS) model will be constructed to assess how and if the outcome has changed after the regulation change, with the secondary outcome as the dependent variable, to assess any immediate or long-term effects.
Data visualizations like bar plot, box plot, line graph, scatter plot, autocorrelation plot, etc, will be provided as appropriate. Subgroup analysis could be conducted as necessary.
4 Simulation study
4.1 Data simulation
In this section, I simulated data based on some information of CNO historical data, to demonstrate the analytical outlined above, providing an basic example of expected result output according to this statistical analysis plan
Based on the historical data reported in College of Nurses of Ontario Nursing Statistics Report 2024: Appendix B we simulated data and an analysis study using simulated data …
For simplicity, I simulate applicant-level data (RN and RPN) only for internationally educated nurses (IENs) from 2019–2027, including:
- Historical period: 2019–2023
- Future period: 2024–2027
- A regulation change on April 1, 2025 (split 2025 into Q1 = pre-change, Q2–Q4 = post-change)
- Time to registration (days) dropping over time using provided information
- Ages (using reported means)
- Ignore Ontario and Other Canadian data entirely (only deal with IEN volumes for this analysis).
- only focus on RN and RPN data based on this source.
Important: All data values are simulated based on reasonable assumptions, for the purpose of this simulation study
RN: Registered Nurse in the General Class
RPN: Registered Practical Nurse in the General Class
Other assumptions/conditions:
Assume that data on time to registration is available for all applicants.
The average time to registration is currently 150 days for IEN applicants but has already fallen from 180 days over the last year due to other registration process changes.
1) Historical Data (2019–2023)
I use:
- Number of IEN RNs and RPNs for each year (2019–2023) from reported historical data.
- Average time to registration for each year (declining from ~180 to ~150 days) as provided.
- Average age for each year by nurse type (RN vs RPN) from reported historical data.
2) Future Data (2024–2027)
Now define:
- Volumes of IEN RNs and RPNs for 2024–2027 (extrapolated numbers based on historical trend).
- Time-to-registration assumptions, including a special split for
2025:
- Q1 2025 = “pre-change”
- Q2–Q4 2025 = “post-change”
- We assume further reductions in time to registration after April 1, 2025.
- Assume a RN:RPN ratio of 60:40 (observed in 2023 data)
- Assume average ages for future years using observed averages in 2023
Quick data structure check
## Rows: 49,707
## Columns: 5
## $ year <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 20…
## $ nurse_type <chr> "RN", "RN", "RN", "RN", "RN", "RN", "RN", "R…
## $ time_to_registration_days <dbl> 169, 175, 211, 181, 183, 214, 189, 155, 166,…
## $ age_years <dbl> 29, 37, 34, 33, 34, 39, 31, 31, 31, 27, 37, …
## $ cohort <chr> "pre-change", "pre-change", "pre-change", "p…
First and last few rows of the data
| year | nurse_type | time_to_registration_days | age_years | cohort |
|---|---|---|---|---|
| 2019 | RN | 169 | 29 | pre-change |
| 2019 | RN | 175 | 37 | pre-change |
| 2019 | RN | 211 | 34 | pre-change |
| 2019 | RN | 181 | 33 | pre-change |
| 2019 | RN | 183 | 34 | pre-change |
| 2019 | RN | 214 | 39 | pre-change |
| year | nurse_type | time_to_registration_days | age_years | cohort |
|---|---|---|---|---|
| 2027 | RPN | 106 | 32 | post-change |
| 2027 | RPN | 103 | 29 | post-change |
| 2027 | RPN | 104 | 28 | post-change |
| 2027 | RPN | 102 | 30 | post-change |
| 2027 | RPN | 93 | 32 | post-change |
| 2027 | RPN | 129 | 30 | post-change |
4.1.1 New registrants by nurse type
4.1.1.1 Graph
4.1.1.2 Counts
| Nurse Type | 2019 | 2020 | 2021 | 2022 | 2023 | 2024 | 2025 | 2026 | 2027 | Total |
|---|---|---|---|---|---|---|---|---|---|---|
| RN | 792 | 866 | 1,168 | 2,507 | 3,323 | 3,996 | 4,678 | 5,360 | 6,042 | 28,732 |
| RPN | 773 | 743 | 1,289 | 2,618 | 2,168 | 2,664 | 3,119 | 3,573 | 4,028 | 20,975 |
| Total | 1,565 | 1,609 | 2,457 | 5,125 | 5,491 | 6,660 | 7,797 | 8,933 | 10,070 | 49,707 |
4.1.1.3 Proportions
| Nurse Type | 2019 | 2020 | 2021 | 2022 | 2023 | 2024 | 2025 | 2026 | 2027 |
|---|---|---|---|---|---|---|---|---|---|
| RN | 50.6% | 53.8% | 47.5% | 48.9% | 60.5% | 60.0% | 60.0% | 60.0% | 60.0% |
| RPN | 49.4% | 46.2% | 52.5% | 51.1% | 39.5% | 40.0% | 40.0% | 40.0% | 40.0% |
| Total | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% |
Time to registration:
4.1.2 Time to registration
4.1.2.1 Graph
4.1.2.2 Summary statistics
| Cohort | Mean | Median | SD | Min | Max | IQR |
|---|---|---|---|---|---|---|
| post-change | 110.5 | 110 | 20.8 | 27 | 194 | 28 |
| pre-change | 177.7 | 178 | 21.4 | 88 | 266 | 28 |
| Total | 144.1 | 143 | 39.7 | 27 | 266 | 68 |
4.2 Statistical Analysis
4.2.1 Exploratory data visualization
4.2.2 Two-sample t-test
First check whether the distribution of time to registration in days using a Normal Q-Q plot.
Based on the plot below, it looks Normal, so I proceed with a two-sample t-test assuming equal variance.
| T-Test: Time to Registration by Cohort | ||||||
| Mean (Pre-Change) | Mean (Post-Change) | Mean Difference | T-Statistic | P-Value | Lower 95% CI | Upper 95% CI |
|---|---|---|---|---|---|---|
| 110.5 | 177.7 | −67.2 | −354.5 | <0.001 | −67.6 | −66.8 |
Example of intepretation: Mean time to registration after the regulation change is significantly lower than that before the regulation change.
4.2.3 Linear regression model (LM)
| Univariable | Age-adjusted | |
|---|---|---|
| (Intercept) | 177.65*** | 173.88*** |
| [177.39, 177.92] | [172.46, 175.31] | |
| cohortpost-change | -67.18*** | -66.90*** |
| [-67.55, -66.81] | [-67.29, -66.52] | |
| age | 0.12*** | |
| [0.07, 0.16] |
Note: ^^ * p < 0.05, ** p < 0.01, *** p < 0.001
Example of interpretation: The coefficient estimate of post-change is around -67, which indicates after the regulation change, there is a decrease about 67 days in time to registration. This effect is statistically significant with p-value<0.001.
4.2.4 Accelerated failure time model (AFT)
I’ve chosen log-normal model as an example because:
Non-Negativity & Skewness: Time-to-event data (like time to registration) is always positive data and often right-skewed. A log-normal distribution
Normalization by Log-transformation: The data could be more normally distributed after log-transforming, improving model assumptions and reliability of the estimations
Proportional Effects: Log-normal model assumes changes affect time multiplicatively, which better explains real-world processes
Clear interpretation: Model estimates can be explained as acceleration factors, showing how factors like policy changes or other covariates speed up or slow down the registration time
Overall, if the real time-to-registration data are positively skewed and the intervention is expected to affect the process in a proportional/multiplicative manner, then a log-normal AFT model is a strong candidate.
Limitation: the distribution doesn’t look very log-normal. This is a limitation of AFT model, as it assumes a parametric distribution on the time to event outcome, and this assumption might not always be satisfied.
| Accelerated Failure Time Model: Univariable | ||
| Variable | Acceleration Factor | P-Value |
|---|---|---|
| (Intercept) | 176.320 | <0.001 |
| cohortpost-change | 0.615 | <0.001 |
| Log(scale) | 0.165 | <0.001 |
| Accelerated Failure Time Model: adjusting Age | ||
| Variable | Acceleration Factor | P-Value |
|---|---|---|
| (Intercept) | 172.324 | <0.001 |
| cohortpost-change | 0.616 | <0.001 |
| age | 1.001 | <0.001 |
| Log(scale) | 0.165 | <0.001 |
Example of interpretation: Two models give a time ratio (TR) around 0.62 for post regulation-change. It indicates that time to registration after regulation change was approximately 61% of the pre-change time to registration.
4.2.5 Interrupted time series (ITS)
This model is fitted with secondary outcome - average time to registration, aggregated over a specific time period. Since shorter period (seasonal) information is not available publicly, I illustrate the analysis with aggregated yearly data based on simulated historical and predicted data. Given that only 9 datapoints available, and ARIMA model is unnecessary due to the insufficient sample size.
In this ITS model, the core is linear model. time captures the yearly general trend, post is the policy change indicator measuring the immediate effect of the regulation change, time_post is the interaction term between time and the post indicator, assessing the change in trend after the regulation change. avg_age is adjusted only in the second model, controlling for the average age of applicants.
| Univariable ITS | Age-adjusted ITS | |
|---|---|---|
| (Intercept) | 180.36*** | 183.97** |
| [174.76, 185.96] | [114.64, 253.30] | |
| time | -0.10 | -0.12 |
| [-1.54, 1.33] | [-1.89, 1.64] | |
| post | 24.87 | 24.54 |
| [-9.80, 59.54] | [-17.67, 66.76] | |
| time_post | -11.25** | -11.24** |
| [-15.74, -6.76] | [-16.65, -5.82] | |
| avg_age | -0.11 | |
| [-2.19, 1.97] |
Note: ^^ * p < 0.05, ** p < 0.01, *** p < 0.001
Example of interpretation: After regulation change, there was and increase of about 25 days in registration time (but this effect is not statistically significant since the 95% CI includes 0). But the trend in registration time significantly decreased by ~11.25 days per year.