Introduction
Linear spline (also known as piecewise) models are useful when
looking for changes at specific points in time. We can use the
Stata
command mkspline
to create knots in the
data, which can be used for piecewise linear models. These knots are
placed where “shocks” or major events occur that would change the
trajectory of the data.
In the example below, the figure plots the scatter of average scores
per student across time. The red dashed lines indicate major events that
occurred. It’s clear that these events at knot1
and
knot2
changed the trajectory of the average student
scores.
Linear spline example
It’s up to the investigator to determine where to insert these
knots
in their model. However, there should be some
rationale or justification to accompany this decision.
In this article, we’ll explore how we can use these
mkspline
command to generate knots
using
linear mixed effects models. I am using Stata
version 18
for this tutorial.
Motivating example - Student scores across time
In this motivating example, we will use simulated data of 10 students
with scores in Fall, Winter, and Spring at Grades 1, 2, 3, and 4. Hence,
each student will have a total of 12 scores across time. We will assume
that there were major events that could potentially change the student’s
scores at time = 3 and time = 7. These are the periods where we will
insert knots
using the mkspline
command.
Load data from GitHub repository
We will load our data from the GitHub repository Stata
tutorials. From Stata
, we can load data using the
import delimited
command. Once the data is loaded, we can
explore the data using the describe
command.
// SELECT DIRECTORY / LOAD DATA FROM GITHUB
clear all
import delimited "https://raw.githubusercontent.com/mbounthavong/Stata-tutorials/refs/heads/main/Data/linear_spline.csv"
// DESCRIBE DATA
describe
. // SELECT DIRECTORY / LOAD DATA FROM GITHUB
. clear all
. import delimited "https://raw.githubusercontent.com/mbounthavong/Stata-tutori
> als/refs/heads/main/Data/linear_spline.csv"
(encoding automatically selected: ISO-8859-1)
(7 vars, 120 obs)
.
. // DESCRIBE DATA
. describe
Contains data
Observations: 120
Variables: 7
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
subjectid byte %8.0g
grade byte %8.0g
quarter byte %8.0g
season str6 %9s
time byte %8.0g
score byte %8.0g
shocks byte %8.0g
-------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
.
The following variables are listed:
subjectid: This is the unique identifier of the student
grade: Grade of the student
quarter: Quarter of the Grade year (Fall, Winter, and Spring). There is no Summer.
season: String data with the “Fall,” “Winter,” and “Spring” text
time: Numerical data that has the sequence from grade and quarter (range 1 to 12)
score: Dependent variable or student score at each grade/quarter
shocks: Events that occurred during the time period (0, 1, and 2)
Visualize data
Let’s visualize the data. Notice that the scores change at
time = 3
and time = 7
.
// VISUALIZE PATTERNS
egen avg_score = mean(score), by(time)
sort time
graph twoway (scatter avg_score time, msize(large) col("navy"))
Scatter plot of student scores
Models
We will construct two models:
Linear spline model with dummy variables
Linear spline model with missing data
Model 1a - Linear spline model with dummy variables
First, we will need to create dummy variables for when a subject is
in the period between time >=3
and
time >= 7
.
Next, we will input the knots
at time = 3
and time = 7
using the mkspline
command.
Then, we will run a linear mixed effects model with random effects.
The standard errors will need to be clustered at the student level
(subjectid
). Afterwards, we will plot our results using the
predicted scores of each student at each time point.
Afterwards, we will plot our linear spline model.
// GENERATE DUMMY VARIABLES
gen int1 = 0
replace int1 = 1 if time >= 3
gen int2 = 0
replace int2 = 1 if time >= 7
// MKSPLINE SETUP
mkspline knot1 3 knot2 7 knot3 = time
// MKSPLINE MODEL - LINEAR REGRESSION MODEL
xtset subjectid time
xtreg score knot1 knot2 knot3 int1 int2, re vce(cluster subjectid)
predict linear1 /* Generate the predicted scores */
// PLOT
sort time
graph twoway (scatter avg_score time) ///
(lfit linear1 time if shocks == 0, col("navy") range(1 3)) ///
(lfit linear1 time if shocks == 1, col("purple") range(3 7)) ///
(lfit linear1 time if shocks == 2, col("brown") range(7 12) ///
graphregion(color(white)) ///
ylab(, nogrid labsize(small)) ///
ytitle("Average score per student") ///
xtitle("Time (Quarters)") ///
xline(3 7, lpattern(dash) lcol("cranberry")) ///
xlab(1 3 7 12) ///
xscale(range(1 12) noextend) ///
legend(off))
(100 real changes made)
(60 real changes made)
Panel variable: subjectid (strongly balanced)
Time variable: time, 1 to 12
Delta: 1 unit
Random-effects GLS regression Number of obs = 120
Group variable: subjectid Number of groups = 10
R-squared: Obs per group:
Within = 0.0000 min = 12
Between = 0.0000 avg = 12.0
Overall = 0.3356 max = 12
Wald chi2(5) = 129.94
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
(Std. err. adjusted for 10 clusters in subjectid)
------------------------------------------------------------------------------
| Robust
score | Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
knot1 | .7 .8208955 0.85 0.394 -.9089256 2.308926
knot2 | .56 1.149799 0.49 0.626 -1.693564 2.813564
knot3 | .3714286 .2574796 1.44 0.149 -.1332222 .8760793
int1 | 7.86 2.120065 3.71 0.000 3.70475 12.01525
int2 | -7.428571 2.88804 -2.57 0.010 -13.08903 -1.768116
_cons | 1.6 1.220344 1.31 0.190 -.7918305 3.991831
-------------+----------------------------------------------------------------
sigma_u | 2.5772655
sigma_e | 4.2705735
rho | .26697235 (fraction of variance due to u_i)
------------------------------------------------------------------------------
(option xb assumed; fitted values)
The Stata
output for Model 1 provides the slopes of each
fitted line in the segments. The following are relevant coefficients
from the model.
knot1
denotes the slope whentime < 3
(0.70 points; 95% CI: -0.91, 2.31)knot2
denotes the slope whentime >= 3
andtime < 7
(0.56 points; 95% CI: -1.69, 2.81)knot3
denotes the slope whentime >= 7
(0.37 points; 95% CI: -0.13, 0.88)int1
denotes the level change attime = 3
(7.86; 95% CI: 3.70, 12.02)int2
denotes the level change attime = 7
(-7.43; 95% CI: -13.09, -1.77)
We can see these much more clearly with the figure that was generated. I have annotated these so that you can see where the regression output values are on the figures.
In Figure 1, we can see the slopes within the segments and the level changes at each knots.
Figure 1: Linear spline with dummy variables
We can identify the values that were used to estimate the level
changes at each knot using the margins
command. The
showcoding
command allows us to see when the relationships
between the knots
and int
terms. This will
help up code the margins
command more efficiently.
// MARGINS TO ESTIMATE CHANGES AT THE LEVELS
showcoding time knot1 knot2 knot3 int1 int2
margins, at(knot1 = 1 knot2 = 0 knot3 = 0 int1 = 0 int2 = 0) ///
at(knot1 = 3 knot2 = 0 knot3 = 0 int1 = 0 int2 = 0) ///
at(knot1 = 3 knot2 = 0 knot3 = 0 int1 = 1 int2 = 0) ///
at(knot1 = 3 knot2 = 4 knot3 = 0 int1 = 1 int2 = 0) ///
at(knot1 = 3 knot2 = 4 knot3 = 0 int1 = 1 int2 = 1) ///
at(knot1 = 3 knot2 = 4 knot3 = 5 int1 = 1 int2 = 1)
| time knot1 knot2 knot3 int1 int2 |
|--------------------------------------------|
| 1 1 0 0 0 0 |
| 2 2 0 0 0 0 |
| 3 3 0 0 1 0 |
| 4 3 1 0 1 0 |
| 5 3 2 0 1 0 |
| 6 3 3 0 1 0 |
| 7 3 4 0 1 1 |
| 8 3 4 1 1 1 |
| 9 3 4 2 1 1 |
| 10 3 4 3 1 1 |
| 11 3 4 4 1 1 |
| 12 3 4 5 1 1 |
+--------------------------------------------+
Adjusted predictions Number of obs = 120
Model VCE: Robust
Expression: Linear prediction, predict()
1._at: knot1 = 1
knot2 = 0
knot3 = 0
int1 = 0
int2 = 0
2._at: knot1 = 3
knot2 = 0
knot3 = 0
int1 = 0
int2 = 0
3._at: knot1 = 3
knot2 = 0
knot3 = 0
int1 = 1
int2 = 0
4._at: knot1 = 3
knot2 = 4
knot3 = 0
int1 = 1
int2 = 0
5._at: knot1 = 3
knot2 = 4
knot3 = 0
int1 = 1
int2 = 1
6._at: knot1 = 3
knot2 = 4
knot3 = 5
int1 = 1
int2 = 1
------------------------------------------------------------------------------
| Delta-method
| Margin std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
_at |
1 | 2.3 .4321283 5.32 0.000 1.453044 3.146956
2 | 3.7 1.274732 2.90 0.004 1.201571 6.198429
3 | 11.56 1.698678 6.81 0.000 8.230653 14.88935
4 | 13.8 3.329185 4.15 0.000 7.274918 20.32508
5 | 6.371429 .986299 6.46 0.000 4.438318 8.304539
6 | 8.228571 1.897541 4.34 0.000 4.50946 11.94768
------------------------------------------------------------------------------
In Figure 2, we can how the level changes were calculated. The
results from the margins
command yields the scores before
and immediately after the knots
.
Figure 2: Linear spline model - results from the margins
output
Next, we can use the lincom
command to get the
difference in slopes.
We can estimate the differences between the slopes for the following:
time >=3
andtime < 3
time >=7
andtime < 7
The difference in the slopes between knot2
and
knot1
is 0.14. The difference in the slopes between
knot3
and knot2
is -0.19.
// RUN REGRESSION
qui xtreg score knot1 knot2 knot3 int1 int2, re vce(cluster subjectid)
// LINCOM COMMANDS
lincom knot2 - knot1 /* the difference in the slopes between `time >= 3` and `time < 3` */
lincom knot3 - knot2 /* the difference in the slopes between `time >= 7` and `time < 7` */
( 1) - knot1 + knot2 = 0
------------------------------------------------------------------------------
score | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
(1) | -.14 1.591294 -0.09 0.930 -3.25888 2.97888
------------------------------------------------------------------------------
( 1) - knot2 + knot3 = 0
------------------------------------------------------------------------------
score | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
(1) | -.1885714 1.169694 -0.16 0.872 -2.48113 2.103987
------------------------------------------------------------------------------
Model 1b - Linear spline model with dummy variables using the
marginal
option
We can also do this using the marginal
option in the
mkspline
command. The coefficients are the same as the
results from the lincom
command.
// SELECT DIRECTORY / LOAD DATA FROM GITHUB
clear all
import delimited "https://raw.githubusercontent.com/mbounthavong/Stata-tutorials/refs/heads/main/Data/linear_spline.csv"
// GENERATE DUMMY VARIABLES
gen int1 = 0
replace int1 = 1 if time >= 3
gen int2 = 0
replace int2 = 1 if time >= 7
// MKSPLINE SETUP
mkspline knot1 3 knot2 7 knot3 = time, marginal
// MKSPLINE MODEL - LINEAR REGRESSION MODEL
xtset subjectid time
xtreg score knot1 knot2 knot3 int1 int2, re vce(cluster subjectid)
(encoding automatically selected: ISO-8859-1)
(7 vars, 120 obs)
(100 real changes made)
(60 real changes made)
Panel variable: subjectid (strongly balanced)
Time variable: time, 1 to 12
Delta: 1 unit
Random-effects GLS regression Number of obs = 120
Group variable: subjectid Number of groups = 10
R-squared: Obs per group:
Within = 0.0000 min = 12
Between = 0.0000 avg = 12.0
Overall = 0.3356 max = 12
Wald chi2(5) = 129.94
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
(Std. err. adjusted for 10 clusters in subjectid)
------------------------------------------------------------------------------
| Robust
score | Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
knot1 | .7 .8208955 0.85 0.394 -.9089256 2.308926
knot2 | -.14 1.591294 -0.09 0.930 -3.25888 2.97888
knot3 | -.1885714 1.169694 -0.16 0.872 -2.48113 2.103987
int1 | 7.86 2.120065 3.71 0.000 3.70475 12.01525
int2 | -7.428571 2.88804 -2.57 0.010 -13.08903 -1.768116
_cons | 1.6 1.220344 1.31 0.190 -.7918305 3.991831
-------------+----------------------------------------------------------------
sigma_u | 2.5772655
sigma_e | 4.2705735
rho | .26697235 (fraction of variance due to u_i)
------------------------------------------------------------------------------
Notice that the coefficients for knot2
and
knot3
are the same as the results from the
lincom
command above. The marginal
option
provides the difference in the slopes between the segments. Whereas the
spline model without using marginal
option yields the
slopes within each segment.
Model 1c - Linear spline model with interaction terms
We can also perform this analysis using interaction terms instead of
the mkspline
command.
// MODEL 1a: LINEAR MIXED EFFECT MODEL USING INTERACTION TERMS
** Using interaction terms instead of linear splines
xtreg score c.time c.int1 c.int2 c.time#c.int1 c.time#c.int2, re vce(cluster subjectid)
**** Slope for the period between knot1 and knot2:
lincom time + c.time#c.int1
**** Slope for the period after knot2?
lincom time + c.time#c.int1 + c.time#c.int2
**** Level change at knot1:
margins, dydx(int1) at(time = 3)
**** Level change at knot2:
margins, dydx(int2) at(time = 7)
Random-effects GLS regression Number of obs = 120
Group variable: subjectid Number of groups = 10
R-squared: Obs per group:
Within = 0.0000 min = 12
Between = 0.0000 avg = 12.0
Overall = 0.3356 max = 12
Wald chi2(5) = 129.94
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
(Std. err. adjusted for 10 clusters in subjectid)
------------------------------------------------------------------------------
| Robust
score | Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
time | .7 .8208955 0.85 0.394 -.9089256 2.308926
int1 | 8.28 5.394375 1.53 0.125 -2.292781 18.85278
int2 | -6.108571 5.571987 -1.10 0.273 -17.02947 4.812323
|
c.time#|
c.int1 | -.14 1.591294 -0.09 0.930 -3.25888 2.97888
|
c.time#|
c.int2 | -.1885714 1.169694 -0.16 0.872 -2.48113 2.103987
|
_cons | 1.6 1.220344 1.31 0.190 -.7918305 3.991831
-------------+----------------------------------------------------------------
sigma_u | 2.5772655
sigma_e | 4.2705735
rho | .26697235 (fraction of variance due to u_i)
------------------------------------------------------------------------------
( 1) time + c.time#c.int1 = 0
------------------------------------------------------------------------------
score | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
(1) | .56 1.149799 0.49 0.626 -1.693564 2.813564
------------------------------------------------------------------------------
( 1) time + c.time#c.int1 + c.time#c.int2 = 0
------------------------------------------------------------------------------
score | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
(1) | .3714286 .2574796 1.44 0.149 -.1332222 .8760793
------------------------------------------------------------------------------
Average marginal effects Number of obs = 120
Model VCE: Robust
Expression: Linear prediction, predict()
dy/dx wrt: int1
At: time = 3
------------------------------------------------------------------------------
| Delta-method
| dy/dx std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
int1 | 7.86 2.120065 3.71 0.000 3.70475 12.01525
------------------------------------------------------------------------------
Average marginal effects Number of obs = 120
Model VCE: Robust
Expression: Linear prediction, predict()
dy/dx wrt: int2
At: time = 7
------------------------------------------------------------------------------
| Delta-method
| dy/dx std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
int2 | -7.428571 2.88804 -2.57 0.010 -13.08903 -1.768116
------------------------------------------------------------------------------
Model 2 - Linear spline model with missing data
We will assume that there is missing data at the Fall quarter during
Grade 1. We will replace the original score value with a missing value
.
. Once that has been done, we will repeat the procedure
from Model 1.
// MISSING DATA EXAMPLE
// SELECT DIRECTORY / LOAD DATA FROM GITHUB
clear all
import delimited "https://raw.githubusercontent.com/mbounthavong/Stata-tutorials/refs/heads/main/Data/linear_spline.csv"
// CREATE MISSING DATA
replace score = . if grade == 1 & time == 3
// AVERAGE SCORE
egen avg_score = mean(score), by(time)
// GENERATE DUMMY VARIABLES
gen int1 = 0
replace int1 = 1 if time >= 3
gen int2 = 0
replace int2 = 1 if time >= 7
// MKSPLINE SETUP
mkspline knot1 3 knot2 7 knot3 = time
// MKSPLINE MODEL - LINEAR MIXED EFFECTS MODEL
xtset subjectid time
xtreg score knot1 knot2 knot3 int1 int2, re vce(cluster subjectid)
predict linear2 /* Generate the predicted scores */
// PLOT
sort time
graph twoway (scatter avg_score time) ///
(lfit linear2 time if shocks == 0, col("navy") range(1 3)) ///
(lfit linear2 time if shocks == 1, col("purple") range(3 7)) ///
(lfit linear2 time if shocks == 2, col("brown") range(7 12) ///
graphregion(color(white)) ///
ylab(, nogrid labsize(small)) ///
ytitle("Average score per student") ///
xtitle("Time (Quarters)") ///
xline(3 7, lpattern(dash) lcol("cranberry")) ///
xlab(1 3 7 12) ///
xscale(range(1 12) noextend) ///
legend(off))
(encoding automatically selected: ISO-8859-1)
(7 vars, 120 obs)
(10 real changes made, 10 to missing)
(10 missing values generated)
(100 real changes made)
(60 real changes made)
Panel variable: subjectid (strongly balanced)
Time variable: time, 1 to 12
Delta: 1 unit
Random-effects GLS regression Number of obs = 110
Group variable: subjectid Number of groups = 10
R-squared: Obs per group:
Within = 0.0000 min = 11
Between = 0.0000 avg = 11.0
Overall = 0.3593 max = 11
Wald chi2(5) = 91.01
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
(Std. err. adjusted for 10 clusters in subjectid)
------------------------------------------------------------------------------
| Robust
score | Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
knot1 | .7 .8225521 0.85 0.395 -.9121724 2.312172
knot2 | -.2 1.413624 -0.14 0.887 -2.970652 2.570652
knot3 | .3714286 .2579992 1.44 0.150 -.1342405 .8770977
int1 | 9.633333 3.080292 3.13 0.002 3.596071 15.6706
int2 | -6.161905 3.089641 -1.99 0.046 -12.21749 -.1063197
_cons | 1.6 1.222807 1.31 0.191 -.7966572 3.996657
-------------+----------------------------------------------------------------
sigma_u | 2.8015938
sigma_e | 3.9790073
rho | .33143813 (fraction of variance due to u_i)
------------------------------------------------------------------------------
(option xb assumed; fitted values)
Similar to Model 1, the Stata
output for Model 2
provides the slopes of each fitted line in the segments. The following
are relevant coefficients from the model.
knot1
denotes the slope whentime < 3
(0.70 points; 95% CI: -0.91, 2.31)knot2
denotes the slope whentime >= 3
andtime < 7
(-0.2 points; 95% CI: -2.97, 2.57)knot3
denotes the slope whentime >= 7
(0.37 points; 95% CI: -0.13, 0.88)int1
denotes the level change attime = 3
(0.96; 95% CI: 3.60, 15.67)int2
denotes the level change attime = 7
(-6.16; 95% CI: -12.22, -0.11)
Similar with Model 1, I have annotated these so that you can see where the regression output values are on the figures. Notice that the slopes are different with the missing data.
In Figure 3, we can see the slope within the segments between
time > 3
and time <= 7
is no longer
positive. Removing data from time = 3
changed the slope
from positive to negative. This provides some idea of how missing data
can influence the study results.
Figure 3: Linear spline model with missing data at time = 3
Conclusions
We can use linear spline models with more than one knot to measure
the slope and level changes by using dummy variables for the segments.
The mkspline
command makes it easy to generate the
necessary variables to capture the slopes and the dummy variables allow
us to measure the level changes. The margins
command can
provide us with further details about the intercepts at each of the
knots, which we can use to verify the regression coefficients from the
models.
Acknowledgements
Stata
has a basic document on using mkpline
here
I also found the tutorial on the UCLA Advanced Research Computing site very helpful.
Michael N. Mitchell has an excellent book that provides numerous examples to perform linear spline models. I used many of the codes in this book to develop this tutorial. You can find his book Interpreting and Visualizing Regression Models Using Stata, Second Edition in the Stata Press site.
Disclaimers
This is a work in progress, and I may update this in the future.
This is for educational purposes only.