Linear Spline (Piecewise) Models in Stata

Mark Bounthavong

28 October 2024; updated on 01 November 2024

Introduction

Linear spline (also known as piecewise) models are useful when looking for changes at specific points in time. We can use the Stata command mkspline to create knots in the data, which can be used for piecewise linear models. These knots are placed where “shocks” or major events occur that would change the trajectory of the data.

In the example below, the figure plots the scatter of average scores per student across time. The red dashed lines indicate major events that occurred. It’s clear that these events at knot1 and knot2 changed the trajectory of the average student scores.

Linear spline example

Linear spline example

It’s up to the investigator to determine where to insert these knots in their model. However, there should be some rationale or justification to accompany this decision.

In this article, we’ll explore how we can use these mkspline command to generate knots using linear mixed effects models. I am using Stata version 18 for this tutorial.

Motivating example - Student scores across time

In this motivating example, we will use simulated data of 10 students with scores in Fall, Winter, and Spring at Grades 1, 2, 3, and 4. Hence, each student will have a total of 12 scores across time. We will assume that there were major events that could potentially change the student’s scores at time = 3 and time = 7. These are the periods where we will insert knots using the mkspline command.

Load data from GitHub repository

We will load our data from the GitHub repository Stata tutorials. From Stata, we can load data using the import delimited command. Once the data is loaded, we can explore the data using the describe command.

// SELECT DIRECTORY / LOAD DATA FROM GITHUB
clear all
import delimited "https://raw.githubusercontent.com/mbounthavong/Stata-tutorials/refs/heads/main/Data/linear_spline.csv"

// DESCRIBE DATA
describe
. // SELECT DIRECTORY / LOAD DATA FROM GITHUB

. clear all

. import delimited "https://raw.githubusercontent.com/mbounthavong/Stata-tutori
> als/refs/heads/main/Data/linear_spline.csv"
(encoding automatically selected: ISO-8859-1)
(7 vars, 120 obs)

. 
. // DESCRIBE DATA
. describe

Contains data
 Observations:           120                  
    Variables:             7                  
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
subjectid       byte    %8.0g                 
grade           byte    %8.0g                 
quarter         byte    %8.0g                 
season          str6    %9s                   
time            byte    %8.0g                 
score           byte    %8.0g                 
shocks          byte    %8.0g                 
-------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

. 

The following variables are listed:

  • subjectid: This is the unique identifier of the student

  • grade: Grade of the student

  • quarter: Quarter of the Grade year (Fall, Winter, and Spring). There is no Summer.

  • season: String data with the “Fall,” “Winter,” and “Spring” text

  • time: Numerical data that has the sequence from grade and quarter (range 1 to 12)

  • score: Dependent variable or student score at each grade/quarter

  • shocks: Events that occurred during the time period (0, 1, and 2)

Visualize data

Let’s visualize the data. Notice that the scores change at time = 3 and time = 7.

// VISUALIZE PATTERNS
egen avg_score = mean(score), by(time)

sort time
graph twoway (scatter avg_score time, msize(large) col("navy"))
Scatter plot of student scores

Scatter plot of student scores

Models

We will construct two models:

  1. Linear spline model with dummy variables

  2. Linear spline model with missing data

Model 1a - Linear spline model with dummy variables

First, we will need to create dummy variables for when a subject is in the period between time >=3 and time >= 7.

Next, we will input the knots at time = 3 and time = 7 using the mkspline command.

Then, we will run a linear mixed effects model with random effects. The standard errors will need to be clustered at the student level (subjectid). Afterwards, we will plot our results using the predicted scores of each student at each time point.

Afterwards, we will plot our linear spline model.

// GENERATE DUMMY VARIABLES
gen int1 = 0
    replace int1 = 1 if time >= 3

gen int2 = 0
    replace int2 = 1 if time >= 7
    
// MKSPLINE SETUP
mkspline knot1 3 knot2 7 knot3 = time 
             
// MKSPLINE MODEL - LINEAR REGRESSION MODEL
xtset subjectid time

xtreg score knot1 knot2 knot3 int1 int2, re vce(cluster subjectid)
predict linear1 /* Generate the predicted scores */

// PLOT
sort time
graph twoway (scatter avg_score time) ///
             (lfit linear1 time if shocks == 0, col("navy") range(1 3)) ///
             (lfit linear1 time if shocks == 1, col("purple") range(3 7)) ///
             (lfit linear1 time if shocks == 2, col("brown") range(7 12) ///
              graphregion(color(white)) ///
              ylab(, nogrid labsize(small)) ///
              ytitle("Average score per student") ///
              xtitle("Time (Quarters)") ///
              xline(3 7, lpattern(dash) lcol("cranberry")) ///
              xlab(1 3 7 12) ///
              xscale(range(1 12) noextend) ///
              legend(off))            
(100 real changes made)


(60 real changes made)



Panel variable: subjectid (strongly balanced)
 Time variable: time, 1 to 12
         Delta: 1 unit


Random-effects GLS regression                   Number of obs     =        120
Group variable: subjectid                       Number of groups  =         10

R-squared:                                      Obs per group:
     Within  = 0.0000                                         min =         12
     Between = 0.0000                                         avg =       12.0
     Overall = 0.3356                                         max =         12

                                                Wald chi2(5)      =     129.94
corr(u_i, X) = 0 (assumed)                      Prob > chi2       =     0.0000

                             (Std. err. adjusted for 10 clusters in subjectid)
------------------------------------------------------------------------------
             |               Robust
       score | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       knot1 |         .7   .8208955     0.85   0.394    -.9089256    2.308926
       knot2 |        .56   1.149799     0.49   0.626    -1.693564    2.813564
       knot3 |   .3714286   .2574796     1.44   0.149    -.1332222    .8760793
        int1 |       7.86   2.120065     3.71   0.000      3.70475    12.01525
        int2 |  -7.428571    2.88804    -2.57   0.010    -13.08903   -1.768116
       _cons |        1.6   1.220344     1.31   0.190    -.7918305    3.991831
-------------+----------------------------------------------------------------
     sigma_u |  2.5772655
     sigma_e |  4.2705735
         rho |  .26697235   (fraction of variance due to u_i)
------------------------------------------------------------------------------

(option xb assumed; fitted values)

The Stata output for Model 1 provides the slopes of each fitted line in the segments. The following are relevant coefficients from the model.

  • knot1 denotes the slope when time < 3 (0.70 points; 95% CI: -0.91, 2.31)

  • knot2 denotes the slope when time >= 3 and time < 7 (0.56 points; 95% CI: -1.69, 2.81)

  • knot3 denotes the slope when time >= 7 (0.37 points; 95% CI: -0.13, 0.88)

  • int1 denotes the level change at time = 3 (7.86; 95% CI: 3.70, 12.02)

  • int2 denotes the level change at time = 7 (-7.43; 95% CI: -13.09, -1.77)

We can see these much more clearly with the figure that was generated. I have annotated these so that you can see where the regression output values are on the figures.

In Figure 1, we can see the slopes within the segments and the level changes at each knots.

Figure 1: Linear spline with dummy variables

Figure 1: Linear spline with dummy variables

We can identify the values that were used to estimate the level changes at each knot using the margins command. The showcoding command allows us to see when the relationships between the knots and int terms. This will help up code the margins command more efficiently.

// MARGINS TO ESTIMATE CHANGES AT THE LEVELS
showcoding time knot1 knot2 knot3 int1 int2

margins, at(knot1 = 1 knot2 = 0 knot3 = 0 int1 = 0 int2 = 0) ///
         at(knot1 = 3 knot2 = 0 knot3 = 0 int1 = 0 int2 = 0) ///
         at(knot1 = 3 knot2 = 0 knot3 = 0 int1 = 1 int2 = 0) ///
         at(knot1 = 3 knot2 = 4 knot3 = 0 int1 = 1 int2 = 0) ///
         at(knot1 = 3 knot2 = 4 knot3 = 0 int1 = 1 int2 = 1) ///
         at(knot1 = 3 knot2 = 4 knot3 = 5 int1 = 1 int2 = 1)
  | time   knot1   knot2   knot3   int1   int2 |
  |--------------------------------------------|
  |    1       1       0       0      0      0 |
  |    2       2       0       0      0      0 |
  |    3       3       0       0      1      0 |
  |    4       3       1       0      1      0 |
  |    5       3       2       0      1      0 |
  |    6       3       3       0      1      0 |
  |    7       3       4       0      1      1 |
  |    8       3       4       1      1      1 |
  |    9       3       4       2      1      1 |
  |   10       3       4       3      1      1 |
  |   11       3       4       4      1      1 |
  |   12       3       4       5      1      1 |
  +--------------------------------------------+


Adjusted predictions                                       Number of obs = 120
Model VCE: Robust

Expression: Linear prediction, predict()
1._at: knot1 = 1
       knot2 = 0
       knot3 = 0
       int1  = 0
       int2  = 0
2._at: knot1 = 3
       knot2 = 0
       knot3 = 0
       int1  = 0
       int2  = 0
3._at: knot1 = 3
       knot2 = 0
       knot3 = 0
       int1  = 1
       int2  = 0
4._at: knot1 = 3
       knot2 = 4
       knot3 = 0
       int1  = 1
       int2  = 0
5._at: knot1 = 3
       knot2 = 4
       knot3 = 0
       int1  = 1
       int2  = 1
6._at: knot1 = 3
       knot2 = 4
       knot3 = 5
       int1  = 1
       int2  = 1

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         _at |
          1  |        2.3   .4321283     5.32   0.000     1.453044    3.146956
          2  |        3.7   1.274732     2.90   0.004     1.201571    6.198429
          3  |      11.56   1.698678     6.81   0.000     8.230653    14.88935
          4  |       13.8   3.329185     4.15   0.000     7.274918    20.32508
          5  |   6.371429    .986299     6.46   0.000     4.438318    8.304539
          6  |   8.228571   1.897541     4.34   0.000      4.50946    11.94768
------------------------------------------------------------------------------

In Figure 2, we can how the level changes were calculated. The results from the margins command yields the scores before and immediately after the knots.

Figure 2: Linear spline model - results from the `margins` output

Figure 2: Linear spline model - results from the margins output

Next, we can use the lincom command to get the difference in slopes.

We can estimate the differences between the slopes for the following:

  • time >=3 and time < 3

  • time >=7 and time < 7

The difference in the slopes between knot2 and knot1 is 0.14. The difference in the slopes between knot3 and knot2 is -0.19.

// RUN REGRESSION
qui xtreg score knot1 knot2 knot3 int1 int2, re vce(cluster subjectid)

// LINCOM COMMANDS
lincom knot2 - knot1 /* the difference in the slopes between `time >= 3` and `time < 3` */

lincom knot3 - knot2 /* the difference in the slopes between `time >= 7` and `time < 7` */
 ( 1)  - knot1 + knot2 = 0

------------------------------------------------------------------------------
       score | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         (1) |       -.14   1.591294    -0.09   0.930     -3.25888     2.97888
------------------------------------------------------------------------------


 ( 1)  - knot2 + knot3 = 0

------------------------------------------------------------------------------
       score | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         (1) |  -.1885714   1.169694    -0.16   0.872     -2.48113    2.103987
------------------------------------------------------------------------------

Model 1b - Linear spline model with dummy variables using the marginal option

We can also do this using the marginal option in the mkspline command. The coefficients are the same as the results from the lincom command.

// SELECT DIRECTORY / LOAD DATA FROM GITHUB
clear all
import delimited "https://raw.githubusercontent.com/mbounthavong/Stata-tutorials/refs/heads/main/Data/linear_spline.csv"

// GENERATE DUMMY VARIABLES
gen int1 = 0
    replace int1 = 1 if time >= 3

gen int2 = 0
    replace int2 = 1 if time >= 7
    
// MKSPLINE SETUP
mkspline knot1 3 knot2 7 knot3 = time, marginal
             
// MKSPLINE MODEL - LINEAR REGRESSION MODEL
xtset subjectid time

xtreg score knot1 knot2 knot3 int1 int2, re vce(cluster subjectid)
(encoding automatically selected: ISO-8859-1)
(7 vars, 120 obs)


(100 real changes made)


(60 real changes made)



Panel variable: subjectid (strongly balanced)
 Time variable: time, 1 to 12
         Delta: 1 unit


Random-effects GLS regression                   Number of obs     =        120
Group variable: subjectid                       Number of groups  =         10

R-squared:                                      Obs per group:
     Within  = 0.0000                                         min =         12
     Between = 0.0000                                         avg =       12.0
     Overall = 0.3356                                         max =         12

                                                Wald chi2(5)      =     129.94
corr(u_i, X) = 0 (assumed)                      Prob > chi2       =     0.0000

                             (Std. err. adjusted for 10 clusters in subjectid)
------------------------------------------------------------------------------
             |               Robust
       score | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       knot1 |         .7   .8208955     0.85   0.394    -.9089256    2.308926
       knot2 |       -.14   1.591294    -0.09   0.930     -3.25888     2.97888
       knot3 |  -.1885714   1.169694    -0.16   0.872     -2.48113    2.103987
        int1 |       7.86   2.120065     3.71   0.000      3.70475    12.01525
        int2 |  -7.428571    2.88804    -2.57   0.010    -13.08903   -1.768116
       _cons |        1.6   1.220344     1.31   0.190    -.7918305    3.991831
-------------+----------------------------------------------------------------
     sigma_u |  2.5772655
     sigma_e |  4.2705735
         rho |  .26697235   (fraction of variance due to u_i)
------------------------------------------------------------------------------

Notice that the coefficients for knot2 and knot3 are the same as the results from the lincom command above. The marginal option provides the difference in the slopes between the segments. Whereas the spline model without using marginal option yields the slopes within each segment.

Model 1c - Linear spline model with interaction terms

We can also perform this analysis using interaction terms instead of the mkspline command.

// MODEL 1a: LINEAR MIXED EFFECT MODEL USING INTERACTION TERMS      

** Using interaction terms instead of linear splines
xtreg score c.time c.int1 c.int2 c.time#c.int1 c.time#c.int2, re vce(cluster subjectid)
              
**** Slope for the period between knot1 and knot2:
lincom time + c.time#c.int1

**** Slope for the period after knot2?
lincom time + c.time#c.int1 + c.time#c.int2

**** Level change at knot1:
margins, dydx(int1) at(time = 3)

**** Level change at knot2:
margins, dydx(int2) at(time = 7)    
Random-effects GLS regression                   Number of obs     =        120
Group variable: subjectid                       Number of groups  =         10

R-squared:                                      Obs per group:
     Within  = 0.0000                                         min =         12
     Between = 0.0000                                         avg =       12.0
     Overall = 0.3356                                         max =         12

                                                Wald chi2(5)      =     129.94
corr(u_i, X) = 0 (assumed)                      Prob > chi2       =     0.0000

                             (Std. err. adjusted for 10 clusters in subjectid)
------------------------------------------------------------------------------
             |               Robust
       score | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        time |         .7   .8208955     0.85   0.394    -.9089256    2.308926
        int1 |       8.28   5.394375     1.53   0.125    -2.292781    18.85278
        int2 |  -6.108571   5.571987    -1.10   0.273    -17.02947    4.812323
             |
      c.time#|
      c.int1 |       -.14   1.591294    -0.09   0.930     -3.25888     2.97888
             |
      c.time#|
      c.int2 |  -.1885714   1.169694    -0.16   0.872     -2.48113    2.103987
             |
       _cons |        1.6   1.220344     1.31   0.190    -.7918305    3.991831
-------------+----------------------------------------------------------------
     sigma_u |  2.5772655
     sigma_e |  4.2705735
         rho |  .26697235   (fraction of variance due to u_i)
------------------------------------------------------------------------------


 ( 1)  time + c.time#c.int1 = 0

------------------------------------------------------------------------------
       score | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         (1) |        .56   1.149799     0.49   0.626    -1.693564    2.813564
------------------------------------------------------------------------------


 ( 1)  time + c.time#c.int1 + c.time#c.int2 = 0

------------------------------------------------------------------------------
       score | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         (1) |   .3714286   .2574796     1.44   0.149    -.1332222    .8760793
------------------------------------------------------------------------------


Average marginal effects                                   Number of obs = 120
Model VCE: Robust

Expression: Linear prediction, predict()
dy/dx wrt:  int1
At: time = 3

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        int1 |       7.86   2.120065     3.71   0.000      3.70475    12.01525
------------------------------------------------------------------------------


Average marginal effects                                   Number of obs = 120
Model VCE: Robust

Expression: Linear prediction, predict()
dy/dx wrt:  int2
At: time = 7

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        int2 |  -7.428571    2.88804    -2.57   0.010    -13.08903   -1.768116
------------------------------------------------------------------------------

Model 2 - Linear spline model with missing data

We will assume that there is missing data at the Fall quarter during Grade 1. We will replace the original score value with a missing value .. Once that has been done, we will repeat the procedure from Model 1.

// MISSING DATA EXAMPLE

// SELECT DIRECTORY / LOAD DATA FROM GITHUB
clear all
import delimited "https://raw.githubusercontent.com/mbounthavong/Stata-tutorials/refs/heads/main/Data/linear_spline.csv"

// CREATE MISSING DATA
replace score = . if grade == 1 & time == 3 

// AVERAGE SCORE
egen avg_score = mean(score), by(time)

// GENERATE DUMMY VARIABLES
gen int1 = 0
    replace int1 = 1 if time >= 3

gen int2 = 0
    replace int2 = 1 if time >= 7
    
// MKSPLINE SETUP
mkspline knot1 3 knot2 7 knot3 = time 
     
// MKSPLINE MODEL - LINEAR MIXED EFFECTS MODEL
xtset subjectid time

xtreg score knot1 knot2 knot3 int1 int2, re vce(cluster subjectid)
predict linear2 /* Generate the predicted scores */

// PLOT
sort time
graph twoway (scatter avg_score time) ///
             (lfit linear2 time if shocks == 0, col("navy") range(1 3)) ///
             (lfit linear2 time if shocks == 1, col("purple") range(3 7)) ///
             (lfit linear2 time if shocks == 2, col("brown") range(7 12) ///
              graphregion(color(white)) ///
              ylab(, nogrid labsize(small)) ///
              ytitle("Average score per student") ///
              xtitle("Time (Quarters)") ///
              xline(3 7, lpattern(dash) lcol("cranberry")) ///
              xlab(1 3 7 12) ///
              xscale(range(1 12) noextend) ///
              legend(off))
(encoding automatically selected: ISO-8859-1)
(7 vars, 120 obs)

(10 real changes made, 10 to missing)

(10 missing values generated)


(100 real changes made)


(60 real changes made)



Panel variable: subjectid (strongly balanced)
 Time variable: time, 1 to 12
         Delta: 1 unit


Random-effects GLS regression                   Number of obs     =        110
Group variable: subjectid                       Number of groups  =         10

R-squared:                                      Obs per group:
     Within  = 0.0000                                         min =         11
     Between = 0.0000                                         avg =       11.0
     Overall = 0.3593                                         max =         11

                                                Wald chi2(5)      =      91.01
corr(u_i, X) = 0 (assumed)                      Prob > chi2       =     0.0000

                             (Std. err. adjusted for 10 clusters in subjectid)
------------------------------------------------------------------------------
             |               Robust
       score | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       knot1 |         .7   .8225521     0.85   0.395    -.9121724    2.312172
       knot2 |        -.2   1.413624    -0.14   0.887    -2.970652    2.570652
       knot3 |   .3714286   .2579992     1.44   0.150    -.1342405    .8770977
        int1 |   9.633333   3.080292     3.13   0.002     3.596071     15.6706
        int2 |  -6.161905   3.089641    -1.99   0.046    -12.21749   -.1063197
       _cons |        1.6   1.222807     1.31   0.191    -.7966572    3.996657
-------------+----------------------------------------------------------------
     sigma_u |  2.8015938
     sigma_e |  3.9790073
         rho |  .33143813   (fraction of variance due to u_i)
------------------------------------------------------------------------------

(option xb assumed; fitted values)

Similar to Model 1, the Stata output for Model 2 provides the slopes of each fitted line in the segments. The following are relevant coefficients from the model.

  • knot1 denotes the slope when time < 3 (0.70 points; 95% CI: -0.91, 2.31)

  • knot2 denotes the slope when time >= 3 and time < 7 (-0.2 points; 95% CI: -2.97, 2.57)

  • knot3 denotes the slope when time >= 7 (0.37 points; 95% CI: -0.13, 0.88)

  • int1 denotes the level change at time = 3 (0.96; 95% CI: 3.60, 15.67)

  • int2 denotes the level change at time = 7 (-6.16; 95% CI: -12.22, -0.11)

Similar with Model 1, I have annotated these so that you can see where the regression output values are on the figures. Notice that the slopes are different with the missing data.

In Figure 3, we can see the slope within the segments between time > 3 and time <= 7 is no longer positive. Removing data from time = 3 changed the slope from positive to negative. This provides some idea of how missing data can influence the study results.

Figure 3: Linear spline model with missing data at `time = 3`

Figure 3: Linear spline model with missing data at time = 3

Conclusions

We can use linear spline models with more than one knot to measure the slope and level changes by using dummy variables for the segments. The mkspline command makes it easy to generate the necessary variables to capture the slopes and the dummy variables allow us to measure the level changes. The margins command can provide us with further details about the intercepts at each of the knots, which we can use to verify the regression coefficients from the models.

Acknowledgements

Stata has a basic document on using mkpline here

I also found the tutorial on the UCLA Advanced Research Computing site very helpful.

Michael N. Mitchell has an excellent book that provides numerous examples to perform linear spline models. I used many of the codes in this book to develop this tutorial. You can find his book Interpreting and Visualizing Regression Models Using Stata, Second Edition in the Stata Press site.

Disclaimers

This is a work in progress, and I may update this in the future.

This is for educational purposes only.