Nonlinearity

##       Year      Apprehensions      President             Party       
##  Min.   :1971   Min.   : 420126   Length:49          Min.   :0.0000  
##  1st Qu.:1983   1st Qu.: 796587   Class :character   1st Qu.:0.0000  
##  Median :1995   Median :1057977   Mode  :character   Median :0.0000  
##  Mean   :1995   Mean   :1075302                      Mean   :0.4082  
##  3rd Qu.:2007   3rd Qu.:1258481                      3rd Qu.:1.0000  
##  Max.   :2019   Max.   :1814729                      Max.   :1.0000  
##                                                                      
##      PCGdp           Decade      Deportations          VR         
##  Min.   : 5609   Min.   :1970   Min.   : 15216   Min.   : 100452  
##  1st Qu.:15544   1st Qu.:1980   1st Qu.: 24592   1st Qu.: 568005  
##  Median :28691   Median :1990   Median : 50924   Median : 911790  
##  Mean   :31116   Mean   :1990   Mean   :152115   Mean   : 851657  
##  3rd Qu.:47195   3rd Qu.:2000   3rd Qu.:284365   3rd Qu.:1091203  
##  Max.   :65548   Max.   :2010   Max.   :432334   Max.   :1675876  
##                                                                   
##  Administrative  EnforcementReturns    Criminal       Noncriminal    
##  Min.   :15072   Min.   : 76137     Min.   :108519   Min.   :175846  
##  1st Qu.:43972   1st Qu.: 85890     1st Qu.:122815   1st Qu.:179405  
##  Median :47361   Median :118170     Median :169898   Median :201613  
##  Mean   :51914   Mean   :186082     Mean   :158533   Mean   :203616  
##  3rd Qu.:61396   3rd Qu.:222446     3rd Qu.:189702   3rd Qu.:215597  
##  Max.   :89719   Max.   :523153     Max.   :200039   Max.   :233846  
##  NA's   :38      NA's   :38         NA's   :40       NA's   :40      
##     Title 42    Foreign Born       Naturalized         Noncitizen      
##  Min.   : NA   Min.   : 9619300   Min.   :14967828   Min.   :20722014  
##  1st Qu.: NA   1st Qu.:14079900   1st Qu.:16588153   1st Qu.:21765021  
##  Median : NA   Median :31107900   Median :18686237   Median :22098984  
##  Mean   :NaN   Mean   :26569778   Mean   :18891505   Mean   :22034557  
##  3rd Qu.: NA   3rd Qu.:37960935   3rd Qu.:20967738   3rd Qu.:22443414  
##  Max.   : NA   Max.   :44932901   Max.   :23182917   Max.   :22593269  
##  NA's   :49                       NA's   :34         NA's   :34        
##  Unauthorized population US Population         App_lagged     
##  Min.   : 3500000        Min.   :207660677   Min.   : 345353  
##  1st Qu.:10175000        1st Qu.:233791994   1st Qu.: 795735  
##  Median :11050000        Median :262803276   Median :1046422  
##  Mean   :10142500        Mean   :266639846   Mean   :1058353  
##  3rd Qu.:11425000        3rd Qu.:301231207   3rd Qu.:1258481  
##  Max.   :12200000        Max.   :328329953   Max.   :1814729  
##  NA's   :29                                                   
##   ForeignBorn        YearCentered
##  Min.   : 9619300   Min.   :-24  
##  1st Qu.:14079900   1st Qu.:-12  
##  Median :31107900   Median :  0  
##  Mean   :26569778   Mean   :  0  
##  3rd Qu.:37960935   3rd Qu.: 12  
##  Max.   :44932901   Max.   : 24  
##

Linearity

Linearity is a function of the model

It’s a linear model!

In last slideset we talked about nonlinearity in the context of least squares regression

Discontinuity

Idea from Thinking Clearly

What they call a run variable is our year-centered variable

\(\textrm{YearCentered}\)

Create a dummy variable: 1 if 1995 or greater; 0 if 1971-1994.

Hard vs. fuzzy discontinuity

Postulate a model: \(\hat{D}=\beta_0 + \beta_1 * PrePost1995 + \beta_2 * Year + \beta_3 * PrePost1995*Year\)

Discontinuity

A model: \(\hat{D}=\beta_0 + \beta_1 * PrePost1995 + \beta_2 * YearCentered + \beta_3 * (PrePost1995*YearCentered)\)

“Model 1: 1971-1995”: \(\hat{D}=\beta_0 + \beta_1 * 0 + \beta_2 * YearCentered + \beta_3 * (0*YearCentered)\)

\(\hat{D}=\beta_0 + \beta_2 * YearCentered\)

“Model 2: 1995-2019”: \(\hat{D}=\beta_0 + \beta_1 * 1 + \beta_2 * YearCentered + \beta_3 * (1*YearCentered)\)

\(\hat{D}=(\beta_0 + \beta_1) + (\beta_2 + \beta_3) * YearCentered\)

#Create a "treatment" indicator

remove.subset$pre_post95 <- ifelse(remove.subset$YearCentered >= "0", 1, 0)

remove.subset$pre_post95 <- factor(remove.subset$pre_post95,
                              levels=c(0,1),
                              labels=c("1971-1994", "1995-2019"))

table(remove.subset$pre_post95)

## 
## 1971-1994 1995-2019 
##        24        25

#Local regression 

#"Regress" deportations "on" time

reg6<-lm(Deportations~ pre_post95 + YearCentered +   
           pre_post95*YearCentered, data=remove.subset)

summary(reg6)

## 
## Call:
## lm(formula = Deportations ~ pre_post95 + YearCentered + pre_post95 * 
##     YearCentered, data = remove.subset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -114068   -8106    -346    9845   84226 
## 
## Coefficients:
##                                  Estimate Std. Error t value     Pr(>|t|)    
## (Intercept)                       36632.0    18333.7   1.998      0.05177 .  
## pre_post951995-2019               85011.9    24931.4   3.410      0.00138 ** 
## YearCentered                        803.5     1283.1   0.626      0.53434    
## pre_post951995-2019:YearCentered  11777.9     1761.4   6.686 0.0000000298 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43510 on 45 degrees of freedom
## Multiple R-squared:  0.9175, Adjusted R-squared:  0.912 
## F-statistic: 166.9 on 3 and 45 DF,  p-value: < 0.00000000000000022

p1<-plot_model(reg6, type = "pred", terms = c("YearCentered", "pre_post95"), ci.lvl = .95, show.data=TRUE,  
    title="Use of deportations, 1971-2019", axis.title=c("Year", "Number of removals"), colors=c("skyblue4", "coral2")) +  
  geom_line() +
  theme_classic()

p1

Model

\(\hat{D}=36,632 + 85,011*pre\_post95 + 803.5*YearCentered + 11,778*(Y*pp)\)

Pre-1995: \(\hat{D}=36,632 + 803.5*YearCentered\)

1971: In terms of centered variable is \(-24\) (i.e. 24 years before 1995)

\(\hat{D}=36,632 + 803.5*-24\)

\(\hat{D}=17,348\)

Actual value: \(D=18,294\)

Model

\(\hat{D}=36,632 + 85,011*pre\_post95 + 803.5*YearCentered + 11,778*(Y*pp)\)

Post-1995: \(\hat{D}=(36,632 + 85,011)*1+ (803.5 + 11,778)*(YearCentered)\)

2019: In terms of centered variable is \(24\) (i.e. 24 years after 1995)

\(\hat{D}=(36632 + 85011)*1 + (803.5 + 11778)*24\)

\(\hat{D}=423,599\)

Actual value: \(D=347,090\)

Residual: \(347,090-423,599=-76,509\)

Other ways: splines, knots, nonparametric regression

Another way

Use of splines and knots

Consideration of the project 3 data

reasons="https://raw.githubusercontent.com/mightyjoemoon/POL51/main/ICE_reasonforremoval.csv"
reasons<-read_csv(url(reasons))
summary(reasons)

##       Year       President              All              None       
##  Min.   :2003   Length:22          Min.   : 56882   Min.   : 19495  
##  1st Qu.:2008   Class :character   1st Qu.:178148   1st Qu.: 85446  
##  Median :2014   Mode  :character   Median :238765   Median :106426  
##  Mean   :2014                      Mean   :248987   Mean   :122287  
##  3rd Qu.:2019                      3rd Qu.:356423   3rd Qu.:165287  
##  Max.   :2024                      Max.   :407821   Max.   :253342  
##      Level1          Level2          Level3        Undocumented     
##  Min.   : 9819   Min.   : 3846   Min.   : 11045   Min.   :10100000  
##  1st Qu.:38484   1st Qu.: 9056   1st Qu.: 34978   1st Qu.:10500000  
##  Median :46743   Median :17480   Median : 63186   Median :11050000  
##  Mean   :46534   Mean   :15601   Mean   : 64541   Mean   :11015455  
##  3rd Qu.:57148   3rd Qu.:20342   3rd Qu.: 90950   3rd Qu.:11375000  
##  Max.   :75590   Max.   :29436   Max.   :130251   Max.   :12200000  
##      ER_Non     
##  Min.   : 4018  
##  1st Qu.:28563  
##  Median :41647  
##  Mean   :38980  
##  3rd Qu.:50230  
##  Max.   :71686

Piecewise regression

Piecewise regression (we just did it above)

An alternative: polynomial regression

\(y=\beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_2 x_i^3 + \epsilon\)

Linear model but non-linear smooth function

Standard linear models assume \(x_1^1\) or just \(x\)

Example with project 3 data

spline_model2<- lm(reasons$All~poly(reasons$Year,3))
summary(spline_model2)

## 
## Call:
## lm(formula = reasons$All ~ poly(reasons$Year, 3))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -82614 -33031  11470  35348  78485 
## 
## Coefficients:
##                        Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)              248987      10532  23.640 0.00000000000000528 ***
## poly(reasons$Year, 3)1  -211000      49402  -4.271             0.00046 ***
## poly(reasons$Year, 3)2  -370285      49402  -7.495 0.00000061195622757 ***
## poly(reasons$Year, 3)3   163678      49402   3.313             0.00387 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49400 on 18 degrees of freedom
## Multiple R-squared:  0.8259, Adjusted R-squared:  0.7969 
## F-statistic: 28.47 on 3 and 18 DF,  p-value: 0.0000004751

plot(reasons$Year, reasons$All, main = "Polynomial Regression: cublic spline on deportation data", xlab = "Year", ylab = "Number")
lines(reasons$Year, predict(spline_model2), col = "red", lwd = 2)
legend("topright", legend = "Fitted Spline", col = "red", lwd = 2)

spline_modellin<- lm(reasons$All~poly(reasons$Year,1))
summary(spline_modellin)

## 
## Call:
## lm(formula = reasons$All ~ poly(reasons$Year, 1))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -164108  -85764    2699   91750  148198 
## 
## Coefficients:
##                       Estimate Std. Error t value       Pr(>|t|)    
## (Intercept)             248987      21733   11.46 0.000000000307 ***
## poly(reasons$Year, 1)  -211000     101939   -2.07         0.0516 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 101900 on 20 degrees of freedom
## Multiple R-squared:  0.1764, Adjusted R-squared:  0.1352 
## F-statistic: 4.284 on 1 and 20 DF,  p-value: 0.05163

Other ways of thinking

“Non-parametric” modeling

Functions that use interpolation or smoothing

Nonparametric ``solution’’ to a regression problem:

Find some function to characterize the data

One choice: Splines

Splines

Think of your data as pieces

You’re doing this in your last task visually

ggplot(reasons, aes(x = Year, y = All)) +
  geom_point()

Splines…some intuition

\(\hat{f}\) is piecewise polynomial over some range of \(x\), \((x_i, x_i+1)\)

Estimate the coefficients for the pieces \(f\), \(f^{'}\), and \(f^{''}\) then we can characterize \(Y\)

\(f^{'}\), and \(f^{''}\) are first- and second-derivatives

\(f^{'}\) is the “Rate of change” and \(f^{''}\) is the “rate of change of the rate of change” (concavity or inflection)

\(x_i\) serves as the knot points

Splines

Concept: data are divided into bins at intervals

The bins are bounded by “knots”

Fit some function in the interval

Standard regression: straight lines, flat plains

Splines: many ways to do things

Knots, splines, and nonlinearity

What is a knot point?

Polynomials with “smoothing restrictions”

Natural splines, clamped splines, B-splines (“B” stands for “basis”)

*A spline is a polynomial of degree \(d\) that has continuous derivatives up to order \(d-1\) at each of the knots

*Natural splines constrain the second derivatives of the spline polynomials to be zero at the endpoints of the interval of interpolation

Example: natural splines

spline_model2 <- lm(reasons$All ~ ns(reasons$Year, df = 3))

summary(spline_model2)

## 
## Call:
## lm(formula = reasons$All ~ ns(reasons$Year, df = 3))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -85816 -23194   5389  30907  85082 
## 
## Coefficients:
##                           Estimate Std. Error t value    Pr(>|t|)    
## (Intercept)                 116402      30109   3.866    0.001132 ** 
## ns(reasons$Year, df = 3)1   -16237      40233  -0.404    0.691282    
## ns(reasons$Year, df = 3)2   336389      75643   4.447    0.000311 ***
## ns(reasons$Year, df = 3)3  -242438      31508  -7.695 0.000000425 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45300 on 18 degrees of freedom
## Multiple R-squared:  0.8536, Adjusted R-squared:  0.8292 
## F-statistic: 34.99 on 3 and 18 DF,  p-value: 0.0000001013

plot(reasons$Year, reasons$All, main = "Spline Regression: natural splines on deportation data", xlab = "Year", ylab = "Number")
lines(reasons$Year, predict(spline_model2), col = "red", lwd = 2)
legend("topright", legend = "Fitted Spline", col = "red", lwd = 2)

Splines

Basis splines or B-splines

Rubber band analogy

B-splines are piecewise polynomial functions that determine the polynomial degree/order

Idea: find a polynomial function that best characterizes the data

Splines

Smoothness is characterized by specifying the degrees-of-freedom and the order of the polynomial

Order\(={1,2,3, \ldots k}\): Linear, quadratic, cubic, and so on

We’ve lived in the world of straight lines

Spline with 3 “sticks” and polynomial 1 (linear function)

“Broken-stick regression”

Nonsmooth function as the polynomial is of order 1

\(x^1=x\)

ggplot(reasons, aes(x = Year, y = All)) +
  geom_point() +
  geom_smooth(method = lm, 
              formula = y ~ splines::bs(x, df = 3, degree = 1)) + labs(title="Deportations by year (FY 2003-2024)",
       y="Number of deportations", x="Fiscal year") +
  theme_bw()

m<-lm(reasons$All ~ splines::bs(reasons$Year, df = 3, degree = 1))

summary(m)

## 
## Call:
## lm(formula = reasons$All ~ splines::bs(reasons$Year, df = 3, 
##     degree = 1))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -91163 -29376   3194  27263  77218 
## 
## Coefficients:
##                                                Estimate Std. Error t value
## (Intercept)                                      131285      27413   4.789
## splines::bs(reasons$Year, df = 3, degree = 1)1   292658      40353   7.252
## splines::bs(reasons$Year, df = 3, degree = 1)2   105880      33675   3.144
## splines::bs(reasons$Year, df = 3, degree = 1)3   -50080      39304  -1.274
##                                                   Pr(>|t|)    
## (Intercept)                                       0.000147 ***
## splines::bs(reasons$Year, df = 3, degree = 1)1 0.000000962 ***
## splines::bs(reasons$Year, df = 3, degree = 1)2    0.005610 ** 
## splines::bs(reasons$Year, df = 3, degree = 1)3    0.218810    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43870 on 18 degrees of freedom
## Multiple R-squared:  0.8627, Adjusted R-squared:  0.8399 
## F-statistic: 37.71 on 3 and 18 DF,  p-value: 0.00000005708

Cubic spline

Defaults in a lot of packages is cubic spline

Example

ggplot(reasons, aes(x = Year, y = All)) +
  geom_point() +
  geom_smooth(method = lm, 
              formula = y ~ splines::bs(x, df = 3, degree = 3)) + labs(title="Deportations by year (FY 2003-2024)",
       y="Number of deportations", x="Fiscal year") +
  theme_bw()

m2<-lm(reasons$All ~ splines::bs(reasons$Year, df = 3, degree = 3))

summary(m2)

## 
## Call:
## lm(formula = reasons$All ~ splines::bs(reasons$Year, df = 3, 
##     degree = 3))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -82614 -33031  11470  35348  78485 
## 
## Coefficients:
##                                                Estimate Std. Error t value
## (Intercept)                                       99250      35859   2.768
## splines::bs(reasons$Year, df = 3, degree = 3)1   592125     105979   5.587
## splines::bs(reasons$Year, df = 3, degree = 3)2    46104      72713   0.634
## splines::bs(reasons$Year, df = 3, degree = 3)3    -8487      54878  -0.155
##                                                 Pr(>|t|)    
## (Intercept)                                       0.0127 *  
## splines::bs(reasons$Year, df = 3, degree = 3)1 0.0000265 ***
## splines::bs(reasons$Year, df = 3, degree = 3)2    0.5340    
## splines::bs(reasons$Year, df = 3, degree = 3)3    0.8788    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49400 on 18 degrees of freedom
## Multiple R-squared:  0.8259, Adjusted R-squared:  0.7969 
## F-statistic: 28.47 on 3 and 18 DF,  p-value: 0.0000004751

Some thoughts

Project 3, Task 6 asks this:

For this task, first create a diagnostic plot of all deportations by year. Based on inspection of the plot, how many piecewise functions do you think would best fit these data? Following this, estimate a regression function using a spline function with a polynomial of order 1 and the number of splines equal to what your diagnostic plot suggests. Comparing a model with 2 or 3 degrees of freedom, which model best describes the data? This question is worth 50 points.

Why order 1?

Parametric v. non-parametric models

Nonlinearity

B. Jones

June 5, 2025

Linearity

Discontinuity

Discontinuity

Model

Model

Other ways: splines, knots, nonparametric regression

Piecewise regression

Other ways of thinking

Splines

Splines…some intuition

Splines

Splines: many ways to do things

Splines

Splines

Cubic spline

Some thoughts