## Year Apprehensions President Party
## Min. :1971 Min. : 420126 Length:49 Min. :0.0000
## 1st Qu.:1983 1st Qu.: 796587 Class :character 1st Qu.:0.0000
## Median :1995 Median :1057977 Mode :character Median :0.0000
## Mean :1995 Mean :1075302 Mean :0.4082
## 3rd Qu.:2007 3rd Qu.:1258481 3rd Qu.:1.0000
## Max. :2019 Max. :1814729 Max. :1.0000
##
## PCGdp Decade Deportations VR
## Min. : 5609 Min. :1970 Min. : 15216 Min. : 100452
## 1st Qu.:15544 1st Qu.:1980 1st Qu.: 24592 1st Qu.: 568005
## Median :28691 Median :1990 Median : 50924 Median : 911790
## Mean :31116 Mean :1990 Mean :152115 Mean : 851657
## 3rd Qu.:47195 3rd Qu.:2000 3rd Qu.:284365 3rd Qu.:1091203
## Max. :65548 Max. :2010 Max. :432334 Max. :1675876
##
## Administrative EnforcementReturns Criminal Noncriminal
## Min. :15072 Min. : 76137 Min. :108519 Min. :175846
## 1st Qu.:43972 1st Qu.: 85890 1st Qu.:122815 1st Qu.:179405
## Median :47361 Median :118170 Median :169898 Median :201613
## Mean :51914 Mean :186082 Mean :158533 Mean :203616
## 3rd Qu.:61396 3rd Qu.:222446 3rd Qu.:189702 3rd Qu.:215597
## Max. :89719 Max. :523153 Max. :200039 Max. :233846
## NA's :38 NA's :38 NA's :40 NA's :40
## Title 42 Foreign Born Naturalized Noncitizen
## Min. : NA Min. : 9619300 Min. :14967828 Min. :20722014
## 1st Qu.: NA 1st Qu.:14079900 1st Qu.:16588153 1st Qu.:21765021
## Median : NA Median :31107900 Median :18686237 Median :22098984
## Mean :NaN Mean :26569778 Mean :18891505 Mean :22034557
## 3rd Qu.: NA 3rd Qu.:37960935 3rd Qu.:20967738 3rd Qu.:22443414
## Max. : NA Max. :44932901 Max. :23182917 Max. :22593269
## NA's :49 NA's :34 NA's :34
## Unauthorized population US Population App_lagged
## Min. : 3500000 Min. :207660677 Min. : 345353
## 1st Qu.:10175000 1st Qu.:233791994 1st Qu.: 795735
## Median :11050000 Median :262803276 Median :1046422
## Mean :10142500 Mean :266639846 Mean :1058353
## 3rd Qu.:11425000 3rd Qu.:301231207 3rd Qu.:1258481
## Max. :12200000 Max. :328329953 Max. :1814729
## NA's :29
## ForeignBorn YearCentered
## Min. : 9619300 Min. :-24
## 1st Qu.:14079900 1st Qu.:-12
## Median :31107900 Median : 0
## Mean :26569778 Mean : 0
## 3rd Qu.:37960935 3rd Qu.: 12
## Max. :44932901 Max. : 24
##
Linearity
- Linearity is a function of the model
- In last slideset we talked about nonlinearity in the context of
least squares regression
Discontinuity
- Idea from Thinking Clearly
- What they call a run variable is our year-centered
variable
- \(\textrm{YearCentered}\)
- Create a dummy variable: 1 if 1995 or greater; 0 if 1971-1994.
- Hard vs. fuzzy discontinuity
- Postulate a model: \(\hat{D}=\beta_0 +
\beta_1 * PrePost1995 + \beta_2 * Year + \beta_3 *
PrePost1995*Year\)
Discontinuity
- A model: \(\hat{D}=\beta_0 + \beta_1 *
PrePost1995 + \beta_2 * YearCentered + \beta_3 *
(PrePost1995*YearCentered)\)
- “Model 1: 1971-1995”: \(\hat{D}=\beta_0 +
\beta_1 * 0 + \beta_2 * YearCentered + \beta_3 *
(0*YearCentered)\)
- \(\hat{D}=\beta_0 + \beta_2 *
YearCentered\)
- “Model 2: 1995-2019”: \(\hat{D}=\beta_0 +
\beta_1 * 1 + \beta_2 * YearCentered + \beta_3 *
(1*YearCentered)\)
- \(\hat{D}=(\beta_0 + \beta_1) + (\beta_2 +
\beta_3) * YearCentered\)
#Create a "treatment" indicator
remove.subset$pre_post95 <- ifelse(remove.subset$YearCentered >= "0", 1, 0)
remove.subset$pre_post95 <- factor(remove.subset$pre_post95,
levels=c(0,1),
labels=c("1971-1994", "1995-2019"))
table(remove.subset$pre_post95)
##
## 1971-1994 1995-2019
## 24 25
#Local regression
#"Regress" deportations "on" time
reg6<-lm(Deportations~ pre_post95 + YearCentered +
pre_post95*YearCentered, data=remove.subset)
summary(reg6)
##
## Call:
## lm(formula = Deportations ~ pre_post95 + YearCentered + pre_post95 *
## YearCentered, data = remove.subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -114068 -8106 -346 9845 84226
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36632.0 18333.7 1.998 0.05177 .
## pre_post951995-2019 85011.9 24931.4 3.410 0.00138 **
## YearCentered 803.5 1283.1 0.626 0.53434
## pre_post951995-2019:YearCentered 11777.9 1761.4 6.686 0.0000000298 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43510 on 45 degrees of freedom
## Multiple R-squared: 0.9175, Adjusted R-squared: 0.912
## F-statistic: 166.9 on 3 and 45 DF, p-value: < 0.00000000000000022
p1<-plot_model(reg6, type = "pred", terms = c("YearCentered", "pre_post95"), ci.lvl = .95, show.data=TRUE,
title="Use of deportations, 1971-2019", axis.title=c("Year", "Number of removals"), colors=c("skyblue4", "coral2")) +
geom_line() +
theme_classic()
p1

Model
- \(\hat{D}=36,632 + 85,011*pre\_post95 +
803.5*YearCentered + 11,778*(Y*pp)\)
- Pre-1995: \(\hat{D}=36,632 +
803.5*YearCentered\)
- 1971: In terms of centered variable is \(-24\) (i.e. 24 years before 1995)
- \(\hat{D}=36,632 + 803.5*-24\)
- Actual value: \(D=18,294\)
Model
- \(\hat{D}=36,632 + 85,011*pre\_post95 +
803.5*YearCentered + 11,778*(Y*pp)\)
- Post-1995: \(\hat{D}=(36,632 + 85,011)*1+
(803.5 + 11,778)*(YearCentered)\)
- 2019: In terms of centered variable is \(24\) (i.e. 24 years after 1995)
- \(\hat{D}=(36632 + 85011)*1 + (803.5 +
11778)*24\)
- Actual value: \(D=347,090\)
- Residual: \(347,090-423,599=-76,509\)
Other ways: splines, knots, nonparametric regression
- Consideration of the project 3 data
reasons="https://raw.githubusercontent.com/mightyjoemoon/POL51/main/ICE_reasonforremoval.csv"
reasons<-read_csv(url(reasons))
summary(reasons)
## Year President All None
## Min. :2003 Length:22 Min. : 56882 Min. : 19495
## 1st Qu.:2008 Class :character 1st Qu.:178148 1st Qu.: 85446
## Median :2014 Mode :character Median :238765 Median :106426
## Mean :2014 Mean :248987 Mean :122287
## 3rd Qu.:2019 3rd Qu.:356423 3rd Qu.:165287
## Max. :2024 Max. :407821 Max. :253342
## Level1 Level2 Level3 Undocumented
## Min. : 9819 Min. : 3846 Min. : 11045 Min. :10100000
## 1st Qu.:38484 1st Qu.: 9056 1st Qu.: 34978 1st Qu.:10500000
## Median :46743 Median :17480 Median : 63186 Median :11050000
## Mean :46534 Mean :15601 Mean : 64541 Mean :11015455
## 3rd Qu.:57148 3rd Qu.:20342 3rd Qu.: 90950 3rd Qu.:11375000
## Max. :75590 Max. :29436 Max. :130251 Max. :12200000
## ER_Non
## Min. : 4018
## 1st Qu.:28563
## Median :41647
## Mean :38980
## 3rd Qu.:50230
## Max. :71686
Piecewise regression
- Piecewise regression (we just did it above)
- An alternative: polynomial regression
- \(y=\beta_0 + \beta_1 x_i + \beta_2 x_i^2
+ \beta_2 x_i^3 + \epsilon\)
- Linear model but non-linear smooth function
- Standard linear models assume \(x_1^1\) or just \(x\)
- Example with project 3 data
spline_model2<- lm(reasons$All~poly(reasons$Year,3))
summary(spline_model2)
##
## Call:
## lm(formula = reasons$All ~ poly(reasons$Year, 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -82614 -33031 11470 35348 78485
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 248987 10532 23.640 0.00000000000000528 ***
## poly(reasons$Year, 3)1 -211000 49402 -4.271 0.00046 ***
## poly(reasons$Year, 3)2 -370285 49402 -7.495 0.00000061195622757 ***
## poly(reasons$Year, 3)3 163678 49402 3.313 0.00387 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49400 on 18 degrees of freedom
## Multiple R-squared: 0.8259, Adjusted R-squared: 0.7969
## F-statistic: 28.47 on 3 and 18 DF, p-value: 0.0000004751
plot(reasons$Year, reasons$All, main = "Polynomial Regression: cublic spline on deportation data", xlab = "Year", ylab = "Number")
lines(reasons$Year, predict(spline_model2), col = "red", lwd = 2)
legend("topright", legend = "Fitted Spline", col = "red", lwd = 2)

spline_modellin<- lm(reasons$All~poly(reasons$Year,1))
summary(spline_modellin)
##
## Call:
## lm(formula = reasons$All ~ poly(reasons$Year, 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -164108 -85764 2699 91750 148198
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 248987 21733 11.46 0.000000000307 ***
## poly(reasons$Year, 1) -211000 101939 -2.07 0.0516 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 101900 on 20 degrees of freedom
## Multiple R-squared: 0.1764, Adjusted R-squared: 0.1352
## F-statistic: 4.284 on 1 and 20 DF, p-value: 0.05163
Other ways of thinking
- “Non-parametric” modeling
- Functions that use interpolation or smoothing
- Nonparametric ``solution’’ to a regression problem:
- Find some function to characterize the data
Splines
- Think of your data as pieces
- You’re doing this in your last task visually
ggplot(reasons, aes(x = Year, y = All)) +
geom_point()

Splines…some intuition
- \(\hat{f}\) is piecewise polynomial
over some range of \(x\), \((x_i, x_i+1)\)
- Estimate the coefficients for the pieces \(f\), \(f^{'}\), and \(f^{''}\) then we can characterize
\(Y\)
- \(f^{'}\), and \(f^{''}\) are first- and
second-derivatives
- \(f^{'}\) is the “Rate of
change” and \(f^{''}\) is the
“rate of change of the rate of change” (concavity or inflection)
- \(x_i\) serves as the knot
points
Splines
- Concept: data are divided into bins at intervals
- The bins are bounded by “knots”
- Fit some function in the interval
- Standard regression: straight lines, flat plains
Splines: many ways to do things
- Knots, splines, and nonlinearity
- Polynomials with “smoothing restrictions”
- Natural splines, clamped splines, B-splines (“B” stands for
“basis”)
- *A spline is a polynomial of degree \(d\) that has continuous derivatives up to
order \(d-1\) at each of the knots
- *Natural splines constrain the second derivatives of the spline
polynomials to be zero at the endpoints of the interval of
interpolation
spline_model2 <- lm(reasons$All ~ ns(reasons$Year, df = 3))
summary(spline_model2)
##
## Call:
## lm(formula = reasons$All ~ ns(reasons$Year, df = 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -85816 -23194 5389 30907 85082
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 116402 30109 3.866 0.001132 **
## ns(reasons$Year, df = 3)1 -16237 40233 -0.404 0.691282
## ns(reasons$Year, df = 3)2 336389 75643 4.447 0.000311 ***
## ns(reasons$Year, df = 3)3 -242438 31508 -7.695 0.000000425 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45300 on 18 degrees of freedom
## Multiple R-squared: 0.8536, Adjusted R-squared: 0.8292
## F-statistic: 34.99 on 3 and 18 DF, p-value: 0.0000001013
plot(reasons$Year, reasons$All, main = "Spline Regression: natural splines on deportation data", xlab = "Year", ylab = "Number")
lines(reasons$Year, predict(spline_model2), col = "red", lwd = 2)
legend("topright", legend = "Fitted Spline", col = "red", lwd = 2)

Splines
- Basis splines or B-splines
- B-splines are piecewise polynomial functions that determine the
polynomial degree/order
- Idea: find a polynomial function that best characterizes the
data
Splines
- Smoothness is characterized by specifying the degrees-of-freedom and
the order of the polynomial
- Order\(={1,2,3, \ldots k}\):
Linear, quadratic, cubic, and so on
- We’ve lived in the world of straight lines
- Spline with 3 “sticks” and polynomial 1 (linear function)
- “Broken-stick regression”
- Nonsmooth function as the polynomial is of order 1
ggplot(reasons, aes(x = Year, y = All)) +
geom_point() +
geom_smooth(method = lm,
formula = y ~ splines::bs(x, df = 3, degree = 1)) + labs(title="Deportations by year (FY 2003-2024)",
y="Number of deportations", x="Fiscal year") +
theme_bw()

m<-lm(reasons$All ~ splines::bs(reasons$Year, df = 3, degree = 1))
summary(m)
##
## Call:
## lm(formula = reasons$All ~ splines::bs(reasons$Year, df = 3,
## degree = 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -91163 -29376 3194 27263 77218
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 131285 27413 4.789
## splines::bs(reasons$Year, df = 3, degree = 1)1 292658 40353 7.252
## splines::bs(reasons$Year, df = 3, degree = 1)2 105880 33675 3.144
## splines::bs(reasons$Year, df = 3, degree = 1)3 -50080 39304 -1.274
## Pr(>|t|)
## (Intercept) 0.000147 ***
## splines::bs(reasons$Year, df = 3, degree = 1)1 0.000000962 ***
## splines::bs(reasons$Year, df = 3, degree = 1)2 0.005610 **
## splines::bs(reasons$Year, df = 3, degree = 1)3 0.218810
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43870 on 18 degrees of freedom
## Multiple R-squared: 0.8627, Adjusted R-squared: 0.8399
## F-statistic: 37.71 on 3 and 18 DF, p-value: 0.00000005708
Cubic spline
- Defaults in a lot of packages is cubic spline
ggplot(reasons, aes(x = Year, y = All)) +
geom_point() +
geom_smooth(method = lm,
formula = y ~ splines::bs(x, df = 3, degree = 3)) + labs(title="Deportations by year (FY 2003-2024)",
y="Number of deportations", x="Fiscal year") +
theme_bw()

m2<-lm(reasons$All ~ splines::bs(reasons$Year, df = 3, degree = 3))
summary(m2)
##
## Call:
## lm(formula = reasons$All ~ splines::bs(reasons$Year, df = 3,
## degree = 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -82614 -33031 11470 35348 78485
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 99250 35859 2.768
## splines::bs(reasons$Year, df = 3, degree = 3)1 592125 105979 5.587
## splines::bs(reasons$Year, df = 3, degree = 3)2 46104 72713 0.634
## splines::bs(reasons$Year, df = 3, degree = 3)3 -8487 54878 -0.155
## Pr(>|t|)
## (Intercept) 0.0127 *
## splines::bs(reasons$Year, df = 3, degree = 3)1 0.0000265 ***
## splines::bs(reasons$Year, df = 3, degree = 3)2 0.5340
## splines::bs(reasons$Year, df = 3, degree = 3)3 0.8788
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49400 on 18 degrees of freedom
## Multiple R-squared: 0.8259, Adjusted R-squared: 0.7969
## F-statistic: 28.47 on 3 and 18 DF, p-value: 0.0000004751
Some thoughts
- Project 3, Task 6 asks this:
- For this task, first create a diagnostic plot of all deportations by
year. Based on inspection of the plot, how many piecewise functions do
you think would best fit these data? Following this, estimate a
regression function using a spline function with a polynomial of order 1
and the number of splines equal to what your diagnostic plot suggests.
Comparing a model with 2 or 3 degrees of freedom, which model best
describes the data? This question is worth 50 points.
- Parametric v. non-parametric models