library(tidyverse)
library(splines)
library(broom)
library(lme4)
library(moderndive)
library(ggformula)
Splajn reda \(M\) s čvorovima \(\xi_1,\xi_2,\dotsc,\xi_K\) je po dijelovima polinomijalna funkcija stupnja \(M-1\) koja je klase \(C^{M-2}\), tj. ima neprekidne derivacije sve do reda \(M-2\). Uz zadane čvorove i standardno zbrajanje funkcija i množenje funkcija realnim brojem, splajnovi reda \(M\) čine realni vektorski prostor dimenzije \(K+M\). Jedna baza za taj vektorski prostor sastoji se od sljedećih funkcija \[\begin{align*} h_k(x)&=x^{k-1},\quad k=1,\dotsc,M\\ h_{k+M}(x)&=(x-\xi_k)_{+}^{M-1},\quad k=1,\dotsc,K \end{align*}\] pri čemu je \[(x-\xi_k)_{+}^{M-1} = \begin{cases} (x-\xi_k)^{M-1},& \text{ako je } x>\xi_k\\ \hfill0,\hfill& \text{inače} \end{cases}\]
Splajnove reda \(M=1\) zovemo linearnim splajnovima. Pogledajmo bazu za linearne splajnove sa zadanim čvorovima \(0.2, 0.5, 0.9, 1.7\). Najprije definiramo čvorove i potrebne funkcije.
ksi <- c(0.2, 0.5, 0.9, 1.7)
bfun <- function(x, ksi = 0, p = 1) {
(x > ksi) * (x - ksi)^p
}
bfun <- Vectorize(bfun)
Nakon toga definiramo potrebne podatke za crtanje grafova bazičnih funkcija.
xval <- seq(-1, 3, by = 0.01)
graf <- tibble(x = xval, h1 = 1, h2 = x, h3 = bfun(x, ksi[1]),
h4 = bfun(x, ksi[2]), h5 = bfun(x, ksi[3]),
h6 = bfun(x, ksi[4])) %>%
pivot_longer(h1:h6, names_to = "fun", values_to = "value")
Nakon toga možemo nacrtati grafove tih funkcija.
ggplot(graf, aes(x = x, y = value)) +
geom_vline(xintercept = ksi, linetype="dotted", linewidth = 0.4, color = "blue") +
geom_line() + facet_wrap(vars(fun))
Grafovi bazičnih funkcija za linearne splajnove s čvorovima \(0.2, 0.5, 0.9, 1.7\)
Splajnove reda \(M=4\) zovemo kubičnim splajnovima. Pogledajmo bazu za kubične splajnove sa zadanim čvorovima \(0.2, 0.5, 0.9, 1.7\).
graf3 <- tibble(x = xval, h1 = 1, h2 = x, h3 = x^2, h4 = x^3,
h5 = bfun(x, ksi[1], 3), h6 = bfun(x, ksi[2], 3),
h7 = bfun(x, ksi[3], 3), h8 = bfun(x, ksi[4], 3)) %>%
pivot_longer(h1:h8, names_to = "fun", values_to = "value")
ggplot(graf3, aes(x = x, y = value)) +
geom_vline(xintercept = ksi, linetype="dotted", linewidth = 0.4, color = "blue") +
geom_line() + facet_wrap(vars(fun), scales = "free_y")
Grafovi bazičnih funkcija za kubične splajnove s čvorovima \(0.2, 0.5, 0.9, 1.7\)
Zadanim unutarnjim čvorovima \(\xi_1,\xi_2,\dotsc,\xi_K\) dodajemo još dva rubna čvora \(\xi_0<\xi_1\) i \(\xi_K<\xi_{K+1}\) koji definiraju segment \([\xi_0,\xi_{K+1}]\) koji predstavlja domenu na kojoj ćemo računati vrijednosti splajna u nekoj konkretnoj primjeni.
Definiramo novi veći niz čvorova \(\tau\) takve da vrijedi
Vrijednosti dodatnih čvorova izvan segmenta \([\xi_0,\xi_{K+1}]\) mogu biti proizvoljne i najčešće se stavlja da sve budu redom jednake \(\xi_0\) i \(\xi_{K+1}\).
Neka je \(B_{i,\,M}\) oznaka za \(i\)-tu B-splajn bazičnu funkciju reda \(M\) s nizom čvorova \(\tau\). Te funkcije se definiraju rekurzivno na sljedeći način: \[B_{i,1}(x)=\begin{cases} 1,& \text{ako je } \tau_i\leqslant x<\tau_{i+1}\\ 0,& \text{inače} \end{cases},\quad i=1,2,\dotsc,K+1 \] Po dogovoru je \(B_{i,1}=0\) ako je \(\tau_i=\tau_{i+1}\). \[B_{i,\,M}(x)=\frac{x-\tau_i}{\tau_{i+M-1}-\tau_i}B_{i,\,M-1}(x)+\frac{\tau_{i+M}-x}{\tau_{i+M}-\tau_{i+1}}B_{i+1,\,M-1}(x) ,\quad i=1,2,\dotsc,K+M \] Neka je \(V_M(\tau)\) realni vektorski prostor generiran s B-splajn bazičnim funkcijama reda \(M\). Vektorski prostor \(V_M(\tau)\) zovemo B-splajn reda \(M\), a njegova dimenzija je jednaka \(K+M\). Elemente vektorskog prostora \(V_M(\tau)\) zovemo B-splajn funkcijama reda \(M\) (stupnja \(M-1\)). Prvi član \(B_{0,\,M}\) iz B-splajn baze često zovemo odsječak (engl. intercept).
Funkcija bs iz biblioteke splines daje
B-splajn bazu za vektorski prostor \(V_M(\tau)\). Također, ta funkcija omogućuje
da umjesto zadavanja čvorova, zadajemo broj stupnjeva slobode
df pri čemu je zapravo \(df=K+M\) uz pretpostavku da je
intercept = TRUE. Ako je intercept = FALSE
(defaultna vrijednost), tada je \(df=K+M-1\).
Ako je
intercept = FALSE, tada se ispušta prvi vektor iz B-splajn
baze.
Napomena. Splajn stupnja \(d\) je reda \(M=d+1\). Neka je \(K\) broj unutarnjih čvorova. Tada je uz dva rubna čvora ukupni broj čvorova jednak \(k=K+2\). Za splajn stupnja \(d\) s ukupno \(k\) čvorova potrebno je
Stoga je dimenzija pripadnog vektorskog prostora jednaka \((d+1)(k-1)-d(k-2)=k+d-1\). S druge strane je \(k+d-1=(K+2)+(M-1)-1=K+M\).
Pogledajmo bazu za linearne B-splajn funkcije sa zadanim čvorovima \(0.2, 0.5, 0.9, 1.7\) na segmentu \([0,2]\).
Vrijednosti bazičnih funkcija
x1 <- seq(0, 2, by = 0.2)
bs(x1, knots = ksi, degree = 1, intercept = TRUE)
## 1 2 3 4 5 6
## [1,] 1 0.0000000 0.0000000 0.000 0.0000000 0.0000000
## [2,] 0 1.0000000 0.0000000 0.000 0.0000000 0.0000000
## [3,] 0 0.3333333 0.6666667 0.000 0.0000000 0.0000000
## [4,] 0 0.0000000 0.7500000 0.250 0.0000000 0.0000000
## [5,] 0 0.0000000 0.2500000 0.750 0.0000000 0.0000000
## [6,] 0 0.0000000 0.0000000 0.875 0.1250000 0.0000000
## [7,] 0 0.0000000 0.0000000 0.625 0.3750000 0.0000000
## [8,] 0 0.0000000 0.0000000 0.375 0.6250000 0.0000000
## [9,] 0 0.0000000 0.0000000 0.125 0.8750000 0.0000000
## [10,] 0 0.0000000 0.0000000 0.000 0.6666667 0.3333333
## [11,] 0 0.0000000 0.0000000 0.000 0.0000000 1.0000000
## attr(,"degree")
## [1] 1
## attr(,"knots")
## [1] 0.2 0.5 0.9 1.7
## attr(,"Boundary.knots")
## [1] 0 2
## attr(,"intercept")
## [1] TRUE
## attr(,"class")
## [1] "bs" "basis" "matrix"
Za crtanje grafova koristit ćemo veći broj točaka.
x2 <- seq(0, 2, by = 0.01)
bsMat1 <- bs(x2, knots = ksi, degree = 1, intercept = TRUE)
tablica1 <- as_tibble(bsMat1) %>% pivot_longer(everything()) %>%
mutate(value = as.numeric(value),
x = rep(x2, each = ncol(bsMat1)))
tablica1
## # A tibble: 1,206 × 3
## name value x
## <chr> <dbl> <dbl>
## 1 1 1 0
## 2 2 0 0
## 3 3 0 0
## 4 4 0 0
## 5 5 0 0
## 6 6 0 0
## 7 1 0.95 0.01
## 8 2 0.05 0.01
## 9 3 0 0.01
## 10 4 0 0.01
## # … with 1,196 more rows
Grafovi B-splajn bazičnih funkcija
ggplot(tablica1, aes(x = x, y = value)) +
geom_vline(xintercept = ksi, linetype="dotted", linewidth = 0.4, color = "blue") +
geom_line() + facet_wrap(vars(name))
Grafovi B-splajn bazičnih funkcija za linearne B-splajnove s čvorovima \(0.2, 0.5, 0.9, 1.7\)
Zadavanje bazičnih funkcija preko parametra
df
Ako stavimo intercept = TRUE, broj unutarnjih čvorova je
jednak df - degree - 1.
bs(x1, df = 6, degree = 1, intercept = TRUE)
## 1 2 3 4 5 6
## [1,] 1.0 0.0 0.000000e+00 0.0 0.0 0.0
## [2,] 0.5 0.5 0.000000e+00 0.0 0.0 0.0
## [3,] 0.0 1.0 0.000000e+00 0.0 0.0 0.0
## [4,] 0.0 0.5 5.000000e-01 0.0 0.0 0.0
## [5,] 0.0 0.0 1.000000e+00 0.0 0.0 0.0
## [6,] 0.0 0.0 5.000000e-01 0.5 0.0 0.0
## [7,] 0.0 0.0 5.551115e-16 1.0 0.0 0.0
## [8,] 0.0 0.0 0.000000e+00 0.5 0.5 0.0
## [9,] 0.0 0.0 0.000000e+00 0.0 1.0 0.0
## [10,] 0.0 0.0 0.000000e+00 0.0 0.5 0.5
## [11,] 0.0 0.0 0.000000e+00 0.0 0.0 1.0
## attr(,"degree")
## [1] 1
## attr(,"knots")
## 20% 40% 60% 80%
## 0.4 0.8 1.2 1.6
## attr(,"Boundary.knots")
## [1] 0 2
## attr(,"intercept")
## [1] TRUE
## attr(,"class")
## [1] "bs" "basis" "matrix"
bsMat2 <- bs(x2, df = 6, degree = 1, intercept = TRUE)
tablica2 <- as_tibble(bsMat2) %>% pivot_longer(everything()) %>%
mutate(value = as.numeric(value),
x = rep(x2, each = ncol(bsMat2)))
ggplot(tablica2, aes(x = x, y = value)) +
geom_vline(xintercept = attr(bsMat2, "knots"),
linetype="dotted", linewidth = 0.4, color = "blue") +
geom_line() + facet_wrap(vars(name))
Grafovi B-splajn bazičnih funkcija za linearne B-splajnove s
df = 6 i intercept = TRUE
Ako stavimo intercept = FALSE (defaultna vrijednost),
broj unutarnjih čvorova je jednak df - degree.
bs(x1, df = 6, degree = 1)
## 1 2 3 4 5 6
## [1,] 0.0 0.0 0.0 0.0 0.0 0.0
## [2,] 0.6 0.0 0.0 0.0 0.0 0.0
## [3,] 0.8 0.2 0.0 0.0 0.0 0.0
## [4,] 0.2 0.8 0.0 0.0 0.0 0.0
## [5,] 0.0 0.6 0.4 0.0 0.0 0.0
## [6,] 0.0 0.0 1.0 0.0 0.0 0.0
## [7,] 0.0 0.0 0.4 0.6 0.0 0.0
## [8,] 0.0 0.0 0.0 0.8 0.2 0.0
## [9,] 0.0 0.0 0.0 0.2 0.8 0.0
## [10,] 0.0 0.0 0.0 0.0 0.6 0.4
## [11,] 0.0 0.0 0.0 0.0 0.0 1.0
## attr(,"degree")
## [1] 1
## attr(,"knots")
## 16.66667% 33.33333% 50% 66.66667% 83.33333%
## 0.3333333 0.6666667 1.0000000 1.3333333 1.6666667
## attr(,"Boundary.knots")
## [1] 0 2
## attr(,"intercept")
## [1] FALSE
## attr(,"class")
## [1] "bs" "basis" "matrix"
bsMat3 <- bs(x2, df = 6, degree = 1)
tablica3 <- as_tibble(bsMat3) %>% pivot_longer(everything()) %>%
mutate(value = as.numeric(value),
x = rep(x2, each = ncol(bsMat3)))
ggplot(tablica3, aes(x = x, y = value)) +
geom_vline(xintercept = attr(bsMat3, "knots"),
linetype="dotted", linewidth = 0.4, color = "blue") +
geom_line() + facet_wrap(vars(name))
Grafovi B-splajn bazičnih funkcija za linearne B-splajnove s
df = 6 i intercept = FALSE
Pogledajmo bazu za kubične B-splajn funkcije sa zadanim čvorovima
\(0.2, 0.5, 0.9, 1.7\) na segmentu
\([0,2]\). Po defaultu je
degree = 3 pa ne treba u tom slučaju eksplicitno pisati
vrijednost tog parametra.
Vrijednosti bazičnih funkcija
bs(x1, knots = ksi, intercept = TRUE)
## 1 2 3 4 5 6 7
## [1,] 1 0.00000000 0.000000000 0.0000000000 0.00000000 0.000000000 0.000000000
## [2,] 0 0.36000000 0.551111111 0.0888888889 0.00000000 0.000000000 0.000000000
## [3,] 0 0.01333333 0.466031746 0.4952380952 0.02539683 0.000000000 0.000000000
## [4,] 0 0.00000000 0.107142857 0.6966269841 0.19484127 0.001388889 0.000000000
## [5,] 0 0.00000000 0.003968254 0.4978174603 0.46071429 0.037500000 0.000000000
## [6,] 0 0.00000000 0.000000000 0.2381944444 0.59103535 0.169737144 0.001033058
## [7,] 0 0.00000000 0.000000000 0.0868055556 0.51351010 0.371791781 0.027892562
## [8,] 0 0.00000000 0.000000000 0.0187500000 0.32386364 0.528254132 0.129132231
## [9,] 0 0.00000000 0.000000000 0.0006944444 0.12512626 0.519840450 0.354338843
## [10,] 0 0.00000000 0.000000000 0.0000000000 0.01616162 0.244628099 0.702173248
## [11,] 0 0.00000000 0.000000000 0.0000000000 0.00000000 0.000000000 0.000000000
## 8
## [1,] 0.00000000
## [2,] 0.00000000
## [3,] 0.00000000
## [4,] 0.00000000
## [5,] 0.00000000
## [6,] 0.00000000
## [7,] 0.00000000
## [8,] 0.00000000
## [9,] 0.00000000
## [10,] 0.03703704
## [11,] 1.00000000
## attr(,"degree")
## [1] 3
## attr(,"knots")
## [1] 0.2 0.5 0.9 1.7
## attr(,"Boundary.knots")
## [1] 0 2
## attr(,"intercept")
## [1] TRUE
## attr(,"class")
## [1] "bs" "basis" "matrix"
Grafovi B-splajn bazičnih funkcija
bsMat4 <- bs(x2, knots = ksi, intercept = TRUE)
tablica4 <- as_tibble(bsMat4) %>% pivot_longer(everything()) %>%
mutate(value = as.numeric(value),
x = rep(x2, each = ncol(bsMat4)))
ggplot(tablica4, aes(x = x, y = value)) +
geom_vline(xintercept = ksi, linetype="dotted", linewidth = 0.4, color = "blue") +
geom_line() + facet_wrap(vars(name))
Grafovi B-splajn bazičnih funkcija za kubične B-splajnove s čvorovima \(0.2, 0.5, 0.9, 1.7\)
Zadavanje bazičnih funkcija preko parametra
df
Ako stavimo intercept = TRUE, broj unutarnjih čvorova je
jednak df - degree - 1.
bs(x1, df = 8, intercept = TRUE)
## 1 2 3 4 5 6 7 8
## [1,] 1.000 0.00000 0.000000e+00 0.00000000 0.00000000 0.00000000 0.00000 0.000
## [2,] 0.125 0.59375 2.604167e-01 0.02083333 0.00000000 0.00000000 0.00000 0.000
## [3,] 0.000 0.25000 5.833333e-01 0.16666667 0.00000000 0.00000000 0.00000 0.000
## [4,] 0.000 0.03125 4.687500e-01 0.47916667 0.02083333 0.00000000 0.00000 0.000
## [5,] 0.000 0.00000 1.666667e-01 0.66666667 0.16666667 0.00000000 0.00000 0.000
## [6,] 0.000 0.00000 2.083333e-02 0.47916667 0.47916667 0.02083333 0.00000 0.000
## [7,] 0.000 0.00000 2.850949e-47 0.16666667 0.66666667 0.16666667 0.00000 0.000
## [8,] 0.000 0.00000 0.000000e+00 0.02083333 0.47916667 0.46875000 0.03125 0.000
## [9,] 0.000 0.00000 0.000000e+00 0.00000000 0.16666667 0.58333333 0.25000 0.000
## [10,] 0.000 0.00000 0.000000e+00 0.00000000 0.02083333 0.26041667 0.59375 0.125
## [11,] 0.000 0.00000 0.000000e+00 0.00000000 0.00000000 0.00000000 0.00000 1.000
## attr(,"degree")
## [1] 3
## attr(,"knots")
## 20% 40% 60% 80%
## 0.4 0.8 1.2 1.6
## attr(,"Boundary.knots")
## [1] 0 2
## attr(,"intercept")
## [1] TRUE
## attr(,"class")
## [1] "bs" "basis" "matrix"
bsMat5 <- bs(x2, df = 8, intercept = TRUE)
tablica5 <- as_tibble(bsMat5) %>% pivot_longer(everything()) %>%
mutate(value = as.numeric(value),
x = rep(x2, each = ncol(bsMat5)))
ggplot(tablica5, aes(x = x, y = value)) +
geom_vline(xintercept = attr(bsMat5, "knots"),
linetype="dotted", linewidth = 0.4, color = "blue") +
geom_line() + facet_wrap(vars(name))
Grafovi B-splajn bazičnih funkcija za kubične B-splajnove s
df = 8 i intercept = TRUE
Ako stavimo intercept = FALSE (defaultna vrijednost),
broj unutarnjih čvorova je jednak df - degree.
bs(x1, df = 8)
## 1 2 3 4 5 6 7 8
## [1,] 0.000 0.000 0.00000000 0.000000000 0.00000000 0.000 0.000 0.000
## [2,] 0.558 0.342 0.03600000 0.000000000 0.00000000 0.000 0.000 0.000
## [3,] 0.128 0.588 0.28266667 0.001333333 0.00000000 0.000 0.000 0.000
## [4,] 0.002 0.282 0.63066667 0.085333333 0.00000000 0.000 0.000 0.000
## [5,] 0.000 0.036 0.53866667 0.414666667 0.01066667 0.000 0.000 0.000
## [6,] 0.000 0.000 0.16666667 0.666666667 0.16666667 0.000 0.000 0.000
## [7,] 0.000 0.000 0.01066667 0.414666667 0.53866667 0.036 0.000 0.000
## [8,] 0.000 0.000 0.00000000 0.085333333 0.63066667 0.282 0.002 0.000
## [9,] 0.000 0.000 0.00000000 0.001333333 0.28266667 0.588 0.128 0.000
## [10,] 0.000 0.000 0.00000000 0.000000000 0.03600000 0.342 0.558 0.064
## [11,] 0.000 0.000 0.00000000 0.000000000 0.00000000 0.000 0.000 1.000
## attr(,"degree")
## [1] 3
## attr(,"knots")
## 16.66667% 33.33333% 50% 66.66667% 83.33333%
## 0.3333333 0.6666667 1.0000000 1.3333333 1.6666667
## attr(,"Boundary.knots")
## [1] 0 2
## attr(,"intercept")
## [1] FALSE
## attr(,"class")
## [1] "bs" "basis" "matrix"
bsMat6 <- bs(x2, df = 8)
tablica6 <- as_tibble(bsMat6) %>% pivot_longer(everything()) %>%
mutate(value = as.numeric(value),
x = rep(x2, each = ncol(bsMat6)))
ggplot(tablica6, aes(x = x, y = value)) +
geom_vline(xintercept = attr(bsMat6, "knots"),
linetype="dotted", linewidth = 0.4, color = "blue") +
geom_line() + facet_wrap(vars(name))
Grafovi B-splajn bazičnih funkcija za kubične B-splajnove s
df = 8 i intercept = FALSE
Prirodni kubični splajn dodaje dodatna ograničenja, tj. pretpostavlja da je funkcija linearna izvan rubnih čvorova. Prirodni kubični splajn s \(K\) čvorova (uključujući i rubne čvorove) može se prikazati kao linearna kombinacija \(K\) bazičnih funkcija. Možemo krenuti od baze za kubične splajnove i dodavanjem rubnih uvjeta doći do baze za prirodne kubične splajnove. Ako na \[f(x)=\sum_{j=0}^3{\beta_jx^j}+\sum_{k=1}^K{\theta_k(x-\xi_k)_{+}^3}\] stavimo ograničenja \(f''(x)=0\) za \(x<\xi_1\) i \(x>\xi_K\), dolazimo do uvjeta \[\beta_2=\beta_3=0,\quad \sum_{k=1}^K{\theta_k}=0,\quad \sum_{k=1}^K{\xi_k\theta_k}=0.\] Navedeni uvjeti su zadovoljeni ako uzmemo sljedeću bazu \[N_1(x)=1,\quad N_2(x)=x,\quad N_{k+2}(x)=d_k(x)-d_{K-1}(x),\ \ k=1,2,\dotsc,K-2\] pri čemu je \[d_k(x)=\frac{(x-\xi_k)_{+}^3-(x-\xi_K)_{+}^3}{\xi_K-\xi_k}.\]
Nfun <- function(x, ksi1, ksi2, ksiK) {
v1 <- ((x > ksi1) * (x - ksi1)^3 - (x > ksiK) * (x - ksiK)^3) / (ksiK - ksi1)
v2 <- ((x > ksi2) * (x - ksi2)^3 - (x > ksiK) * (x - ksiK)^3) / (ksiK - ksi2)
v1 - v2
}
Nfun <- Vectorize(Nfun)
xnval <- seq(0, 2, by = 0.01)
nksi <- c(0, 0.2, 0.5, 0.9, 1.7, 2)
ngraf3 <- tibble(x = xnval, N1 = 1, N2 = x,
N3 = Nfun(x, nksi[1], nksi[length(nksi)-1], nksi[length(nksi)]),
N4 = Nfun(x, nksi[2], nksi[length(nksi)-1], nksi[length(nksi)]),
N5 = Nfun(x, nksi[3], nksi[length(nksi)-1], nksi[length(nksi)]),
N6 = Nfun(x, nksi[4], nksi[length(nksi)-1], nksi[length(nksi)])) %>%
pivot_longer(N1:N6, names_to = "fun", values_to = "value")
ggplot(ngraf3, aes(x = x, y = value)) +
geom_vline(xintercept = nksi[2:(length(nksi)-1)], linetype="dotted",
linewidth = 0.4, color = "blue") +
geom_line() + facet_wrap(vars(fun), scales = "free_y")
Grafovi bazičnih funkcija \(N_k\) za prirodne kubične splajnove s čvorovima \(0, 0.2, 0.5, 0.9, 1.7, 2\)
Funkcija ns iz biblioteke splines daje bazu
za prirodne kubične splajnove koja je bazirana na B-splajn bazi.
Funkcija ns ima iste parametre kao i bs
funkcija, jedino nema parametra degree jer je on u ovom
slučaju zapravo uvijek jednak 3.
Vrijednosti bazičnih funkcija
ns(x1, knots = ksi, intercept = TRUE)
## 1 2 3 4 5
## [1,] -0.226455407 0.0000000000 0.00000000 -0.1316931523 0.614568044
## [2,] 0.574323155 0.0888888889 0.00000000 -0.0463169074 0.216145568
## [3,] 0.452299946 0.4952380952 0.02539683 -0.0102010030 0.047604681
## [4,] 0.103634543 0.6966269841 0.19484127 -0.0006896838 0.009700006
## [5,] 0.003838316 0.4978174603 0.46071429 0.0363891599 0.005183921
## [6,] 0.000000000 0.2381944444 0.59103535 0.1651842427 0.022279931
## [7,] 0.000000000 0.0868055556 0.51351010 0.3651211114 0.059022355
## [8,] 0.000000000 0.0187500000 0.32386364 0.5303070970 0.119551729
## [9,] 0.000000000 0.0006944444 0.12512626 0.5511400431 0.208274075
## [10,] 0.000000000 0.0000000000 0.01616162 0.3245894528 0.329020264
## [11,] 0.000000000 0.0000000000 0.00000000 -0.1012269939 0.472392638
## 6
## [1,] -0.482874892
## [2,] -0.169828661
## [3,] -0.037403678
## [4,] -0.007621433
## [5,] -0.004073080
## [6,] -0.016693972
## [7,] -0.024459123
## [8,] 0.007527538
## [9,] 0.114765175
## [10,] 0.330228667
## [11,] 0.628834356
## attr(,"degree")
## [1] 3
## attr(,"knots")
## [1] 0.2 0.5 0.9 1.7
## attr(,"Boundary.knots")
## [1] 0 2
## attr(,"intercept")
## [1] TRUE
## attr(,"class")
## [1] "ns" "basis" "matrix"
Grafovi B-splajn bazičnih funkcija
nsMat1 <- ns(x2, knots = ksi, intercept = TRUE)
nsTablica1 <- as_tibble(nsMat1) %>% pivot_longer(everything()) %>%
mutate(value = as.numeric(value),
x = rep(x2, each = ncol(nsMat1)))
ggplot(nsTablica1, aes(x = x, y = value)) +
geom_vline(xintercept = ksi, linetype="dotted", linewidth = 0.4, color = "blue") +
geom_line() + facet_wrap(vars(name))
Grafovi B-splajn bazičnih funkcija za prirodne B-splajnove s čvorovima \(0, 0.2, 0.5, 0.9, 1.7, 2\)
Zadavanje bazičnih funkcija preko parametra
df
Ako stavimo intercept = TRUE, broj unutarnjih čvorova je
jednak df - 2.
ns(x1, df = 6, intercept = TRUE)
## 1 2 3 4 5 6
## [1,] -2.672612e-01 0.00000000 0.00000000 -0.214285714 0.642857143 -0.428571429
## [2,] 2.978004e-01 0.02083333 0.00000000 -0.128712706 0.386138119 -0.257425413
## [3,] 5.910913e-01 0.16666667 0.00000000 -0.060595106 0.181785317 -0.121190211
## [4,] 4.512946e-01 0.47916667 0.02083333 -0.022347375 0.067042125 -0.044694750
## [5,] 1.589087e-01 0.66666667 0.16666667 -0.006220205 0.018660615 -0.012440410
## [6,] 1.986359e-02 0.47916667 0.47916667 0.018567712 0.006796863 -0.004531242
## [7,] 2.718244e-47 0.16666667 0.66666667 0.154761905 0.035714286 -0.023809524
## [8,] 0.000000e+00 0.02083333 0.47916667 0.441964286 0.111607143 -0.053571429
## [9,] 0.000000e+00 0.00000000 0.16666667 0.595238095 0.214285714 0.023809524
## [10,] 0.000000e+00 0.00000000 0.02083333 0.351190476 0.321428571 0.306547619
## [11,] 0.000000e+00 0.00000000 0.00000000 -0.142857143 0.428571429 0.714285714
## attr(,"degree")
## [1] 3
## attr(,"knots")
## 20% 40% 60% 80%
## 0.4 0.8 1.2 1.6
## attr(,"Boundary.knots")
## [1] 0 2
## attr(,"intercept")
## [1] TRUE
## attr(,"class")
## [1] "ns" "basis" "matrix"
nsMat2 <- ns(x2, df = 6, intercept = TRUE)
nsTablica2 <- as_tibble(nsMat2) %>% pivot_longer(everything()) %>%
mutate(value = as.numeric(value),
x = rep(x2, each = ncol(nsMat2)))
ggplot(nsTablica2, aes(x = x, y = value)) +
geom_vline(xintercept = attr(nsMat2, "knots"),
linetype="dotted", linewidth = 0.4, color = "blue") +
geom_line() + facet_wrap(vars(name))
Grafovi B-splajn bazičnih funkcija za prirodne B-splajnove s
df = 6 i intercept = TRUE
Ako stavimo intercept = FALSE (defaultna vrijednost),
broj unutarnjih čvorova je jednak df - 1.
ns(x1, df = 6)
## 1 2 3 4 5 6
## [1,] 0.00000000 0.000000000 0.00000000 0.000000000 0.000000000 0.000000000
## [2,] 0.03600000 0.000000000 0.00000000 -0.133872434 0.401617302 -0.267744868
## [3,] 0.28266667 0.001333333 0.00000000 -0.159903185 0.479709555 -0.319806370
## [4,] 0.63066667 0.085333333 0.00000000 -0.071669081 0.215007242 -0.143338162
## [5,] 0.53866667 0.414666667 0.01066667 -0.009127666 0.027382998 -0.018255332
## [6,] 0.16666667 0.666666667 0.16666667 0.000000000 0.000000000 0.000000000
## [7,] 0.01066667 0.414666667 0.53866667 0.033428571 0.007714286 -0.005142857
## [8,] 0.00000000 0.085333333 0.63066667 0.262285714 0.061142857 -0.039428571
## [9,] 0.00000000 0.001333333 0.28266667 0.573428571 0.171714286 -0.029142857
## [10,] 0.00000000 0.000000000 0.03600000 0.428000000 0.300000000 0.236000000
## [11,] 0.00000000 0.000000000 0.00000000 -0.142857143 0.428571429 0.714285714
## attr(,"degree")
## [1] 3
## attr(,"knots")
## 16.66667% 33.33333% 50% 66.66667% 83.33333%
## 0.3333333 0.6666667 1.0000000 1.3333333 1.6666667
## attr(,"Boundary.knots")
## [1] 0 2
## attr(,"intercept")
## [1] FALSE
## attr(,"class")
## [1] "ns" "basis" "matrix"
nsMat3 <- ns(x2, df = 6)
nsTablica3 <- as_tibble(nsMat3) %>% pivot_longer(everything()) %>%
mutate(value = as.numeric(value),
x = rep(x2, each = ncol(nsMat3)))
ggplot(nsTablica3, aes(x = x, y = value)) +
geom_vline(xintercept = attr(nsMat3, "knots"),
linetype="dotted", linewidth = 0.4, color = "blue") +
geom_line() + facet_wrap(vars(name))
Grafovi B-splajn bazičnih funkcija za prirodne B-splajnove s
df = 6 i intercept = FALSE
Prava korist od bs i ns funkcija je njihovo
korištenje u regresijskim modelima. Uz klasičnu linearnu regresiju
možemo koristiti i splajnove za kompleksnije modele. Koristit ćemo
iris skup podataka koji dolazi s R programskim
jezikom.
iris %>% glimpse()
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
Ovdje je ideja da samo pokažemo kako možemo koristiti splajnove na konkretnim podacima, nećemo ulaziti u analize koliko su ti modeli dobri ili loši na novim nepoznatim podacima. Ukratko, radi jednostavnosti, treniranje modela ćemo napraviti na čitavom skupu podataka, nećemo zadane podatke dijeliti na trening podatke i podatke za testiranje.
Kod linearne regresije s jednom nezavisnom varijablom \(x\), zavisna varijabla \(y\) je povezana s nezavisnom varijablom preko jednadžbe \[\begin{equation} y=\alpha_0+\alpha_1x+\varepsilon\tag{$\clubsuit$} \end{equation}\] pri čemu parametre \(\alpha_0,\alpha_1\in\mathbb{R}\) određujemo tako da minimiziramo sumu kvadrata reziduala. Pritom je pretpostavka da reziduali imaju normalnu razdiobu s aritmetičkom sredinom \(0\) i standardnom devijacijom \(\sigma\), tj. \(\varepsilon\sim N(0,\sigma^2)\).
U kontekstu vektorskih prostora možemo razmišljati na sljedeći način:
Umjesto potprostora \(\mathcal{P}_2\) možemo zadati neki drugi konačnodimenzionalni potprostor \(L\) unutar vektorskog prostora realnih funkcija realne varijable i tražiti funkciju \(f\) iz tog potprostora tako da umjesto \((\clubsuit)\) vrijedi \[\begin{equation} y=f(x)+\varepsilon.\tag{$\spadesuit$} \end{equation}\] Funkcija \(f\) bit će prikazana kao linearna kombinacija vektora iz zadane baze za \(L\). Pripadni koeficijenti se određuju tako da se minimizira suma kvadrata reziduala u \((\spadesuit)\).
Za potprostor \(L\) možemo uzeti
vektorski prostor B-splajnova ili prirodnih splajnova čije se baze mogu
izgenerirati pomoću funkcija bs i ns.
model1 <- lm(Sepal.Length ~ Petal.Length, data = iris)
summary(model1)
##
## Call:
## lm(formula = Sepal.Length ~ Petal.Length, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.24675 -0.29657 -0.01515 0.27676 1.00269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.30660 0.07839 54.94 <2e-16 ***
## Petal.Length 0.40892 0.01889 21.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4071 on 148 degrees of freedom
## Multiple R-squared: 0.76, Adjusted R-squared: 0.7583
## F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16
tidy(model1)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.31 0.0784 54.9 2.43e-100
## 2 Petal.Length 0.409 0.0189 21.6 1.04e- 47
augment(model1)
## # A tibble: 150 × 8
## Sepal.Length Petal.Length .fitted .resid .hat .sigma .cooksd .std.resid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5.1 1.4 4.88 0.221 0.0186 0.408 0.00285 0.548
## 2 4.9 1.4 4.88 0.0209 0.0186 0.408 0.0000255 0.0518
## 3 4.7 1.3 4.84 -0.138 0.0197 0.408 0.00118 -0.343
## 4 4.6 1.5 4.92 -0.320 0.0176 0.408 0.00565 -0.793
## 5 5 1.4 4.88 0.121 0.0186 0.408 0.000854 0.300
## 6 5.4 1.7 5.00 0.398 0.0158 0.407 0.00780 0.986
## 7 4.6 1.4 4.88 -0.279 0.0186 0.408 0.00455 -0.692
## 8 5 1.5 4.92 0.0800 0.0176 0.408 0.000353 0.198
## 9 4.4 1.4 4.88 -0.479 0.0186 0.407 0.0134 -1.19
## 10 4.9 1.5 4.92 -0.0200 0.0176 0.408 0.0000220 -0.0495
## # … with 140 more rows
glance(model1) %>% print(width = Inf)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.760 0.758 0.407 469. 1.04e-47 1 -77.0 160. 169.
## deviance df.residual nobs
## <dbl> <int> <int>
## 1 24.5 148 150
ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length)) +
geom_point(alpha = 0.5) + geom_smooth(method = "lm", formula = y ~ x)
Linearna regresija
Možemo napraviti linearnu regresiju po svakom faktoru u varijabli
Species. Ovaj posao olakšava naredba lmList iz
lme4 biblioteke.
model1_kat <- lmList(Sepal.Length ~ Petal.Length | Species, data = iris)
summary(model1_kat)
## Call:
## Model: Sepal.Length ~ Petal.Length | NULL
## Data: iris
##
## Coefficients:
## (Intercept)
## Estimate Std. Error t value Pr(>|t|)
## setosa 4.213168 0.4074209 10.341071 4.331619e-19
## versicolor 2.407523 0.4383204 5.492611 1.744801e-07
## virginica 1.059659 0.4858582 2.181005 3.080623e-02
## Petal.Length
## Estimate Std. Error t value Pr(>|t|)
## setosa 0.5422926 0.27676668 1.959385 5.199902e-02
## versicolor 0.8282810 0.10228407 8.097849 2.155919e-13
## virginica 0.9957386 0.08708982 11.433468 6.047554e-22
##
## Residual standard error: 0.3364509 on 144 degrees of freedom
map_dfr(model1_kat, tidy, .id = 'Species')
## # A tibble: 6 × 6
## Species term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 setosa (Intercept) 4.21 0.416 10.1 1.61e-13
## 2 setosa Petal.Length 0.542 0.282 1.92 6.07e- 2
## 3 versicolor (Intercept) 2.41 0.446 5.39 2.08e- 6
## 4 versicolor Petal.Length 0.828 0.104 7.95 2.59e-10
## 5 virginica (Intercept) 1.06 0.467 2.27 2.77e- 2
## 6 virginica Petal.Length 0.996 0.0837 11.9 6.30e-16
map_dfr(model1_kat, glance, .id = 'Species') %>% print(width = Inf)
## # A tibble: 3 × 13
## Species r.squared adj.r.squared sigma statistic p.value df logLik AIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 0.0714 0.0520 0.343 3.69 6.07e- 2 1 -16.5 38.9
## 2 versicolor 0.569 0.560 0.343 63.3 2.59e-10 1 -16.4 38.7
## 3 virginica 0.747 0.742 0.323 142. 6.30e-16 1 -13.5 32.9
## BIC deviance df.residual nobs
## <dbl> <dbl> <int> <int>
## 1 44.6 5.65 48 50
## 2 44.5 5.63 48 50
## 3 38.6 5.01 48 50
map_dfr(model1_kat, augment, .id = 'Species') %>% print(width = Inf)
## # A tibble: 150 × 10
## Species Sepal.Length Petal.Length .fitted .resid .hat .sigma .cooksd
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5.1 1.4 4.97 0.128 0.0226 0.346 0.00164
## 2 setosa 4.9 1.4 4.97 -0.0724 0.0226 0.347 0.000526
## 3 setosa 4.7 1.3 4.92 -0.218 0.0378 0.345 0.00824
## 4 setosa 4.6 1.5 5.03 -0.427 0.0210 0.341 0.0169
## 5 setosa 5 1.4 4.97 0.0276 0.0226 0.347 0.0000766
## 6 setosa 5.4 1.7 5.14 0.265 0.0583 0.345 0.0196
## 7 setosa 4.6 1.4 4.97 -0.372 0.0226 0.342 0.0139
## 8 setosa 5 1.5 5.03 -0.0266 0.0210 0.347 0.0000658
## 9 setosa 4.4 1.4 4.97 -0.572 0.0226 0.336 0.0329
## 10 setosa 4.9 1.5 5.03 -0.127 0.0210 0.346 0.00149
## .std.resid .rownames
## <dbl> <chr>
## 1 0.376 <NA>
## 2 -0.213 <NA>
## 3 -0.648 <NA>
## 4 -1.26 <NA>
## 5 0.0814 <NA>
## 6 0.796 <NA>
## 7 -1.10 <NA>
## 8 -0.0784 <NA>
## 9 -1.69 <NA>
## 10 -0.373 <NA>
## # … with 140 more rows
Zapravo isto možemo dobiti i preko lm naredbe, samo
treba znati interpretirati izlazne podatke.
lm(Sepal.Length ~ Petal.Length * Species, data = iris)
##
## Call:
## lm(formula = Sepal.Length ~ Petal.Length * Species, data = iris)
##
## Coefficients:
## (Intercept) Petal.Length
## 4.2132 0.5423
## Speciesversicolor Speciesvirginica
## -1.8056 -3.1535
## Petal.Length:Speciesversicolor Petal.Length:Speciesvirginica
## 0.2860 0.4534
lm(Sepal.Length ~ Petal.Length * Species + 0, data = iris)
##
## Call:
## lm(formula = Sepal.Length ~ Petal.Length * Species + 0, data = iris)
##
## Coefficients:
## Petal.Length Speciessetosa
## 0.5423 4.2132
## Speciesversicolor Speciesvirginica
## 2.4075 1.0597
## Petal.Length:Speciesversicolor Petal.Length:Speciesvirginica
## 0.2860 0.4534
ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length, color = Species)) +
geom_point(alpha = 0.5) + geom_smooth(method = "lm", formula = y ~ x)
Linearna regresija po faktorima u Species varijabli
Preko modela bez interakcije dobivamo paralelne pravce.
model1_kat2 <- lm(Sepal.Length ~ Petal.Length + Species, data = iris)
summary(model1_kat2)
##
## Call:
## lm(formula = Sepal.Length ~ Petal.Length + Species, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.75310 -0.23142 -0.00081 0.23085 1.03100
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.68353 0.10610 34.719 < 2e-16 ***
## Petal.Length 0.90456 0.06479 13.962 < 2e-16 ***
## Speciesversicolor -1.60097 0.19347 -8.275 7.37e-14 ***
## Speciesvirginica -2.11767 0.27346 -7.744 1.48e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.338 on 146 degrees of freedom
## Multiple R-squared: 0.8367, Adjusted R-squared: 0.8334
## F-statistic: 249.4 on 3 and 146 DF, p-value: < 2.2e-16
tidy(model1_kat2)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.68 0.106 34.7 1.97e-72
## 2 Petal.Length 0.905 0.0648 14.0 1.12e-28
## 3 Speciesversicolor -1.60 0.193 -8.28 7.37e-14
## 4 Speciesvirginica -2.12 0.273 -7.74 1.48e-12
augment(model1_kat2)
## # A tibble: 150 × 9
## Sepal.Length Petal.Length Species .fitted .resid .hat .sigma .cooksd
## <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5.1 1.4 setosa 4.95 0.150 0.0201 0.339 0.00103
## 2 4.9 1.4 setosa 4.95 -0.0499 0.0201 0.339 0.000114
## 3 4.7 1.3 setosa 4.86 -0.159 0.0210 0.339 0.00122
## 4 4.6 1.5 setosa 5.04 -0.440 0.0201 0.337 0.00886
## 5 5 1.4 setosa 4.95 0.0501 0.0201 0.339 0.000115
## 6 5.4 1.7 setosa 5.22 0.179 0.0221 0.339 0.00161
## 7 4.6 1.4 setosa 4.95 -0.350 0.0201 0.338 0.00562
## 8 5 1.5 setosa 5.04 -0.0404 0.0201 0.339 0.0000745
## 9 4.4 1.4 setosa 4.95 -0.550 0.0201 0.336 0.0139
## 10 4.9 1.5 setosa 5.04 -0.140 0.0201 0.339 0.000900
## # … with 140 more rows, and 1 more variable: .std.resid <dbl>
glance(model1_kat2) %>% print(width = Inf)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.837 0.833 0.338 249. 3.10e-57 3 -48.1 106. 121.
## deviance df.residual nobs
## <dbl> <int> <int>
## 1 16.7 146 150
ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length, color = Species)) +
geom_point(alpha = 0.5) + geom_parallel_slopes()
Linearna regresija sa Species varijablom u modelu
B-splajn regresija
Pogledajmo regresiju s kubičnom B-splajn funkcijom s \(5\) i \(8\) stupnjeva slobode.
model2 <- lm(Sepal.Length ~ bs(Petal.Length, df = 5), data = iris)
summary(model2)
##
## Call:
## lm(formula = Sepal.Length ~ bs(Petal.Length, df = 5), data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.10349 -0.23865 -0.00704 0.20595 0.96287
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.60448 0.20683 22.262 < 2e-16 ***
## bs(Petal.Length, df = 5)1 0.75358 0.37486 2.010 0.0463 *
## bs(Petal.Length, df = 5)2 0.07583 0.33296 0.228 0.8202
## bs(Petal.Length, df = 5)3 1.79417 0.34290 5.232 5.8e-07 ***
## bs(Petal.Length, df = 5)4 2.23305 0.27398 8.150 1.6e-13 ***
## bs(Petal.Length, df = 5)5 3.48962 0.32318 10.798 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3605 on 144 degrees of freedom
## Multiple R-squared: 0.8168, Adjusted R-squared: 0.8104
## F-statistic: 128.4 on 5 and 144 DF, p-value: < 2.2e-16
ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length)) +
geom_point(alpha = 0.5) + geom_smooth(method = "lm", formula = y ~ bs(x, df = 5))
B-splajn regresija s 5 stupnjeva slobode
model3 <- lm(Sepal.Length ~ bs(Petal.Length, df = 8), data = iris)
summary(model3)
##
## Call:
## lm(formula = Sepal.Length ~ bs(Petal.Length, df = 8), data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.18260 -0.21045 -0.00888 0.20538 0.98525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6039 0.3128 14.718 < 2e-16 ***
## bs(Petal.Length, df = 8)1 0.1760 0.4057 0.434 0.66511
## bs(Petal.Length, df = 8)2 0.8011 0.3514 2.280 0.02411 *
## bs(Petal.Length, df = 8)3 -0.2299 0.5536 -0.415 0.67853
## bs(Petal.Length, df = 8)4 1.2185 0.3670 3.320 0.00115 **
## bs(Petal.Length, df = 8)5 1.7079 0.3394 5.033 1.45e-06 ***
## bs(Petal.Length, df = 8)6 1.6223 0.3938 4.120 6.44e-05 ***
## bs(Petal.Length, df = 8)7 3.5758 0.4888 7.316 1.77e-11 ***
## bs(Petal.Length, df = 8)8 3.0177 0.4219 7.153 4.24e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3521 on 141 degrees of freedom
## Multiple R-squared: 0.8289, Adjusted R-squared: 0.8192
## F-statistic: 85.37 on 8 and 141 DF, p-value: < 2.2e-16
ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length)) +
geom_point(alpha = 0.5) + geom_smooth(method = "lm", formula = y ~ bs(x, df = 8))
B-splajn regresija s 8 stupnjeva slobode
Povećavanjem stupnja slobode model postaje kompleksniji, krivulja bolje prati podatke. Međutim, to može dovesti do prenaučenosti (overfitting) pa se model može lošije ponašati na nepoznatim novim podacima. S druge strane, manji broj stupnja slobode dovodi do jednostavnijeg modela koji se može loše ponašati na podacima na kojima je treniran (underfitting). Stoga treba naći neki balans između tih dvaju krajnjih slučajeva. Ovdje nećemo ulaziti u te detalje.
ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", formula = y ~ bs(x, df = 3), se = FALSE, aes(color = "df = 3")) +
geom_smooth(method = "lm", formula = y ~ bs(x, df = 5), se = FALSE, aes(color = "df = 5")) +
geom_smooth(method = "lm", formula = y ~ bs(x, df = 8), se = FALSE, aes(color = "df = 8")) +
geom_smooth(method = "lm", formula = y ~ bs(x, df = 12), se = FALSE, aes(color = "df = 12")) +
scale_color_hue("Stupnjevi slobode", breaks = c('df = 3', 'df = 5', 'df = 8', 'df = 12'))
B-splajn regresija i stupnjevi slobode
Možemo napraviti B-splajn regresiju po svakom faktoru u varijabli
Species.
model2_kat <- lmList(Sepal.Length ~ bs(Petal.Length, df = 5) | Species, data = iris)
summary(model2_kat)
## Call:
## Model: Sepal.Length ~ bs(Petal.Length, df = 5) | NULL
## Data: iris
##
## Coefficients:
## (Intercept)
## Estimate Std. Error t value Pr(>|t|)
## setosa 4.476538 0.3232012 13.85062 1.692040e-27
## versicolor 5.031455 0.3221284 15.61941 8.532766e-32
## virginica 4.970176 0.3314348 14.99594 2.696428e-30
## bs(Petal.Length, df = 5)1
## Estimate Std. Error t value Pr(>|t|)
## setosa 0.70614873 0.6657565 1.06067120 0.29077623
## versicolor -0.05282411 0.6163630 -0.08570292 0.93183244
## virginica 0.99188353 0.5296902 1.87257304 0.06333953
## bs(Petal.Length, df = 5)2
## Estimate Std. Error t value Pr(>|t|)
## setosa 0.3154402 0.3800365 0.8300262 0.408022610
## versicolor 0.7273118 0.3814169 1.9068682 0.058710731
## virginica 1.3225922 0.3998330 3.3078617 0.001211605
## bs(Petal.Length, df = 5)3
## Estimate Std. Error t value Pr(>|t|)
## setosa 0.6825392 0.4382068 1.557573 0.1217288705
## versicolor 0.9849692 0.4145526 2.375981 0.0189396510
## virginica 1.6537002 0.4708541 3.512128 0.0006090798
## bs(Petal.Length, df = 5)4
## Estimate Std. Error t value Pr(>|t|)
## setosa 0.8832191 0.5338884 1.654314 1.004397e-01
## versicolor 1.8901328 0.4316514 4.378840 2.407342e-05
## virginica 3.2197578 0.5222661 6.164976 7.990972e-09
## bs(Petal.Length, df = 5)5
## Estimate Std. Error t value Pr(>|t|)
## setosa 0.5075331 0.4012038 1.265026 2.080908e-01
## versicolor 1.1200614 0.4432348 2.527016 1.268393e-02
## virginica 2.6299744 0.4446038 5.915321 2.684442e-08
##
## Residual standard error: 0.3350539 on 132 degrees of freedom
map_dfr(model2_kat, glance, .id = 'Species') %>% print(width = Inf)
## # A tibble: 3 × 13
## Species r.squared adj.r.squared sigma statistic p.value df logLik AIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 0.121 0.0213 0.349 1.21 3.19e- 1 5 -15.1 44.2
## 2 versicolor 0.600 0.554 0.345 13.2 7.27e- 8 5 -14.5 43.0
## 3 virginica 0.786 0.761 0.311 32.3 1.13e-13 5 -9.28 32.6
## BIC deviance df.residual nobs
## <dbl> <dbl> <int> <int>
## 1 57.5 5.35 44 50
## 2 56.3 5.22 44 50
## 3 45.9 4.24 44 50
ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length, color = Species)) +
geom_point(alpha = 0.5) + geom_smooth(method = "lm", formula = y ~ bs(x, df = 5))
B-splajn regresija s 5 stupnjeva slobode po faktorima u
Species varijabli
Možemo napraviti B-splajn regresiju bez interakcije.
model2_kat2 <- lm(Sepal.Length ~ bs(Petal.Length, df = 5) + Species, data = iris)
summary(model2_kat2)
##
## Call:
## lm(formula = Sepal.Length ~ bs(Petal.Length, df = 5) + Species,
## data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.73531 -0.22757 -0.01729 0.25160 0.96720
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.5999 0.2292 20.068 < 2e-16 ***
## bs(Petal.Length, df = 5)1 0.7495 0.5537 1.354 0.177970
## bs(Petal.Length, df = 5)2 0.1575 1.1011 0.143 0.886469
## bs(Petal.Length, df = 5)3 2.1009 0.8042 2.612 0.009960 **
## bs(Petal.Length, df = 5)4 3.4648 0.9236 3.751 0.000255 ***
## bs(Petal.Length, df = 5)5 4.1441 0.8692 4.768 4.56e-06 ***
## Speciesversicolor -0.2550 0.9590 -0.266 0.790713
## Speciesvirginica -0.8241 0.9523 -0.865 0.388295
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3384 on 142 degrees of freedom
## Multiple R-squared: 0.8408, Adjusted R-squared: 0.8329
## F-statistic: 107.1 on 7 and 142 DF, p-value: < 2.2e-16
ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length, color = Species)) +
geom_point(alpha = 0.5) + geom_parallel_slopes(formula = y ~ bs(x, df = 5))
B-splajn regresija s 5 stupnjeva slobode sa Species
varijablom u modelu
Regresija s prirodnim splajnom
Na analogni način možemo koristiti i ns funkciju ukoliko
želimo raditi regresiju s prirodnim kubičnim splajnom.
model4 <- lm(Sepal.Length ~ ns(Petal.Length, df = 5), data = iris)
summary(model4)
##
## Call:
## lm(formula = Sepal.Length ~ ns(Petal.Length, df = 5), data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.17312 -0.23348 -0.01548 0.24715 0.93717
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7318 0.1669 28.356 < 2e-16 ***
## ns(Petal.Length, df = 5)1 0.1193 0.2316 0.515 0.607
## ns(Petal.Length, df = 5)2 1.5501 0.2107 7.358 1.31e-11 ***
## ns(Petal.Length, df = 5)3 1.3303 0.1490 8.929 1.83e-15 ***
## ns(Petal.Length, df = 5)4 3.2545 0.5328 6.109 8.89e-09 ***
## ns(Petal.Length, df = 5)5 3.0034 0.2004 14.987 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3559 on 144 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8153
## F-statistic: 132.5 on 5 and 144 DF, p-value: < 2.2e-16
ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length)) +
geom_point(alpha = 0.5) + geom_smooth(method = "lm", formula = y ~ ns(x, df = 5))
regresija s kubičnim prirodnim splajnom s 5 stupnjeva slobode
Usporedimo kubičnu B-splajn regresiju i regresiju s prirodnim kubičnim splajnom.
ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", formula = y ~ bs(x, df = 5), se = FALSE, aes(color = "B-splajn")) +
geom_smooth(method = "lm", formula = y ~ ns(x, df = 5), se = FALSE, aes(color = "prirodni splajn")) +
scale_color_hue("Vrsta splajna")
kubična B-splajn regresija i regresija s prirodnim splajnom s 5 stupnjeva slobode
Možemo napraviti regresiju s prirodnim kubičnim splajnom po svakom
faktoru u varijabli Species.
model4_kat <- lmList(Sepal.Length ~ ns(Petal.Length, df = 5) | Species, data = iris)
summary(model4_kat)
## Call:
## Model: Sepal.Length ~ ns(Petal.Length, df = 5) | NULL
## Data: iris
##
## Coefficients:
## (Intercept)
## Estimate Std. Error t value Pr(>|t|)
## setosa 4.522067 0.2977273 15.18862 9.235235e-31
## versicolor 4.979430 0.2787831 17.86130 4.957453e-37
## virginica 5.150869 0.3057127 16.84873 1.067269e-34
## ns(Petal.Length, df = 5)1
## Estimate Std. Error t value Pr(>|t|)
## setosa 0.2932865 0.3119170 0.940271 3.487961e-01
## versicolor 0.9645424 0.2834930 3.402350 8.846783e-04
## virginica 1.3841802 0.3348311 4.133965 6.299776e-05
## ns(Petal.Length, df = 5)2
## Estimate Std. Error t value Pr(>|t|)
## setosa 0.6293576 0.3403219 1.849301 0.066651263
## versicolor 0.9949228 0.3488177 2.852272 0.005040361
## virginica 1.2205007 0.3727495 3.274319 0.001352636
## ns(Petal.Length, df = 5)3
## Estimate Std. Error t value Pr(>|t|)
## setosa 0.3774199 0.2869594 1.315238 1.907101e-01
## versicolor 1.5463572 0.2687898 5.753035 5.817437e-08
## virginica 2.1102055 0.3114274 6.775914 3.717627e-10
## ns(Petal.Length, df = 5)4
## Estimate Std. Error t value Pr(>|t|)
## setosa 1.090321 0.6855206 1.590500 1.141143e-01
## versicolor 1.791715 0.6845422 2.617392 9.894781e-03
## virginica 3.431570 0.6905213 4.969535 2.042526e-06
## ns(Petal.Length, df = 5)5
## Estimate Std. Error t value Pr(>|t|)
## setosa 0.2372311 0.2815043 0.8427262 4.009060e-01
## versicolor 1.0936160 0.2940531 3.7191100 2.945519e-04
## virginica 2.0538101 0.2659351 7.7229749 2.514982e-12
##
## Residual standard error: 0.3367084 on 132 degrees of freedom
map_dfr(model4_kat, glance, .id = 'Species') %>% print(width = Inf)
## # A tibble: 3 × 13
## Species r.squared adj.r.squared sigma statistic p.value df logLik AIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 0.119 0.0190 0.349 1.19 3.30e- 1 5 -15.1 44.3
## 2 versicolor 0.592 0.546 0.348 12.8 1.09e- 7 5 -15.0 43.9
## 3 virginica 0.784 0.760 0.312 32.0 1.34e-13 5 -9.47 32.9
## BIC deviance df.residual nobs
## <dbl> <dbl> <int> <int>
## 1 57.7 5.36 44 50
## 2 57.3 5.33 44 50
## 3 46.3 4.28 44 50
ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length, color = Species)) +
geom_point(alpha = 0.5) + geom_smooth(method = "lm", formula = y ~ ns(x, df = 5))
regresija s prirodnim splajnom s 5 stupnjeva slobode po faktorima u
Species varijabli
Možemo napraviti regresiju s prirodnim kubičnim splajnom bez interakcije.
model4_kat2 <- lm(Sepal.Length ~ ns(Petal.Length, df = 5) + Species, data = iris)
summary(model4_kat2)
##
## Call:
## lm(formula = Sepal.Length ~ ns(Petal.Length, df = 5) + Species,
## data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.74569 -0.22154 -0.01368 0.24593 0.94935
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7214 0.1591 29.676 < 2e-16 ***
## ns(Petal.Length, df = 5)1 0.8924 0.9994 0.893 0.37339
## ns(Petal.Length, df = 5)2 2.2226 0.7440 2.987 0.00331 **
## ns(Petal.Length, df = 5)3 2.8126 0.8435 3.335 0.00109 **
## ns(Petal.Length, df = 5)4 4.4791 0.8706 5.145 8.73e-07 ***
## ns(Petal.Length, df = 5)5 4.0835 0.8047 5.075 1.20e-06 ***
## Speciesversicolor -0.6592 0.7692 -0.857 0.39286
## Speciesvirginica -1.1977 0.7715 -1.552 0.12279
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.339 on 142 degrees of freedom
## Multiple R-squared: 0.8403, Adjusted R-squared: 0.8324
## F-statistic: 106.7 on 7 and 142 DF, p-value: < 2.2e-16
ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length, color = Species)) +
geom_point(alpha = 0.5) + geom_parallel_slopes(formula = y ~ ns(x, df = 5))
regresija s prirodnim splajnom s 5 stupnjeva slobode sa
Species varijablom u modelu
Problem. Među svim realnim funkcijama realne varijable klase \(C^2\) treba pronaći funkciju \(f\) koja minimizira penaliziranu sumu kvadrata reziduala \[\begin{equation} RSS(f,\lambda)=\sum_{i=1}^N\big(y_i-f(x_i)\big)^2+\lambda\int{f''(t)^2\,\mathrm{d}t}\tag{$\heartsuit$} \end{equation}\] gdje je \(\lambda\) fiksni parametar zaglađivanja. U slučaju \(\lambda=0\) nema nikakvih uvjeta na \(f\), dok za \(\lambda=\infty\) funkcija \(f\) mora biti linearna.
Može se pokazati da je rješenje ovog problema prirodni kubični splajn čiji čvorovi su sve različite vrijednosti od \(x_i,\ i=1,2,\dotsc,N\). Stoga je rješenje oblika \[f(x)=\sum_{j=1}^M{\theta_jN_j(x)}\] pri čemu su \(N_j\) bazične funkcije za pripadnu familiju prirodnih kubičnih splajnova, a \(M\) je ukupni broj različitih vrijednosti od \(x_i,\ i=1,2,\dotsc,N\).
Relacija \((\heartsuit)\) može se zapisati u matričnom obliku \[RSS(f,\lambda)=(Y-A\Theta)^T(Y-A\Theta)+\lambda\Theta^T\Omega\Theta\] pri čemu je \[ Y=\begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_N \end{bmatrix},\quad \Theta=\begin{bmatrix} \theta_1\\ \theta_2\\ \vdots\\ \theta_M \end{bmatrix},\quad A=\begin{bmatrix}a_{ij}\end{bmatrix},\quad \Omega=\begin{bmatrix}\omega_{ij}\end{bmatrix},\quad a_{ij}=N_j(x_i),\quad \omega_{ij}=\int{N''_{i}(t)N''_{j}(t)\,\mathrm{d}t}. \] Rješenje je jednako \[\hat{\Theta}=(A^TA+\lambda\Omega)^{-1}A^TY,\] a pripadni splajn je dan s \[\hat{f}(x)=\sum_{j=1}^M{\hat{\theta}_jN_j(x)}.\] Neka je \(\hat{Y}=A\hat{\Theta}\). Tada je \[\hat{Y}=A\hat{\Theta}=A(A^TA+\lambda\Omega)^{-1}A^TY=S_{\lambda}Y.\] Efektivni stupanj slobode \(\mathrm{d}f_{\lambda}\) definira se kao trag matrice \(S_{\lambda}\).
Kako se parametar \(\lambda\) povećava od \(0\) do \(\infty\), efektivni stupanj slobode \(\mathrm{d}f_{\lambda}\) smanjuje se od \(M\) do \(2\). Prirodni kubični splajn s \(M\) čvorova ima ukupno \(M\) nominalnih stupnjeva slobode. Međutim, faktor zaglađivanja \(\lambda\) daje ograničenja na parametre u modelu, što utječe na zakrivljenost pripadne krivulje. Taj fenomen se mjeri s efektivnim stupnjem slobode.
Optimalnu vrijednost parametra \(\lambda\) možemo odrediti pomoću LOOCV (leave-one-out cross-validation). Pritom se greška računa prema formuli
\[\begin{equation} RSS_{\mathrm{cv}}(\lambda)=\sum _{i=1}^{N}{\big(y_i-\hat{f_{\lambda}}^{(-i)}(x_i)\big)^2} = \sum _{i=1}^{N}{\bigg[\frac{y_i-\hat{f_{\lambda}}(x_i)}{1-(S _{\lambda})_{ii}}\bigg]^2}. \tag{$\diamondsuit$} \end{equation}\]
Ovdje \(\hat{f_{\lambda}}^{(-i)}\) označava vrijednost dobivenog splajna u točki \(x_i\) pri čemu se pri treniranju modela uzimaju sve vrijednosti osim \(x_i\), a \(\hat{f_{\lambda}}(x_i)\) označava vrijednost dobivenog splajna u točki \(x_i\) pri čemu se kod treniranja modela koriste sve vrijednosti.
U generaliziranom slučaju se svaki član \((S _{\lambda})_{ii}\) u formuli (\(\diamondsuit\)) zamijeni s \(\frac{1}{N}\mathop{\mathrm{tr}}{S_{\lambda}}\).
Generalizirana verzija je zapravo težinska verzija od (\(\diamondsuit\)) s težinama \[\frac{\big(1-(S
_{\lambda})_{ii}\big)^2}{\big(1-\frac{1}{N}\mathop{\mathrm{tr}}{S_{\lambda}}\big)^2}.\]
U programskom jeziku R navedena teorija je implementirana u
naredbi smooth.spline. Detalje implementacije možete
pogledati u dokumentaciji te naredbe. Pripadni graf možemo nacrtati
pomoću geom_spline naredbe iz biblioteke
ggformula.
sm1 <- smooth.spline(iris$Petal.Length, iris$Sepal.Length, df = 3)
sm2 <- smooth.spline(iris$Petal.Length, iris$Sepal.Length, df = 5)
sm3 <- smooth.spline(iris$Petal.Length, iris$Sepal.Length)
list(sm1, sm2, sm3)
## [[1]]
## Call:
## smooth.spline(x = iris$Petal.Length, y = iris$Sepal.Length, df = 3)
##
## Smoothing Parameter spar= 0.9368252 lambda= 0.0935363 (12 iterations)
## Equivalent Degrees of Freedom (Df): 2.999686
## Penalized Criterion (RSS): 7.365555
## GCV: 0.1375983
##
## [[2]]
## Call:
## smooth.spline(x = iris$Petal.Length, y = iris$Sepal.Length, df = 5)
##
## Smoothing Parameter spar= 0.7648557 lambda= 0.005346461 (13 iterations)
## Equivalent Degrees of Freedom (Df): 5.00085
## Penalized Criterion (RSS): 6.154997
## GCV: 0.1327859
##
## [[3]]
## Call:
## smooth.spline(x = iris$Petal.Length, y = iris$Sepal.Length)
##
## Smoothing Parameter spar= 0.6614865 lambda= 0.0009577521 (11 iterations)
## Equivalent Degrees of Freedom (Df): 7.152497
## Penalized Criterion (RSS): 5.357991
## GCV: 0.1309575
ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length)) +
geom_point(alpha = 0.5) + geom_spline(df = 3, aes(color = "3")) +
geom_spline(df = 5, aes(color = "5")) +
geom_spline(aes(color = paste(round(sm3$df, 2), '(optimalni)'))) +
scale_color_hue("df")
smoothing splajnovi i stupnjevi slobode
S obzirom da nemamo podataka za Petal.Length između
\(2\) i \(3\), možemo generirati gušće podatke tako
da dobijemo ljepše nacrtane krivulje.
xmin <- min(iris$Petal.Length)
xmax <- max(iris$Petal.Length)
xvalues <- seq(xmin, xmax, by = 0.05)
graf1 <- predict(sm1, xvalues) %>% as_tibble()
graf2 <- predict(sm2, xvalues) %>% as_tibble()
graf3 <- predict(sm3, xvalues) %>% as_tibble()
ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length)) +
geom_point(alpha = 0.5) +
geom_line(data = graf1, aes(x = x, y = y, color = "3")) +
geom_line(data = graf2, aes(x = x, y = y, color = "5")) +
geom_line(data = graf3, aes(x = x, y = y,
color = paste(round(sm3$df, 2), '(optimalni)'))) +
scale_color_hue("df")
smoothing splajnovi i stupnjevi slobode (ljepše nacrtane krivulje)