Jednoduchá regresia
Zadanie: Urobte lineárny regresný model pre premennú dist z datasetu cars, pričom prediktorom je premenná speed. Overte predpoklady pre tento model a otestujte jeho štatistickú významnosť. Vizualizujte vzťah reálnych a predikovaných hodnôt pre target. Je medzi týmito premennými skutočne lineárny vzťah alebo sa dá nájsť iný, ktorý bude dávať lepšie výsledky?
data1 <- datasets::cars
# Lineárny regresný model
model1 <- lm(dist ~ speed, data1)
summary(model1) # F-test, R^2, p-hodnoty##
## Call:
## lm(formula = dist ~ speed, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Tak že R^2 = 0.6511 (65.1 % variability v dist je vysvetlené premennou speed). Adjustované R^2 = 0.6438. Hodnota F-testu: F = 89.57, p < 0.001 znamená, že model je celkovo významný.
Vizualizacia:
plot(data1$speed, data1$dist, main="Scatterplot speed vs dist",
xlab="Speed", ylab="Dist", pch=19, col="blue")
abline(model1, col="red") # Regresná priamkaSú normálne rozdelené? Shapiro-Wilk test
##
## Shapiro-Wilk normality test
##
## data: resid(model1)
## W = 0.94509, p-value = 0.02152
Je stredná hodnota nula? t-test
##
## One Sample t-test
##
## data: model1$residuals
## t = 1.0315e-16, df = 49, p-value = 1
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -4.326 4.326
## sample estimates:
## mean of x
## 2.220446e-16
Stredna hodnota je nula.
Sú náhodné? Runs test a turning point test
##
## Runs Test
##
## data: model1$residuals
## statistic = -0.28577, runs = 25, n1 = 25, n2 = 25, n = 50, p-value =
## 0.7751
## alternative hypothesis: nonrandomness
##
## Turning Point Test
##
## data: model1$residuals
## statistic = -1.4961, n = 49, p-value = 0.1346
## alternative hypothesis: non randomness
Rezíduá sú náhodné.
Platí homoskedasticita? Golfeld-Quandtov test:
## Warning: пакет 'lmtest' был собран под R версии 4.4.2
## Загрузка требуемого пакета: zoo
##
## Присоединяю пакет: 'zoo'
## Следующие объекты скрыты от 'package:base':
##
## as.Date, as.Date.numeric
##
## Goldfeld-Quandt test
##
## data: model1
## GQ = 1.5512, df1 = 23, df2 = 23, p-value = 0.1498
## alternative hypothesis: variance increases from segment 1 to 2
Rezíduá maju konštantne rozptyly.
Sú nekorelované? Resp. neexistuje autokorelácia? Durbin-Watsonov test:
## Warning: пакет 'car' был собран под R версии 4.4.2
## Загрузка требуемого пакета: carData
## Warning: пакет 'carData' был собран под R версии 4.4.2
## lag Autocorrelation D-W Statistic p-value
## 1 0.1604322 1.676225 0.178
## Alternative hypothesis: rho != 0
Rezíduá sú nekorelované.
Porovnanie modelov:
# Alternatívny model (kvadratický vzťah)
model_quad <- lm(dist ~ speed + I(speed^2), data = cars)
summary(model_quad)##
## Call:
## lm(formula = dist ~ speed + I(speed^2), data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.720 -9.184 -3.188 4.628 45.152
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.47014 14.81716 0.167 0.868
## speed 0.91329 2.03422 0.449 0.656
## I(speed^2) 0.09996 0.06597 1.515 0.136
##
## Residual standard error: 15.18 on 47 degrees of freedom
## Multiple R-squared: 0.6673, Adjusted R-squared: 0.6532
## F-statistic: 47.14 on 2 and 47 DF, p-value: 5.852e-12
## Analysis of Variance Table
##
## Model 1: dist ~ speed
## Model 2: dist ~ speed + I(speed^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 48 11354
## 2 47 10825 1 528.81 2.296 0.1364
Násobná regresia – generované dáta
Zadanie: Nájdite najvhodnejší lineárny regresný model pre nižšie uvedené náhodne generované dáta. Overte predpoklady. Zdôvodnite prečo je vybraný model najvhodnejší. Testujte štatistickú významnosť modelu a parametrov.
n <- 100
size <- rnorm(n, mean = 2000, sd = 500) # House size
bedrooms <- sample(2:5, n, replace = TRUE) # Number of bedrooms
bathrooms <- sample(1:4, n, replace = TRUE) # Number of bathrooms
age <- rpois(n, lambda = 10) # Age of the house
price <- 50000 + size * 150 + bedrooms * 10000 - age * 2000 + rnorm(n, mean = 0, sd = 30000) # Create target variable
house_data <- data.frame( Size = size, Bedrooms = bedrooms, Bathrooms = bathrooms, Age = age, Price = price ) # Create data frame
house_data## Size Bedrooms Bathrooms Age Price
## 1 1273.7147 4 2 9 327675.2
## 2 2336.6568 3 1 18 416102.0
## 3 1778.4684 2 2 11 287858.3
## 4 2114.5043 2 4 19 285917.6
## 5 2027.8086 5 1 9 381211.2
## 6 1649.0849 3 4 14 252444.3
## 7 2036.9870 5 4 10 394121.3
## 8 2570.1199 4 1 9 455091.3
## 9 2067.3272 5 2 22 366556.7
## 10 2108.8902 4 3 8 398374.1
## 11 2707.6004 2 2 15 395148.0
## 12 1838.9219 4 2 10 360426.0
## 13 2488.6596 2 2 9 411899.4
## 14 1168.6646 5 3 10 283609.4
## 15 2438.1485 5 2 11 433519.6
## 16 1969.0318 4 3 9 416981.5
## 17 1068.2977 3 1 10 221784.6
## 18 1988.9479 5 3 6 359774.7
## 19 2445.6826 3 4 10 420865.3
## 20 2664.0058 3 1 10 446145.2
## 21 2209.4099 2 4 7 427344.2
## 22 2155.8874 2 2 4 370287.4
## 23 1002.5307 4 1 14 219887.5
## 24 2982.2998 2 3 8 551227.4
## 25 1740.2298 5 3 16 384499.4
## 26 2924.1014 3 2 9 513481.7
## 27 1889.6166 4 4 7 302329.9
## 28 2156.0895 5 4 5 393066.6
## 29 1530.3124 2 4 7 273687.0
## 30 2231.7717 2 1 6 356713.5
## 31 2887.9781 2 2 10 462223.1
## 32 1923.8451 5 4 13 395586.8
## 33 2779.5816 5 2 10 488770.3
## 34 2475.6780 3 3 4 430991.8
## 35 1923.4543 4 4 11 332395.9
## 36 2180.3688 5 2 9 430912.4
## 37 2110.6396 3 1 17 393615.1
## 38 1834.6946 5 1 9 316541.6
## 39 1777.2431 5 1 7 382514.7
## 40 995.8425 2 1 12 257880.5
## 41 2124.8690 5 2 11 411101.5
## 42 2586.6258 4 3 10 465444.5
## 43 603.7253 5 1 5 197276.9
## 44 2345.9292 5 2 12 469316.9
## 45 2170.3577 5 3 11 394001.0
## 46 1927.0995 4 4 10 390290.0
## 47 1788.6476 4 2 15 296977.6
## 48 2321.2343 4 1 14 493364.1
## 49 2762.9614 3 3 11 462967.9
## 50 2298.9738 5 2 12 417699.6
## 51 1836.6629 4 1 9 302502.2
## 52 2119.4744 5 1 9 410216.3
## 53 1697.1201 4 1 9 343345.3
## 54 2220.5075 2 3 5 379008.0
## 55 2394.8484 4 3 10 437076.6
## 56 2078.9610 5 3 3 362589.8
## 57 2798.3107 4 2 15 487507.7
## 58 1613.9529 2 2 8 320266.2
## 59 1632.9467 3 2 4 297858.1
## 60 1778.8792 3 4 8 318892.8
## 61 2448.1034 3 4 10 416516.5
## 62 1773.3328 4 4 12 313810.8
## 63 1946.3702 3 1 10 340606.8
## 64 2439.9142 2 4 9 437747.9
## 65 907.8200 2 1 9 141544.7
## 66 2112.5208 2 2 13 350527.4
## 67 3002.1371 5 1 7 566086.9
## 68 2004.6748 4 2 7 314789.6
## 69 2072.1219 3 4 9 357082.4
## 70 1943.5003 4 2 10 348741.3
## 71 2029.7569 3 3 9 340324.4
## 72 1696.8290 3 1 7 320883.5
## 73 1468.9888 5 1 10 316865.2
## 74 2485.4445 5 2 9 482376.7
## 75 1475.1658 2 2 12 271746.0
## 76 3074.9061 3 3 13 533252.2
## 77 2463.8855 4 4 15 443318.4
## 78 2787.9295 4 1 12 499302.7
## 79 2109.4463 3 1 11 375634.1
## 80 1979.2124 5 1 11 428779.1
## 81 3066.9384 2 4 14 514938.9
## 82 2180.1629 3 2 12 385172.7
## 83 1706.0471 4 3 21 336780.8
## 84 1935.9288 3 2 11 327160.0
## 85 1642.2857 5 4 13 363570.5
## 86 2227.1912 2 2 15 399554.7
## 87 2071.3910 2 4 8 395315.7
## 88 1474.7602 2 1 11 223182.5
## 89 1288.0843 2 3 10 223701.4
## 90 1933.3378 3 3 7 321412.7
## 91 933.3453 5 3 19 153916.8
## 92 2219.1687 3 4 10 393919.9
## 93 1817.0478 3 2 15 324853.7
## 94 3663.7051 3 4 9 592780.3
## 95 2303.0893 3 4 11 400012.6
## 96 1794.0573 2 1 14 351584.5
## 97 1737.0622 4 4 13 310984.9
## 98 2071.0548 5 4 8 395858.2
## 99 2126.3672 3 2 14 367972.6
## 100 2145.9714 5 3 9 377231.4
Model:
model2 <- lm(house_data$Price ~ house_data$Size + house_data$Bedrooms + house_data$Bathrooms + house_data$Age)
print(model2)##
## Call:
## lm(formula = house_data$Price ~ house_data$Size + house_data$Bedrooms +
## house_data$Bathrooms + house_data$Age)
##
## Coefficients:
## (Intercept) house_data$Size house_data$Bedrooms
## 23825.1 153.6 14006.6
## house_data$Bathrooms house_data$Age
## -3162.3 -801.3
##
## Call:
## lm(formula = house_data$Price ~ house_data$Size + house_data$Bedrooms +
## house_data$Bathrooms + house_data$Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62887 -16594 -249 15676 71293
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23825.119 18782.041 1.269 0.208
## house_data$Size 153.625 5.778 26.586 < 2e-16 ***
## house_data$Bedrooms 14006.635 2612.414 5.362 5.78e-07 ***
## house_data$Bathrooms -3162.302 2642.483 -1.197 0.234
## house_data$Age -801.333 827.140 -0.969 0.335
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29290 on 95 degrees of freedom
## Multiple R-squared: 0.8848, Adjusted R-squared: 0.88
## F-statistic: 182.5 on 4 and 95 DF, p-value: < 2.2e-16
Analýza rezíduí:
## 1 2 3 4 5 6
## 65685.50015 8873.92721 -22058.33188 -62887.43124 -13795.30961 -42873.87412
## 7 8 9 10 11 12
## 7993.04080 -9221.19959 -20941.18480 10441.33469 -54301.37314 12407.59655
## 13 14 15 16 17 18
## -8723.15543 27715.06384 -19760.50590 51335.84540 2997.75989 -25341.17469
## 19 20 21 22 23 24
## -35.66196 -17782.59803 54343.33835 -3219.67789 402.78239 57130.19489
## 25 26 27 28 29 30
## 45606.21433 11957.62401 -49555.88866 -15365.49383 5012.64317 -30010.95217
## 31 32 33 34 35 36
## -18943.57672 29244.01718 -17763.79958 -2487.49049 -21482.87278 15631.14684
## 37 38 39 40 41 42
## 20307.59818 -48797.68581 24398.73016 65833.94052 5949.08118 5722.18425
## 43 44 45 46 47 48
## 17840.12805 31005.34271 -14977.30475 35049.80979 -39310.78586 71293.36569
## 49 50 51 52 53 54
## -9036.05146 -13398.43332 -49132.80662 1127.68871 13147.52858 -462.70989
## 55 56 57 58 59 60
## 6816.14311 -38758.41830 -3890.40337 33219.27081 -9318.76907 -1173.00935
## 61 62 63 64 65 66
## -4756.37461 -16204.28035 -13074.08294 30938.44857 -39383.39352 -9105.47403
## 67 68 69 70 71 72
## 19796.39487 -61096.67712 -7231.57514 -15342.99116 -20643.49179 3134.41977
## 73 74 75 76 77 78
## 8508.87322 20228.09903 9225.61695 14928.40216 9621.10016 3933.16951
## 79 80 81 82 83 84
## -2298.02216 42840.91541 15809.40681 340.34109 21152.24201 -20953.21084
## 85 86 87 88 89 90
## 40482.25606 23908.25668 44319.38688 -43239.28665 -8518.97667 -26345.49628
## 91 92 93 94 95 96
## -58614.59651 7817.17764 -1791.05981 -16040.96967 1818.91778 38514.68940
## 97 98 99 100
## -12656.76263 2893.58831 -6992.67780 -29603.21375
Sú normálne rozdelené? Shapiro-Wilk test
##
## Shapiro-Wilk normality test
##
## data: model2$residuals
## W = 0.9866, p-value = 0.4115
Rezíduá je normalne rozdelené.
Je stredná hodnota nula? t-test
##
## One Sample t-test
##
## data: model2$residuals
## t = 2.9635e-16, df = 99, p-value = 1
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -5692.61 5692.61
## sample estimates:
## mean of x
## 8.501999e-13
Stredna hodnota je nula.
Sú náhodné? Runs test a turning point test:
##
## Runs Test
##
## data: model2$residuals
## statistic = 0.20102, runs = 52, n1 = 50, n2 = 50, n = 100, p-value =
## 0.8407
## alternative hypothesis: nonrandomness
##
## Turning Point Test
##
## data: model2$residuals
## statistic = 0.63827, n = 100, p-value = 0.5233
## alternative hypothesis: non randomness
Rezíduá sú náhodné.
Platí homoskedasticita? Golfeld-Quandtov test
##
## Goldfeld-Quandt test
##
## data: model2
## GQ = 0.5548, df1 = 45, df2 = 45, p-value = 0.9745
## alternative hypothesis: variance increases from segment 1 to 2
Rezíduá maju konštantne rozptyly.
Sú nekorelované? Resp. neexistuje autokorelácia? Durbin-Watsonov test
## lag Autocorrelation D-W Statistic p-value
## 1 -0.04793618 2.032168 0.902
## Alternative hypothesis: rho != 0
Rezíduá sú nekorelované.
Porovnanie modelov:
model_quad2 <- lm(house_data$Price ~ house_data$Size + I(house_data$Size^2) + house_data$Bedrooms + I(house_data$Bedrooms^2) + house_data$Bathrooms + I(house_data$Bathrooms^2) + house_data$Age + + I(house_data$Age^2))
print(model_quad2)##
## Call:
## lm(formula = house_data$Price ~ house_data$Size + I(house_data$Size^2) +
## house_data$Bedrooms + I(house_data$Bedrooms^2) + house_data$Bathrooms +
## I(house_data$Bathrooms^2) + house_data$Age + +I(house_data$Age^2))
##
## Coefficients:
## (Intercept) house_data$Size
## 3.866e+04 1.428e+02
## I(house_data$Size^2) house_data$Bedrooms
## 2.591e-03 -5.279e+03
## I(house_data$Bedrooms^2) house_data$Bathrooms
## 2.780e+03 -1.076e+04
## I(house_data$Bathrooms^2) house_data$Age
## 1.607e+03 5.262e+03
## I(house_data$Age^2)
## -2.537e+02
## Analysis of Variance Table
##
## Model 1: house_data$Price ~ house_data$Size + house_data$Bedrooms + house_data$Bathrooms +
## house_data$Age
## Model 2: house_data$Price ~ house_data$Size + I(house_data$Size^2) + house_data$Bedrooms +
## I(house_data$Bedrooms^2) + house_data$Bathrooms + I(house_data$Bathrooms^2) +
## house_data$Age + +I(house_data$Age^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 95 8.1485e+10
## 2 91 7.7923e+10 4 3562088096 1.04 0.3911
Keď porovnáme hodnoty týchto dvoch modelov, zdá sa, že rozdiely v informačných kritériách nie sú príliš veľké. Teda zvolime linearný model.
Násobná regresia - dataset
Zadanie: Urobte lineárny regresný model pre premennú mpg z datasetu mtcars, pričom prediktormi budú premenné horsepower, weight a cylinders. Overte predpoklady pre tento model a otestujte jeho štatistickú významnosť. Vizualizujte vzťah reálnych a predikovaných hodnôt pre target. Dávajú iné premenné z datasetu ako prediktory lepší model a ak áno, tak ktoré?
##
## Call:
## lm(formula = mpg ~ hp + wt + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9290 -1.5598 -0.5311 1.1850 5.8986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.75179 1.78686 21.687 < 2e-16 ***
## hp -0.01804 0.01188 -1.519 0.140015
## wt -3.16697 0.74058 -4.276 0.000199 ***
## cyl -0.94162 0.55092 -1.709 0.098480 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared: 0.8431, Adjusted R-squared: 0.8263
## F-statistic: 50.17 on 3 and 28 DF, p-value: 2.184e-11
Vizualizacia:
predicted <- predict(model4)
plot(mtcars$mpg, predicted, main="Reálne vs Predikované hodnoty",
xlab="Reálne hodnoty (mpg)", ylab="Predikované hodnoty", pch=19, col="blue")
abline(0, 1, col="red", lty=2) # Ideálna zhodaAnalýza rezíduí:
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
## -1.8204257 -1.0128476 -3.1603990 0.4639233
## Hornet Sportabout Valiant Duster 360 Merc 240D
## 1.5322025 -2.1503588 -1.1934238 0.6356864
## Merc 230 Merc 280 Merc 280C Merc 450SE
## -0.4957351 -0.7890124 -2.1890124 1.3175861
## Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
## 1.1408152 -0.8008361 -0.4944331 0.2370012
## Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
## 4.5573819 5.5725355 1.4673228 5.8985522
## Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
## -3.9290355 -1.8653922 -2.4345849 -1.3383411
## Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
## 3.3148266 -0.3667124 -0.5665304 2.2446157
## Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
## -0.6174892 -1.4729031 1.1300053 -2.8149817
Sú normálne rozdelené? Shapiro-Wilk test
##
## Shapiro-Wilk normality test
##
## data: model4$residuals
## W = 0.93455, p-value = 0.05252
Rezíduá je normalne rozdelené.
Je stredná hodnota nula? t-test
##
## One Sample t-test
##
## data: model4$residuals
## t = 2.5078e-16, df = 31, p-value = 1
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.8605801 0.8605801
## sample estimates:
## mean of x
## 1.058181e-16
Stredna hodnota je nula.
Sú náhodné? Runs test a turning point test:
##
## Runs Test
##
## data: model4$residuals
## statistic = -0.7188, runs = 15, n1 = 16, n2 = 16, n = 32, p-value =
## 0.4723
## alternative hypothesis: nonrandomness
##
## Turning Point Test
##
## data: model4$residuals
## statistic = -0.43167, n = 32, p-value = 0.666
## alternative hypothesis: non randomness
Rezíduá sú náhodné.
Platí homoskedasticita? Golfeld-Quandtov test
##
## Goldfeld-Quandt test
##
## data: model4
## GQ = 6.6633, df1 = 12, df2 = 12, p-value = 0.001265
## alternative hypothesis: variance increases from segment 1 to 2
Rezíduá maju konštantne rozptyly.
Sú nekorelované? Resp. neexistuje autokorelácia? Durbin-Watsonov test
## lag Autocorrelation D-W Statistic p-value
## 1 0.146151 1.64407 0.158
## Alternative hypothesis: rho != 0
Rezíduá sú nekorelované.
Porovnanie modelov:
model5 <- lm(mtcars$mpg ~ mtcars$disp + mtcars$drat + mtcars$qsec)
model6 <- lm(mtcars$mpg~ mtcars$am + mtcars$gear)
model7 <- lm(mtcars$mpg ~ mtcars$hp + mtcars$qsec + mtcars$vs)
model8 <- lm(mtcars$mpg ~ mtcars$disp + mtcars$qsec + mtcars$wt)
c(AIC(model4),AIC(model5),AIC(model6),AIC(model7),AIC(model8))## [1] 155.4766 171.4928 198.4822 175.1748 158.7203
## [1] 162.8053 178.8215 204.3452 182.5034 166.0490
Najlepši je 8 model.