Un supuesto adicional en los modelo de regresion es el supuesto de independencia que indica que los residuos sean independientes entre sà y que no haya ningún tipo de correlación entre ellos. Esto puede contrastarse realizando la prueba de Durbin-Watson, cuya hipótesis nula supone, precisamente, que los residuos son independientes.
Hallar un intervalo para āYā = hace referencia al intevalo de prediccion (directo para Y)
Hallar un intervalo para āYā = promedio = hace referencia a un intervalo de confianza
#Cargando librerias
library(alr4)
data(oldfaith)
head(oldfaith,3)
Duration Interval
1 216 79
2 108 54
3 200 74
#help(oldfaith)
dim(oldfaith)
[1] 270 2
cor(oldfaith$Interval, oldfaith$Duration, method = "pearson")
[1] 0.8960697
cor(oldfaith$Interval, oldfaith$Duration, method = "spearman")
[1] 0.7750216
oldfaith.m1 <- lm(Duration ~ Interval, data = oldfaith)
summary(oldfaith.m1)
Call:
lm(formula = Duration ~ Interval, data = oldfaith)
Residuals:
Min 1Q Median 3Q Max
-98.310 -22.086 1.861 20.725 71.910
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -112.9431 9.9428 -11.36 <2e-16 ***
Interval 4.5399 0.1374 33.05 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 30.42 on 268 degrees of freedom
Multiple R-squared: 0.8029, Adjusted R-squared: 0.8022
F-statistic: 1092 on 1 and 268 DF, p-value: < 2.2e-16
plot(Duration ~ Interval, data = oldfaith)
abline(oldfaith.m1, col ='red')
Interpretacion (tener en cuenta la unidad de medida)
Para no indicar causa efecto podemos interpretar el coeficiente de regresion como:
Los incrementos en un minuto de la variable interval estan realacionado con un incremento de 4.5399 segundos en la variable duracion.
#Estimados
fitted.m1 <- fitted(oldfaith.m1)
head(fitted.m1)
1 2 3 4 5 6
245.7094 132.2118 223.0099 168.5310 272.9488 136.7517
#Residuales
residuales.m1 <- residuals(oldfaith.m1)
head(residuales.m1)
1 2 3 4 5 6
-29.709399 -24.211773 -23.009873 -31.531013 -0.948829 36.248322
confint(oldfaith.m1, level = 0.92)
4 % 96 %
(Intercept) -130.416157 -95.470043
Interval 4.298474 4.781337
como concepto general el modelo mas pequeƱo siempre es mejor
Hipotesis:
#anova(oldfaith.m1)
summary(oldfaith.m1) #el estadistico t para definir el nivel de significancia de cada variable
Call:
lm(formula = Duration ~ Interval, data = oldfaith)
Residuals:
Min 1Q Median 3Q Max
-98.310 -22.086 1.861 20.725 71.910
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -112.9431 9.9428 -11.36 <2e-16 ***
Interval 4.5399 0.1374 33.05 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 30.42 on 268 degrees of freedom
Multiple R-squared: 0.8029, Adjusted R-squared: 0.8022
F-statistic: 1092 on 1 and 268 DF, p-value: < 2.2e-16
#el estadistico F para definir el nivel de significancia del modelo
Comentario
Normalmente nuestro modelo propuesto tiene mucha ventaja sobre el modelo nulo pero eso no quiere decir que nuestro modelo propuesto sea un buen modelo
summary(oldfaith.m1)$r.square
[1] 0.8029409
cor(oldfaith$Duration, oldfaith$Interval, method = "pearson")^2
[1] 0.8029409
Intrepretacion:
# el calor para Duration (Y) cuando Interval (X) =95
predict(oldfaith.m1, data.frame(Interval = 95))
1
318.3479
# para un X = 95
predict(oldfaith.m1, data.frame(Interval = 95), level = 0.98, interval = 'confidence')
fit lwr upr
1 318.3479 309.5284 327.1673
# para un X = 95
predict(oldfaith.m1, data.frame(Interval = 95), level = 0.98, interval = 'prediction')
fit lwr upr
1 318.3479 246.618 390.0778
#analisis de residuales
par(mfrow = c(2, 2))
plot(oldfaith.m1)
Comentarios
-El grafico nos permite observar una apariencia de un mayor grosor de la nube de puntos en una dirección, lo cual no permite aceptar el supuesto de varianza constante de los residuos
#Conjunto de entrenamiento y prueba
library(caret)
RNGkind(sample.kind = "Rejection")#Metodo que genera los numeros seudo aleatorios
set.seed(4926)
#Indicar la variable respuesta y p= el valor del % de la data de prueba
ind.train <- createDataPartition(y = oldfaith$Duration, p = 0.75, list = FALSE)
data.train <- oldfaith[ind.train, ]
data.test <- oldfaith[-ind.train, ]
ctrl <- trainControl(method = 'none') # para no usar aun otro metodo de remuestreo como boostrap
m1 <- train(Duration ~ Interval, data = data.train, method = 'lm', #Entrenando modelo
trControl = ctrl)
m1
Linear Regression
203 samples
1 predictor
No pre-processing
Resampling: None
#predicciones
m1.pred <- predict(m1, newdata = data.test)
#Evaluacion del modelo
postResample(m1.pred, data.test$Duration) #comparo el valor de test predicho con el valor original de test
RMSE Rsquared MAE
29.5392942 0.7990675 24.0067461
summary(m1)
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-97.854 -21.854 1.552 22.081 72.595
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -111.1460 11.3559 -9.788 <2e-16 ***
Interval 4.5072 0.1567 28.760 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 30.71 on 201 degrees of freedom
Multiple R-squared: 0.8045, Adjusted R-squared: 0.8035
F-statistic: 827.1 on 1 and 201 DF, p-value: < 2.2e-16
#m1$finalModel
#summary(m1$finalModel)
library(alr4)
data(fuel2001)
head(fuel2001,3)
Drivers FuelC Income Miles MPC Pop Tax
AL 3559897 2382507 23471 94440 12737.00 3451586 18
AK 472211 235400 30064 13628 7639.16 457728 8
AZ 3550367 2428430 25578 55245 9411.55 3907526 18
#help(fuel2001)
dim(fuel2001)
[1] 51 7
#Preprocesamiento
fuel2001$TasaDrivers <- 1000*fuel2001$Drivers/fuel2001$Pop
fuel2001$TasaFuelC <- 1000*fuel2001$FuelC/fuel2001$Pop #variable respuesta
fuel2001$log2Miles <- log2(fuel2001$Miles)
fuel2001$log2MPC <- log2(fuel2001$MPC)
head(fuel2001,3)
Drivers FuelC Income Miles MPC Pop Tax TasaDrivers TasaFuelC
AL 3559897 2382507 23471 94440 12737.00 3451586 18 1031.3801 690.2644
AK 472211 235400 30064 13628 7639.16 457728 8 1031.6411 514.2792
AZ 3550367 2428430 25578 55245 9411.55 3907526 18 908.5972 621.4751
log2Miles log2MPC
AL 16.52711 13.63674
AK 13.73429 12.89920
AZ 15.75356 13.20022
library(dplyr)
fuel2001.data1 <- fuel2001 %>% select(TasaFuelC, TasaDrivers, Income, log2Miles,
log2MPC, Tax)
head(fuel2001.data1)
TasaFuelC TasaDrivers Income log2Miles log2MPC Tax
AL 690.2644 1031.3801 23471 16.52711 13.63674 18.0
AK 514.2792 1031.6411 30064 13.73429 12.89920 8.0
AZ 621.4751 908.5972 25578 15.75356 13.20022 18.0
AR 655.2927 946.5706 22257 16.58244 13.46000 21.7
CA 573.9129 844.7033 32275 17.36471 13.12346 18.0
CO 616.6115 989.6062 32949 16.38960 13.24715 22.0
cor(fuel2001.data1, method = "pearson")
TasaFuelC TasaDrivers Income log2Miles log2MPC
TasaFuelC 1.0000000 0.46850627 -0.46440498 0.42203233 0.8703852
TasaDrivers 0.4685063 1.00000000 -0.17596063 0.03059068 0.4734608
Income -0.4644050 -0.17596063 1.00000000 -0.29585136 -0.5964404
log2Miles 0.4220323 0.03059068 -0.29585136 1.00000000 0.2967255
log2MPC 0.8703852 0.47346076 -0.59644037 0.29672545 1.0000000
Tax -0.2594471 -0.08584424 -0.01068494 -0.04373696 -0.1386326
Tax
TasaFuelC -0.25944711
TasaDrivers -0.08584424
Income -0.01068494
log2Miles -0.04373696
log2MPC -0.13863257
Tax 1.00000000
cor(fuel2001.data1, method = "spearman")
TasaFuelC TasaDrivers Income log2Miles log2MPC
TasaFuelC 1.0000000 0.25547511 -0.48063348 0.18171946 0.8037104
TasaDrivers 0.2554751 1.00000000 -0.13837104 -0.21574661 0.3259729
Income -0.4806335 -0.13837104 1.00000000 -0.07438914 -0.6913122
log2Miles 0.1817195 -0.21574661 -0.07438914 1.00000000 0.1201810
log2MPC 0.8037104 0.32597285 -0.69131222 0.12018100 1.0000000
Tax -0.3696222 0.06982558 0.05323067 -0.11462277 -0.2039451
Tax
TasaFuelC -0.36962217
TasaDrivers 0.06982558
Income 0.05323067
log2Miles -0.11462277
log2MPC -0.20394511
Tax 1.00000000
library(corrplot)
mx = cor(fuel2001.data1, method = "pearson") #Matriz de correlacion
corrplot(mx,method="number",type="lower",order="hclust")
fuel2001.m1 <- lm(TasaFuelC ~ TasaDrivers + Income + log2Miles + log2MPC + Tax,
data = fuel2001.data1)
summary(fuel2001.m1)
Call:
lm(formula = TasaFuelC ~ TasaDrivers + Income + log2Miles + log2MPC +
Tax, data = fuel2001.data1)
Residuals:
Min 1Q Median 3Q Max
-92.056 -27.704 -1.289 28.720 92.443
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.108e+03 3.924e+02 -7.921 4.49e-10 ***
TasaDrivers 1.045e-01 8.964e-02 1.166 0.2498
Income 1.778e-03 1.627e-03 1.093 0.2803
log2Miles 1.203e+01 4.055e+00 2.967 0.0048 **
log2MPC 2.581e+02 2.955e+01 8.733 3.03e-11 ***
Tax -2.559e+00 1.265e+00 -2.023 0.0490 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 39.97 on 45 degrees of freedom
Multiple R-squared: 0.8183, Adjusted R-squared: 0.7982
F-statistic: 40.55 on 5 and 45 DF, p-value: 1.407e-15
Intrepretacion sólo para la variable Tax (tener en cuenta la unidad de medida):
Para no indicar causa efecto podemos interpretar el coeficiente de regresion como:
El incremento en el impuesto a la gasolina (Tax) en un centavo por galon estan realacionado con la disminucion de -2.559 respecto a la tasa de la gasolina (variable convertida)
Usar logartimo permite interpretar la variable en terminos originales solo que si fuera log2 ⦠seria la variable duplicada
como concepto general el modelo mas pequeƱo siempre es mejor
Hipotesis:
#anova(fuel2001.m1)
summary(fuel2001.m1) #el estadistico t para definir el nivel de significancia de cada variable
Call:
lm(formula = TasaFuelC ~ TasaDrivers + Income + log2Miles + log2MPC +
Tax, data = fuel2001.data1)
Residuals:
Min 1Q Median 3Q Max
-92.056 -27.704 -1.289 28.720 92.443
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.108e+03 3.924e+02 -7.921 4.49e-10 ***
TasaDrivers 1.045e-01 8.964e-02 1.166 0.2498
Income 1.778e-03 1.627e-03 1.093 0.2803
log2Miles 1.203e+01 4.055e+00 2.967 0.0048 **
log2MPC 2.581e+02 2.955e+01 8.733 3.03e-11 ***
Tax -2.559e+00 1.265e+00 -2.023 0.0490 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 39.97 on 45 degrees of freedom
Multiple R-squared: 0.8183, Adjusted R-squared: 0.7982
F-statistic: 40.55 on 5 and 45 DF, p-value: 1.407e-15
#el estadistico F para definir el nivel de significancia del modelo
fuel2001.m2 <- lm(TasaFuelC ~ log2Miles + log2MPC + Tax, data = fuel2001.data1)
summary(fuel2001.m2)
Call:
lm(formula = TasaFuelC ~ log2Miles + log2MPC + Tax, data = fuel2001.data1)
Residuals:
Min 1Q Median 3Q Max
-97.424 -26.837 -2.038 29.229 106.588
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2906.674 282.284 -10.297 1.23e-13 ***
log2Miles 10.721 4.021 2.667 0.0105 *
log2MPC 255.661 21.725 11.768 1.30e-15 ***
Tax -2.761 1.268 -2.177 0.0346 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 40.36 on 47 degrees of freedom
Multiple R-squared: 0.8065, Adjusted R-squared: 0.7941
F-statistic: 65.29 on 3 and 47 DF, p-value: < 2.2e-16
#Obtner el coeficiente de determinacion directamente
summary(fuel2001.m2)$r.squared
[1] 0.8064858
anova(fuel2001.m2, fuel2001.m1) # anova(segundo modelo, (sólo vars significativas) , primer modelo)
Analysis of Variance Table
Model 1: TasaFuelC ~ log2Miles + log2MPC + Tax
Model 2: TasaFuelC ~ TasaDrivers + Income + log2Miles + log2MPC + Tax
Res.Df RSS Df Sum of Sq F Pr(>F)
1 47 76572
2 45 71879 2 4693.9 1.4693 0.2409
predict(fuel2001.m2, data.frame(log2Miles = 15, log2MPC = 13, Tax = 20),
level = 0.94, interval = "prediction")
fit lwr upr
1 522.5326 442.8841 602.1811
#El valor fit es la estimacion puntual de pasar los valores directamente en la funcion del modelo
El Factor de inflacion de varianza (VIF) es un indicador usado para detectar la multicolinealidad
Rango | Descripción |
---|---|
VIF >= 10 | Existe multicolinealidad severa |
5<=VIF<10 | Se requiere antención |
VIF<5 | No requiere atención |
Si encontramos mƔs de una variable con multicolinealidad , podemos eliminar la variable con mas VIF y probar volviendo a generar el modelo
library(car)
vif(fuel2001.m2)
log2Miles log2MPC Tax
1.096555 1.115904 1.019603
# Pregunta 1h: Evaluacion de supuestos
#windows()
par(mfrow = c(2, 2))
plot(fuel2001.m2)
library(caret)
RNGkind(sample.kind = "Rejection")
set.seed(5701)
ind.train <- createDataPartition(y = fuel2001.data1$TasaFuelC, p = 0.72,
list = FALSE)
data.train <- fuel2001.data1[ind.train, ]
data.test <- fuel2001.data1[-ind.train, ]
ctrl <- trainControl(method = 'none')
m1 <- train(TasaFuelC ~ log2Miles + log2MPC + Tax, data = data.train,
method = 'lm', trControl = ctrl)
m1
Linear Regression
39 samples
3 predictor
No pre-processing
Resampling: None
#Librerias
#Data hubble
library(gamair)
Warning: package 'gamair' was built under R version 4.2.3
data(hubble)
head(hubble,3)
Galaxy y x
1 NGC0300 133 2.00
2 NGC0925 664 9.16
3 NGC1326A 1794 16.14
dim(hubble)
[1] 24 3
#help(hubble)
plot(y ~ x, data = hubble, xlab = "Distancia (Mpc)", ylab = "Velocidad (km/s)")
hubble.m1 <- lm(y ~ x - 1, data = hubble) # colocar al final de las variables el valor -1 para indicar que no se requiere #
#Para un modelo multiple con intercepto: m1=lm(y ~ x1 + x2 + x3 - 1, data = hubble)
#Otra forma para un modelo multiple con intercepto: m1=lm(y ~ 0+ x1 + x2+ x3 , data = hubble)
summary(hubble.m1)
Call:
lm(formula = y ~ x - 1, data = hubble)
Residuals:
Min 1Q Median 3Q Max
-736.5 -132.5 -19.0 172.2 558.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x 76.581 3.965 19.32 1.03e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 258.9 on 23 degrees of freedom
Multiple R-squared: 0.9419, Adjusted R-squared: 0.9394
F-statistic: 373.1 on 1 and 23 DF, p-value: 1.032e-15
#abline(hubble.m1, col = "red")
#Librerias Data Hitters
library(ISLR)
data(Hitters)
head(Hitters,3)
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
-Andy Allanson 293 66 1 30 29 14 1 293 66 1 30
-Alan Ashby 315 81 7 24 38 39 14 3449 835 69 321
-Alvin Davis 479 130 18 66 72 76 3 1624 457 63 224
CRBI CWalks League Division PutOuts Assists Errors Salary
-Andy Allanson 29 14 A E 446 33 20 NA
-Alan Ashby 414 375 N W 632 43 10 475
-Alvin Davis 266 263 A W 880 82 14 480
NewLeague
-Andy Allanson A
-Alan Ashby N
-Alvin Davis A
#help(Hitters)
dim(Hitters)
[1] 322 20
library(VIM)
summary(aggr(Hitters))
Missings per variable:
Variable Count
AtBat 0
Hits 0
HmRun 0
Runs 0
RBI 0
Walks 0
Years 0
CAtBat 0
CHits 0
CHmRun 0
CRuns 0
CRBI 0
CWalks 0
League 0
Division 0
PutOuts 0
Assists 0
Errors 0
Salary 59
NewLeague 0
Missings in combinations of variables:
Combinations Count Percent
0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0 263 81.67702
0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:1:0 59 18.32298
#Otra formas: para una sola variable
sum(is.na(Hitters$Salary))
[1] 59
#Otra formas: para todo el dataframe
colSums(is.na(Hitters))
AtBat Hits HmRun Runs RBI Walks Years CAtBat
0 0 0 0 0 0 0 0
CHits CHmRun CRuns CRBI CWalks League Division PutOuts
0 0 0 0 0 0 0 0
Assists Errors Salary NewLeague
0 0 59 0
#Otra formas: % para todo el dataframe
colSums(is.na(Hitters))
AtBat Hits HmRun Runs RBI Walks Years CAtBat
0 0 0 0 0 0 0 0
CHits CHmRun CRuns CRBI CWalks League Division PutOuts
0 0 0 0 0 0 0 0
Assists Errors Salary NewLeague
0 0 59 0
#Devolvera un dataFrame completo pero con valores imputados.
#Creara una variable o columna para indicar que el valor a sido imputado
Hitters.Imp <- regressionImp(formula = Salary ~ AtBat + Hits, data = Hitters)
head(Hitters.Imp,3)
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
-Andy Allanson 293 66 1 30 29 14 1 293 66 1 30
-Alan Ashby 315 81 7 24 38 39 14 3449 835 69 321
-Alvin Davis 479 130 18 66 72 76 3 1624 457 63 224
CRBI CWalks League Division PutOuts Assists Errors Salary
-Andy Allanson 29 14 A E 446 33 20 326.9715
-Alan Ashby 414 375 N W 632 43 10 475.0000
-Alvin Davis 266 263 A W 880 82 14 480.0000
NewLeague Salary_imp
-Andy Allanson A TRUE
-Alan Ashby N FALSE
-Alvin Davis A FALSE
nrow(Hitters.Imp )
[1] 322
str(Hitters.Imp)
'data.frame': 322 obs. of 21 variables:
$ AtBat : int 293 315 479 496 321 594 185 298 323 401 ...
$ Hits : int 66 81 130 141 87 169 37 73 81 92 ...
$ HmRun : int 1 7 18 20 10 4 1 0 6 17 ...
$ Runs : int 30 24 66 65 39 74 23 24 26 49 ...
$ RBI : int 29 38 72 78 42 51 8 24 32 66 ...
$ Walks : int 14 39 76 37 30 35 21 7 8 65 ...
$ Years : int 1 14 3 11 2 11 2 3 2 13 ...
$ CAtBat : int 293 3449 1624 5628 396 4408 214 509 341 5206 ...
$ CHits : int 66 835 457 1575 101 1133 42 108 86 1332 ...
$ CHmRun : int 1 69 63 225 12 19 1 0 6 253 ...
$ CRuns : int 30 321 224 828 48 501 30 41 32 784 ...
$ CRBI : int 29 414 266 838 46 336 9 37 34 890 ...
$ CWalks : int 14 375 263 354 33 194 24 12 8 866 ...
$ League : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ...
$ Division : Factor w/ 2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ...
$ PutOuts : int 446 632 880 200 805 282 76 121 143 0 ...
$ Assists : int 33 43 82 11 40 421 127 283 290 0 ...
$ Errors : int 20 10 14 3 4 25 7 9 19 0 ...
$ Salary : num 327 475 480 500 91.5 ...
$ NewLeague : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ...
$ Salary_imp: logi TRUE FALSE FALSE FALSE FALSE FALSE ...
#Filtrando solo valores imputados en la varible Salary
head(Hitters.Imp[Hitters.Imp$Salary_imp==TRUE,],3)
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
-Andy Allanson 293 66 1 30 29 14 1 293 66 1 30
-Billy Beane 183 39 3 20 15 11 3 201 42 3 20
-Bruce Bochte 407 104 6 57 43 65 12 5233 1478 100 643
CRBI CWalks League Division PutOuts Assists Errors Salary
-Andy Allanson 29 14 A E 446 33 20 326.9715
-Billy Beane 16 11 A W 118 0 0 239.0094
-Bruce Bochte 658 653 A W 912 88 9 500.4007
NewLeague Salary_imp
-Andy Allanson A TRUE
-Billy Beane A TRUE
-Bruce Bochte A TRUE
#Cargado librerias
library(ISLR)
data(Auto)
#head(Auto)
dim(Auto)
[1] 392 9
library(dplyr)
# Auto.data <- Auto %>% select(mpg, horsepower)
Auto.data <- Auto |> select(mpg, horsepower)
head(Auto.data,3)
mpg horsepower
1 18 130
2 15 165
3 18 150
library(ggplot2)
ggplot(Auto.data, aes(x = horsepower, y = mpg)) + geom_point()
Auto.m1 <- lm(mpg ~ horsepower, data = Auto.data)
summary(Auto.m1)
Call:
lm(formula = mpg ~ horsepower, data = Auto.data)
Residuals:
Min 1Q Median 3Q Max
-13.5710 -3.2592 -0.3435 2.7630 16.9240
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.935861 0.717499 55.66 <2e-16 ***
horsepower -0.157845 0.006446 -24.49 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.906 on 390 degrees of freedom
Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
ggplot(Auto.data, aes(x = horsepower, y = mpg)) + geom_point() +
stat_smooth(method = "lm", formula = y ~ x, color = "red")
Auto.m2 <- lm(mpg ~ horsepower + I(horsepower^2), data = Auto.data)
summary(Auto.m2)
Call:
lm(formula = mpg ~ horsepower + I(horsepower^2), data = Auto.data)
Residuals:
Min 1Q Median 3Q Max
-14.7135 -2.5943 -0.0859 2.2868 15.8961
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 56.9000997 1.8004268 31.60 <2e-16 ***
horsepower -0.4661896 0.0311246 -14.98 <2e-16 ***
I(horsepower^2) 0.0012305 0.0001221 10.08 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.374 on 389 degrees of freedom
Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
ggplot(Auto.data, aes(x = horsepower, y = mpg)) + geom_point() +
stat_smooth(method = "lm", formula = y ~ x + I(x^2), color = "red")
Auto.m3 <- lm(mpg ~ horsepower + I(horsepower^2) + I(horsepower^3), data = Auto.data)
summary(Auto.m3)
Call:
lm(formula = mpg ~ horsepower + I(horsepower^2) + I(horsepower^3),
data = Auto.data)
Residuals:
Min 1Q Median 3Q Max
-14.7039 -2.4491 -0.1519 2.2035 15.8159
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.068e+01 4.563e+00 13.298 < 2e-16 ***
horsepower -5.689e-01 1.179e-01 -4.824 2.03e-06 ***
I(horsepower^2) 2.079e-03 9.479e-04 2.193 0.0289 *
I(horsepower^3) -2.147e-06 2.378e-06 -0.903 0.3673
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.375 on 388 degrees of freedom
Multiple R-squared: 0.6882, Adjusted R-squared: 0.6858
F-statistic: 285.5 on 3 and 388 DF, p-value: < 2.2e-16
ggplot(Auto.data, aes(x = horsepower, y = mpg)) + geom_point() +
stat_smooth(method = "lm", formula = y ~ x + I(x^2) + I(x^3), color = "red")
summary(Auto.m1)$r.squared
[1] 0.6059483
summary(Auto.m2)$r.squared
[1] 0.687559
summary(Auto.m3)$r.squared
[1] 0.6882137
RNGkind(sample.kind = "Rejection")
library(caret)
set.seed(100)
ind.train <- createDataPartition(y = Auto.data$mpg, p = 0.70, list = FALSE)
# Conjunto de entrenamiento
Auto.train <- Auto.data[ind.train, ]
head(Auto.train, 3)
mpg horsepower
1 18 130
2 15 165
3 18 150
# Conjunto de prueba
Auto.test <- Auto.data[-ind.train, ]
head(Auto.test, 3)
mpg horsepower
8 14 215
10 15 190
13 15 150
# Metodo de remuestreo
train.cont <- trainControl(method = "none")
# Estimacion del modelo usando la data de entrenamiento
Auto.m1 <- train(mpg ~ poly(horsepower, 1), data = Auto.train, method = "lm",
trControl = train.cont)
summary(Auto.m1)
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-13.6362 -3.1082 -0.2193 2.7175 16.8370
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.4279 0.2951 79.38 <2e-16 ***
`poly(horsepower, 1)` -100.7969 4.9033 -20.56 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.903 on 274 degrees of freedom
Multiple R-squared: 0.6067, Adjusted R-squared: 0.6052
F-statistic: 422.6 on 1 and 274 DF, p-value: < 2.2e-16
Auto.m2 <- train(mpg ~ poly(horsepower, 2), data = Auto.train, method = "lm",
trControl = train.cont)
#Librerias
library(alr4)
data(cakes)
head(cakes,3)
block X1 X2 Y
1 0 33 340 3.89
2 0 37 340 6.36
3 0 33 360 7.65
cakes.m1 <- lm(Y ~ X1 + X2, data = cakes) # Regresion lineal multiple
summary(cakes.m1)
Call:
lm(formula = Y ~ X1 + X2, data = cakes)
Residuals:
Min 1Q Median 3Q Max
-1.9376 -1.0188 0.3622 0.9689 1.4114
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -39.57329 17.16141 -2.306 0.0416 *
X1 0.36756 0.21924 1.677 0.1218
X2 0.09639 0.04385 2.198 0.0502 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.24 on 11 degrees of freedom
Multiple R-squared: 0.41, Adjusted R-squared: 0.3027
F-statistic: 3.822 on 2 and 11 DF, p-value: 0.05493
cakes.m2 <- lm(Y ~ X1 + X2 + I(X1^2) + I(X2^2), data = cakes) #polinomial sin termino de interaccion
summary(cakes.m2)
Call:
lm(formula = Y ~ X1 + X2 + I(X1^2) + I(X2^2), data = cakes)
Residuals:
Min 1Q Median 3Q Max
-1.1565 -0.3615 0.0200 0.3285 1.1737
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.695e+03 3.247e+02 -5.219 0.000550 ***
X1 1.135e+01 4.423e+00 2.566 0.030404 *
X2 8.461e+00 1.769e+00 4.784 0.000996 ***
I(X1^2) -1.569e-01 6.317e-02 -2.483 0.034791 *
I(X2^2) -1.195e-02 2.527e-03 -4.730 0.001075 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6866 on 9 degrees of freedom
Multiple R-squared: 0.852, Adjusted R-squared: 0.7863
F-statistic: 12.96 on 4 and 9 DF, p-value: 0.0008913
cakes.m3 <- lm(Y ~ X1 + X2 + I(X1^2) + I(X2^2) + X1*X2, data = cakes) #polinomial con termino de interaccion (X1 * X2)
summary(cakes.m3)
Call:
lm(formula = Y ~ X1 + X2 + I(X1^2) + I(X2^2) + X1 * X2, data = cakes)
Residuals:
Min 1Q Median 3Q Max
-0.4912 -0.3080 0.0200 0.2658 0.5454
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.204e+03 2.416e+02 -9.125 1.67e-05 ***
X1 2.592e+01 4.659e+00 5.563 0.000533 ***
X2 9.918e+00 1.167e+00 8.502 2.81e-05 ***
I(X1^2) -1.569e-01 3.945e-02 -3.977 0.004079 **
I(X2^2) -1.195e-02 1.578e-03 -7.574 6.46e-05 ***
X1:X2 -4.163e-02 1.072e-02 -3.883 0.004654 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4288 on 8 degrees of freedom
Multiple R-squared: 0.9487, Adjusted R-squared: 0.9167
F-statistic: 29.6 on 5 and 8 DF, p-value: 5.864e-05
summary(cakes.m1)$r.squared
[1] 0.4099781
summary(cakes.m2)$r.squared
[1] 0.8520354
summary(cakes.m3)$r.squared
[1] 0.9487109
library(rsm)
cakes.rs.m3 <- rsm(Y ~ SO(X1, X2), data = cakes) #definiendo el modelo polinomia de orden 2 es como el polinomial con intercepto
summary(cakes.rs.m3)
Call:
rsm(formula = Y ~ SO(X1, X2), data = cakes)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.2045e+03 2.4158e+02 -9.1253 1.674e-05 ***
X1 2.5918e+01 4.6589e+00 5.5630 0.0005328 ***
X2 9.9183e+00 1.1666e+00 8.5022 2.810e-05 ***
X1:X2 -4.1625e-02 1.0719e-02 -3.8832 0.0046537 **
X1^2 -1.5687e-01 3.9446e-02 -3.9770 0.0040789 **
X2^2 -1.1950e-02 1.5778e-03 -7.5737 6.462e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Multiple R-squared: 0.9487, Adjusted R-squared: 0.9167
F-statistic: 29.6 on 5 and 8 DF, p-value: 5.864e-05
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
FO(X1, X2) 2 11.7564 5.8782 31.9739 0.0001529
TWI(X1, X2) 1 2.7722 2.7722 15.0793 0.0046537
PQ(X1, X2) 2 12.6762 6.3381 34.4757 0.0001168
Residuals 8 1.4707 0.1838
Lack of fit 3 0.9859 0.3286 3.3895 0.1110072
Pure error 5 0.4848 0.0970
Stationary point of response surface:
X1 X2
35.82766 352.59167
Eigenanalysis:
eigen() decomposition
$values
[1] -0.009020365 -0.159804635
$vectors
[,1] [,2]
X1 0.1393891 -0.9902377
X2 -0.9902377 -0.1393891
persp(cakes.rs.m3, X1 ~ X2)
contour(cakes.rs.m3, X1 ~ X2, image = TRUE)