Regresion Lineal Simple y Multiple

prob-22(R.L.simple):La siguiente tabla muestra los valores del consumo de metilmercurio y la concentracion total de mercurio en la sangre en 12 individuos expuestos al metilmercurio por consumir peces contaminados.

consumomercurio <- c(180,200,230,410,600,550,275,580,105,250,460,650)
mercuriosangre <- c(90,120,125,290,310,290,170,375,70,105,205,480)
tabla<- data.frame(consumomercurio,mercuriosangre)
tabla

##    consumomercurio mercuriosangre
## 1              180             90
## 2              200            120
## 3              230            125
## 4              410            290
## 5              600            310
## 6              550            290
## 7              275            170
## 8              580            375
## 9              105             70
## 10             250            105
## 11             460            205
## 12             650            480

Grafico de dispersion

Evaluamos la posible relacion lineal entre ambas variables.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Estimando la ecuacion de regresion lineal simple

Podemos observar que el valor del intercepto y de la pendiente por lo tanto la ecuacion tendria la siguiente forma:

modelo <-lm(mercuriosangre~consumomercurio)
modelo

## 
## Call:
## lm(formula = mercuriosangre ~ consumomercurio)
## 
## Coefficients:
##     (Intercept)  consumomercurio  
##        -20.5791           0.6407

 Y = -20.5791 + 0.6407x

Pasamos a interpretar la pendiente: Por cada aumento de 1 ug Hg/dia de consumo de pescado contaminado con mercurio de metil, corresponde un aumento de 0.6407 ng/g de mercurio en toda la sangre.

Visulaizando la rectaestimada en el grafico de dispersion

plot(consumomercurio,mercuriosangre, pch= 21, col= "red", main = "Relacion entre el consumo de mercurio metil y el mercurio en la sangre",xlab = "Consumo de mercurio", ylab = "mercurio en la sangre",abline(modelo,col= "blue"))

Inferencia de coeficientes

summary(modelo)

## 
## Call:
## lm(formula = mercuriosangre ~ consumomercurio)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.164 -36.413   5.319  23.462  84.094 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -20.57906   30.66189  -0.671    0.517    
## consumomercurio   0.64075    0.07373   8.691 5.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.37 on 10 degrees of freedom
## Multiple R-squared:  0.8831, Adjusted R-squared:  0.8714 
## F-statistic: 75.53 on 1 and 10 DF,  p-value: 5.66e-06

El modelo con la variable de prediccion tiene un R2( coef. determinacion) alta (0.8831), lo que quiere decir, que la variabilidad del mercurio en toda la sangre de los individuos se ve explicada en 88.31% por el consumo de peces contaminados por mercurio metil.

Analisis de varianza(ANOVA)

anova(modelo)

## Analysis of Variance Table
## 
## Response: mercuriosangre
##                 Df Sum Sq Mean Sq F value   Pr(>F)    
## consumomercurio  1 162392  162392  75.531 5.66e-06 ***
## Residuals       10  21500    2150                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

El test F muestra un p-value(5.66e-06) es significativo por lo que se puede aceptar el modelo

Conclusion:

Se satisfacen todas las condiciones para este tipo de regresion lineal

Prob-10.3.6(R.L.Multiple):

se registraron los siguientes datos para una muestra aleatoria simple de 20 pacientes con hipertension. las variables son: Y = Presion arterial sanguinea media(mmhg) x1= Edad(años) x2= Peso(Kg) x3= Area de la superficie corporal(m2) x4= Duracion de la hipertension(años) x5= Pulso basico(latidos/min) x6= Medicion del estres

instalando los paquetes necesarios para cagar la data

library(readxl)

## Warning: package 'readxl' was built under R version 3.4.4

Cargando la data

prob_10_3_6 <- read_excel("C:/Users/Franco/Desktop/ciclo 7/regresion 1/prob-10.3.6.xlsx")
View(prob_10_3_6)
attach(prob_10_3_6)
data<- data.frame(y,x1,x2,x3,x4,x5,x6)
data

##      y x1    x2   x3   x4 x5 x6
## 1  105 47  85.4 1.75  5.1 63 33
## 2  115 49  94.2 2.10  3.8 70 14
## 3  116 49  95.3 1.98  8.2 72 10
## 4  117 50  94.7 2.01  5.8 73 99
## 5  112 51  89.4 1.89  7.0 72 95
## 6  121 48  99.5 2.25  9.3 71 10
## 7  121 49  99.8 2.25  2.5 69 42
## 8  110 47  90.9 1.90  6.2 66  8
## 9  110 49  89.2 1.83  7.1 69 62
## 10 114 48  92.7 2.07  5.6 64 35
## 11 114 47  94.4 2.07  5.3 74 90
## 12 115 49  94.1 1.98  5.6 71 21
## 13 114 50  91.6 2.05 10.2 68 47
## 14 106 45  87.1 1.92  5.6 67 80
## 15 125 52 101.3 2.19 10.0 76 98
## 16 114 46  94.5 1.98  7.4 69 95
## 17 106 46  87.0 1.87  3.6 62 18
## 18 113 46  94.5 1.90  4.3 70 12
## 19 110 48  90.5 1.88  9.0 71 99
## 20 122 56  95.7 2.09  7.0 75 99

Analizando la correlacion para saber que variables presentan un relacion del tipo no lineal.

round(cor(x= data,method = "pearson"),3)

##        y    x1    x2    x3    x4    x5    x6
## y  1.000 0.659 0.950 0.866 0.293 0.721 0.164
## x1 0.659 1.000 0.407 0.378 0.344 0.619 0.368
## x2 0.950 0.407 1.000 0.875 0.201 0.659 0.034
## x3 0.866 0.378 0.875 1.000 0.131 0.465 0.018
## x4 0.293 0.344 0.201 0.131 1.000 0.402 0.312
## x5 0.721 0.619 0.659 0.465 0.402 1.000 0.506
## x6 0.164 0.368 0.034 0.018 0.312 0.506 1.000

require(GGally)

## Loading required package: GGally

## Warning: package 'GGally' was built under R version 3.4.4

ggpairs(data, lower = list(continuous = "smooth"),
        diag = list(continuous = "bar"), axisLabels = "none")

## Warning in check_and_set_ggpairs_defaults("diag", diag, continuous =
## "densityDiag", : Changing diag$continuous from 'bar' to 'barDiag'

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Podemos observar:

La variable medicion de estres presenta una relacion inversa con el peso y el area de la superficie corporal.
La presion arterial sanguinea media presenta una relacion positiva y alta con la edad el peso y el area de la superficie corporal.

Generando el modelo:

modelo1 <- lm(y~x1+x2+x3+x4+x5+x6)
modelo1

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6)
## 
## Coefficients:
## (Intercept)           x1           x2           x3           x4  
##  -12.870476     0.703259     0.969920     3.776491     0.068383  
##          x5           x6  
##   -0.084485     0.005572

Inferencia de coeficientes:

summary(modelo1)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.93213 -0.11314  0.03064  0.21834  0.48454 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12.870476   2.556650  -5.034 0.000229 ***
## x1            0.703259   0.049606  14.177 2.76e-09 ***
## x2            0.969920   0.063108  15.369 1.02e-09 ***
## x3            3.776491   1.580151   2.390 0.032694 *  
## x4            0.068383   0.048441   1.412 0.181534    
## x5           -0.084485   0.051609  -1.637 0.125594    
## x6            0.005572   0.003412   1.633 0.126491    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4072 on 13 degrees of freedom
## Multiple R-squared:  0.9962, Adjusted R-squared:  0.9944 
## F-statistic: 560.6 on 6 and 13 DF,  p-value: 6.395e-15

Podemos observar que el R2(coef. de determinacion) es alto, esto quiere decir que la variabilidad de la presion arterial sanguinea media de los pacientes con hipertension se ve explicada en un 99.57% por las demas variables.

Analisis de varianza(ANOVA):

anova(modelo1)

## Analysis of Variance Table
## 
## Response: y
##           Df  Sum Sq Mean Sq   F value    Pr(>F)    
## x1         1 243.266 243.266 1466.9140 9.384e-15 ***
## x2         1 311.910 311.910 1880.8435 1.887e-15 ***
## x3         1   1.768   1.768   10.6599  0.006149 ** 
## x4         1   0.335   0.335    2.0207  0.178711    
## x5         1   0.123   0.123    0.7421  0.404584    
## x6         1   0.442   0.442    2.6659  0.126491    
## Residuals 13   2.156   0.166                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

El test F muestra un p-value(1.873e-13) es significativo por lo que se puede aceptar el modelo. Se puede observar tambien que las variables x4 x5 x6 tienen un p-value mayor que 0.05 por lo tanto se puede obviar de aquellas variables en el modelo.

Estimaciones:

y.est <- predict(modelo1)
y.est

##        1        2        3        4        5        6        7        8 
## 104.8328 115.3102 116.0336 116.5155 111.7692 120.5834 121.4599 110.4163 
##        9       10       11       12       13       14       15       16 
## 110.0186 113.7858 114.1725 114.8377 114.0934 105.6753 125.0314 113.8202 
##       17       18       19       20 
## 106.0329 112.7592 110.9321 121.9201

#calculamos los errores(E)
E<- y-y.est
E

##           1           2           3           4           5           6 
##  0.16719843 -0.31023699 -0.03359983  0.48453838  0.23077440  0.41663789 
##           7           8           9          10          11          12 
## -0.45985014 -0.41631378 -0.01856626  0.21419770 -0.17247725  0.16232862 
##          13          14          15          16          17          18 
## -0.09336075  0.32473685 -0.03140253  0.17978903 -0.03293036  0.24081494 
##          19          20 
## -0.93213104  0.07985268

#Juntamos todo en una tabla
data <- data.frame(data,y.est,E)
data

##      y x1    x2   x3   x4 x5 x6    y.est           E
## 1  105 47  85.4 1.75  5.1 63 33 104.8328  0.16719843
## 2  115 49  94.2 2.10  3.8 70 14 115.3102 -0.31023699
## 3  116 49  95.3 1.98  8.2 72 10 116.0336 -0.03359983
## 4  117 50  94.7 2.01  5.8 73 99 116.5155  0.48453838
## 5  112 51  89.4 1.89  7.0 72 95 111.7692  0.23077440
## 6  121 48  99.5 2.25  9.3 71 10 120.5834  0.41663789
## 7  121 49  99.8 2.25  2.5 69 42 121.4599 -0.45985014
## 8  110 47  90.9 1.90  6.2 66  8 110.4163 -0.41631378
## 9  110 49  89.2 1.83  7.1 69 62 110.0186 -0.01856626
## 10 114 48  92.7 2.07  5.6 64 35 113.7858  0.21419770
## 11 114 47  94.4 2.07  5.3 74 90 114.1725 -0.17247725
## 12 115 49  94.1 1.98  5.6 71 21 114.8377  0.16232862
## 13 114 50  91.6 2.05 10.2 68 47 114.0934 -0.09336075
## 14 106 45  87.1 1.92  5.6 67 80 105.6753  0.32473685
## 15 125 52 101.3 2.19 10.0 76 98 125.0314 -0.03140253
## 16 114 46  94.5 1.98  7.4 69 95 113.8202  0.17978903
## 17 106 46  87.0 1.87  3.6 62 18 106.0329 -0.03293036
## 18 113 46  94.5 1.90  4.3 70 12 112.7592  0.24081494
## 19 110 48  90.5 1.88  9.0 71 99 110.9321 -0.93213104
## 20 122 56  95.7 2.09  7.0 75 99 121.9201  0.07985268

Predicciones :

Ynew <- data.frame(x1= c(60,61,62),x2= c(102,90,100),x3= c(2.5,2.1,8),x4= c(10.5,9,8),x5= c(77,60,65),x6= c(99.5,45,80) )
Ynew

##   x1  x2  x3   x4 x5   x6
## 1 60 102 2.5 10.5 77 99.5
## 2 61  90 2.1  9.0 60 45.0
## 3 62 100 8.0  8.0 65 80.0

ypredic <-predict(modelo1,Ynew)
ypredic

##        1        2        3 
## 132.4652 121.0488 153.4368