Regression Analysis

Se você me disser sua altura, posso ter uma ideia do seu peso.
Se você me disser qual sua idade, posso lhe indicar uma música que provavelmente irá gostar.
Se me disser sua escolaridade, posso ter uma estimativa de quanto você ganha por mês.
E se me disser qual foi o último filme que assistiu, posso lhe indicar um que com certeza irá gostar.

Isso é apenas uma amostra do que é regressão linear. Regressão linear nada mais é do que estudar em números a relação entre variáveis. Isso são algoritmos que conseguem estimar o que você pode vir a gostar baseado no que você já gostou ou visitou. Essa é a ideia central de uma regressão linear: estudar o que temos para poder entender como reagir no futuro.

O que é Regressão Linear?
Regressão é o mesmo que correlação?
O que preciso para fazer uma regressão?
Entendendo o output de uma regressão?
Como encontrar o número mínimo ideal de variáveis da regressão?
Conceituação: Intervalos de confiança
Conceituação: Betas padronizados
Conceituação: Linearidade
Conceituação: Multicolirearidade (Tolerance)
Conceituação: Correlação (Pearson Correlation Coefficient)
Conceituação: Gráficos de Diagnóstico
Regressão linear simples
Passo-a-passo: Análise da Regressão Linear Simples
Regressão linear múltipla
Passo-a-passo: Análise da Regressão Linear Múltipla
Erro residual
Making predictions
Regressão logística
Entenda o efeito dos fatores (levels) numa Regressão linear

20171228091600

O que é Regressão Linear?

Regressão linear é estudar uma variável de interesse (variável dependente) em função de outras variáveis que irão me ajudar a entendê-la, as chamadas covariáveis ou variáveis independentes.

Essa relação pode ser estudada matematicamente, ou seja, podemos quantificar em números quanto uma variável está relacionada a outra e como ela impacta em seu comportamento.

No caso de duas variáveis, quando tenho dados de um par do tipo (x,y) e coloco em um gráfico é possível observar aproximadamente como elas se comportam. A regressão nos diz qual a melhor curva a ser traçada que se ajusta melhor nesses dados, podendo ser uma reta ou uma curva qualquer.

Só de olhar para o gráfico você já pode ter um palpite se é uma reta, ou um polinômio de segundo grau, por exemplo. O importante é que o método lhe diz qual a melhor curva de forma que a distância entre os pontos e a curva ajustada seja a menor possível. Ou seja, minimizo o erro e consigo estimar com maior precisão.

A lição aqui é verificar que apesar do nome ser Regressão Linear, temos mais opções além de traçar a melhor reta, podendo por exemplo ser uma curva.

20171228091601

Regressão é o mesmo que correlação?

Há uma relação entre elas, mas não se confunda, não são a mesma coisa! A correlação é um número de -1 a 1 que nos diz o quanto duas variáveis estão relacionadas, sendo que quanto mais próximo de 1 indica que quanto mais eu aumento uma, mais a outra também irá aumentar e quanto mais próximo de -1 indica que quanto mais eu aumento uma, mais eu diminuo a outra. Mas cuidado! Alta correlação não implica causalidade!

A relação entre correlação e regressão linear se pelo fato de que eu consigo construir uma boa regressão se minha variável de interesse e as variáveis independentes tiverem uma relação.

Como você conseguiria estimar o preço do café baseado na nota média do ENEM?
Ou como estimar o salário de alguém baseado em quanto tempo ele(a) consegue comer um cachorro quente?

Não consegue pois são variáveis sem relação alguma!

20171228091602

O que preciso para fazer uma regressão?

Dados! Precisamos de dados de todas as variáveis envolvidas no processo. Apesar de termos mostrado apenas os exemplos de regressão linear simples, ou seja, o estudo da relação de apenas duas variáveis, podemos estudar a variável de interesse em função de várias variáveis independentes.

A ideia é que quanto mais informação eu tenho acerca de um assunto, melhor eu posso falar sobre ele, certo? O mesmo funciona para a regressão.

20170102084500

Entendendo o output de uma regressão

Grades = c(82,98,76,68,84,99,67,58,50,78) # Dependent variable
Absences = c(4,2,2,3,1,0,4,8,7,3) # Independent variable
SAT_Score = c(620,750,500,520,540,690,590,490,450,560) # Independent variable
Regression001 = lm(Grades ~ Absences+SAT_Score)
summary(Regression001)

## 
## Call:
## lm(formula = Grades ~ Absences + SAT_Score)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.791 -1.809  1.060  2.691  5.016 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 33.42231   13.58416   2.460  0.04344 * 
## Absences    -3.34018    0.77323  -4.320  0.00348 **
## SAT_Score    0.09446    0.02067   4.569  0.00258 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.729 on 7 degrees of freedom
## Multiple R-squared:  0.9308, Adjusted R-squared:  0.911 
## F-statistic: 47.07 on 2 and 7 DF,  p-value: 8.724e-05

F-statistic:

Teste conjunto de significância dos parâmetros do modelo (também conhecido como Teste de significancia do modelo com p-value).
Se o p-value de F não for significante a análise já deve ser descartada.
Uma opção para não descartar é remover observações.
Geralmente não funciona mas fica o lembrete.

Ajustamento (R2):

Se a regressão é simples usa-se o Multiple R-squared.
Se a regressão é múltipla usa-se o Adjusted R-squared.

O R2 explica a variação da variávbel resposta.

De forma mais simples, dá o percentual de quanto a variável explicativa (independente) consegue explicar a resposta (dependente).

Coefficients:

A seção dos Coefficients mostra os valores das variáveis independentes e deve-se observar se os mesmos são significativos.

Na regressão linear simples não faz muito sentido a variável independente não ser significativa pois isso inviabiliza a regressão e se o p-value for não significativo isso significa que provavelmente o p-value de F já foi não significativo também.

Na regressão linear múltipla se o p-value do F for significativo neste caso é possível que alguma das variáveis independentes não tenham sido significativas e neste caso procura-se a variável com maior p-value e remove-se a mesma da regressão. Realiza-se nova regressão e observa-se se o modelo agora melhorou seu grau de explicação. Se melhorou então procura-se outra variável não significativa e remove-se da regressão e observa-se se o modelo melhorou seu grau de explicação (R2) e assim sucessivamente até encontra-se o modelo ideal.

Observação: Esse processo de tentativa e erro pode ser trabalhoso dependento do número de variáveis envolvidas. Nestes casos é possível usar a técnica de stepwise que automatiza esse processo de tentativa e erro e retorna o modelo otimizado para a regressão.

20170102085800

Como encontrar o número mínimo ideal de variáveis da regressão?

This dataset below shows the prices of electricity used for different purposes in different states in the united states of america.

We will use the first four variables to model I2005.

methodselection = read.csv("stepwise.csv", sep=";")

AIC is a parameter used in all three model selection procedures as a criteria to select the model.

The rule is, the lower the AIC the better the model.

1.Método Backward
2.Método Forward
3.Método Stepwise

20170102090400

Método Backward

Starts with the most general model and drops variables one at a time until the best model is reached.

linearmodel = lm(I2005 ~ R2005+R2004+C2005+C2004,data=methodselection)
step(linearmodel,direction="backward")

## Start:  AIC=17.54
## I2005 ~ R2005 + R2004 + C2005 + C2004
## 
##         Df Sum of Sq    RSS    AIC
## - R2004  1    2.6862 28.324 17.138
## <none>               25.638 17.544
## - R2005  1    4.1257 29.764 17.931
## - C2004  1    5.0474 30.685 18.419
## - C2005  1   16.8259 42.464 23.617
## 
## Step:  AIC=17.14
## I2005 ~ R2005 + C2005 + C2004
## 
##         Df Sum of Sq    RSS    AIC
## - R2005  1    1.6536 29.978 16.046
## - C2004  1    2.5963 30.920 16.541
## <none>               28.324 17.138
## - C2005  1   17.0895 45.414 22.692
## 
## Step:  AIC=16.05
## I2005 ~ C2005 + C2004
## 
##         Df Sum of Sq    RSS    AIC
## <none>               29.978 16.046
## - C2004  1    5.4914 35.469 16.737
## - C2005  1   15.4612 45.439 20.701

## 
## Call:
## lm(formula = I2005 ~ C2005 + C2004, data = methodselection)
## 
## Coefficients:
## (Intercept)        C2005        C2004  
##     -0.2861       1.9943      -1.3396

Understanding the output:

Output 1: The AIC=17.54. Observe that the model suggests that if you remove I2004 the new AIC is better thean the above. Observe that if you remove other variables from the model the AIC values are greater than the value 17.54, so you should stick to removing only R2004 and rerunning the step command.

Output 2: The AIC=17.14 which is better then the above result. In this case if you remove R2005 as suggested the AIC goest to 16.046 which is better than AIC=17.14. Observe that all the other AICs are greater than 16.046 so you should only remove R2005 as suggested and rerun the step command.

Output 3: The AIC=16.05 which is bette then the previous one. Observe that removing other variables as suggested below does note decreacse the value of AIC. In other words, This is the best model that can be reached .

20170102090401

Método Forward

Starts with the simplest model of all and adds suitable variables at a time until the best model is reached.

linearmodel = lm(I2005 ~ 1,data=methodselection)
step(linearmodel,direction="forward", scope=~R2005+R2004+C2005+C2004)

## Start:  AIC=29.62
## I2005 ~ 1
## 
##         Df Sum of Sq    RSS    AIC
## + C2005  1    54.429 35.469 16.737
## + C2004  1    44.459 45.439 20.701
## + R2005  1    39.728 50.171 22.285
## + R2004  1    38.882 51.017 22.553
## <none>               89.898 29.617
## 
## Step:  AIC=16.74
## I2005 ~ C2005
## 
##         Df Sum of Sq    RSS    AIC
## + C2004  1    5.4914 29.978 16.046
## + R2005  1    4.5486 30.920 16.541
## + R2004  1    4.2830 31.186 16.678
## <none>               35.469 16.737
## 
## Step:  AIC=16.05
## I2005 ~ C2005 + C2004
## 
##         Df Sum of Sq    RSS    AIC
## <none>               29.978 16.046
## + R2005  1   1.65356 28.324 17.138
## + R2004  1   0.21398 29.764 17.931

## 
## Call:
## lm(formula = I2005 ~ C2005 + C2004, data = methodselection)
## 
## Coefficients:
## (Intercept)        C2005        C2004  
##     -0.2861       1.9943      -1.3396

Understanding the output:

Output 1: The AIC=29.62. Observe that the model suggests that if you add C2005 the new AIC is 16.737 taht is better then the above. Observe that if you add any other variables from the model the AIC values are greater than the value 16.737, so you should stick to adding only C2005 and rerunning the step command.

Output 2: The AIC=16,74 which is better then the above result. In this case if you add C2004 as suggested the AIC goest to 16.046 which is better than AIC=16.74. Observe that all the other AICs are greater than 16.046 so you should only add C2004 as suggested and rerun the step command.

Output 3: The AIC=16.05 which is bette then the previous one. Observe that adding other variables as suggested below does note decreacse the value of AIC. In other words, This is the best model that can be reached .

20170102090402

Método Stepwise

Is a combination of backward and forward methods are dropped and added.

linearmodel = lm(I2005 ~ R2005+R2004+C2005+C2004,data=methodselection)
step(linearmodel,direction="both")

## Start:  AIC=17.54
## I2005 ~ R2005 + R2004 + C2005 + C2004
## 
##         Df Sum of Sq    RSS    AIC
## - R2004  1    2.6862 28.324 17.138
## <none>               25.638 17.544
## - R2005  1    4.1257 29.764 17.931
## - C2004  1    5.0474 30.685 18.419
## - C2005  1   16.8259 42.464 23.617
## 
## Step:  AIC=17.14
## I2005 ~ R2005 + C2005 + C2004
## 
##         Df Sum of Sq    RSS    AIC
## - R2005  1    1.6536 29.978 16.046
## - C2004  1    2.5963 30.920 16.541
## <none>               28.324 17.138
## + R2004  1    2.6862 25.638 17.544
## - C2005  1   17.0895 45.414 22.692
## 
## Step:  AIC=16.05
## I2005 ~ C2005 + C2004
## 
##         Df Sum of Sq    RSS    AIC
## <none>               29.978 16.046
## - C2004  1    5.4914 35.469 16.737
## + R2005  1    1.6536 28.324 17.138
## + R2004  1    0.2140 29.764 17.931
## - C2005  1   15.4612 45.439 20.701

## 
## Call:
## lm(formula = I2005 ~ C2005 + C2004, data = methodselection)
## 
## Coefficients:
## (Intercept)        C2005        C2004  
##     -0.2861       1.9943      -1.3396

Understanding the output:

Output 1: The AIC=17.54. Observe that the model suggests that if you remove R2004 the new AIC is 17.138 that is better then the above. Observe that if you add or remove any other variables from the model the AIC values are greater than the value 17.54, so you should stick to removing only R2004 and rerunning the step command.

Output 2: The AIC=17.14 which is better then the above result. In this case if you remove R2005 as suggested the AIC goest to 16.046 which is better than AIC=16.74. Observe that all the other AICs are greater than 16.046 so you should only add R2005 as suggested and rerun the step command.

Output 3: The AIC=16.05 which is bette then the previous one. Observe that adding or removing other variables as suggested below does note decreacse the value of AIC. In other words, This is the best model that can be reached .

Note on the output:

If you are interested only on the output of the best model you can see it by using the code below:

linearmodel = lm(I2005 ~ R2005+R2004+C2005+C2004,data=methodselection)
stepmodel = step(linearmodel,direction="both")

## Start:  AIC=17.54
## I2005 ~ R2005 + R2004 + C2005 + C2004
## 
##         Df Sum of Sq    RSS    AIC
## - R2004  1    2.6862 28.324 17.138
## <none>               25.638 17.544
## - R2005  1    4.1257 29.764 17.931
## - C2004  1    5.0474 30.685 18.419
## - C2005  1   16.8259 42.464 23.617
## 
## Step:  AIC=17.14
## I2005 ~ R2005 + C2005 + C2004
## 
##         Df Sum of Sq    RSS    AIC
## - R2005  1    1.6536 29.978 16.046
## - C2004  1    2.5963 30.920 16.541
## <none>               28.324 17.138
## + R2004  1    2.6862 25.638 17.544
## - C2005  1   17.0895 45.414 22.692
## 
## Step:  AIC=16.05
## I2005 ~ C2005 + C2004
## 
##         Df Sum of Sq    RSS    AIC
## <none>               29.978 16.046
## - C2004  1    5.4914 35.469 16.737
## + R2005  1    1.6536 28.324 17.138
## + R2004  1    0.2140 29.764 17.931
## - C2005  1   15.4612 45.439 20.701

summary(stepmodel)

## 
## Call:
## lm(formula = I2005 ~ C2005 + C2004, data = methodselection)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5168 -0.3434  0.0938  0.9787  2.6232 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -0.2861     1.8248  -0.157   0.8778  
## C2005         1.9943     0.7702   2.589   0.0225 *
## C2004        -1.3396     0.8681  -1.543   0.1468  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.519 on 13 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6152 
## F-statistic: 12.99 on 2 and 13 DF,  p-value: 0.000794

20170102092200

Regressão linear simples

A Simple regression (along with The Pearson Correlation Coefficient) answer roughly the same question:

How do two quantitative variable relate to one another?

With simple linear regression we can use one variable to predict one single quantitative variable and in doing so we expand the Pearson Correlation Coefficient by examing the slope or the impact of the independent variable (x-axis) on the outcome variable (y-axis).

Grades = c(82,98,76,68,84,99,67,58,50,78)
Absences = c(4,2,2,3,1,0,4,8,7,3)
Regression001 = lm(Grades ~ Absences)
summary(Regression001)

## 
## Call:
## lm(formula = Grades ~ Absences)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.156  -6.388  -2.546   6.264  14.454 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   94.326      4.875  19.349 5.28e-08 ***
## Absences      -5.390      1.175  -4.586  0.00179 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.828 on 8 degrees of freedom
## Multiple R-squared:  0.7244, Adjusted R-squared:  0.6899 
## F-statistic: 21.03 on 1 and 8 DF,  p-value: 0.001789

Observação importante:

Observe a coluna ‘t value’, ela é a a informação sobre o t-statistic para cada variável. Logo se houver a seguinte pergunta: Quais são os t-statistic para os preditores (variáveis independentes do modelo) a respostas são esses valores.

No nosso exemplo, os t-statistics dos preditores do modelo são:
Absences = -4.586

20180121100400

Intervalos de confiança

Using the confint() function we can get the corresponding confidence intervals for slopes for every coefficiente in our regression model.

Therefore, for the results below we can say that we are 2.5% confident that the true underlying population slope for Absences predicting Grades are between these two values. Just like a t-test when those values do not capture zero (-8.10067 | -2.679471) we have a significante difference for the test.

confint(Regression001)

##                2.5 %     97.5 %
## (Intercept) 83.08460 105.567879
## Absences    -8.10067  -2.679471

20180121100500

Betas padronizados

What are the standardized betas for our regression?

What would it be if I had a Standardized X and a Standardize Y, in other words, if I had made X a z-score first, and Y a z-score also, and then run it thru a simples linear regression.

We can also use the lmBeta() function to call standardized betas.
Standardized betas free us from any scale.
It is like using a z-score for EVERY variable INCLUDING the outcome variable.

library(SDSFoundations)
lmBeta(Regression001) # Belongs to the SDSFoundation package

##  Absences 
## -0.851114

Observe that the lmBeta() value matches the pearson correlation value (r^2=0.7244) it means the pearson correlation value is (r = 0,8511169132381285)..

Therefore, the standardized beta matches the pearson correlation coefficiente value.

So, we have only two variables in this simples regression.

So the t-values (for significance) match, the r and r^2 values match. And the Standardized beta matches the pearson correlation coefficient value.

This happens because like the pearson correlation value this standardized beta is completely free of scale.

Adicionally we have the F-Test that evaluates overall model fit.

20180112155101

Linearidade

If we want to see if some two variables are linearly related to each other, before you run any correlation analysis you first want to check if the two variables you are going to analyse don´t have some non-linear relationship.

Because correlation measures the strength and direction of the linear relationship you need to confirm that the two variables are indeed linearly related.

And the best way to check for the assumption of linearity is to just make a scatterplot of the two variables.

Unclear relationship

bull  = read.csv("BullRiders.csv")

plot(bull$YearsPro, bull$BuckOuts12)
abline(lm(bull$BuckOuts12 ~ bull$YearsPro))

Clear relationship (somewhat)

bull  = read.csv("BullRiders.csv")
plot(bull$Events12, bull$BuckOuts12)
abline(lm(bull$BuckOuts12~ bull$Events12))

20180121101600

Multicolinearity

When it comes to Multiple Regression we have to do a little background checking to make sure our model is ok, and to do so we need to check for Multicolinearity.

Só funciona em um regressão múltipla.

Multicolinearidade consiste em um problema comum em regressões, no qual as variáveis independentes possuem relações lineares exatas ou aproximadamente exatas. O índício mais claro da existência da multicolinearidade é quando o R² é bastante alto, mas nenhum dos coeficientes da regressão é estatisticamente significativo segundo a estatística t convencional. As consequências da multicolinearidade em uma regressão são a de erros-padrão elevados no caso de multicolinearidade moderada ou severa e até mesmo a impossibilidade de qualquer estimação se a multicolinearidade for perfeita.

Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors. That is, a multiple regression model with colinear predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others.

The car package helps us to see multicolinearity in a regression.

res = read.csv("TempskiResilience.csv")
clin <- res[res$Group == "Clinical Sciences",] # Subset of students in the clinical sciences programm.

RegressionMultipla <- lm(MS.QoL ~ DREEM.S.SP + DREEM.A.SP + Resilience + BDI + Age, data=clin)
summary(RegressionMultipla)

## 
## Call:
## lm(formula = MS.QoL ~ DREEM.S.SP + DREEM.A.SP + Resilience + 
##     BDI + Age, data = clin)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5697 -0.7056  0.0725  0.8409  3.9567 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.674e+00  7.650e-01   6.110 2.05e-09 ***
## DREEM.S.SP   1.394e-01  1.801e-02   7.739 5.87e-14 ***
## DREEM.A.SP   3.670e-02  1.428e-02   2.571 0.010439 *  
## Resilience   4.345e-05  6.135e-03   0.007 0.994353    
## BDI         -3.636e-02  1.067e-02  -3.409 0.000707 ***
## Age         -2.904e-02  2.352e-02  -1.235 0.217592    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.229 on 485 degrees of freedom
## Multiple R-squared:  0.3337, Adjusted R-squared:  0.3269 
## F-statistic: 48.59 on 5 and 485 DF,  p-value: < 2.2e-16

#install.packages("car")
library(car)
# function vif() stands for variance inflation factor

vif(RegressionMultipla)

## DREEM.S.SP DREEM.A.SP Resilience        BDI        Age 
##   1.823375   1.698128   1.731791   1.737641   1.018183

We can also take the reciprocal of the vif() function which is called TOLERANCE. In this case a TOLERANCE value of 0.663285.

1/vif(RegressionMultipla)

## DREEM.S.SP DREEM.A.SP Resilience        BDI        Age 
##  0.5484335  0.5888836  0.5774369  0.5754930  0.9821422

If we revesit the multiple regression predictions we have:

summary(RegressionMultipla)

## 
## Call:
## lm(formula = MS.QoL ~ DREEM.S.SP + DREEM.A.SP + Resilience + 
##     BDI + Age, data = clin)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5697 -0.7056  0.0725  0.8409  3.9567 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.674e+00  7.650e-01   6.110 2.05e-09 ***
## DREEM.S.SP   1.394e-01  1.801e-02   7.739 5.87e-14 ***
## DREEM.A.SP   3.670e-02  1.428e-02   2.571 0.010439 *  
## Resilience   4.345e-05  6.135e-03   0.007 0.994353    
## BDI         -3.636e-02  1.067e-02  -3.409 0.000707 ***
## Age         -2.904e-02  2.352e-02  -1.235 0.217592    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.229 on 485 degrees of freedom
## Multiple R-squared:  0.3337, Adjusted R-squared:  0.3269 
## F-statistic: 48.59 on 5 and 485 DF,  p-value: < 2.2e-16

The R^2 value (0.3367) is for the overall model predicting Absences.

This R^2 value tells us how much Absences is accountted for, in terms of its variance, by the other independent variables (in our case SAT_Score).

In other words, If we look at the amount of remaining variance that is still left over or unexplained in the prediction of Absences (1-0.3367=0.6633), what do we get? We get TOLERANCE (0.6633).

Just reinforcing: Observe that (1-R^2=1-0.3367=0.6633), this value is called TOLERANCE.

So TOLERANCE: is defined as that concept that says this is how much variances left over in this particular independent variable once I know all the other independence in the model. A really low amount of variance left over means that hist particular independent is highly redundant with everything else in the model.

So what do we use for a typical rules of thumb here?

We say that any variable that has as VIF > 5 or larger is kind of redundant and this corresponds to a tolerance value that would be < 0.2 or lower.

This means that I have only got about 20% of variance of this independent variable left over once I account for all of the other variables in the model.

20180112155600

Correlação (Pearson Correlation Coefficient)

Pearson Correlation Coefficient

For any Pearson Correlation Coefficient we can calculate the t-statistic which allow us to examine the significance of that relationship.

The Pearson Correlation Coefficiente and Simple Regression answer roughly the same question: How do two quantitative variable relate to one another?

The t and f distributions allow us to investigagte and tell the story of a linear model. When using a Pearson Correlation we should make sure that the correlation is in fact linear by examing the scatterplot.

The Pearson correlation Coefficcient by itself is free of context and dificult to interpret, however byu squaring it´s value we make it more meaningful, the R^2 is also know as proportion of variance (explained), or as the Coefficient of determination.

The t distribution is the test taht tells us if the pearson correlation is significant.

Keep in mind that:

A correlation coeficcient is only part of the full relational story. To fully explain the relationship between two quantitative variable we can use simple regression. Simple regressions also use an underling linear model to describe the relationship.

But unlike the Pearson Correlation Coeficcient, simple regression allows us to predict a particular outcome.

Correlação simples (Duas variáveis)
Matriz de correlação (Coeficientes de correlação)
Preditores da matriz de correlação
Proportion of Correlation Accounted for

20180112153715

Correlação simples (Duas variáveis)

bull  = read.csv("BullRiders.csv")

Fracamente Correlacionada
Este é o Pearson Correlation Coefficient desta relação

cor(bull$YearsPro, bull$BuckOuts12)

## [1] 0.4095777

Veja o gráfico de linearidade.

Altamente Correlacionada
Este é o Pearson Correlation Coefficient desta relação

cor(bull$Events12, bull$BuckOuts12)

## [1] 0.9936924

Veja o gráfico de linearidade.

20180112153716

Matriz de correlação (Coeficientes da Correlação)

Calculando a matriz de correlação para um dado dataset.

No caso da matriz de correlação teremos pares de Pearson Correlation Coefficients.

Gross x IMDB = 0.281908 Gross x Days = 0.3228279
Gross x Budget = 0.3980789

E assim seria para tantas quantas fossem as variáveis independentes que estivessem se relacionando com a variável dependente YearsPro.

library(SDSFoundations)
#help(package = SDSFoundations)
Filmes = FilmData
Filmes = Filmes[complete.cases(Filmes),] # remove NAs
vars <- c("Gross", "IMDB", "Days", "Budget")
correlacao = cor(Filmes[, vars])
correlacao

##            Gross        IMDB       Days      Budget
## Gross  1.0000000  0.28190849  0.3228279  0.39807892
## IMDB   0.2819085  1.00000000  0.2646178 -0.08484071
## Days   0.3228279  0.26461776  1.0000000 -0.14290797
## Budget 0.3980789 -0.08484071 -0.1429080  1.00000000

20180113102800

Preditores da correlação

Examine the correlation matrixes for any significant predictors.

When we run corr.test() function we get the full correlation matrix, along with the probability values for each one of those.

Below you can se how you can test each and every one of those correlation coefficients, those Pearson Correlation Coeficients, for significance and get their p-values and t-values, simply by looking at a package called ‘psych’ and it´s new function called corr.test().

#install.packages("psych")
library(psych)

## 
## Attaching package: 'psych'

## The following object is masked from 'package:car':
## 
##     logit

corr.test(correlacao)

## Call:corr.test(x = correlacao)
## Correlation matrix 
##        Gross  IMDB  Days Budget
## Gross   1.00 -0.25 -0.13   0.26
## IMDB   -0.25  1.00  0.16  -0.72
## Days   -0.13  0.16  1.00  -0.80
## Budget  0.26 -0.72 -0.80   1.00
## Sample Size 
## [1] 4
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##        Gross IMDB Days Budget
## Gross   0.00 1.00  1.0      1
## IMDB    0.75 0.00  1.0      1
## Days    0.87 0.84  0.0      1
## Budget  0.74 0.28  0.2      0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

corr.test(correlacao)$r # The matrix of correlations

##             Gross       IMDB       Days     Budget
## Gross   1.0000000 -0.2532511 -0.1253871  0.2618710
## IMDB   -0.2532511  1.0000000  0.1614480 -0.7215625
## Days   -0.1253871  0.1614480  1.0000000 -0.7994579
## Budget  0.2618710 -0.7215625 -0.7994579  1.0000000

corr.test(correlacao)$n # Number of cases per correlation

## [1] 4

corr.test(correlacao)$p # two tailed probability of t for each correlation. For symmetric matrices, p values adjusted for multiple tests are reported above the diagonal.

##            Gross      IMDB      Days Budget
## Gross  0.0000000 1.0000000 1.0000000      1
## IMDB   0.7467489 0.0000000 1.0000000      1
## Days   0.8746129 0.8385520 0.0000000      1
## Budget 0.7381290 0.2784375 0.2005421      0

corr.test(correlacao)$t # value of t-test for each correlation

##             Gross       IMDB       Days     Budget
## Gross         Inf -0.3702201 -0.1787347  0.3837327
## IMDB   -0.3702201        Inf  0.2313570 -1.4738852
## Days   -0.1787347  0.2313570        Inf -1.8820755
## Budget  0.3837327 -1.4738852 -1.8820755        Inf

20180124091504

Proportion of Variance Accounted for

In addition we can even talk about the idea of proportion of variance accounted for uniquely by each variable.

How can we do this?

Well, Let´s remember our Venn Diagram.

When we use another function called pCorr() we can get the partial and semi-partial or park correlation coefficient squared value for every variable in the model.

And use the park correlation coefficient squared to evaluate the overall proportion of variance accounted for uniquely by each coefficent examining which one has the biggestg impact.

Em resumo:

A variável ‘Part_Corr_sq’ (semi-partial ou part correlation coefficient squared) armazena a quantidade de variação do modelo que pode ser representada por uma determinada variável.

Por exemplo, a variável DREEM.S.SP responde por 0,08227752 (8,228%) da variação do modelo de regressão linear múltipla.

res = read.csv("TempskiResilience.csv")
clin <- res[res$Group == "Clinical Sciences",] # Subset of students in the clinical sciences programm.

RegressionMultipla <- lm(MS.QoL ~ DREEM.S.SP + DREEM.A.SP + Resilience + BDI + Age, data=clin)
#summary(RegressionMultipla)

library(SDSFoundations)
pCorr(RegressionMultipla)

##             Partial_Corr Partial_Corr_sq    Part_Corr Part_Corr_sq
## DREEM.S.SP  0.3315348929    1.099154e-01  0.286840575 8.227752e-02
## DREEM.A.SP  0.1159533555    1.344518e-02  0.095290520 9.080283e-03
## Resilience  0.0003215396    1.033877e-07  0.000262459 6.888471e-08
## BDI        -0.1529677527    2.339913e-02 -0.126347985 1.596381e-02
## Age        -0.0559705907    3.132707e-03 -0.045758123 2.093806e-03

20170105183100

Passo-a-passo para a análise de uma regressão linear simples

In a 2015 study, Tempski and associates examined a measurement they called Quality of Life among medical school students in Brazilian medical schools. They borrowed measurement scales from the World Health Organization, the Dundee Ready Education Environment Scale, and the Beck Depression Inventory to assess the dependent variable in potential relation to a number of predictor variables.

Variables
* DREEM: Social Self Perception (DREEM.S.SP)
* DREEM: Academic Self Perception (DREEM.A.SP)
* Resilience (Resilience)
* BDI (BDI)
* Age (Age)
* Med School Quality of Life (MS.QoL)

Neste exemplo, vamos analisar a influência de BDI sobre QoL.

res = read.csv("TempskiResilience.csv")
clin <- res[res$Group == "Clinical Sciences",] # Subset of students in the clinical sciences programm.

Matriz de correlação básica
Modelo de Regressão
Intervalos de confiança
Betas padronizados
Linearidade
Residual vs. Fitted
Cook´s Distance

20180122104100

Modelo de Regressão

Use para analisar a influência da variável independente sobre a variável de interesse (dependente).

regressionSimple <- lm(QoL ~ BDI, data=clin)
summary(regressionSimple)

## 
## Call:
## lm(formula = QoL ~ BDI, data = clin)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6587 -0.6587  0.0687  0.7962  2.6138 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.544521   0.088521  96.526   <2e-16 ***
## BDI         -0.068137   0.007626  -8.935   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.159 on 489 degrees of freedom
## Multiple R-squared:  0.1404, Adjusted R-squared:  0.1386 
## F-statistic: 79.84 on 1 and 489 DF,  p-value: < 2.2e-16

20180122101101

Intervalos de confiança

Use para saber os intervalos de confiança para o slope e o intercept.

confint(regressionSimple)

##                   2.5 %      97.5 %
## (Intercept)  8.37059347  8.71844843
## BDI         -0.08311937 -0.05315364

20180122101102

Betas padronizados

Use para saber os valores da regressão se antes fosse feito um z-score dos valores de X e Y. Ou seja, fornece os betas padronizados do modelo.

lmBeta(regressionSimple)

##        BDI 
## -0.3746403

20180122101104

Linearidade

Use para ver a relação de linearidade entre essas duas variáveis.

plot(clin$BDI, clin$QoL)
abline(regressionSimple)

20180122101106

Matriz de correlação básica

Use para verificar qual o grau da correlação entre essas duas variáveis.

vars <- c("QoL", "BDI")
cor(clin[,vars])

##            QoL        BDI
## QoL  1.0000000 -0.3746403
## BDI -0.3746403  1.0000000

20180122101107

Residual vs. Fitted

Usado para ver se os pontos se distribuem uniformemente em relação à linha vermelha.

plot(regressionSimple, which = 1)

20180122101108

Cook´s Distance

Use para visualizar os resgistros (outliers) que podem estar influenciando o modelo.

cutoff <- 4/regressionSimple$df

plot(regressionSimple, which = 4, cook.levels = cutoff)

plot(regressionSimple, which = 4, cook.levels = cutoff, id.n = 7)

Esse bloco já existia e vai ser removido, relocado ou reaproveitado.

Regression line graph
Regression line coeficients
Predictions
Summaries

20180105183800

Regression Line graph

# parent is the predictor
# child is the outcome

#install.packages("UsingR")
library(UsingR)

## Loading required package: MASS

## Loading required package: HistData

## Loading required package: Hmisc

## Loading required package: lattice

## 
## Attaching package: 'lattice'

## The following object is masked from 'package:SDSFoundations':
## 
##     histogram

## Loading required package: survival

## Loading required package: splines

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:psych':
## 
##     describe

## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units

## 
## Attaching package: 'UsingR'

## The following object is masked from 'package:survival':
## 
##     cancer

## The following objects are masked from 'package:psych':
## 
##     galton, headtail

data(galton)
galton = Galton
model = lm(child ~ parent, data = galton)
plot(galton$parent, galton$child)
abline(model, lwd=2)

20180105183801

Regression Line Coeficients

#install.packages("UsingR")
library(UsingR)
data(galton)
galton = Galton
model = lm(child ~ parent, data = galton)
coef(model)

## (Intercept)      parent 
##  23.9415302   0.6462906

# We estimate an expected 0.6462906 height increase in child height for every parent increase in height. Also called SLOPE.

# The intercept 23.9415302 is the expected child height of a 0 parent height.

# You can also calculate the coeficients manually
y <- galton$child
x <- galton$parent
beta1 <- cor(x, y) * sd(x) / sd(y)
beta0 <- mean(x) - beta1 * mean(y)
beta1

## [1] 0.3256475

beta0

## [1] 46.13535

20180105183802

Predictions

Creating a linearmodel and predicting new values based on it

diamond = read.csv("diamonds.csv", header=T)
newx <- c(0.16, 0.27, 0.34)
fit <- lm(price ~ carat, data = diamond)
coef(fit)

## (Intercept)       carat 
##   -2256.361    7756.426

predict(fit , newdata = data.frame(carat = newx))

##          1          2          3 
## -1015.3325  -162.1257   380.8241

20180105183803

Summaries

#install.packages("UsingR")
library(UsingR)
data(galton)
galton = Galton
model = lm(child ~ parent, data = galton)

summary(model)

## 
## Call:
## lm(formula = child ~ parent, data = galton)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8050 -1.3661  0.0487  1.6339  5.9264 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 23.94153    2.81088   8.517   <2e-16 ***
## parent       0.64629    0.04114  15.711   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.239 on 926 degrees of freedom
## Multiple R-squared:  0.2105, Adjusted R-squared:  0.2096 
## F-statistic: 246.8 on 1 and 926 DF,  p-value: < 2.2e-16

summary(model)$sigma

## [1] 2.238547

summary(model)$coefficients

##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 23.9415302 2.81087834  8.517455 6.536845e-17
## parent       0.6462906 0.04113588 15.711115 1.732509e-49

20170102092201

Regressão linear múltipla

However regarding regression we are not limited to looking at just two quantitative variables.

With multiple regression we can examine the effects of multiple independent variables on a single quantitative outcome at the same time.

So instead of two simple regression models that considers the impact of the variables separatedly (in this Venn Diagram example two variables):

We can examine the effects of the two variable on each other and on the dependent variable at the same time:

Grades = c(82,98,76,68,84,99,67,58,50,78)
Absences = c(4,2,2,3,1,0,4,8,7,3)
SAT_Score = c(620,750,500,520,540,690,590,490,450,560)
Regression001 = lm(Grades ~ Absences+SAT_Score)
summary(Regression001)

## 
## Call:
## lm(formula = Grades ~ Absences + SAT_Score)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.791 -1.809  1.060  2.691  5.016 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 33.42231   13.58416   2.460  0.04344 * 
## Absences    -3.34018    0.77323  -4.320  0.00348 **
## SAT_Score    0.09446    0.02067   4.569  0.00258 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.729 on 7 degrees of freedom
## Multiple R-squared:  0.9308, Adjusted R-squared:  0.911 
## F-statistic: 47.07 on 2 and 7 DF,  p-value: 8.724e-05

Observação importante:

No nosso exemplo, os t-statistics dos preditores do modelo são:
Absences = -4.320
SAT_Score = 4.569

20180110090500

Gráficos de Diagnóstico

Gráficos de diagnóstico nos ajudam a acessar as ‘assumptions’ do modelo e procurar por potenciais outliers. Através do Gráfico de Residuals vs. fitted poderemos chegar algumas de nossas ‘assuptions’ e o gráfico de Cook´s distance vai nos ajudar a ver potenciais outliers.

Basicamente queremos checar nossas assumptions e checar por outliers potenciais.

We can use diagnostic plots to validate the linear model, either in a simple or multiple linear regression. We can use the Residual versus fitted and Cook´s distance methos. Both methods helps us to determine is a good model for our underlying data.

For more details see: UTAustinX: UT.7.21x Foundations of Data Analysis - Part 2 Course Week 6: Correlation and Regression Lecture Videos Regression Diagnostics

Residuals versus fitted
Cook´s distance

20180115115501

Residuals versus fitted

library(ggplot2)
diamantes = diamonds[diamonds$cut=='Good',]
diamond_lm = lm(price ~ carat, data = diamantes)
summary(diamond_lm)

## 
## Call:
## lm(formula = price ~ carat, data = diamantes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9153.2  -738.1   -69.6   575.4 11394.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2422.73      43.04  -56.28   <2e-16 ***
## carat        7479.64      44.70  167.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1421 on 4904 degrees of freedom
## Multiple R-squared:  0.851,  Adjusted R-squared:  0.8509 
## F-statistic: 2.8e+04 on 1 and 4904 DF,  p-value: < 2.2e-16

plot(diamond_lm, which = 1)

Observe that at zero (0) there is a dashed line running all the way thru the graph and we can start examining our data visually for something that is called homoscedastic or heteroscedastic. We can certainly look for a linear concept to out data. If everything is nice and random around the red line this graph shows good linearity. The data is split around the curve what is good. If the data curved along with the red line we could say we had a curvelinear data which is not the case. It is not perfect but for the most part we can call it homoscedastic.

Now observe that R picks out some weird outlier when it comes to residuals, they are the numbered dots you see in the graph, they are values really far away from the predicted line. Those numbers are actually row numbers that you can look into to understand the outlier better (those are potentially outliers that we want to get rid of).

For example:

    diamantes [row.names(diamantes ) == 27197, ]

##       carat  cut color clarity depth table price    x    y    z
## 27197  1.14 Good     D      IF  61.7    63 17499 6.64 6.74 4.13

20180115115502

Cook´s distance

A Cooks is a great measure of influence on the overall model, and there is a classic rule of thumb that says that our cutoff for a cook´s distance (cook.level) is 4 divided by the number of degrees of freedom within our model, so…

cutoff <- 4/diamond_lm$df

Now that we have this cutoff we can ask for another plot.

    plot(diamond_lm, which = 4, cook.levels = cutoff)

Observe that theses particular observations, 22832, 23540, 25851 (these numbers are the row numbers of these observations) are really these observations that really show themselves as being high outliers, in terms of high influence, in terms of a cook´s ‘d’ value.

You might also observe that there are some other points in the graph that look like outliers but that do not have their positions you can force the plot() function to number them with the following command (id.n default is 3).

plot(diamond_lm, which = 4, cook.levels = cutoff, id.n = 7)

Those high observations are really the observations that have highest influence in terms of the cook´s d value and would be great contenders for removal (because I can justify once I have a chi cooks t cut off and the id is gona give me the row numbers and we can easily remove them from the data frame).

20170103082400

Erro residual

The residuals can help us understand if the final model selected has increased too much the residuals of the regression.

If it did not, it helps to justify the final model as strong.

Let´s see if using the data above.

The complete regression reads as follows:

linearmodel = lm(I2005 ~ R2005+R2004+C2005+C2004,data=methodselection)
summary(linearmodel)

## 
## Call:
## lm(formula = I2005 ~ R2005 + R2004 + C2005 + C2004, data = methodselection)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9193 -0.2467  0.1455  0.4088  2.4758 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   -1.383      2.167  -0.638   0.5363  
## R2005         -1.645      1.236  -1.330   0.2103  
## R2004          2.020      1.881   1.074   0.3060  
## C2005          3.025      1.126   2.687   0.0211 *
## C2004         -2.632      1.789  -1.472   0.1692  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.527 on 11 degrees of freedom
## Multiple R-squared:  0.7148, Adjusted R-squared:  0.6111 
## F-statistic: 6.893 on 4 and 11 DF,  p-value: 0.004968

Let´s calculate the Root mean squared error which tells us the average error we make in our predictions. In other words the number of errors we making using the model.

SSE = sum(linearmodel$residuals^2) # Sum of Squared Errors
RMSE = sqrt(SSE/nrow(methodselection)) # Root mean squared error

# Root mean squer error
RMSE

## [1] 1.26585

Now let´s use the model suggested by the stepwise method and see the effect it has on the residual error.

linearmodel = lm(I2005 ~ C2005+C2004,data=methodselection)
summary(linearmodel)

## 
## Call:
## lm(formula = I2005 ~ C2005 + C2004, data = methodselection)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5168 -0.3434  0.0938  0.9787  2.6232 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -0.2861     1.8248  -0.157   0.8778  
## C2005         1.9943     0.7702   2.589   0.0225 *
## C2004        -1.3396     0.8681  -1.543   0.1468  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.519 on 13 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6152 
## F-statistic: 12.99 on 2 and 13 DF,  p-value: 0.000794

SSE = sum(linearmodel$residuals^2) # Sum of Squared Errors
RMSE = sqrt(SSE/nrow(methodselection)) # Root mean squared error

# Root mean squer error
RMSE

## [1] 1.368798

Let´s understand what this means.

Using the model with all variables we have a root mean square error of 1.26585.

Using the adjusted model with stepwise method we have a root mean square error of 1.368798.

It is a great deal.

1.26585-1.368798 = 0.102948

No.
Because the difference between them both is 0.102948.
But also because the mean is much higher then that, see below.

mean(methodselection$I2005)

## [1] 6.77375

6.77375/0.102948

## [1] 65.79778

If fact 65 times higher. Which is highly acceptable.

20170103083500

Making predictions

In order to better understand how to make predictions, We will be using a dataset containing data for all teams and seasons in NBA since 1980 except those who lost 82 games.

NBA = read.csv("NBA_train.csv")

First we need to take a look at the variables in the dataset to understand how to create a model of points scored.

Variables

The variables where one variable name is repeated with an A afterwards means that with an A it was attemped without it it was successfull, so…

    Season: Year of the season end  
    Team: name of the team  
    Playoffs: Whether or not a team made it to the playoffs  
    W: Number of wins during the regular season  
    PTS: Points scored during regular season  
    oopPTS: Opponent points scored during regular season  
    FG: Successful field goals including two and three pointers  
    FGA: Attempted field goals (number of unsuccessful field goas)  
    X2P: Successful two pointers  
    XSPA: Attempted two pointers  
    X3P: Successful three pointers  
    X3PA: Atempted three pointers  
    FT: Sucessful free tolls  
    FTA: Attempted free tools  
    ORB: Offensive debounds  
    DRB: Defensive debounds  
    AST: Assists  
    STL: Steals  
    BLK: Blocks   
    TOV: Turn overs

Our dependent variable is points scored (PTS) and our independent variables are:

    X2PA
    X3PA
    FTA
    AST
    ORB
    DRB
    BLK
    TOV
    STL

In order to find out the best model we will use stepwise method.

So:

PointsRegPTS = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + BLK + TOV + STL, data=NBA)

step(PointsRegPTS, direction = "both")

## Start:  AIC=8732.61
## PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + BLK + TOV + STL
## 
##        Df Sum of Sq      RSS    AIC
## - TOV   1      5634 28399948 8730.8
## - DRB   1     13687 28408001 8731.0
## - BLK   1     13878 28408192 8731.0
## <none>              28394314 8732.6
## - STL   1    161970 28556284 8735.4
## - ORB   1   5174007 33568321 8870.4
## - AST   1  13973859 42368173 9064.8
## - X3PA  1  36907474 65301787 9426.0
## - FTA   1  38486912 66881226 9446.0
## - X2PA  1  42823394 71217708 9498.4
## 
## Step:  AIC=8730.78
## PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + BLK + STL
## 
##        Df Sum of Sq      RSS    AIC
## - DRB   1     13751 28413699 8729.2
## - BLK   1     15444 28415392 8729.2
## <none>              28399948 8730.8
## + TOV   1      5634 28394314 8732.6
## - STL   1    182111 28582059 8734.1
## - ORB   1   5244408 33644356 8870.3
## - AST   1  13992244 42392192 9063.3
## - FTA   1  39799378 68199326 9460.3
## - X3PA  1  39978483 68378431 9462.5
## - X2PA  1  43003978 71403926 9498.6
## 
## Step:  AIC=8729.18
## PTS ~ X2PA + X3PA + FTA + AST + ORB + BLK + STL
## 
##        Df Sum of Sq      RSS    AIC
## - BLK   1      7765 28421465 8727.4
## <none>              28413699 8729.2
## + DRB   1     13751 28399948 8730.8
## + TOV   1      5699 28408001 8731.0
## - STL   1    254864 28668563 8734.6
## - ORB   1   5720440 34134139 8880.3
## - AST   1  14575147 42988847 9072.9
## - FTA   1  40907684 69321383 9471.9
## - X3PA  1  45944414 74358114 9530.5
## - X2PA  1  46410153 74823853 9535.7
## 
## Step:  AIC=8727.41
## PTS ~ X2PA + X3PA + FTA + AST + ORB + STL
## 
##        Df Sum of Sq      RSS    AIC
## <none>              28421465 8727.4
## + BLK   1      7765 28413699 8729.2
## + TOV   1      6842 28414623 8729.2
## + DRB   1      6073 28415392 8729.2
## - STL   1    253304 28674769 8732.8
## - ORB   1   5846710 34268175 8881.6
## - AST   1  14709040 43130504 9073.7
## - FTA   1  41048721 69470186 9471.7
## - X3PA  1  46992818 75414282 9540.2
## - X2PA  1  47288836 75710301 9543.5

## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data = NBA)
## 
## Coefficients:
## (Intercept)         X2PA         X3PA          FTA          AST  
##  -2032.7164       1.0500       1.2731       1.1273       0.8884  
##         ORB          STL  
##     -0.9743      -0.2268

By the stepwise method above we find that the best model is:

lm(formula = PTS = -2032.7164 + 1.05 * X2PA + 1.2731 * X3PA + 1.1273 * FTA + 0.8884 * AST - 0.9743 * ORB - 0.2268 * STL, data = NBA)

We can use this model make predictions.

But before that, let´s understand what predictions are.

Observe that we used the NBA_train.csv that contains data from years previous to 2014.

We above used this data to create a model that can help us to predict points scored in other years.

In our case we are going to predict Points Scored in 2014.

To do so we need a Test dataset.

A test dataset is a dataset equal in structure to that we used to create the model with one difference:

In that dataset the independent variables are either variables collected to help the prediction or like in this example they are the players statistics predicted using the 2014 performance of the players who will be playing in 2014.

So we need to load the test dataset:

        NBA_test = read.csv("NBA_test.csv")

The NBA_test dataset has all the same variables for the 2012-2013 Season.

However, we want to predict results using the model we just created.

The model we are using as suggested by the stepwise method is:

PointsRegPTS = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data=NBA)

Let´s make our predictions:

        PointsRegPTS = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data=NBA)
        PointsPredictions = predict(PointsRegPTS, newdata=NBA_test)

The question here is:

How can they know the values for X2PA + X3PA + FTA + AST + ORB + STL in 2012.

They can not know because the season would only be starting but once those results are achieved by players, what they CAN do is to estimate the values of those variables for the current season (2012-2013) by looking at the players that will play in 2012-2013 and measure their performances in the year before let´s say 2011.

By doing so they prepared the NBA_test dataset to feed the model with the needed data for the prediction.

What does the predict() function do?

It creates a vector with all the values predicted to the dependent variable regarding all observations.

In our case we have the real results stored in the test dataset and therefore we can compare them and also view the difference between them:

        head((data.frame(PointsPredictions ,NBA_test$PTS, abs(PointsPredictions -NBA_test$PTS))))

##   PointsPredictions NBA_test.PTS abs.PointsPredictions...NBA_test.PTS.
## 1          8086.446         8032                              54.44594
## 2          7764.143         7944                             179.85706
## 3          7965.348         7661                             304.34799
## 4          7784.034         7641                             143.03405
## 5          8004.349         7913                              91.34868
## 6          8247.427         8293                              45.57268

Observe that the difference between the actual value and the prediction is very low, thus confirming the power of the prediction.

In it´s highest point we have a 471 difference in a 8669 points scored. It is 18 times smaller.

We could therefore say it is a quite exquisite approximation.

How good is the model?

Another way to answer this question is to compute the Out of sample R squared. This is a measurment of how well the model predicts the testset data.

If we recall the model created we will see that the R squared value is as seen below (.8991) which is the measure of an in-sample r-squared, or in other words, how well the model fits the trainingset.

        summary(PointsRegPTS)

## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -523.33 -122.02    6.93  120.68  568.26 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.033e+03  1.629e+02 -12.475  < 2e-16 ***
## X2PA         1.050e+00  2.829e-02  37.117  < 2e-16 ***
## X3PA         1.273e+00  3.441e-02  37.001  < 2e-16 ***
## FTA          1.127e+00  3.260e-02  34.581  < 2e-16 ***
## AST          8.884e-01  4.292e-02  20.701  < 2e-16 ***
## ORB         -9.743e-01  7.465e-02 -13.051  < 2e-16 ***
## STL         -2.268e-01  8.350e-02  -2.717  0.00673 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.3 on 828 degrees of freedom
## Multiple R-squared:  0.8991, Adjusted R-squared:  0.8983 
## F-statistic:  1229 on 6 and 828 DF,  p-value: < 2.2e-16

        head(sort(abs(PointsRegPTS$residuals)))

##       255       279       567       151       187       700 
## 0.5412785 0.6095379 0.6743536 0.9799886 1.0435412 1.0630677

However to get a measure of the predictions goodness of fit we need to calculate the out-of-sample r-squared.

        # Sum of squared erros (SEE)
        SSE =  sum((PointsPredictions - NBA_test$PTS)^2)

        # Total sum of squares
        SST = sum((mean(NBA$PTS) - NBA_test$PTS)^2)

        # R-squared
        R2 = 1 - SSE/SST
        R2

## [1] 0.8127142

        # Root mean square error
        RMSE = sqrt(SSE/nrow(NBA_test))
        RMSE

## [1] 196.3723

Although different it is still a quite acceptable r-square value.

If you calculate the RMSE for the trainningset you will also see that the values don´t differ greatly.

20170103091300

Regressão logística

É uma técnica recomendada para situações em que a variável dependente é de natureza dicotômica ou binária. Quanto às independentes, tanto podem ser categóricas ou não.

A regressão logística é um recurso que nos permite estimar a probabilidade associada à ocorrência de determinado evento em face de um conjunto de variáveis explanatórias.

Busca estimar a probabilidade da variável dependente assumir um determinado valor em função dos conhecidos de outras variáveis; Os resultados da análise ficam contidos no intervalo de zero a um.

Em comparação com as técnicas conhecidas em regressão, em especial a regressão linear, a regressão logística distingue-se essencialmente pelo facto de a variável resposta ser categórica.

Trata-se de um modelo de regressão para variáveis dependentes ou de resposta binomialmente distribuídas.

Understanding the Logistic Regression
Model of Health Care Quality
Building the Trainning and Testing Sets
Building the Logistic Regression Model
Making predictions on the Trainning Set
Accuracy of Predictions

20170104083700

Understanding the Logistic Regression

In this example we will use the quality.csv dataset that store information about health care for various pacients.

This example is from: edX.org.-.The.Analytics.Edge Week.3.-.Logistic.Regression 1.Modelating the Expert - An Introduction to Logistic Regression.VIDEO 4 LOGISTIC REGRESSION IN R

The variables in the dataset quality.csv are as follows:

    This are the variables in the dataset   
    MemberID numbers the patients from 1 to 131, and is just an identifying number.  
    InpatientDays is the number of inpatient visits, or number of days the person spent in the hospital.  
    ERVisits is the number of times the patient visited the emergency room.  
    OfficeVisits is the number of times the patient visited any doctor's office.  
    Narcotics is the number of prescriptions the patient had for narcotics.  
    DaysSinceLastERVisit is the number of days between the patient's last emergency room visit and the end of the study period (set to the length of the study period if they never visited the ER).   
    Pain is the number of visits for which the patient complained about pain.
    TotalVisits is the total number of times the patient visited any healthcare provider.  
    ProviderCount is the number of providers that served the patient.  
    MedicalClaims is the number of days on which the patient had a medical claim.  
    ClaimLines is the total number of medical claims.  
    StartedOnCombination is whether or not the patient was started on a combination of drugs to treat their diabetes (TRUE or FALSE).  
    AcuteDrugGapSmall is the fraction of acute drugs that were refilled quickly after the prescription ran out.
    PoorCare is the outcome or dependent variable, and is equal to 1 if the patient had poor care, and equal to 0 if the patient had good care.

However we will use only three variables in our example:

Dependent variable (PoorCare):

1: Patient had poor care
0: Patiente had good care

Independent variable 1 (OfficeVisits): Number of times the patient visited any doctor’s office.
Independent variable 1 (Narcotics): Number of prescriptions the patient had for narcotics.

Logistic regression predicts the probablilty of the outcome variable beeing true. In our example, the logistic regression model will predict the probability of the patient is receiving poor care. For this example we will denote:

poor care = 1
good care = 0

Just like a linear regression we have a set of independent variables (x1 to xz). In our case PoorCare is the dependent variable. MemberID should be removed from the dataset. All the other variables are independent variables.

To predict the probability that Y=1 we use the logistic response function, that is the nonlinear transformation of linear regression equation to produce numbers between 0 e 1.

logistic funciotn = 1 / (1+ e ^ - lm())

Look at the logistic curve below:

bodysize=rnorm(20,30,2) 
bodysize=sort(bodysize) 
survive=c(0,0,0,0,0,1,0,1,0,0,1,1,0,1,1,1,0,1,1,1) 
dat=as.data.frame(cbind(bodysize,survive))
plot(bodysize,survive, type="n")
g=glm(survive~bodysize,family=binomial,dat) 
curve(predict(g,data.frame(bodysize=x),type="resp"),add=TRUE) 
points(bodysize,fitted(g),pch=20)

Let´s understand this function a little better. This graph shows the logistic reponse function for different values of the linear regression peice, the logistic response function always takes values between zero and one which makes sense since it equals a probability.

A positive coeficient value for a variable increases the linear regression piece which increases that y equals one, or in our case, increases the probability of poor care.

On the other hand, a negative coeficient value for a variable decreases the linear regression piece which decreases that y equals one, or in our case, increases the probability of good care.

Another way to think of it is in terms of probabilty.

odds= prob(y=1)/prob(y=0)

odds= e ^ lm()
log(odds) = lm()
logit = log(oods) = beta 0 + beta1 * x1 + beta2 * x2 + … + betan * Xn

It means that:

So a positive beta values increases the logit which in turn increases the odds of 1.
A negative beta values decreases the logit which in turn decreases the odds of 1 or increases the odds of 0.

20170104083701

Model of Health Care Quality

From the plot below it seems that as office visits and narcotics increase so does the quality of care increases once we have more red dots than black ones above and to the right of the plot.

But it is hard to see a trend in the data by just inspecting it. But it looks like maybe more office visits and more narcotics are more likely to have poor care (red dots)

quality = read.csv("quality.csv") # reads the dataset
qualityFilter = quality[quality$PoorCare==1,] # Stores which observatins had good care
plot(quality$OfficeVisits,quality$Narcotics, pch=20, cex=2, col="green") # plots all observations in green
points(qualityFilter$OfficeVisits,qualityFilter$Narcotics, col="red", pch=20, cex=2) # plots all observations that had good care in red 
lines(c(-2,30),c(30,-2), col="blue", lwd=3) # This line helps us to see that above a certain point there seems to be more and more patients with good care.

Let´s build a logistic regression to better predict poor care.

20170104083702

Building the Trainning and Testing Sets

First let´s see how many patients received good care (0) and how many received poor care (1).

table(quality$PoorCare)

## 
##  0  1 
## 98 33

98/ (98+33)

## [1] 0.7480916

So 98 out of 131 (98+33) patients received good care (0). In other words 75% receive good care. In a classification problem, a standard baseline method is to just predict the most frequent outcome for all observations. Since GOOD CARE (0) is more commum than POOR CARE (1) we would predict that all patients are receiving good care.

If we would do this, we would get 98 our of 131 observations correct or have an acurracy of 75%.

Therefore we can say that our baseline model has an accuracy of 75%.

That´s what we will try to beat with a logistic regression.

Splitting the Trainning set & Testing set

To make predictions we need a trainning set to build the model on and a testing set to make predictions about. To do so we will use the sample.split() function.
Randomly splitting data

Lets understand how the split takes place. Using the sample.split function we use the dataset quality and the column PoorCare (dependent variable) to create a hole new dataset with the same distribution between 1s and 0s. It is important to point that the parameter SplitRatio should be set to equal the more frequent value of the variable, in our case the more frequent value is 1 (poor care) and that is the value that will have. It represents 75% of the splitted variable.

        #install.packages("caTools")
        library(caTools)
        set.seed(88)
        split = sample.split(quality$PoorCare, SplitRatio = 0.75)
        split

##   [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE
##  [12] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [23]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
##  [34]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
##  [45] FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
##  [56]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
##  [67]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [78]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
##  [89]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## [100]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE
## [111] FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
## [122]  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE

The result of the funciton is a vector with a ratio of 75% between TRUES and FALSES. In the split vector TRUE means that we should put the result in the trainning set and FALSE means that we should put the result in the testing set

        table(split) # ARound the same as quality@PoorCare

## split
## FALSE  TRUE 
##    32    99

    99/(99+32)

## [1] 0.7557252

Create training and testing sets

    qualityTrain = subset(quality, split == TRUE)
    qualityTest = subset(quality, split == FALSE)

Analysing the trainning and testing sets we observe that although the number of rows differs from Train and Test they keep the ratio providade for the sample.split() funciton of 75%

    nrow(qualityTrain)

## [1] 99

    nrow(qualityTest)

## [1] 32

    99/(99+32)

## [1] 0.7557252

The ratio within each one of datasets creates also keep the same ratio of 75%.

    table(qualityTrain$PoorCare)

## 
##  0  1 
## 74 25

    74/(74+25)

## [1] 0.7474747

    table(qualityTest$PoorCare)

## 
##  0  1 
## 24  8

    24/(24+8)

## [1] 0.75

Therefore that is what the sample.split() function does.

    What do we want to do with this?
    By creating a trainning and a test set we can create a model with the trainning set.  
    validate making predictions using the testing set.

20170104083703

Building the Logistic Regression Model

Now we are ready to build a logistic regression model using office visits and narcotics as seen in the plot above as independent variables. To do so we will use the GLM() funciton that stands for Generalized Linear Model.

        QualityLog = glm(PoorCare ~ OfficeVisits + Narcotics, data=qualityTrain, family=binomial)
    summary(QualityLog)

## 
## Call:
## glm(formula = PoorCare ~ OfficeVisits + Narcotics, family = binomial, 
##     data = qualityTrain)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.06303  -0.63155  -0.50503  -0.09689   2.16686  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.64613    0.52357  -5.054 4.33e-07 ***
## OfficeVisits  0.08212    0.03055   2.688  0.00718 ** 
## Narcotics     0.07630    0.03205   2.381  0.01728 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 111.888  on 98  degrees of freedom
## Residual deviance:  89.127  on 96  degrees of freedom
## AIC: 95.127
## 
## Number of Fisher Scoring iterations: 4

We use family=binomial to set the logistic regression.

Observe that both betas are positive what means that higher values in this two variables are indicative of poor care (0) as we could suspect by the plot above.

Each variable has at least one * which means they are all significant in our model.

AIC is a mesure of the quality of the model. It accounts for the number of variables used compared to the number of observations, however it can only be compared between models on the same dataset. But it provides a means for model selection. The preferred model is that with the minimum AIC

20170104083704

Making predictions on the Trainning Set

Let´s now make predictions on the trainning set using the QualityLog model.

It is not clear to me why we don´t make predictions on the testing using the QualityLog model set but it may came later.

    predictTrain = predict(QualityLog, type="response")
    # The type="response" tells the predict funciton to give probabilities
    
    summary(predictTrain)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.06623 0.11910 0.15970 0.25250 0.26760 0.98460

Observe that once we are expecting probabilites all those numbers should be between 1 and 0.

Let´s see if we are predicting higher probabilities for the actual poor care (1) cases as we expect. To do so we use the tapply() function.

        tapply(predictTrain, qualityTrain$PoorCare, mean)

##         0         1 
## 0.1894512 0.4392246

    0.4392246 / (0.1894512 + 0.4392246 )

## [1] 0.6986504

As we expected we are producing higher probabilities to 1, Around 69.86%.

20170104083705

Accuracy of Predictions

Often we want to make a binary prediction.

Did this patient receive good care or poor care?

We can do this using a threshold value t

If P(PoorCare = 1) > t, predict good quality
If P(PoorCare = 1) < t, predict poor quality

What value should we pick for t?

There is two types of erros that a model can make.

#1 - When you predict poor care (1) but the outcome is good care (0)  
#2 - When you precict good care (0) but the outcome is poor care (1)

Picking a large threshold t We will predict poor care rarely once the probability of poor care would have to be really large to be bigger then the threshold.

Picking a small threshold t We will predict good care rarely because the probability of good care would have to be really small to be smaller then the treshold.

If you have no preference in error type select t =.5

Confusion matrix
(classification matrix)

This compares the actual outcomes with the predicted outcomes
The rows are labeled with the actual outcomes
The columns are labled with the predicted outcomesd
There are four values in the matrix

True Negatives (TN): Actual=0,Predicted = 0
True Positives (TP): Actual=1, Predicted = 1
False Positives (FP): Actual=0, Predicted=1
Falte Negatives (FN): Actual=1, Predicted=0

We can compute two outcome measures that help us to determine what types of erros we are making:

Sensitivity: TP/(TP+FN) [Actual=1 line]
Measures the percentace of actual poor care cases that we classifeid correctly. Has to do with the Actual 1 row.

Also known as the true positive rate

Specificity: TN/(TN+FP) [Actual=0 line]
Measures the percentace of actual good care cases that we classified correctly.

Has to do with the Actual 1 row.

Also known as the true negative rate.

A model with a higher threshold will have lower sensitivity and higher specifity.

A model with a lower threshold will have higher sensitivity and lower specificity.

    # Confusion matrix (classification matrix) for threshold of 0.5
    table(qualityTrain$PoorCare, predictTrain > 0.5)

##    
##     FALSE TRUE
##   0    70    4
##   1    15   10

    # Sensitivity
    10/(15+10)

## [1] 0.4

    # Specificity
    70/(70+4)

## [1] 0.9459459

But, Which threshold should we pick?

Receiver Operator Characteristic (ROC) curve

Recall that we made predictions on our training set and called them predictTrain and we will use these predictions to create our ROC curve.

First we will call the prediction function of ROCR, and we will call the output of this function ROCRpred, and then use the prediction function which takes two arguments:

Argument 1: Predictions made with the model (predicTrain)
Argument 2: True outcomes of our data points (qualityTraing$PoorCare)

Then we need to use the performance function, this defines what we would like to plot on the x and y axis of our ROC curve. We will call the output of this ROCperf.

The performance funciton takes three arguments:

The output of the prediction function (ROCRpred)
x-axis: True positive rate (tpr)
y-axis: False positive rate (fpr)

And then we can plot the output of this performance function.

    # Install and load ROCR package
    # install.packages("ROCR")
    library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

    # Prediction function
    ROCRpred = prediction(predictTrain, qualityTrain$PoorCare)

    # Performance function
    ROCRperf = performance(ROCRpred, "tpr", "fpr")

    # Plot ROC curve
    plot(ROCRperf)

It can help you decide whic h value of the threshold is best!

True positive rate (SENSITIVITY): Y-axis, proportion of poor care caught.
False positive rate (1-SPECIFICITY): x-axis, proportion of good care labeled as poor care.

ROC always start at (0,0)[upper right corner] which corresponds to a threshold = 1 which will not catch any poor care case, in other words, a sensitivity = 0. Therefore you will correctly label all good care cases, meaning that you have a false positive rate = 0.

ROC always ends at (1,1)[lower left corner] which corresponds to a thresholod = 0. In that case you will catch all of the poor care cases, with a sensitiviy = 1 but you you label all of the good care cases as poor care cases too, meaning that you have a false positive rate = 1.

Observe the ROC curve below at some points and their intepretations:

Point (0,.4): Correctly labeling 40% of the poor care cases with about 0,5% false positive rate.

Point (.6,.9): Correctly labeling 90% of the poor care cases with about 60% of false positive rate.

Point (.3,.8)? Correctly labeling 80% of poor care cases with about 30% of false poistive rate.

Selecting a Threshold using ROC

ROC curves capture all thresholds simultaneously.

High threshold - At point = (0,0), threshold = 1  
  High Specificity
  Low Sensitivity
Low threshold  -  At point = (1,1), threshold = 0  
  Low Specificity
  High Sensitivity

So, what treshold value should you pick?

You should select the best threshold for the trade off you want to make.

Let’s say you want a low false positive rate (high specificity) take the threshold that maximizes the true positive rate while keeping the false positive rate really low. In our example, a threshold around (.1,.5) seems a good choice.

On the other hand, if you are more concerned with a high sensitivity (high true positive rate) take a treshold that that minimizes the false positive rate but has a very high true poisitive rate. In our example a threshold around (.3,.8) looks like a good choice in this case.

Color coding the curve

You can label the thresholds in R by color coding the curve.

The legend is shown on the right. By picking a point (x,y) you can see which color is closest on the line. After that just see at the y-axis at the right which value if the best value for the threshold.

    # Add colors
    plot(ROCRperf, colorize=TRUE)

Adding threshold labels

    # Add threshold labels 
    plot(ROCRperf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))

20180105155000

Entenda o efeito dos fatores (levels) numa Regressão linear

Relative to type of variable:

If the dependent variable is continuous (e.g. Blood count: 3.2, 3.4,5.5) or discreet (e.g: Age: 23years, 43 years), where there is not necessarily a hierarquical value between values we have a less strong result from the regression analysis.
When the dependent variable is ordianl (e.g. 1,2,3,4 where 1<2<3<4) we have more robuts results from the regression.
When the dependent varible is dicotomic (0 or 1) we should use a Logistic Regression.
It is important to notice that if one or more of the variables is an unordered factor, we must define one level as the “reference level” and add a binary variable for each of the remaining leves. In this way, a factor with n levels is replaced by n-1 binary variables. The reference level is typically selected to be the most frequently occurring level in the dataset. However R when loads texts selects the reference level using alphabetical order. It chooses the first level alphabetically as the reference level instead of the most common. Pay attention to that.

Let´s see an example using the quine dataset in the MASS library. The quine data frame has 146 rows and 5 columns. Children from Walgett, New South Wales, Australia, were classified by Culture, Age, Sex and Learner status and the number of days absent from school in a particular school year was recorded. Thisis an unordered factor. The is no logical sequence.

For this example we will use the variable:

Lrn (learner status): factor with levels Average or Slow learner, (“AL” or “SL”).

# This is an incomplete example. You should do the reading of the pdf below as suggested.

y<-c(72,25,24,2,18,38,62,30,78,34,67,21,97,79,64,53,27,81) 
Day<-c(rep("Day 1",6),rep("Day 2",8),rep("Day 3",4)) 

dataf<-data.frame(y,Day) 

str(dataf) #Day is not ordered

## 'data.frame':    18 obs. of  2 variables:
##  $ y  : num  72 25 24 2 18 38 62 30 78 34 ...
##  $ Day: Factor w/ 3 levels "Day 1","Day 2",..: 1 1 1 1 1 1 2 2 2 2 ...

summary(lm(y~Day,data=dataf))  #Day 2 is not significantly different from Day 1, but Day 3 is.

## 
## Call:
## lm(formula = y ~ Day, data = dataf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.500 -21.333   0.125  16.750  42.167 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    29.83      10.30   2.897   0.0111 *
## DayDay 2       28.67      13.62   2.105   0.0526 .
## DayDay 3       26.42      16.28   1.623   0.1255  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.22 on 15 degrees of freedom
## Multiple R-squared:  0.2471, Adjusted R-squared:  0.1467 
## F-statistic: 2.461 on 2 and 15 DF,  p-value: 0.119

dataf$Day

##  [1] Day 1 Day 1 Day 1 Day 1 Day 1 Day 1 Day 2 Day 2 Day 2 Day 2 Day 2
## [12] Day 2 Day 2 Day 2 Day 3 Day 3 Day 3 Day 3
## Levels: Day 1 Day 2 Day 3

dataf$Day<-ordered(dataf$Day) 
dataf$Day

##  [1] Day 1 Day 1 Day 1 Day 1 Day 1 Day 1 Day 2 Day 2 Day 2 Day 2 Day 2
## [12] Day 2 Day 2 Day 2 Day 3 Day 3 Day 3 Day 3
## Levels: Day 1 < Day 2 < Day 3

# Observe that after the ordered function Levels: Day 1 < Day 2 < Day 3
# You can also see this with the command below

str(dataf) # "Day 1"<"Day 2"<"Day 3"

## 'data.frame':    18 obs. of  2 variables:
##  $ y  : num  72 25 24 2 18 38 62 30 78 34 ...
##  $ Day: Ord.factor w/ 3 levels "Day 1"<"Day 2"<..: 1 1 1 1 1 1 2 2 2 2 ...

summary(lm(y~Day,data=dataf))

## 
## Call:
## lm(formula = y ~ Day, data = dataf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.500 -21.333   0.125  16.750  42.167 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   48.194      6.187   7.789 1.19e-06 ***
## Day.L         18.679     11.512   1.623    0.125    
## Day.Q        -12.622      9.858  -1.280    0.220    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.22 on 15 degrees of freedom
## Multiple R-squared:  0.2471, Adjusted R-squared:  0.1467 
## F-statistic: 2.461 on 2 and 15 DF,  p-value: 0.119

#Significances reversed (or "Day.L" and  "Day.Q" are not sinonymous "Day 2" and "Day 3"?): Day 2 (".L") is significantly different from Day 1, but Day 3 (.Q) isn't. 


# This helps to at least see the effect of ordering a factor in the linear regression results.

f <- paste("day", 1:3) 
contrasts(ordered(f))

##                 .L         .Q
## [1,] -7.071068e-01  0.4082483
## [2,] -7.850462e-17 -0.8164966
## [3,]  7.071068e-01  0.4082483

contrasts(factor(f))

##       day 2 day 3
## day 1     0     0
## day 2     1     0
## day 3     0     1

This is a more complex aproach to unordered factors. You can however solve this problem with a more simple approach. As an example, consider the unordered factor variable “color”, with levels “red”, “green”, and “blue”. If “green” were the reference level, then we would add binary variables “colorred” and “colorblue” to a linear regression problem. All red examples would have colorred=1 and colorblue=0. All blue examples would have colorred=0 and colorblue=1. All green examples would have colorred=0 and colorblue=0.

y<-c(72,25,24,2,18,38,62,30,78,34,67,21,97,79,64,53,27,81) 
Day<-c(rep("Day 1",6),rep("Day 2",8),rep("Day 3",4)) 

dataf<-data.frame(y,Day) 

str(dataf) #Day is not ordered

## 'data.frame':    18 obs. of  2 variables:
##  $ y  : num  72 25 24 2 18 38 62 30 78 34 ...
##  $ Day: Factor w/ 3 levels "Day 1","Day 2",..: 1 1 1 1 1 1 2 2 2 2 ...

dataf$Day1 = 0 
dataf$Day2 = 0
dataf$Day3 = 0


dataf$Day1= ifelse(dataf$Day=="Day 1",yes=1,no=0)
dataf$Day2= ifelse(dataf$Day=="Day 2",yes=1,no=0)
dataf$Day3= ifelse(dataf$Day=="Day 3",yes=1,no=0)

str(dataf)

## 'data.frame':    18 obs. of  5 variables:
##  $ y   : num  72 25 24 2 18 38 62 30 78 34 ...
##  $ Day : Factor w/ 3 levels "Day 1","Day 2",..: 1 1 1 1 1 1 2 2 2 2 ...
##  $ Day1: num  1 1 1 1 1 1 0 0 0 0 ...
##  $ Day2: num  0 0 0 0 0 0 1 1 1 1 ...
##  $ Day3: num  0 0 0 0 0 0 0 0 0 0 ...

dataf

##     y   Day Day1 Day2 Day3
## 1  72 Day 1    1    0    0
## 2  25 Day 1    1    0    0
## 3  24 Day 1    1    0    0
## 4   2 Day 1    1    0    0
## 5  18 Day 1    1    0    0
## 6  38 Day 1    1    0    0
## 7  62 Day 2    0    1    0
## 8  30 Day 2    0    1    0
## 9  78 Day 2    0    1    0
## 10 34 Day 2    0    1    0
## 11 67 Day 2    0    1    0
## 12 21 Day 2    0    1    0
## 13 97 Day 2    0    1    0
## 14 79 Day 2    0    1    0
## 15 64 Day 3    0    0    1
## 16 53 Day 3    0    0    1
## 17 27 Day 3    0    0    1
## 18 81 Day 3    0    0    1

# You can now make a linear regression using those variables

summary(lm(y  ~ Day1 + Day2 + Day3, data=dataf))

## 
## Call:
## lm(formula = y ~ Day1 + Day2 + Day3, data = dataf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.500 -21.333   0.125  16.750  42.167 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    56.25      12.61   4.461 0.000458 ***
## Day1          -26.42      16.28  -1.623 0.125494    
## Day2            2.25      15.45   0.146 0.886113    
## Day3              NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.22 on 15 degrees of freedom
## Multiple R-squared:  0.2471, Adjusted R-squared:  0.1467 
## F-statistic: 2.461 on 2 and 15 DF,  p-value: 0.119

One last observation is to point out that if you do nothing, when you do a linear regression using a dataset where one or movre variables are factors, the lm() fucniton will remove the reference level from the linear regression. Observe that below Day 1 is a missing coefficiente.

y<-c(72,25,24,2,18,38,62,30,78,34,67,21,97,79,64,53,27,81) 
Day<-c(rep("Day 1",6),rep("Day 2",8),rep("Day 3",4)) 
dataf<-data.frame(y,Day) 
summary(lm(y~Day,data=dataf))

## 
## Call:
## lm(formula = y ~ Day, data = dataf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.500 -21.333   0.125  16.750  42.167 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    29.83      10.30   2.897   0.0111 *
## DayDay 2       28.67      13.62   2.105   0.0526 .
## DayDay 3       26.42      16.28   1.623   0.1255  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.22 on 15 degrees of freedom
## Multiple R-squared:  0.2471, Adjusted R-squared:  0.1467 
## F-statistic: 2.461 on 2 and 15 DF,  p-value: 0.119

However you can relevel the variable and the new reference level will be omited from the linear regression. You must find out what is best to create manually of variable to each possible value of the Day variable or let lm() remove the reference level from the computation.

Important: If any of the variable unfolded by the Day variable (Day1,day2,day3) is significante in the model it means we can not remove the variable.

y<-c(72,25,24,2,18,38,62,30,78,34,67,21,97,79,64,53,27,81) 
Day<-c(rep("Day 1",6),rep("Day 2",8),rep("Day 3",4)) 
dataf<-data.frame(y,Day) 
dataf$Day = relevel(dataf$Day, "Day 2")

summary(lm(y~Day,data=dataf))

## 
## Call:
## lm(formula = y ~ Day, data = dataf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.500 -21.333   0.125  16.750  42.167 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   58.500      8.917   6.560 9.03e-06 ***
## DayDay 1     -28.667     13.621  -2.105   0.0526 .  
## DayDay 3      -2.250     15.445  -0.146   0.8861    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.22 on 15 degrees of freedom
## Multiple R-squared:  0.2471, Adjusted R-squared:  0.1467 
## F-statistic: 2.461 on 2 and 15 DF,  p-value: 0.119

Source:http://r.789695.n4.nabble.com/Models-with-ordered-and-unordered-factors-td4072225.html

You should also read: Working with Ordinal Predictors.pdf

THE FOLLOWING IS STUFF I DO NOT KNOW

WHERE TO PLACE YET

Unused material

1. Correlation

See ?cor for more details

O comportamento conjunto de duas variáveis quantitativas pode ser observado pelo diagrama de dispersão e medido por meio do Coeficiente de Correlação de Pearson (r). Pode ser utilizado também se uma das variáveis é contínua e outro dicotômica (homem/mulher)

Se os dados são em nível ordinal ou não apresentam as pressuposições para realizar a correlação de Pearson, utiliza-se o Coeficiente de Correlação de Spearman (rho). A correlação não tem unidade (adimencional). Varia de -1 a +1, sendo representado pelo ‘r’

Correlação negativa - Rxy < 0 Correlação positiva - Rxy > 0 Correlação Rxy = 0 - Não existe correlação Correlação não linear - o ‘r’ não mede esse tipo de correlação cor(x,y) measures the strength of the linear relationship between the X and Y dasta with stronger relationships as cor(x,y) heads towares -1 or 1.

Some facts about correlation Cor(X, Y) = Cor(Y, X)
???1 ??? Cor(X, Y) ??? 1
Cor(X, Y) =1 and Cor(X,Y) = -1 only when the X or Y observations fall perfectly on a positive or negative sloped line, respectively
Cor(X,Y) measures the strength of the linear relationship between the X and Y data, with stronger relationsships as Cor(X,Y) heads towards -1 or 1.
Cor(X,Y) = 0 implies no linear relationship.

Teste de Hipótese para Correlação Linear

H0: p = 0 - Significa que a correlação é igual a 0
H1: P <> 0 - Significa que a correlação é diferente de 0

Para ver isso, faz-se o teste T.
Analisa-se o p-valor do teste e, se o valor for maior que p > 0,05, aceita H0.
Se menor, rejeita-se H0. Ou seja, aceita H1.

Observação:

Correlação não é o mesmo que causa e efeito. Duas variáveis podem estar altamente correlacionadas e, no entanto, não haver entre elas, relação de causa e efeito.

1.1 Pearson Correlation

Só faz-se a correção de Pearson se as variáveis forem lineares.

For all columns

    ```r
    cor(longley, method = "pearson")
    ```
    
    ```
    ##              GNP.deflator       GNP Unemployed Armed.Forces Population
    ## GNP.deflator    1.0000000 0.9915892  0.6206334    0.4647442  0.9791634
    ## GNP             0.9915892 1.0000000  0.6042609    0.4464368  0.9910901
    ## Unemployed      0.6206334 0.6042609  1.0000000   -0.1774206  0.6865515
    ## Armed.Forces    0.4647442 0.4464368 -0.1774206    1.0000000  0.3644163
    ## Population      0.9791634 0.9910901  0.6865515    0.3644163  1.0000000
    ## Year            0.9911492 0.9952735  0.6682566    0.4172451  0.9939528
    ## Employed        0.9708985 0.9835516  0.5024981    0.4573074  0.9603906
    ##                   Year  Employed
    ## GNP.deflator 0.9911492 0.9708985
    ## GNP          0.9952735 0.9835516
    ## Unemployed   0.6682566 0.5024981
    ## Armed.Forces 0.4172451 0.4573074
    ## Population   0.9939528 0.9603906
    ## Year         1.0000000 0.9713295
    ## Employed     0.9713295 1.0000000
    ```

For two specific columns

    ```r
    with(longley, cor(Employed,GNP,method="spearman"))
    ```
    
    ```
    ## [1] 0.9852941
    ```

1.2 Spearman Correlation

For all columns

    ```r
    cor(longley, method = "spearman")
    ```
    
    ```
    ##              GNP.deflator       GNP Unemployed Armed.Forces Population
    ## GNP.deflator    1.0000000 0.9970588  0.6647059    0.2205882  0.9970588
    ## GNP             0.9970588 1.0000000  0.6382353    0.2235294  0.9941176
    ## Unemployed      0.6647059 0.6382353  1.0000000   -0.3411765  0.6852941
    ## Armed.Forces    0.2205882 0.2235294 -0.3411765    1.0000000  0.2264706
    ## Population      0.9970588 0.9941176  0.6852941    0.2264706  1.0000000
    ## Year            0.9970588 0.9941176  0.6852941    0.2264706  1.0000000
    ## Employed        0.9823529 0.9852941  0.5647059    0.2264706  0.9764706
    ##                   Year  Employed
    ## GNP.deflator 0.9970588 0.9823529
    ## GNP          0.9941176 0.9852941
    ## Unemployed   0.6852941 0.5647059
    ## Armed.Forces 0.2264706 0.2264706
    ## Population   1.0000000 0.9764706
    ## Year         1.0000000 0.9764706
    ## Employed     0.9764706 1.0000000
    ```

For two specific columns

    ```r
    with(longley, cor(Employed,GNP,method="pearson"))
    ```
    
    ```
    ## [1] 0.9835516
    ```

20180121101900

Análise passo-a-passo: Regressão Linear Múltipla

Variables
* DREEM: Social Self Perception (DREEM.S.SP)
* DREEM: Academic Self Perception (DREEM.A.SP)
* Resilience (Resilience)
* BDI (BDI)
* Age (Age)
* Med School Quality of Life (MS.QoL)

Neste exemplo, vamos analisar a influência de DREEM.S.SP, DREEM.A.SP, Resilience, BDI e Age sobre QoL.

res = read.csv("TempskiResilience.csv")

# or
# library(SDSFoundations)
# res = TempskiResilience

clin <- res[res$Group == "Clinical Sciences",] # Subset of students in the clinical sciences programm.

Regressão Linear Múltipla
Correlation Matrix
Correlation Matrix (Predictors)
Confidence Intervals
Standardized Betas
Proportion of Variance Accounted for
Variance Inflation Factor(Tolerância)
Diagnóstico: Residual vs. Fitted
Diagnóstico: Cook´s Distance

20180124115800

Regressão Linear Múltipla

Regression <- lm(MS.QoL ~ DREEM.S.SP + DREEM.A.SP + Resilience + BDI + Age, data=clin)
summary(Regression)

## 
## Call:
## lm(formula = MS.QoL ~ DREEM.S.SP + DREEM.A.SP + Resilience + 
##     BDI + Age, data = clin)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5697 -0.7056  0.0725  0.8409  3.9567 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.674e+00  7.650e-01   6.110 2.05e-09 ***
## DREEM.S.SP   1.394e-01  1.801e-02   7.739 5.87e-14 ***
## DREEM.A.SP   3.670e-02  1.428e-02   2.571 0.010439 *  
## Resilience   4.345e-05  6.135e-03   0.007 0.994353    
## BDI         -3.636e-02  1.067e-02  -3.409 0.000707 ***
## Age         -2.904e-02  2.352e-02  -1.235 0.217592    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.229 on 485 degrees of freedom
## Multiple R-squared:  0.3337, Adjusted R-squared:  0.3269 
## F-statistic: 48.59 on 5 and 485 DF,  p-value: < 2.2e-16

20180124091501

Correlation Matrix

In fact, we can confirm the simultaneous impact of the independent variables by examining the correlation matrix.

vars <- c("MS.QoL", "DREEM.S.SP", "DREEM.A.SP", "Resilience", "BDI", "Age")
cor(clin[,vars], use="pairwise.complete.obs")

##                 MS.QoL  DREEM.S.SP  DREEM.A.SP Resilience         BDI
## MS.QoL      1.00000000  0.54744658  0.40488523  0.3175923 -0.40809301
## DREEM.S.SP  0.54744658  1.00000000  0.57218064  0.4257209 -0.52053588
## DREEM.A.SP  0.40488523  0.57218064  1.00000000  0.4819358 -0.32875516
## Resilience  0.31759228  0.42572085  0.48193583  1.0000000 -0.56656655
## BDI        -0.40809301 -0.52053588 -0.32875516 -0.5665665  1.00000000
## Age        -0.07208173 -0.04819259 -0.09025951  0.0401443 -0.02372227
##                    Age
## MS.QoL     -0.07208173
## DREEM.S.SP -0.04819259
## DREEM.A.SP -0.09025951
## Resilience  0.04014430
## BDI        -0.02372227
## Age         1.00000000

Notice that there is a relationship between Absences and SAT_Score. That relatioship is what makes the slope values found in the multiple regression to be different from those found in the simple regression.

So, how do we interprete the coefficients of a multiple regression model? Well, each coefficient is the impact of that particular variable on the outcome variable while holding constant the other variables in the model. Effectively, these coefficients represent the unique impact of the variable that they represent.

Even better, everything you can draw from a simple regression model are still available in a multiple regression model, namely:

t-test: For the impact of the slope compared to zero.
t-test significance: The significance of each of those slopes in the model.
Multiple R^2: Here it represents the overall model variance accounted for.
F-value: An F statistic that represents the overall model prediction.

Which is perfectly fine once we have multiple things in the model.

20180124093100

Predictors in the Correlation Matrix

Examine the correlation matrixes for any significant predictors.

library(psych)
corr.test(clin[,vars], use="pairwise.complete.obs")

## Call:corr.test(x = clin[, vars], use = "pairwise.complete.obs")
## Correlation matrix 
##            MS.QoL DREEM.S.SP DREEM.A.SP Resilience   BDI   Age
## MS.QoL       1.00       0.55       0.40       0.32 -0.41 -0.07
## DREEM.S.SP   0.55       1.00       0.57       0.43 -0.52 -0.05
## DREEM.A.SP   0.40       0.57       1.00       0.48 -0.33 -0.09
## Resilience   0.32       0.43       0.48       1.00 -0.57  0.04
## BDI         -0.41      -0.52      -0.33      -0.57  1.00 -0.02
## Age         -0.07      -0.05      -0.09       0.04 -0.02  1.00
## Sample Size 
## [1] 491
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##            MS.QoL DREEM.S.SP DREEM.A.SP Resilience BDI  Age
## MS.QoL       0.00       0.00       0.00       0.00 0.0 0.44
## DREEM.S.SP   0.00       0.00       0.00       0.00 0.0 0.86
## DREEM.A.SP   0.00       0.00       0.00       0.00 0.0 0.23
## Resilience   0.00       0.00       0.00       0.00 0.0 0.86
## BDI          0.00       0.00       0.00       0.00 0.0 0.86
## Age          0.11       0.29       0.05       0.37 0.6 0.00
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

corr.test(clin[,vars], use="pairwise.complete.obs")$r # The matrix of correlations

##                 MS.QoL  DREEM.S.SP  DREEM.A.SP Resilience         BDI
## MS.QoL      1.00000000  0.54744658  0.40488523  0.3175923 -0.40809301
## DREEM.S.SP  0.54744658  1.00000000  0.57218064  0.4257209 -0.52053588
## DREEM.A.SP  0.40488523  0.57218064  1.00000000  0.4819358 -0.32875516
## Resilience  0.31759228  0.42572085  0.48193583  1.0000000 -0.56656655
## BDI        -0.40809301 -0.52053588 -0.32875516 -0.5665665  1.00000000
## Age        -0.07208173 -0.04819259 -0.09025951  0.0401443 -0.02372227
##                    Age
## MS.QoL     -0.07208173
## DREEM.S.SP -0.04819259
## DREEM.A.SP -0.09025951
## Resilience  0.04014430
## BDI        -0.02372227
## Age         1.00000000

corr.test(clin[,vars], use="pairwise.complete.obs")$n # Number of cases per correlation

## [1] 491

corr.test(clin[,vars], use="pairwise.complete.obs")$p # two tailed probability of t for each correlation. For symmetric matrices, p values adjusted for multiple tests are reported above the diagonal.

##                 MS.QoL DREEM.S.SP   DREEM.A.SP   Resilience          BDI
## MS.QoL     0.00000e+00  0.0000000 0.000000e+00 3.434586e-12 0.000000e+00
## DREEM.S.SP 0.00000e+00  0.0000000 0.000000e+00 0.000000e+00 0.000000e+00
## DREEM.A.SP 0.00000e+00  0.0000000 0.000000e+00 0.000000e+00 5.409007e-13
## Resilience 5.72431e-13  0.0000000 0.000000e+00 0.000000e+00 0.000000e+00
## BDI        0.00000e+00  0.0000000 7.727152e-14 0.000000e+00 0.000000e+00
## Age        1.10661e-01  0.2865258 4.560823e-02 3.747404e-01 6.000112e-01
##                  Age
## MS.QoL     0.4426441
## DREEM.S.SP 0.8595773
## DREEM.A.SP 0.2280411
## Resilience 0.8595773
## BDI        0.8595773
## Age        0.0000000

corr.test(clin[,vars], use="pairwise.complete.obs")$t # value of t-test for each correlation

##               MS.QoL DREEM.S.SP DREEM.A.SP Resilience         BDI
## MS.QoL           Inf  14.466165   9.791869   7.406480  -9.8848750
## DREEM.S.SP 14.466165        Inf  15.427876  10.404006 -13.4812048
## DREEM.A.SP  9.791869  15.427876        Inf  12.162901  -7.6977543
## Resilience  7.406480  10.404006  12.162901        Inf -15.2044151
## BDI        -9.884875 -13.481205  -7.697754 -15.204415         Inf
## Age        -1.598125  -1.066939  -2.004120   0.888441  -0.5247263
##                   Age
## MS.QoL     -1.5981253
## DREEM.S.SP -1.0669392
## DREEM.A.SP -2.0041198
## Resilience  0.8884410
## BDI        -0.5247263
## Age               Inf

20180124091502

Confidence Intervals

And we can use the confint() funciton in R to again find the confidence intervals of each slope in the regression model.

confint(Regression)

##                    2.5 %      97.5 %
## (Intercept)  3.170622049  6.17671584
## DREEM.S.SP   0.103999969  0.17477991
## DREEM.A.SP   0.008652434  0.06475334
## Resilience  -0.012011893  0.01209879
## BDI         -0.057318060 -0.01540232
## Age         -0.075260330  0.01717907

20180124091503

Standardized Betas

We can even use the lmbeta() function to get at the concept of the standardized beta for every coefficient in the model, effectively freeing up everyone from their scale.

lmBeta(Regression)

##    DREEM.S.SP    DREEM.A.SP    Resilience           BDI           Age 
##  0.3873277406  0.1241752275  0.0003453895 -0.1665514046 -0.0461722487

20180124115500

Variance Inflation Factor(Tolerância)

Tolerance defines how much variance left over in this particular independent variable once I know all the other independence in the model.

library(car)
var1 = vif(Regression)
tolerancia = 1/var1
tolerancia

## DREEM.S.SP DREEM.A.SP Resilience        BDI        Age 
##  0.5484335  0.5888836  0.5774369  0.5754930  0.9821422

20180124120501

Diagnóstico: Residual vs. Fitted

plot(Regression, which = 1)

20180124120502

Diagnóstico: Cook´s Distance

cutoff <- 4/(Regression$df) 
plot(Regression, which=4, cook.levels=cutoff)

plot(Regression, which = 4, cook.levels = cutoff, id.n = 7)

20180125094900

Proportion of Variance Accounted for

#install.packages('SDSFoundations')
library(SDSFoundations)
pCorr(Regression)

##             Partial_Corr Partial_Corr_sq    Part_Corr Part_Corr_sq
## DREEM.S.SP  0.3315348929    1.099154e-01  0.286840575 8.227752e-02
## DREEM.A.SP  0.1159533555    1.344518e-02  0.095290520 9.080283e-03
## Resilience  0.0003215396    1.033877e-07  0.000262459 6.888471e-08
## BDI        -0.1529677527    2.339913e-02 -0.126347985 1.596381e-02
## Age        -0.0559705907    3.132707e-03 -0.045758123 2.093806e-03