Se você me disser sua altura, posso ter uma ideia do seu peso.
Se você me disser qual sua idade, posso lhe indicar uma música que provavelmente irá gostar.
Se me disser sua escolaridade, posso ter uma estimativa de quanto você ganha por mês.
E se me disser qual foi o último filme que assistiu, posso lhe indicar um que com certeza irá gostar.
Isso é apenas uma amostra do que é regressão linear. Regressão linear nada mais é do que estudar em números a relação entre variáveis. Isso são algoritmos que conseguem estimar o que você pode vir a gostar baseado no que você já gostou ou visitou. Essa é a ideia central de uma regressão linear: estudar o que temos para poder entender como reagir no futuro.
O que é Regressão Linear?
Regressão linear é estudar uma variável de interesse (variável dependente) em função de outras variáveis que irão me ajudar a entendê-la, as chamadas covariáveis ou variáveis independentes.
Essa relação pode ser estudada matematicamente, ou seja, podemos quantificar em números quanto uma variável está relacionada a outra e como ela impacta em seu comportamento.
No caso de duas variáveis, quando tenho dados de um par do tipo (x,y) e coloco em um gráfico é possível observar aproximadamente como elas se comportam. A regressão nos diz qual a melhor curva a ser traçada que se ajusta melhor nesses dados, podendo ser uma reta ou uma curva qualquer.
Só de olhar para o gráfico você já pode ter um palpite se é uma reta, ou um polinômio de segundo grau, por exemplo. O importante é que o método lhe diz qual a melhor curva de forma que a distância entre os pontos e a curva ajustada seja a menor possível. Ou seja, minimizo o erro e consigo estimar com maior precisão.
A lição aqui é verificar que apesar do nome ser Regressão Linear, temos mais opções além de traçar a melhor reta, podendo por exemplo ser uma curva.
Regressão é o mesmo que correlação?
Há uma relação entre elas, mas não se confunda, não são a mesma coisa! A correlação é um número de -1 a 1 que nos diz o quanto duas variáveis estão relacionadas, sendo que quanto mais próximo de 1 indica que quanto mais eu aumento uma, mais a outra também irá aumentar e quanto mais próximo de -1 indica que quanto mais eu aumento uma, mais eu diminuo a outra. Mas cuidado! Alta correlação não implica causalidade!
A relação entre correlação e regressão linear se pelo fato de que eu consigo construir uma boa regressão se minha variável de interesse e as variáveis independentes tiverem uma relação.
Como você conseguiria estimar o preço do café baseado na nota média do ENEM?
Ou como estimar o salário de alguém baseado em quanto tempo ele(a) consegue comer um cachorro quente?
Não consegue pois são variáveis sem relação alguma!
O que preciso para fazer uma regressão?
Dados! Precisamos de dados de todas as variáveis envolvidas no processo. Apesar de termos mostrado apenas os exemplos de regressão linear simples, ou seja, o estudo da relação de apenas duas variáveis, podemos estudar a variável de interesse em função de várias variáveis independentes.
A ideia é que quanto mais informação eu tenho acerca de um assunto, melhor eu posso falar sobre ele, certo? O mesmo funciona para a regressão.
Entendendo o output de uma regressão
Grades = c(82,98,76,68,84,99,67,58,50,78) # Dependent variable
Absences = c(4,2,2,3,1,0,4,8,7,3) # Independent variable
SAT_Score = c(620,750,500,520,540,690,590,490,450,560) # Independent variable
Regression001 = lm(Grades ~ Absences+SAT_Score)
summary(Regression001)
##
## Call:
## lm(formula = Grades ~ Absences + SAT_Score)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.791 -1.809 1.060 2.691 5.016
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.42231 13.58416 2.460 0.04344 *
## Absences -3.34018 0.77323 -4.320 0.00348 **
## SAT_Score 0.09446 0.02067 4.569 0.00258 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.729 on 7 degrees of freedom
## Multiple R-squared: 0.9308, Adjusted R-squared: 0.911
## F-statistic: 47.07 on 2 and 7 DF, p-value: 8.724e-05
F-statistic:
Teste conjunto de significância dos parâmetros do modelo (também conhecido como Teste de significancia do modelo com p-value).
Se o p-value de F não for significante a análise já deve ser descartada.
Uma opção para não descartar é remover observações.
Geralmente não funciona mas fica o lembrete.
Ajustamento (R2):
Se a regressão é simples usa-se o Multiple R-squared.
Se a regressão é múltipla usa-se o Adjusted R-squared.
O R2 explica a variação da variávbel resposta.
De forma mais simples, dá o percentual de quanto a variável explicativa (independente) consegue explicar a resposta (dependente).
Coefficients:
A seção dos Coefficients mostra os valores das variáveis independentes e deve-se observar se os mesmos são significativos.
Na regressão linear simples não faz muito sentido a variável independente não ser significativa pois isso inviabiliza a regressão e se o p-value for não significativo isso significa que provavelmente o p-value de F já foi não significativo também.
Na regressão linear múltipla se o p-value do F for significativo neste caso é possível que alguma das variáveis independentes não tenham sido significativas e neste caso procura-se a variável com maior p-value e remove-se a mesma da regressão. Realiza-se nova regressão e observa-se se o modelo agora melhorou seu grau de explicação. Se melhorou então procura-se outra variável não significativa e remove-se da regressão e observa-se se o modelo melhorou seu grau de explicação (R2) e assim sucessivamente até encontra-se o modelo ideal.
Observação: Esse processo de tentativa e erro pode ser trabalhoso dependento do número de variáveis envolvidas. Nestes casos é possível usar a técnica de stepwise que automatiza esse processo de tentativa e erro e retorna o modelo otimizado para a regressão.
Como encontrar o número mínimo ideal de variáveis da regressão?
This dataset below shows the prices of electricity used for different purposes in different states in the united states of america.
We will use the first four variables to model I2005.
methodselection = read.csv("stepwise.csv", sep=";")
AIC is a parameter used in all three model selection procedures as a criteria to select the model.
The rule is, the lower the AIC the better the model.
Método Backward
Starts with the most general model and drops variables one at a time until the best model is reached.
linearmodel = lm(I2005 ~ R2005+R2004+C2005+C2004,data=methodselection)
step(linearmodel,direction="backward")
## Start: AIC=17.54
## I2005 ~ R2005 + R2004 + C2005 + C2004
##
## Df Sum of Sq RSS AIC
## - R2004 1 2.6862 28.324 17.138
## <none> 25.638 17.544
## - R2005 1 4.1257 29.764 17.931
## - C2004 1 5.0474 30.685 18.419
## - C2005 1 16.8259 42.464 23.617
##
## Step: AIC=17.14
## I2005 ~ R2005 + C2005 + C2004
##
## Df Sum of Sq RSS AIC
## - R2005 1 1.6536 29.978 16.046
## - C2004 1 2.5963 30.920 16.541
## <none> 28.324 17.138
## - C2005 1 17.0895 45.414 22.692
##
## Step: AIC=16.05
## I2005 ~ C2005 + C2004
##
## Df Sum of Sq RSS AIC
## <none> 29.978 16.046
## - C2004 1 5.4914 35.469 16.737
## - C2005 1 15.4612 45.439 20.701
##
## Call:
## lm(formula = I2005 ~ C2005 + C2004, data = methodselection)
##
## Coefficients:
## (Intercept) C2005 C2004
## -0.2861 1.9943 -1.3396
Understanding the output:
Output 1: The AIC=17.54. Observe that the model suggests that if you remove I2004 the new AIC is better thean the above. Observe that if you remove other variables from the model the AIC values are greater than the value 17.54, so you should stick to removing only R2004 and rerunning the step command.
Output 2: The AIC=17.14 which is better then the above result. In this case if you remove R2005 as suggested the AIC goest to 16.046 which is better than AIC=17.14. Observe that all the other AICs are greater than 16.046 so you should only remove R2005 as suggested and rerun the step command.
Output 3: The AIC=16.05 which is bette then the previous one. Observe that removing other variables as suggested below does note decreacse the value of AIC. In other words, This is the best model that can be reached .
Método Forward
Starts with the simplest model of all and adds suitable variables at a time until the best model is reached.
linearmodel = lm(I2005 ~ 1,data=methodselection)
step(linearmodel,direction="forward", scope=~R2005+R2004+C2005+C2004)
## Start: AIC=29.62
## I2005 ~ 1
##
## Df Sum of Sq RSS AIC
## + C2005 1 54.429 35.469 16.737
## + C2004 1 44.459 45.439 20.701
## + R2005 1 39.728 50.171 22.285
## + R2004 1 38.882 51.017 22.553
## <none> 89.898 29.617
##
## Step: AIC=16.74
## I2005 ~ C2005
##
## Df Sum of Sq RSS AIC
## + C2004 1 5.4914 29.978 16.046
## + R2005 1 4.5486 30.920 16.541
## + R2004 1 4.2830 31.186 16.678
## <none> 35.469 16.737
##
## Step: AIC=16.05
## I2005 ~ C2005 + C2004
##
## Df Sum of Sq RSS AIC
## <none> 29.978 16.046
## + R2005 1 1.65356 28.324 17.138
## + R2004 1 0.21398 29.764 17.931
##
## Call:
## lm(formula = I2005 ~ C2005 + C2004, data = methodselection)
##
## Coefficients:
## (Intercept) C2005 C2004
## -0.2861 1.9943 -1.3396
Understanding the output:
Output 1: The AIC=29.62. Observe that the model suggests that if you add C2005 the new AIC is 16.737 taht is better then the above. Observe that if you add any other variables from the model the AIC values are greater than the value 16.737, so you should stick to adding only C2005 and rerunning the step command.
Output 2: The AIC=16,74 which is better then the above result. In this case if you add C2004 as suggested the AIC goest to 16.046 which is better than AIC=16.74. Observe that all the other AICs are greater than 16.046 so you should only add C2004 as suggested and rerun the step command.
Output 3: The AIC=16.05 which is bette then the previous one. Observe that adding other variables as suggested below does note decreacse the value of AIC. In other words, This is the best model that can be reached .
Método Stepwise
Is a combination of backward and forward methods are dropped and added.
linearmodel = lm(I2005 ~ R2005+R2004+C2005+C2004,data=methodselection)
step(linearmodel,direction="both")
## Start: AIC=17.54
## I2005 ~ R2005 + R2004 + C2005 + C2004
##
## Df Sum of Sq RSS AIC
## - R2004 1 2.6862 28.324 17.138
## <none> 25.638 17.544
## - R2005 1 4.1257 29.764 17.931
## - C2004 1 5.0474 30.685 18.419
## - C2005 1 16.8259 42.464 23.617
##
## Step: AIC=17.14
## I2005 ~ R2005 + C2005 + C2004
##
## Df Sum of Sq RSS AIC
## - R2005 1 1.6536 29.978 16.046
## - C2004 1 2.5963 30.920 16.541
## <none> 28.324 17.138
## + R2004 1 2.6862 25.638 17.544
## - C2005 1 17.0895 45.414 22.692
##
## Step: AIC=16.05
## I2005 ~ C2005 + C2004
##
## Df Sum of Sq RSS AIC
## <none> 29.978 16.046
## - C2004 1 5.4914 35.469 16.737
## + R2005 1 1.6536 28.324 17.138
## + R2004 1 0.2140 29.764 17.931
## - C2005 1 15.4612 45.439 20.701
##
## Call:
## lm(formula = I2005 ~ C2005 + C2004, data = methodselection)
##
## Coefficients:
## (Intercept) C2005 C2004
## -0.2861 1.9943 -1.3396
Understanding the output:
Output 1: The AIC=17.54. Observe that the model suggests that if you remove R2004 the new AIC is 17.138 that is better then the above. Observe that if you add or remove any other variables from the model the AIC values are greater than the value 17.54, so you should stick to removing only R2004 and rerunning the step command.
Output 2: The AIC=17.14 which is better then the above result. In this case if you remove R2005 as suggested the AIC goest to 16.046 which is better than AIC=16.74. Observe that all the other AICs are greater than 16.046 so you should only add R2005 as suggested and rerun the step command.
Output 3: The AIC=16.05 which is bette then the previous one. Observe that adding or removing other variables as suggested below does note decreacse the value of AIC. In other words, This is the best model that can be reached .
Note on the output:
If you are interested only on the output of the best model you can see it by using the code below:
linearmodel = lm(I2005 ~ R2005+R2004+C2005+C2004,data=methodselection)
stepmodel = step(linearmodel,direction="both")
## Start: AIC=17.54
## I2005 ~ R2005 + R2004 + C2005 + C2004
##
## Df Sum of Sq RSS AIC
## - R2004 1 2.6862 28.324 17.138
## <none> 25.638 17.544
## - R2005 1 4.1257 29.764 17.931
## - C2004 1 5.0474 30.685 18.419
## - C2005 1 16.8259 42.464 23.617
##
## Step: AIC=17.14
## I2005 ~ R2005 + C2005 + C2004
##
## Df Sum of Sq RSS AIC
## - R2005 1 1.6536 29.978 16.046
## - C2004 1 2.5963 30.920 16.541
## <none> 28.324 17.138
## + R2004 1 2.6862 25.638 17.544
## - C2005 1 17.0895 45.414 22.692
##
## Step: AIC=16.05
## I2005 ~ C2005 + C2004
##
## Df Sum of Sq RSS AIC
## <none> 29.978 16.046
## - C2004 1 5.4914 35.469 16.737
## + R2005 1 1.6536 28.324 17.138
## + R2004 1 0.2140 29.764 17.931
## - C2005 1 15.4612 45.439 20.701
summary(stepmodel)
##
## Call:
## lm(formula = I2005 ~ C2005 + C2004, data = methodselection)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5168 -0.3434 0.0938 0.9787 2.6232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2861 1.8248 -0.157 0.8778
## C2005 1.9943 0.7702 2.589 0.0225 *
## C2004 -1.3396 0.8681 -1.543 0.1468
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.519 on 13 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6152
## F-statistic: 12.99 on 2 and 13 DF, p-value: 0.000794
Regressão linear simples
A Simple regression (along with The Pearson Correlation Coefficient) answer roughly the same question:
How do two quantitative variable relate to one another?
With simple linear regression we can use one variable to predict one single quantitative variable and in doing so we expand the Pearson Correlation Coefficient by examing the slope or the impact of the independent variable (x-axis) on the outcome variable (y-axis).
Grades = c(82,98,76,68,84,99,67,58,50,78)
Absences = c(4,2,2,3,1,0,4,8,7,3)
Regression001 = lm(Grades ~ Absences)
summary(Regression001)
##
## Call:
## lm(formula = Grades ~ Absences)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.156 -6.388 -2.546 6.264 14.454
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94.326 4.875 19.349 5.28e-08 ***
## Absences -5.390 1.175 -4.586 0.00179 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.828 on 8 degrees of freedom
## Multiple R-squared: 0.7244, Adjusted R-squared: 0.6899
## F-statistic: 21.03 on 1 and 8 DF, p-value: 0.001789
Observação importante:
Observe a coluna ‘t value’, ela é a a informação sobre o t-statistic para cada variável. Logo se houver a seguinte pergunta: Quais são os t-statistic para os preditores (variáveis independentes do modelo) a respostas são esses valores.
No nosso exemplo, os t-statistics dos preditores do modelo são:
Absences = -4.586
Intervalos de confiança
Using the confint() function we can get the corresponding confidence intervals for slopes for every coefficiente in our regression model.
Therefore, for the results below we can say that we are 2.5% confident that the true underlying population slope for Absences predicting Grades are between these two values. Just like a t-test when those values do not capture zero (-8.10067 | -2.679471) we have a significante difference for the test.
confint(Regression001)
## 2.5 % 97.5 %
## (Intercept) 83.08460 105.567879
## Absences -8.10067 -2.679471
Betas padronizados
What are the standardized betas for our regression?
What would it be if I had a Standardized X and a Standardize Y, in other words, if I had made X a z-score first, and Y a z-score also, and then run it thru a simples linear regression.
We can also use the lmBeta() function to call standardized betas.
Standardized betas free us from any scale.
It is like using a z-score for EVERY variable INCLUDING the outcome variable.
library(SDSFoundations)
lmBeta(Regression001) # Belongs to the SDSFoundation package
## Absences
## -0.851114
Observe that the lmBeta() value matches the pearson correlation value (r^2=0.7244) it means the pearson correlation value is (r = 0,8511169132381285)..
Therefore, the standardized beta matches the pearson correlation coefficiente value.
So, we have only two variables in this simples regression.
So the t-values (for significance) match, the r and r^2 values match. And the Standardized beta matches the pearson correlation coefficient value.
This happens because like the pearson correlation value this standardized beta is completely free of scale.
Adicionally we have the F-Test that evaluates overall model fit.
Linearidade
If we want to see if some two variables are linearly related to each other, before you run any correlation analysis you first want to check if the two variables you are going to analyse don´t have some non-linear relationship.
Because correlation measures the strength and direction of the linear relationship you need to confirm that the two variables are indeed linearly related.
And the best way to check for the assumption of linearity is to just make a scatterplot of the two variables.
Unclear relationship
bull = read.csv("BullRiders.csv")
plot(bull$YearsPro, bull$BuckOuts12)
abline(lm(bull$BuckOuts12 ~ bull$YearsPro))
Clear relationship (somewhat)
bull = read.csv("BullRiders.csv")
plot(bull$Events12, bull$BuckOuts12)
abline(lm(bull$BuckOuts12~ bull$Events12))
Multicolinearity
When it comes to Multiple Regression we have to do a little background checking to make sure our model is ok, and to do so we need to check for Multicolinearity.
Só funciona em um regressão múltipla.
Multicolinearidade consiste em um problema comum em regressões, no qual as variáveis independentes possuem relações lineares exatas ou aproximadamente exatas. O índício mais claro da existência da multicolinearidade é quando o R² é bastante alto, mas nenhum dos coeficientes da regressão é estatisticamente significativo segundo a estatística t convencional. As consequências da multicolinearidade em uma regressão são a de erros-padrão elevados no caso de multicolinearidade moderada ou severa e até mesmo a impossibilidade de qualquer estimação se a multicolinearidade for perfeita.
Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors. That is, a multiple regression model with colinear predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others.
The car package helps us to see multicolinearity in a regression.
res = read.csv("TempskiResilience.csv")
clin <- res[res$Group == "Clinical Sciences",] # Subset of students in the clinical sciences programm.
RegressionMultipla <- lm(MS.QoL ~ DREEM.S.SP + DREEM.A.SP + Resilience + BDI + Age, data=clin)
summary(RegressionMultipla)
##
## Call:
## lm(formula = MS.QoL ~ DREEM.S.SP + DREEM.A.SP + Resilience +
## BDI + Age, data = clin)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5697 -0.7056 0.0725 0.8409 3.9567
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.674e+00 7.650e-01 6.110 2.05e-09 ***
## DREEM.S.SP 1.394e-01 1.801e-02 7.739 5.87e-14 ***
## DREEM.A.SP 3.670e-02 1.428e-02 2.571 0.010439 *
## Resilience 4.345e-05 6.135e-03 0.007 0.994353
## BDI -3.636e-02 1.067e-02 -3.409 0.000707 ***
## Age -2.904e-02 2.352e-02 -1.235 0.217592
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.229 on 485 degrees of freedom
## Multiple R-squared: 0.3337, Adjusted R-squared: 0.3269
## F-statistic: 48.59 on 5 and 485 DF, p-value: < 2.2e-16
#install.packages("car")
library(car)
# function vif() stands for variance inflation factor
vif(RegressionMultipla)
## DREEM.S.SP DREEM.A.SP Resilience BDI Age
## 1.823375 1.698128 1.731791 1.737641 1.018183
We can also take the reciprocal of the vif() function which is called TOLERANCE. In this case a TOLERANCE value of 0.663285.
1/vif(RegressionMultipla)
## DREEM.S.SP DREEM.A.SP Resilience BDI Age
## 0.5484335 0.5888836 0.5774369 0.5754930 0.9821422
If we revesit the multiple regression predictions we have:
summary(RegressionMultipla)
##
## Call:
## lm(formula = MS.QoL ~ DREEM.S.SP + DREEM.A.SP + Resilience +
## BDI + Age, data = clin)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5697 -0.7056 0.0725 0.8409 3.9567
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.674e+00 7.650e-01 6.110 2.05e-09 ***
## DREEM.S.SP 1.394e-01 1.801e-02 7.739 5.87e-14 ***
## DREEM.A.SP 3.670e-02 1.428e-02 2.571 0.010439 *
## Resilience 4.345e-05 6.135e-03 0.007 0.994353
## BDI -3.636e-02 1.067e-02 -3.409 0.000707 ***
## Age -2.904e-02 2.352e-02 -1.235 0.217592
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.229 on 485 degrees of freedom
## Multiple R-squared: 0.3337, Adjusted R-squared: 0.3269
## F-statistic: 48.59 on 5 and 485 DF, p-value: < 2.2e-16
The R^2 value (0.3367) is for the overall model predicting Absences.
This R^2 value tells us how much Absences is accountted for, in terms of its variance, by the other independent variables (in our case SAT_Score).
In other words, If we look at the amount of remaining variance that is still left over or unexplained in the prediction of Absences (1-0.3367=0.6633), what do we get? We get TOLERANCE (0.6633).
Just reinforcing: Observe that (1-R^2=1-0.3367=0.6633), this value is called TOLERANCE.
So TOLERANCE: is defined as that concept that says this is how much variances left over in this particular independent variable once I know all the other independence in the model. A really low amount of variance left over means that hist particular independent is highly redundant with everything else in the model.
So what do we use for a typical rules of thumb here?
We say that any variable that has as VIF > 5 or larger is kind of redundant and this corresponds to a tolerance value that would be < 0.2 or lower.
This means that I have only got about 20% of variance of this independent variable left over once I account for all of the other variables in the model.
Correlação (Pearson Correlation Coefficient)
Pearson Correlation Coefficient
For any Pearson Correlation Coefficient we can calculate the t-statistic which allow us to examine the significance of that relationship.
The Pearson Correlation Coefficiente and Simple Regression answer roughly the same question: How do two quantitative variable relate to one another?
The t and f distributions allow us to investigagte and tell the story of a linear model. When using a Pearson Correlation we should make sure that the correlation is in fact linear by examing the scatterplot.
The Pearson correlation Coefficcient by itself is free of context and dificult to interpret, however byu squaring it´s value we make it more meaningful, the R^2 is also know as proportion of variance (explained), or as the Coefficient of determination.
The t distribution is the test taht tells us if the pearson correlation is significant.
Keep in mind that:
A correlation coeficcient is only part of the full relational story. To fully explain the relationship between two quantitative variable we can use simple regression. Simple regressions also use an underling linear model to describe the relationship.
But unlike the Pearson Correlation Coeficcient, simple regression allows us to predict a particular outcome.
Correlação simples (Duas variáveis)
Matriz de correlação (Coeficientes de correlação)
Preditores da matriz de correlação
Proportion of Correlation Accounted for
Correlação simples (Duas variáveis)
bull = read.csv("BullRiders.csv")
Fracamente Correlacionada
Este é o Pearson Correlation Coefficient desta relação
cor(bull$YearsPro, bull$BuckOuts12)
## [1] 0.4095777
Veja o gráfico de linearidade.
Altamente Correlacionada
Este é o Pearson Correlation Coefficient desta relação
cor(bull$Events12, bull$BuckOuts12)
## [1] 0.9936924
Veja o gráfico de linearidade.
Matriz de correlação (Coeficientes da Correlação)
Calculando a matriz de correlação para um dado dataset.
No caso da matriz de correlação teremos pares de Pearson Correlation Coefficients.
Gross x IMDB = 0.281908 Gross x Days = 0.3228279
Gross x Budget = 0.3980789
E assim seria para tantas quantas fossem as variáveis independentes que estivessem se relacionando com a variável dependente YearsPro.
library(SDSFoundations)
#help(package = SDSFoundations)
Filmes = FilmData
Filmes = Filmes[complete.cases(Filmes),] # remove NAs
vars <- c("Gross", "IMDB", "Days", "Budget")
correlacao = cor(Filmes[, vars])
correlacao
## Gross IMDB Days Budget
## Gross 1.0000000 0.28190849 0.3228279 0.39807892
## IMDB 0.2819085 1.00000000 0.2646178 -0.08484071
## Days 0.3228279 0.26461776 1.0000000 -0.14290797
## Budget 0.3980789 -0.08484071 -0.1429080 1.00000000
Preditores da correlação
Examine the correlation matrixes for any significant predictors.
When we run corr.test() function we get the full correlation matrix, along with the probability values for each one of those.
Below you can se how you can test each and every one of those correlation coefficients, those Pearson Correlation Coeficients, for significance and get their p-values and t-values, simply by looking at a package called ‘psych’ and it´s new function called corr.test().
#install.packages("psych")
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:car':
##
## logit
corr.test(correlacao)
## Call:corr.test(x = correlacao)
## Correlation matrix
## Gross IMDB Days Budget
## Gross 1.00 -0.25 -0.13 0.26
## IMDB -0.25 1.00 0.16 -0.72
## Days -0.13 0.16 1.00 -0.80
## Budget 0.26 -0.72 -0.80 1.00
## Sample Size
## [1] 4
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## Gross IMDB Days Budget
## Gross 0.00 1.00 1.0 1
## IMDB 0.75 0.00 1.0 1
## Days 0.87 0.84 0.0 1
## Budget 0.74 0.28 0.2 0
##
## To see confidence intervals of the correlations, print with the short=FALSE option
corr.test(correlacao)$r # The matrix of correlations
## Gross IMDB Days Budget
## Gross 1.0000000 -0.2532511 -0.1253871 0.2618710
## IMDB -0.2532511 1.0000000 0.1614480 -0.7215625
## Days -0.1253871 0.1614480 1.0000000 -0.7994579
## Budget 0.2618710 -0.7215625 -0.7994579 1.0000000
corr.test(correlacao)$n # Number of cases per correlation
## [1] 4
corr.test(correlacao)$p # two tailed probability of t for each correlation. For symmetric matrices, p values adjusted for multiple tests are reported above the diagonal.
## Gross IMDB Days Budget
## Gross 0.0000000 1.0000000 1.0000000 1
## IMDB 0.7467489 0.0000000 1.0000000 1
## Days 0.8746129 0.8385520 0.0000000 1
## Budget 0.7381290 0.2784375 0.2005421 0
corr.test(correlacao)$t # value of t-test for each correlation
## Gross IMDB Days Budget
## Gross Inf -0.3702201 -0.1787347 0.3837327
## IMDB -0.3702201 Inf 0.2313570 -1.4738852
## Days -0.1787347 0.2313570 Inf -1.8820755
## Budget 0.3837327 -1.4738852 -1.8820755 Inf
Proportion of Variance Accounted for
In addition we can even talk about the idea of proportion of variance accounted for uniquely by each variable.
How can we do this?
Well, Let´s remember our Venn Diagram.
When we use another function called pCorr() we can get the partial and semi-partial or park correlation coefficient squared value for every variable in the model.
And use the park correlation coefficient squared to evaluate the overall proportion of variance accounted for uniquely by each coefficent examining which one has the biggestg impact.
Em resumo:
A variável ‘Part_Corr_sq’ (semi-partial ou part correlation coefficient squared) armazena a quantidade de variação do modelo que pode ser representada por uma determinada variável.
Por exemplo, a variável DREEM.S.SP responde por 0,08227752 (8,228%) da variação do modelo de regressão linear múltipla.
res = read.csv("TempskiResilience.csv")
clin <- res[res$Group == "Clinical Sciences",] # Subset of students in the clinical sciences programm.
RegressionMultipla <- lm(MS.QoL ~ DREEM.S.SP + DREEM.A.SP + Resilience + BDI + Age, data=clin)
#summary(RegressionMultipla)
library(SDSFoundations)
pCorr(RegressionMultipla)
## Partial_Corr Partial_Corr_sq Part_Corr Part_Corr_sq
## DREEM.S.SP 0.3315348929 1.099154e-01 0.286840575 8.227752e-02
## DREEM.A.SP 0.1159533555 1.344518e-02 0.095290520 9.080283e-03
## Resilience 0.0003215396 1.033877e-07 0.000262459 6.888471e-08
## BDI -0.1529677527 2.339913e-02 -0.126347985 1.596381e-02
## Age -0.0559705907 3.132707e-03 -0.045758123 2.093806e-03
Passo-a-passo para a análise de uma regressão linear simples
In a 2015 study, Tempski and associates examined a measurement they called Quality of Life among medical school students in Brazilian medical schools. They borrowed measurement scales from the World Health Organization, the Dundee Ready Education Environment Scale, and the Beck Depression Inventory to assess the dependent variable in potential relation to a number of predictor variables.
Variables
* DREEM: Social Self Perception (DREEM.S.SP)
* DREEM: Academic Self Perception (DREEM.A.SP)
* Resilience (Resilience)
* BDI (BDI)
* Age (Age)
* Med School Quality of Life (MS.QoL)
Neste exemplo, vamos analisar a influência de BDI sobre QoL.
res = read.csv("TempskiResilience.csv")
clin <- res[res$Group == "Clinical Sciences",] # Subset of students in the clinical sciences programm.
Matriz de correlação básica
Modelo de Regressão
Intervalos de confiança
Betas padronizados
Linearidade
Residual vs. Fitted
Cook´s Distance
Modelo de Regressão
Use para analisar a influência da variável independente sobre a variável de interesse (dependente).
regressionSimple <- lm(QoL ~ BDI, data=clin)
summary(regressionSimple)
##
## Call:
## lm(formula = QoL ~ BDI, data = clin)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6587 -0.6587 0.0687 0.7962 2.6138
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.544521 0.088521 96.526 <2e-16 ***
## BDI -0.068137 0.007626 -8.935 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.159 on 489 degrees of freedom
## Multiple R-squared: 0.1404, Adjusted R-squared: 0.1386
## F-statistic: 79.84 on 1 and 489 DF, p-value: < 2.2e-16
Intervalos de confiança
Use para saber os intervalos de confiança para o slope e o intercept.
confint(regressionSimple)
## 2.5 % 97.5 %
## (Intercept) 8.37059347 8.71844843
## BDI -0.08311937 -0.05315364
Betas padronizados
Use para saber os valores da regressão se antes fosse feito um z-score dos valores de X e Y. Ou seja, fornece os betas padronizados do modelo.
lmBeta(regressionSimple)
## BDI
## -0.3746403
Linearidade
Use para ver a relação de linearidade entre essas duas variáveis.
plot(clin$BDI, clin$QoL)
abline(regressionSimple)
Matriz de correlação básica
Use para verificar qual o grau da correlação entre essas duas variáveis.
vars <- c("QoL", "BDI")
cor(clin[,vars])
## QoL BDI
## QoL 1.0000000 -0.3746403
## BDI -0.3746403 1.0000000
Residual vs. Fitted
Usado para ver se os pontos se distribuem uniformemente em relação à linha vermelha.
plot(regressionSimple, which = 1)
Cook´s Distance
Use para visualizar os resgistros (outliers) que podem estar influenciando o modelo.
cutoff <- 4/regressionSimple$df
plot(regressionSimple, which = 4, cook.levels = cutoff)
plot(regressionSimple, which = 4, cook.levels = cutoff, id.n = 7)
Esse bloco já existia e vai ser removido, relocado ou reaproveitado.
Regression line graph
Regression line coeficients
Predictions
Summaries
Regression Line graph
# parent is the predictor
# child is the outcome
#install.packages("UsingR")
library(UsingR)
## Loading required package: MASS
## Loading required package: HistData
## Loading required package: Hmisc
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:SDSFoundations':
##
## histogram
## Loading required package: survival
## Loading required package: splines
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
##
## Attaching package: 'UsingR'
## The following object is masked from 'package:survival':
##
## cancer
## The following objects are masked from 'package:psych':
##
## galton, headtail
data(galton)
galton = Galton
model = lm(child ~ parent, data = galton)
plot(galton$parent, galton$child)
abline(model, lwd=2)
Regression Line Coeficients
#install.packages("UsingR")
library(UsingR)
data(galton)
galton = Galton
model = lm(child ~ parent, data = galton)
coef(model)
## (Intercept) parent
## 23.9415302 0.6462906
# We estimate an expected 0.6462906 height increase in child height for every parent increase in height. Also called SLOPE.
# The intercept 23.9415302 is the expected child height of a 0 parent height.
# You can also calculate the coeficients manually
y <- galton$child
x <- galton$parent
beta1 <- cor(x, y) * sd(x) / sd(y)
beta0 <- mean(x) - beta1 * mean(y)
beta1
## [1] 0.3256475
beta0
## [1] 46.13535
Predictions
Creating a linearmodel and predicting new values based on it
diamond = read.csv("diamonds.csv", header=T)
newx <- c(0.16, 0.27, 0.34)
fit <- lm(price ~ carat, data = diamond)
coef(fit)
## (Intercept) carat
## -2256.361 7756.426
predict(fit , newdata = data.frame(carat = newx))
## 1 2 3
## -1015.3325 -162.1257 380.8241
Summaries
#install.packages("UsingR")
library(UsingR)
data(galton)
galton = Galton
model = lm(child ~ parent, data = galton)
summary(model)
##
## Call:
## lm(formula = child ~ parent, data = galton)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.8050 -1.3661 0.0487 1.6339 5.9264
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.94153 2.81088 8.517 <2e-16 ***
## parent 0.64629 0.04114 15.711 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.239 on 926 degrees of freedom
## Multiple R-squared: 0.2105, Adjusted R-squared: 0.2096
## F-statistic: 246.8 on 1 and 926 DF, p-value: < 2.2e-16
summary(model)$sigma
## [1] 2.238547
summary(model)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.9415302 2.81087834 8.517455 6.536845e-17
## parent 0.6462906 0.04113588 15.711115 1.732509e-49
Regressão linear múltipla
However regarding regression we are not limited to looking at just two quantitative variables.
With multiple regression we can examine the effects of multiple independent variables on a single quantitative outcome at the same time.
So instead of two simple regression models that considers the impact of the variables separatedly (in this Venn Diagram example two variables):
We can examine the effects of the two variable on each other and on the dependent variable at the same time:
Grades = c(82,98,76,68,84,99,67,58,50,78)
Absences = c(4,2,2,3,1,0,4,8,7,3)
SAT_Score = c(620,750,500,520,540,690,590,490,450,560)
Regression001 = lm(Grades ~ Absences+SAT_Score)
summary(Regression001)
##
## Call:
## lm(formula = Grades ~ Absences + SAT_Score)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.791 -1.809 1.060 2.691 5.016
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.42231 13.58416 2.460 0.04344 *
## Absences -3.34018 0.77323 -4.320 0.00348 **
## SAT_Score 0.09446 0.02067 4.569 0.00258 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.729 on 7 degrees of freedom
## Multiple R-squared: 0.9308, Adjusted R-squared: 0.911
## F-statistic: 47.07 on 2 and 7 DF, p-value: 8.724e-05
Observação importante:
Observe a coluna ‘t value’, ela é a a informação sobre o t-statistic para cada variável. Logo se houver a seguinte pergunta: Quais são os t-statistic para os preditores (variáveis independentes do modelo) a respostas são esses valores.
No nosso exemplo, os t-statistics dos preditores do modelo são:
Absences = -4.320
SAT_Score = 4.569
Gráficos de Diagnóstico
Gráficos de diagnóstico nos ajudam a acessar as ‘assumptions’ do modelo e procurar por potenciais outliers. Através do Gráfico de Residuals vs. fitted poderemos chegar algumas de nossas ‘assuptions’ e o gráfico de Cook´s distance vai nos ajudar a ver potenciais outliers.
Basicamente queremos checar nossas assumptions e checar por outliers potenciais.
We can use diagnostic plots to validate the linear model, either in a simple or multiple linear regression. We can use the Residual versus fitted and Cook´s distance methos. Both methods helps us to determine is a good model for our underlying data.
For more details see: UTAustinX: UT.7.21x Foundations of Data Analysis - Part 2 Course Week 6: Correlation and Regression Lecture Videos Regression Diagnostics
Residuals versus fitted
library(ggplot2)
diamantes = diamonds[diamonds$cut=='Good',]
diamond_lm = lm(price ~ carat, data = diamantes)
summary(diamond_lm)
##
## Call:
## lm(formula = price ~ carat, data = diamantes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9153.2 -738.1 -69.6 575.4 11394.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2422.73 43.04 -56.28 <2e-16 ***
## carat 7479.64 44.70 167.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1421 on 4904 degrees of freedom
## Multiple R-squared: 0.851, Adjusted R-squared: 0.8509
## F-statistic: 2.8e+04 on 1 and 4904 DF, p-value: < 2.2e-16
plot(diamond_lm, which = 1)
Observe that at zero (0) there is a dashed line running all the way thru the graph and we can start examining our data visually for something that is called homoscedastic or heteroscedastic. We can certainly look for a linear concept to out data. If everything is nice and random around the red line this graph shows good linearity. The data is split around the curve what is good. If the data curved along with the red line we could say we had a curvelinear data which is not the case. It is not perfect but for the most part we can call it homoscedastic.
Now observe that R picks out some weird outlier when it comes to residuals, they are the numbered dots you see in the graph, they are values really far away from the predicted line. Those numbers are actually row numbers that you can look into to understand the outlier better (those are potentially outliers that we want to get rid of).
For example:
diamantes [row.names(diamantes ) == 27197, ]
## carat cut color clarity depth table price x y z
## 27197 1.14 Good D IF 61.7 63 17499 6.64 6.74 4.13
Cook´s distance
A Cooks is a great measure of influence on the overall model, and there is a classic rule of thumb that says that our cutoff for a cook´s distance (cook.level) is 4 divided by the number of degrees of freedom within our model, so…
cutoff <- 4/diamond_lm$df
Now that we have this cutoff we can ask for another plot.
plot(diamond_lm, which = 4, cook.levels = cutoff)
Observe that theses particular observations, 22832, 23540, 25851 (these numbers are the row numbers of these observations) are really these observations that really show themselves as being high outliers, in terms of high influence, in terms of a cook´s ‘d’ value.
You might also observe that there are some other points in the graph that look like outliers but that do not have their positions you can force the plot() function to number them with the following command (id.n default is 3).
plot(diamond_lm, which = 4, cook.levels = cutoff, id.n = 7)
Those high observations are really the observations that have highest influence in terms of the cook´s d value and would be great contenders for removal (because I can justify once I have a chi cooks t cut off and the id is gona give me the row numbers and we can easily remove them from the data frame).
Erro residual
The residuals can help us understand if the final model selected has increased too much the residuals of the regression.
If it did not, it helps to justify the final model as strong.
Let´s see if using the data above.
The complete regression reads as follows:
linearmodel = lm(I2005 ~ R2005+R2004+C2005+C2004,data=methodselection)
summary(linearmodel)
##
## Call:
## lm(formula = I2005 ~ R2005 + R2004 + C2005 + C2004, data = methodselection)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9193 -0.2467 0.1455 0.4088 2.4758
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.383 2.167 -0.638 0.5363
## R2005 -1.645 1.236 -1.330 0.2103
## R2004 2.020 1.881 1.074 0.3060
## C2005 3.025 1.126 2.687 0.0211 *
## C2004 -2.632 1.789 -1.472 0.1692
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.527 on 11 degrees of freedom
## Multiple R-squared: 0.7148, Adjusted R-squared: 0.6111
## F-statistic: 6.893 on 4 and 11 DF, p-value: 0.004968
Let´s calculate the Root mean squared error which tells us the average error we make in our predictions. In other words the number of errors we making using the model.
SSE = sum(linearmodel$residuals^2) # Sum of Squared Errors
RMSE = sqrt(SSE/nrow(methodselection)) # Root mean squared error
# Root mean squer error
RMSE
## [1] 1.26585
Now let´s use the model suggested by the stepwise method and see the effect it has on the residual error.
linearmodel = lm(I2005 ~ C2005+C2004,data=methodselection)
summary(linearmodel)
##
## Call:
## lm(formula = I2005 ~ C2005 + C2004, data = methodselection)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5168 -0.3434 0.0938 0.9787 2.6232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2861 1.8248 -0.157 0.8778
## C2005 1.9943 0.7702 2.589 0.0225 *
## C2004 -1.3396 0.8681 -1.543 0.1468
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.519 on 13 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6152
## F-statistic: 12.99 on 2 and 13 DF, p-value: 0.000794
SSE = sum(linearmodel$residuals^2) # Sum of Squared Errors
RMSE = sqrt(SSE/nrow(methodselection)) # Root mean squared error
# Root mean squer error
RMSE
## [1] 1.368798
Let´s understand what this means.
Using the model with all variables we have a root mean square error of 1.26585.
Using the adjusted model with stepwise method we have a root mean square error of 1.368798.
It is a great deal.
1.26585-1.368798 = 0.102948
No.
Because the difference between them both is 0.102948.
But also because the mean is much higher then that, see below.
mean(methodselection$I2005)
## [1] 6.77375
6.77375/0.102948
## [1] 65.79778
If fact 65 times higher. Which is highly acceptable.
Making predictions
In order to better understand how to make predictions, We will be using a dataset containing data for all teams and seasons in NBA since 1980 except those who lost 82 games.
NBA = read.csv("NBA_train.csv")
First we need to take a look at the variables in the dataset to understand how to create a model of points scored.
Variables
The variables where one variable name is repeated with an A afterwards means that with an A it was attemped without it it was successfull, so…
Season: Year of the season end
Team: name of the team
Playoffs: Whether or not a team made it to the playoffs
W: Number of wins during the regular season
PTS: Points scored during regular season
oopPTS: Opponent points scored during regular season
FG: Successful field goals including two and three pointers
FGA: Attempted field goals (number of unsuccessful field goas)
X2P: Successful two pointers
XSPA: Attempted two pointers
X3P: Successful three pointers
X3PA: Atempted three pointers
FT: Sucessful free tolls
FTA: Attempted free tools
ORB: Offensive debounds
DRB: Defensive debounds
AST: Assists
STL: Steals
BLK: Blocks
TOV: Turn overs
Our dependent variable is points scored (PTS) and our independent variables are:
X2PA
X3PA
FTA
AST
ORB
DRB
BLK
TOV
STL
In order to find out the best model we will use stepwise method.
So:
PointsRegPTS = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + BLK + TOV + STL, data=NBA)
step(PointsRegPTS, direction = "both")
## Start: AIC=8732.61
## PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + BLK + TOV + STL
##
## Df Sum of Sq RSS AIC
## - TOV 1 5634 28399948 8730.8
## - DRB 1 13687 28408001 8731.0
## - BLK 1 13878 28408192 8731.0
## <none> 28394314 8732.6
## - STL 1 161970 28556284 8735.4
## - ORB 1 5174007 33568321 8870.4
## - AST 1 13973859 42368173 9064.8
## - X3PA 1 36907474 65301787 9426.0
## - FTA 1 38486912 66881226 9446.0
## - X2PA 1 42823394 71217708 9498.4
##
## Step: AIC=8730.78
## PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + BLK + STL
##
## Df Sum of Sq RSS AIC
## - DRB 1 13751 28413699 8729.2
## - BLK 1 15444 28415392 8729.2
## <none> 28399948 8730.8
## + TOV 1 5634 28394314 8732.6
## - STL 1 182111 28582059 8734.1
## - ORB 1 5244408 33644356 8870.3
## - AST 1 13992244 42392192 9063.3
## - FTA 1 39799378 68199326 9460.3
## - X3PA 1 39978483 68378431 9462.5
## - X2PA 1 43003978 71403926 9498.6
##
## Step: AIC=8729.18
## PTS ~ X2PA + X3PA + FTA + AST + ORB + BLK + STL
##
## Df Sum of Sq RSS AIC
## - BLK 1 7765 28421465 8727.4
## <none> 28413699 8729.2
## + DRB 1 13751 28399948 8730.8
## + TOV 1 5699 28408001 8731.0
## - STL 1 254864 28668563 8734.6
## - ORB 1 5720440 34134139 8880.3
## - AST 1 14575147 42988847 9072.9
## - FTA 1 40907684 69321383 9471.9
## - X3PA 1 45944414 74358114 9530.5
## - X2PA 1 46410153 74823853 9535.7
##
## Step: AIC=8727.41
## PTS ~ X2PA + X3PA + FTA + AST + ORB + STL
##
## Df Sum of Sq RSS AIC
## <none> 28421465 8727.4
## + BLK 1 7765 28413699 8729.2
## + TOV 1 6842 28414623 8729.2
## + DRB 1 6073 28415392 8729.2
## - STL 1 253304 28674769 8732.8
## - ORB 1 5846710 34268175 8881.6
## - AST 1 14709040 43130504 9073.7
## - FTA 1 41048721 69470186 9471.7
## - X3PA 1 46992818 75414282 9540.2
## - X2PA 1 47288836 75710301 9543.5
##
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data = NBA)
##
## Coefficients:
## (Intercept) X2PA X3PA FTA AST
## -2032.7164 1.0500 1.2731 1.1273 0.8884
## ORB STL
## -0.9743 -0.2268
By the stepwise method above we find that the best model is:
lm(formula = PTS = -2032.7164 + 1.05 * X2PA + 1.2731 * X3PA + 1.1273 * FTA + 0.8884 * AST - 0.9743 * ORB - 0.2268 * STL, data = NBA)
We can use this model make predictions.
But before that, let´s understand what predictions are.
Observe that we used the NBA_train.csv that contains data from years previous to 2014.
We above used this data to create a model that can help us to predict points scored in other years.
In our case we are going to predict Points Scored in 2014.
To do so we need a Test dataset.
A test dataset is a dataset equal in structure to that we used to create the model with one difference:
In that dataset the independent variables are either variables collected to help the prediction or like in this example they are the players statistics predicted using the 2014 performance of the players who will be playing in 2014.
So we need to load the test dataset:
NBA_test = read.csv("NBA_test.csv")
The NBA_test dataset has all the same variables for the 2012-2013 Season.
However, we want to predict results using the model we just created.
The model we are using as suggested by the stepwise method is:
PointsRegPTS = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data=NBA)
Let´s make our predictions:
PointsRegPTS = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data=NBA)
PointsPredictions = predict(PointsRegPTS, newdata=NBA_test)
The question here is:
How can they know the values for X2PA + X3PA + FTA + AST + ORB + STL in 2012.
They can not know because the season would only be starting but once those results are achieved by players, what they CAN do is to estimate the values of those variables for the current season (2012-2013) by looking at the players that will play in 2012-2013 and measure their performances in the year before let´s say 2011.
By doing so they prepared the NBA_test dataset to feed the model with the needed data for the prediction.
What does the predict() function do?
It creates a vector with all the values predicted to the dependent variable regarding all observations.
In our case we have the real results stored in the test dataset and therefore we can compare them and also view the difference between them:
head((data.frame(PointsPredictions ,NBA_test$PTS, abs(PointsPredictions -NBA_test$PTS))))
## PointsPredictions NBA_test.PTS abs.PointsPredictions...NBA_test.PTS.
## 1 8086.446 8032 54.44594
## 2 7764.143 7944 179.85706
## 3 7965.348 7661 304.34799
## 4 7784.034 7641 143.03405
## 5 8004.349 7913 91.34868
## 6 8247.427 8293 45.57268
Observe that the difference between the actual value and the prediction is very low, thus confirming the power of the prediction.
In it´s highest point we have a 471 difference in a 8669 points scored. It is 18 times smaller.
We could therefore say it is a quite exquisite approximation.
How good is the model?
Another way to answer this question is to compute the Out of sample R squared. This is a measurment of how well the model predicts the testset data.
If we recall the model created we will see that the R squared value is as seen below (.8991) which is the measure of an in-sample r-squared, or in other words, how well the model fits the trainingset.
summary(PointsRegPTS)
##
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data = NBA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -523.33 -122.02 6.93 120.68 568.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.033e+03 1.629e+02 -12.475 < 2e-16 ***
## X2PA 1.050e+00 2.829e-02 37.117 < 2e-16 ***
## X3PA 1.273e+00 3.441e-02 37.001 < 2e-16 ***
## FTA 1.127e+00 3.260e-02 34.581 < 2e-16 ***
## AST 8.884e-01 4.292e-02 20.701 < 2e-16 ***
## ORB -9.743e-01 7.465e-02 -13.051 < 2e-16 ***
## STL -2.268e-01 8.350e-02 -2.717 0.00673 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 185.3 on 828 degrees of freedom
## Multiple R-squared: 0.8991, Adjusted R-squared: 0.8983
## F-statistic: 1229 on 6 and 828 DF, p-value: < 2.2e-16
head(sort(abs(PointsRegPTS$residuals)))
## 255 279 567 151 187 700
## 0.5412785 0.6095379 0.6743536 0.9799886 1.0435412 1.0630677
However to get a measure of the predictions goodness of fit we need to calculate the out-of-sample r-squared.
# Sum of squared erros (SEE)
SSE = sum((PointsPredictions - NBA_test$PTS)^2)
# Total sum of squares
SST = sum((mean(NBA$PTS) - NBA_test$PTS)^2)
# R-squared
R2 = 1 - SSE/SST
R2
## [1] 0.8127142
# Root mean square error
RMSE = sqrt(SSE/nrow(NBA_test))
RMSE
## [1] 196.3723
Although different it is still a quite acceptable r-square value.
If you calculate the RMSE for the trainningset you will also see that the values don´t differ greatly.
Regressão logística
É uma técnica recomendada para situações em que a variável dependente é de natureza dicotômica ou binária. Quanto às independentes, tanto podem ser categóricas ou não.
A regressão logística é um recurso que nos permite estimar a probabilidade associada à ocorrência de determinado evento em face de um conjunto de variáveis explanatórias.
Busca estimar a probabilidade da variável dependente assumir um determinado valor em função dos conhecidos de outras variáveis; Os resultados da análise ficam contidos no intervalo de zero a um.
Em comparação com as técnicas conhecidas em regressão, em especial a regressão linear, a regressão logística distingue-se essencialmente pelo facto de a variável resposta ser categórica.
Trata-se de um modelo de regressão para variáveis dependentes ou de resposta binomialmente distribuídas.
Understanding the Logistic Regression
In this example we will use the quality.csv dataset that store information about health care for various pacients.
This example is from: edX.org.-.The.Analytics.Edge Week.3.-.Logistic.Regression 1.Modelating the Expert - An Introduction to Logistic Regression.VIDEO 4 LOGISTIC REGRESSION IN R
The variables in the dataset quality.csv are as follows:
This are the variables in the dataset
MemberID numbers the patients from 1 to 131, and is just an identifying number.
InpatientDays is the number of inpatient visits, or number of days the person spent in the hospital.
ERVisits is the number of times the patient visited the emergency room.
OfficeVisits is the number of times the patient visited any doctor's office.
Narcotics is the number of prescriptions the patient had for narcotics.
DaysSinceLastERVisit is the number of days between the patient's last emergency room visit and the end of the study period (set to the length of the study period if they never visited the ER).
Pain is the number of visits for which the patient complained about pain.
TotalVisits is the total number of times the patient visited any healthcare provider.
ProviderCount is the number of providers that served the patient.
MedicalClaims is the number of days on which the patient had a medical claim.
ClaimLines is the total number of medical claims.
StartedOnCombination is whether or not the patient was started on a combination of drugs to treat their diabetes (TRUE or FALSE).
AcuteDrugGapSmall is the fraction of acute drugs that were refilled quickly after the prescription ran out.
PoorCare is the outcome or dependent variable, and is equal to 1 if the patient had poor care, and equal to 0 if the patient had good care.
However we will use only three variables in our example:
Dependent variable (PoorCare):
1: Patient had poor care
0: Patiente had good care
Independent variable 1 (OfficeVisits): Number of times the patient visited any doctor’s office.
Independent variable 1 (Narcotics): Number of prescriptions the patient had for narcotics.
Logistic regression predicts the probablilty of the outcome variable beeing true. In our example, the logistic regression model will predict the probability of the patient is receiving poor care. For this example we will denote:
poor care = 1
good care = 0
Just like a linear regression we have a set of independent variables (x1 to xz). In our case PoorCare is the dependent variable. MemberID should be removed from the dataset. All the other variables are independent variables.
To predict the probability that Y=1 we use the logistic response function, that is the nonlinear transformation of linear regression equation to produce numbers between 0 e 1.
logistic funciotn = 1 / (1+ e ^ - lm())
Look at the logistic curve below:
bodysize=rnorm(20,30,2)
bodysize=sort(bodysize)
survive=c(0,0,0,0,0,1,0,1,0,0,1,1,0,1,1,1,0,1,1,1)
dat=as.data.frame(cbind(bodysize,survive))
plot(bodysize,survive, type="n")
g=glm(survive~bodysize,family=binomial,dat)
curve(predict(g,data.frame(bodysize=x),type="resp"),add=TRUE)
points(bodysize,fitted(g),pch=20)
Let´s understand this function a little better. This graph shows the logistic reponse function for different values of the linear regression peice, the logistic response function always takes values between zero and one which makes sense since it equals a probability.
A positive coeficient value for a variable increases the linear regression piece which increases that y equals one, or in our case, increases the probability of poor care.
On the other hand, a negative coeficient value for a variable decreases the linear regression piece which decreases that y equals one, or in our case, increases the probability of good care.
Another way to think of it is in terms of probabilty.
odds= prob(y=1)/prob(y=0)
odds= e ^ lm()
log(odds) = lm()
logit = log(oods) = beta 0 + beta1 * x1 + beta2 * x2 + … + betan * Xn
It means that:
Model of Health Care Quality
From the plot below it seems that as office visits and narcotics increase so does the quality of care increases once we have more red dots than black ones above and to the right of the plot.
But it is hard to see a trend in the data by just inspecting it. But it looks like maybe more office visits and more narcotics are more likely to have poor care (red dots)
quality = read.csv("quality.csv") # reads the dataset
qualityFilter = quality[quality$PoorCare==1,] # Stores which observatins had good care
plot(quality$OfficeVisits,quality$Narcotics, pch=20, cex=2, col="green") # plots all observations in green
points(qualityFilter$OfficeVisits,qualityFilter$Narcotics, col="red", pch=20, cex=2) # plots all observations that had good care in red
lines(c(-2,30),c(30,-2), col="blue", lwd=3) # This line helps us to see that above a certain point there seems to be more and more patients with good care.
Let´s build a logistic regression to better predict poor care.
Building the Trainning and Testing Sets
First let´s see how many patients received good care (0) and how many received poor care (1).
table(quality$PoorCare)
##
## 0 1
## 98 33
98/ (98+33)
## [1] 0.7480916
So 98 out of 131 (98+33) patients received good care (0). In other words 75% receive good care. In a classification problem, a standard baseline method is to just predict the most frequent outcome for all observations. Since GOOD CARE (0) is more commum than POOR CARE (1) we would predict that all patients are receiving good care.
If we would do this, we would get 98 our of 131 observations correct or have an acurracy of 75%.
Therefore we can say that our baseline model has an accuracy of 75%.
That´s what we will try to beat with a logistic regression.
Splitting the Trainning set & Testing set
To make predictions we need a trainning set to build the model on and a testing set to make predictions about. To do so we will use the sample.split() function.
Randomly splitting data
Lets understand how the split takes place. Using the sample.split function we use the dataset quality and the column PoorCare (dependent variable) to create a hole new dataset with the same distribution between 1s and 0s. It is important to point that the parameter SplitRatio should be set to equal the more frequent value of the variable, in our case the more frequent value is 1 (poor care) and that is the value that will have. It represents 75% of the splitted variable.
#install.packages("caTools")
library(caTools)
set.seed(88)
split = sample.split(quality$PoorCare, SplitRatio = 0.75)
split
## [1] TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE
## [12] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [23] TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
## [34] TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## [45] FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
## [56] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [67] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [78] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
## [89] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## [100] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE
## [111] FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE
## [122] TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE
The result of the funciton is a vector with a ratio of 75% between TRUES and FALSES. In the split vector TRUE means that we should put the result in the trainning set and FALSE means that we should put the result in the testing set
table(split) # ARound the same as quality@PoorCare
## split
## FALSE TRUE
## 32 99
99/(99+32)
## [1] 0.7557252
Create training and testing sets
qualityTrain = subset(quality, split == TRUE)
qualityTest = subset(quality, split == FALSE)
Analysing the trainning and testing sets we observe that although the number of rows differs from Train and Test they keep the ratio providade for the sample.split() funciton of 75%
nrow(qualityTrain)
## [1] 99
nrow(qualityTest)
## [1] 32
99/(99+32)
## [1] 0.7557252
The ratio within each one of datasets creates also keep the same ratio of 75%.
table(qualityTrain$PoorCare)
##
## 0 1
## 74 25
74/(74+25)
## [1] 0.7474747
table(qualityTest$PoorCare)
##
## 0 1
## 24 8
24/(24+8)
## [1] 0.75
Therefore that is what the sample.split() function does.
What do we want to do with this?
By creating a trainning and a test set we can create a model with the trainning set.
validate making predictions using the testing set.
Building the Logistic Regression Model
Now we are ready to build a logistic regression model using office visits and narcotics as seen in the plot above as independent variables. To do so we will use the GLM() funciton that stands for Generalized Linear Model.
QualityLog = glm(PoorCare ~ OfficeVisits + Narcotics, data=qualityTrain, family=binomial)
summary(QualityLog)
##
## Call:
## glm(formula = PoorCare ~ OfficeVisits + Narcotics, family = binomial,
## data = qualityTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.06303 -0.63155 -0.50503 -0.09689 2.16686
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.64613 0.52357 -5.054 4.33e-07 ***
## OfficeVisits 0.08212 0.03055 2.688 0.00718 **
## Narcotics 0.07630 0.03205 2.381 0.01728 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 111.888 on 98 degrees of freedom
## Residual deviance: 89.127 on 96 degrees of freedom
## AIC: 95.127
##
## Number of Fisher Scoring iterations: 4
We use family=binomial to set the logistic regression.
Observe that both betas are positive what means that higher values in this two variables are indicative of poor care (0) as we could suspect by the plot above.
Each variable has at least one * which means they are all significant in our model.
AIC is a mesure of the quality of the model. It accounts for the number of variables used compared to the number of observations, however it can only be compared between models on the same dataset. But it provides a means for model selection. The preferred model is that with the minimum AIC
Making predictions on the Trainning Set
Let´s now make predictions on the trainning set using the QualityLog model.
It is not clear to me why we don´t make predictions on the testing using the QualityLog model set but it may came later.
predictTrain = predict(QualityLog, type="response")
# The type="response" tells the predict funciton to give probabilities
summary(predictTrain)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06623 0.11910 0.15970 0.25250 0.26760 0.98460
Observe that once we are expecting probabilites all those numbers should be between 1 and 0.
Let´s see if we are predicting higher probabilities for the actual poor care (1) cases as we expect. To do so we use the tapply() function.
tapply(predictTrain, qualityTrain$PoorCare, mean)
## 0 1
## 0.1894512 0.4392246
0.4392246 / (0.1894512 + 0.4392246 )
## [1] 0.6986504
As we expected we are producing higher probabilities to 1, Around 69.86%.
Accuracy of Predictions
Often we want to make a binary prediction.
Did this patient receive good care or poor care?
We can do this using a threshold value t
If P(PoorCare = 1) > t, predict good quality
If P(PoorCare = 1) < t, predict poor quality
What value should we pick for t?
There is two types of erros that a model can make.
#1 - When you predict poor care (1) but the outcome is good care (0)
#2 - When you precict good care (0) but the outcome is poor care (1)
Picking a large threshold t We will predict poor care rarely once the probability of poor care would have to be really large to be bigger then the threshold.
Picking a small threshold t We will predict good care rarely because the probability of good care would have to be really small to be smaller then the treshold.
If you have no preference in error type select t =.5
Confusion matrix
(classification matrix)
This compares the actual outcomes with the predicted outcomes
The rows are labeled with the actual outcomes
The columns are labled with the predicted outcomesd
There are four values in the matrix
True Negatives (TN): Actual=0,Predicted = 0
True Positives (TP): Actual=1, Predicted = 1
False Positives (FP): Actual=0, Predicted=1
Falte Negatives (FN): Actual=1, Predicted=0
We can compute two outcome measures that help us to determine what types of erros we are making:
Sensitivity: TP/(TP+FN) [Actual=1 line]
Measures the percentace of actual poor care cases that we classifeid correctly. Has to do with the Actual 1 row.
Also known as the true positive rate
Specificity: TN/(TN+FP) [Actual=0 line]
Measures the percentace of actual good care cases that we classified correctly.
Has to do with the Actual 1 row.
Also known as the true negative rate.
A model with a higher threshold will have lower sensitivity and higher specifity.
A model with a lower threshold will have higher sensitivity and lower specificity.
# Confusion matrix (classification matrix) for threshold of 0.5
table(qualityTrain$PoorCare, predictTrain > 0.5)
##
## FALSE TRUE
## 0 70 4
## 1 15 10
# Sensitivity
10/(15+10)
## [1] 0.4
# Specificity
70/(70+4)
## [1] 0.9459459
But, Which threshold should we pick?
Receiver Operator Characteristic (ROC) curve
Recall that we made predictions on our training set and called them predictTrain and we will use these predictions to create our ROC curve.
First we will call the prediction function of ROCR, and we will call the output of this function ROCRpred, and then use the prediction function which takes two arguments:
Then we need to use the performance function, this defines what we would like to plot on the x and y axis of our ROC curve. We will call the output of this ROCperf.
The performance funciton takes three arguments:
And then we can plot the output of this performance function.
# Install and load ROCR package
# install.packages("ROCR")
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
# Prediction function
ROCRpred = prediction(predictTrain, qualityTrain$PoorCare)
# Performance function
ROCRperf = performance(ROCRpred, "tpr", "fpr")
# Plot ROC curve
plot(ROCRperf)
It can help you decide whic h value of the threshold is best!
ROC always start at (0,0)[upper right corner] which corresponds to a threshold = 1 which will not catch any poor care case, in other words, a sensitivity = 0. Therefore you will correctly label all good care cases, meaning that you have a false positive rate = 0.
ROC always ends at (1,1)[lower left corner] which corresponds to a thresholod = 0. In that case you will catch all of the poor care cases, with a sensitiviy = 1 but you you label all of the good care cases as poor care cases too, meaning that you have a false positive rate = 1.
Observe the ROC curve below at some points and their intepretations:
Point (0,.4): Correctly labeling 40% of the poor care cases with about 0,5% false positive rate.
Point (.6,.9): Correctly labeling 90% of the poor care cases with about 60% of false positive rate.
Point (.3,.8)? Correctly labeling 80% of poor care cases with about 30% of false poistive rate.
Selecting a Threshold using ROC
ROC curves capture all thresholds simultaneously.
High threshold - At point = (0,0), threshold = 1
High Specificity
Low Sensitivity
Low threshold - At point = (1,1), threshold = 0
Low Specificity
High Sensitivity
So, what treshold value should you pick?
You should select the best threshold for the trade off you want to make.
Let’s say you want a low false positive rate (high specificity) take the threshold that maximizes the true positive rate while keeping the false positive rate really low. In our example, a threshold around (.1,.5) seems a good choice.
On the other hand, if you are more concerned with a high sensitivity (high true positive rate) take a treshold that that minimizes the false positive rate but has a very high true poisitive rate. In our example a threshold around (.3,.8) looks like a good choice in this case.
Color coding the curve
You can label the thresholds in R by color coding the curve.
The legend is shown on the right. By picking a point (x,y) you can see which color is closest on the line. After that just see at the y-axis at the right which value if the best value for the threshold.
# Add colors
plot(ROCRperf, colorize=TRUE)
Adding threshold labels
# Add threshold labels
plot(ROCRperf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))
Entenda o efeito dos fatores (levels) numa Regressão linear
Relative to type of variable:
If the dependent variable is continuous (e.g. Blood count: 3.2, 3.4,5.5) or discreet (e.g: Age: 23years, 43 years), where there is not necessarily a hierarquical value between values we have a less strong result from the regression analysis.
When the dependent variable is ordianl (e.g. 1,2,3,4 where 1<2<3<4) we have more robuts results from the regression.
When the dependent varible is dicotomic (0 or 1) we should use a Logistic Regression.
It is important to notice that if one or more of the variables is an unordered factor, we must define one level as the “reference level” and add a binary variable for each of the remaining leves. In this way, a factor with n levels is replaced by n-1 binary variables. The reference level is typically selected to be the most frequently occurring level in the dataset. However R when loads texts selects the reference level using alphabetical order. It chooses the first level alphabetically as the reference level instead of the most common. Pay attention to that.
Let´s see an example using the quine dataset in the MASS library. The quine data frame has 146 rows and 5 columns. Children from Walgett, New South Wales, Australia, were classified by Culture, Age, Sex and Learner status and the number of days absent from school in a particular school year was recorded. Thisis an unordered factor. The is no logical sequence.
For this example we will use the variable:
Lrn (learner status): factor with levels Average or Slow learner, (“AL” or “SL”).
# This is an incomplete example. You should do the reading of the pdf below as suggested.
y<-c(72,25,24,2,18,38,62,30,78,34,67,21,97,79,64,53,27,81)
Day<-c(rep("Day 1",6),rep("Day 2",8),rep("Day 3",4))
dataf<-data.frame(y,Day)
str(dataf) #Day is not ordered
## 'data.frame': 18 obs. of 2 variables:
## $ y : num 72 25 24 2 18 38 62 30 78 34 ...
## $ Day: Factor w/ 3 levels "Day 1","Day 2",..: 1 1 1 1 1 1 2 2 2 2 ...
summary(lm(y~Day,data=dataf)) #Day 2 is not significantly different from Day 1, but Day 3 is.
##
## Call:
## lm(formula = y ~ Day, data = dataf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.500 -21.333 0.125 16.750 42.167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.83 10.30 2.897 0.0111 *
## DayDay 2 28.67 13.62 2.105 0.0526 .
## DayDay 3 26.42 16.28 1.623 0.1255
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.22 on 15 degrees of freedom
## Multiple R-squared: 0.2471, Adjusted R-squared: 0.1467
## F-statistic: 2.461 on 2 and 15 DF, p-value: 0.119
dataf$Day
## [1] Day 1 Day 1 Day 1 Day 1 Day 1 Day 1 Day 2 Day 2 Day 2 Day 2 Day 2
## [12] Day 2 Day 2 Day 2 Day 3 Day 3 Day 3 Day 3
## Levels: Day 1 Day 2 Day 3
dataf$Day<-ordered(dataf$Day)
dataf$Day
## [1] Day 1 Day 1 Day 1 Day 1 Day 1 Day 1 Day 2 Day 2 Day 2 Day 2 Day 2
## [12] Day 2 Day 2 Day 2 Day 3 Day 3 Day 3 Day 3
## Levels: Day 1 < Day 2 < Day 3
# Observe that after the ordered function Levels: Day 1 < Day 2 < Day 3
# You can also see this with the command below
str(dataf) # "Day 1"<"Day 2"<"Day 3"
## 'data.frame': 18 obs. of 2 variables:
## $ y : num 72 25 24 2 18 38 62 30 78 34 ...
## $ Day: Ord.factor w/ 3 levels "Day 1"<"Day 2"<..: 1 1 1 1 1 1 2 2 2 2 ...
summary(lm(y~Day,data=dataf))
##
## Call:
## lm(formula = y ~ Day, data = dataf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.500 -21.333 0.125 16.750 42.167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.194 6.187 7.789 1.19e-06 ***
## Day.L 18.679 11.512 1.623 0.125
## Day.Q -12.622 9.858 -1.280 0.220
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.22 on 15 degrees of freedom
## Multiple R-squared: 0.2471, Adjusted R-squared: 0.1467
## F-statistic: 2.461 on 2 and 15 DF, p-value: 0.119
#Significances reversed (or "Day.L" and "Day.Q" are not sinonymous "Day 2" and "Day 3"?): Day 2 (".L") is significantly different from Day 1, but Day 3 (.Q) isn't.
# This helps to at least see the effect of ordering a factor in the linear regression results.
f <- paste("day", 1:3)
contrasts(ordered(f))
## .L .Q
## [1,] -7.071068e-01 0.4082483
## [2,] -7.850462e-17 -0.8164966
## [3,] 7.071068e-01 0.4082483
contrasts(factor(f))
## day 2 day 3
## day 1 0 0
## day 2 1 0
## day 3 0 1
This is a more complex aproach to unordered factors. You can however solve this problem with a more simple approach. As an example, consider the unordered factor variable “color”, with levels “red”, “green”, and “blue”. If “green” were the reference level, then we would add binary variables “colorred” and “colorblue” to a linear regression problem. All red examples would have colorred=1 and colorblue=0. All blue examples would have colorred=0 and colorblue=1. All green examples would have colorred=0 and colorblue=0.
y<-c(72,25,24,2,18,38,62,30,78,34,67,21,97,79,64,53,27,81)
Day<-c(rep("Day 1",6),rep("Day 2",8),rep("Day 3",4))
dataf<-data.frame(y,Day)
str(dataf) #Day is not ordered
## 'data.frame': 18 obs. of 2 variables:
## $ y : num 72 25 24 2 18 38 62 30 78 34 ...
## $ Day: Factor w/ 3 levels "Day 1","Day 2",..: 1 1 1 1 1 1 2 2 2 2 ...
dataf$Day1 = 0
dataf$Day2 = 0
dataf$Day3 = 0
dataf$Day1= ifelse(dataf$Day=="Day 1",yes=1,no=0)
dataf$Day2= ifelse(dataf$Day=="Day 2",yes=1,no=0)
dataf$Day3= ifelse(dataf$Day=="Day 3",yes=1,no=0)
str(dataf)
## 'data.frame': 18 obs. of 5 variables:
## $ y : num 72 25 24 2 18 38 62 30 78 34 ...
## $ Day : Factor w/ 3 levels "Day 1","Day 2",..: 1 1 1 1 1 1 2 2 2 2 ...
## $ Day1: num 1 1 1 1 1 1 0 0 0 0 ...
## $ Day2: num 0 0 0 0 0 0 1 1 1 1 ...
## $ Day3: num 0 0 0 0 0 0 0 0 0 0 ...
dataf
## y Day Day1 Day2 Day3
## 1 72 Day 1 1 0 0
## 2 25 Day 1 1 0 0
## 3 24 Day 1 1 0 0
## 4 2 Day 1 1 0 0
## 5 18 Day 1 1 0 0
## 6 38 Day 1 1 0 0
## 7 62 Day 2 0 1 0
## 8 30 Day 2 0 1 0
## 9 78 Day 2 0 1 0
## 10 34 Day 2 0 1 0
## 11 67 Day 2 0 1 0
## 12 21 Day 2 0 1 0
## 13 97 Day 2 0 1 0
## 14 79 Day 2 0 1 0
## 15 64 Day 3 0 0 1
## 16 53 Day 3 0 0 1
## 17 27 Day 3 0 0 1
## 18 81 Day 3 0 0 1
# You can now make a linear regression using those variables
summary(lm(y ~ Day1 + Day2 + Day3, data=dataf))
##
## Call:
## lm(formula = y ~ Day1 + Day2 + Day3, data = dataf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.500 -21.333 0.125 16.750 42.167
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.25 12.61 4.461 0.000458 ***
## Day1 -26.42 16.28 -1.623 0.125494
## Day2 2.25 15.45 0.146 0.886113
## Day3 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.22 on 15 degrees of freedom
## Multiple R-squared: 0.2471, Adjusted R-squared: 0.1467
## F-statistic: 2.461 on 2 and 15 DF, p-value: 0.119
One last observation is to point out that if you do nothing, when you do a linear regression using a dataset where one or movre variables are factors, the lm() fucniton will remove the reference level from the linear regression. Observe that below Day 1 is a missing coefficiente.
y<-c(72,25,24,2,18,38,62,30,78,34,67,21,97,79,64,53,27,81)
Day<-c(rep("Day 1",6),rep("Day 2",8),rep("Day 3",4))
dataf<-data.frame(y,Day)
summary(lm(y~Day,data=dataf))
##
## Call:
## lm(formula = y ~ Day, data = dataf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.500 -21.333 0.125 16.750 42.167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.83 10.30 2.897 0.0111 *
## DayDay 2 28.67 13.62 2.105 0.0526 .
## DayDay 3 26.42 16.28 1.623 0.1255
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.22 on 15 degrees of freedom
## Multiple R-squared: 0.2471, Adjusted R-squared: 0.1467
## F-statistic: 2.461 on 2 and 15 DF, p-value: 0.119
However you can relevel the variable and the new reference level will be omited from the linear regression. You must find out what is best to create manually of variable to each possible value of the Day variable or let lm() remove the reference level from the computation.
Important: If any of the variable unfolded by the Day variable (Day1,day2,day3) is significante in the model it means we can not remove the variable.
y<-c(72,25,24,2,18,38,62,30,78,34,67,21,97,79,64,53,27,81)
Day<-c(rep("Day 1",6),rep("Day 2",8),rep("Day 3",4))
dataf<-data.frame(y,Day)
dataf$Day = relevel(dataf$Day, "Day 2")
summary(lm(y~Day,data=dataf))
##
## Call:
## lm(formula = y ~ Day, data = dataf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.500 -21.333 0.125 16.750 42.167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 58.500 8.917 6.560 9.03e-06 ***
## DayDay 1 -28.667 13.621 -2.105 0.0526 .
## DayDay 3 -2.250 15.445 -0.146 0.8861
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.22 on 15 degrees of freedom
## Multiple R-squared: 0.2471, Adjusted R-squared: 0.1467
## F-statistic: 2.461 on 2 and 15 DF, p-value: 0.119
Source:http://r.789695.n4.nabble.com/Models-with-ordered-and-unordered-factors-td4072225.html
You should also read: Working with Ordinal Predictors.pdf
See ?cor for more details
O comportamento conjunto de duas variáveis quantitativas pode ser observado pelo diagrama de dispersão e medido por meio do Coeficiente de Correlação de Pearson (r). Pode ser utilizado também se uma das variáveis é contínua e outro dicotômica (homem/mulher)
Se os dados são em nível ordinal ou não apresentam as pressuposições para realizar a correlação de Pearson, utiliza-se o Coeficiente de Correlação de Spearman (rho). A correlação não tem unidade (adimencional). Varia de -1 a +1, sendo representado pelo ‘r’
Correlação negativa - Rxy < 0 Correlação positiva - Rxy > 0 Correlação Rxy = 0 - Não existe correlação Correlação não linear - o ‘r’ não mede esse tipo de correlação cor(x,y) measures the strength of the linear relationship between the X and Y dasta with stronger relationships as cor(x,y) heads towares -1 or 1.
Some facts about correlation Cor(X, Y) = Cor(Y, X)
???1 ??? Cor(X, Y) ??? 1
Cor(X, Y) =1 and Cor(X,Y) = -1 only when the X or Y observations fall perfectly on a positive or negative sloped line, respectively
Cor(X,Y) measures the strength of the linear relationship between the X and Y data, with stronger relationsships as Cor(X,Y) heads towards -1 or 1.
Cor(X,Y) = 0 implies no linear relationship.
Teste de Hipótese para Correlação Linear
H0: p = 0 - Significa que a correlação é igual a 0
H1: P <> 0 - Significa que a correlação é diferente de 0
Para ver isso, faz-se o teste T.
Analisa-se o p-valor do teste e, se o valor for maior que p > 0,05, aceita H0.
Se menor, rejeita-se H0. Ou seja, aceita H1.
Observação:
Correlação não é o mesmo que causa e efeito. Duas variáveis podem estar altamente correlacionadas e, no entanto, não haver entre elas, relação de causa e efeito.
Só faz-se a correção de Pearson se as variáveis forem lineares.
For all columns
```r
cor(longley, method = "pearson")
```
```
## GNP.deflator GNP Unemployed Armed.Forces Population
## GNP.deflator 1.0000000 0.9915892 0.6206334 0.4647442 0.9791634
## GNP 0.9915892 1.0000000 0.6042609 0.4464368 0.9910901
## Unemployed 0.6206334 0.6042609 1.0000000 -0.1774206 0.6865515
## Armed.Forces 0.4647442 0.4464368 -0.1774206 1.0000000 0.3644163
## Population 0.9791634 0.9910901 0.6865515 0.3644163 1.0000000
## Year 0.9911492 0.9952735 0.6682566 0.4172451 0.9939528
## Employed 0.9708985 0.9835516 0.5024981 0.4573074 0.9603906
## Year Employed
## GNP.deflator 0.9911492 0.9708985
## GNP 0.9952735 0.9835516
## Unemployed 0.6682566 0.5024981
## Armed.Forces 0.4172451 0.4573074
## Population 0.9939528 0.9603906
## Year 1.0000000 0.9713295
## Employed 0.9713295 1.0000000
```
For two specific columns
```r
with(longley, cor(Employed,GNP,method="spearman"))
```
```
## [1] 0.9852941
```
For all columns
```r
cor(longley, method = "spearman")
```
```
## GNP.deflator GNP Unemployed Armed.Forces Population
## GNP.deflator 1.0000000 0.9970588 0.6647059 0.2205882 0.9970588
## GNP 0.9970588 1.0000000 0.6382353 0.2235294 0.9941176
## Unemployed 0.6647059 0.6382353 1.0000000 -0.3411765 0.6852941
## Armed.Forces 0.2205882 0.2235294 -0.3411765 1.0000000 0.2264706
## Population 0.9970588 0.9941176 0.6852941 0.2264706 1.0000000
## Year 0.9970588 0.9941176 0.6852941 0.2264706 1.0000000
## Employed 0.9823529 0.9852941 0.5647059 0.2264706 0.9764706
## Year Employed
## GNP.deflator 0.9970588 0.9823529
## GNP 0.9941176 0.9852941
## Unemployed 0.6852941 0.5647059
## Armed.Forces 0.2264706 0.2264706
## Population 1.0000000 0.9764706
## Year 1.0000000 0.9764706
## Employed 0.9764706 1.0000000
```
For two specific columns
```r
with(longley, cor(Employed,GNP,method="pearson"))
```
```
## [1] 0.9835516
```
Análise passo-a-passo: Regressão Linear Múltipla
In a 2015 study, Tempski and associates examined a measurement they called Quality of Life among medical school students in Brazilian medical schools. They borrowed measurement scales from the World Health Organization, the Dundee Ready Education Environment Scale, and the Beck Depression Inventory to assess the dependent variable in potential relation to a number of predictor variables.
Variables
* DREEM: Social Self Perception (DREEM.S.SP)
* DREEM: Academic Self Perception (DREEM.A.SP)
* Resilience (Resilience)
* BDI (BDI)
* Age (Age)
* Med School Quality of Life (MS.QoL)
Neste exemplo, vamos analisar a influência de DREEM.S.SP, DREEM.A.SP, Resilience, BDI e Age sobre QoL.
res = read.csv("TempskiResilience.csv")
# or
# library(SDSFoundations)
# res = TempskiResilience
clin <- res[res$Group == "Clinical Sciences",] # Subset of students in the clinical sciences programm.
Regressão Linear Múltipla
Correlation Matrix
Correlation Matrix (Predictors)
Confidence Intervals
Standardized Betas
Proportion of Variance Accounted for
Variance Inflation Factor(Tolerância)
Diagnóstico: Residual vs. Fitted
Diagnóstico: Cook´s Distance
Regressão Linear Múltipla
Regression <- lm(MS.QoL ~ DREEM.S.SP + DREEM.A.SP + Resilience + BDI + Age, data=clin)
summary(Regression)
##
## Call:
## lm(formula = MS.QoL ~ DREEM.S.SP + DREEM.A.SP + Resilience +
## BDI + Age, data = clin)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5697 -0.7056 0.0725 0.8409 3.9567
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.674e+00 7.650e-01 6.110 2.05e-09 ***
## DREEM.S.SP 1.394e-01 1.801e-02 7.739 5.87e-14 ***
## DREEM.A.SP 3.670e-02 1.428e-02 2.571 0.010439 *
## Resilience 4.345e-05 6.135e-03 0.007 0.994353
## BDI -3.636e-02 1.067e-02 -3.409 0.000707 ***
## Age -2.904e-02 2.352e-02 -1.235 0.217592
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.229 on 485 degrees of freedom
## Multiple R-squared: 0.3337, Adjusted R-squared: 0.3269
## F-statistic: 48.59 on 5 and 485 DF, p-value: < 2.2e-16
Correlation Matrix
In fact, we can confirm the simultaneous impact of the independent variables by examining the correlation matrix.
vars <- c("MS.QoL", "DREEM.S.SP", "DREEM.A.SP", "Resilience", "BDI", "Age")
cor(clin[,vars], use="pairwise.complete.obs")
## MS.QoL DREEM.S.SP DREEM.A.SP Resilience BDI
## MS.QoL 1.00000000 0.54744658 0.40488523 0.3175923 -0.40809301
## DREEM.S.SP 0.54744658 1.00000000 0.57218064 0.4257209 -0.52053588
## DREEM.A.SP 0.40488523 0.57218064 1.00000000 0.4819358 -0.32875516
## Resilience 0.31759228 0.42572085 0.48193583 1.0000000 -0.56656655
## BDI -0.40809301 -0.52053588 -0.32875516 -0.5665665 1.00000000
## Age -0.07208173 -0.04819259 -0.09025951 0.0401443 -0.02372227
## Age
## MS.QoL -0.07208173
## DREEM.S.SP -0.04819259
## DREEM.A.SP -0.09025951
## Resilience 0.04014430
## BDI -0.02372227
## Age 1.00000000
Notice that there is a relationship between Absences and SAT_Score. That relatioship is what makes the slope values found in the multiple regression to be different from those found in the simple regression.
So, how do we interprete the coefficients of a multiple regression model? Well, each coefficient is the impact of that particular variable on the outcome variable while holding constant the other variables in the model. Effectively, these coefficients represent the unique impact of the variable that they represent.
Even better, everything you can draw from a simple regression model are still available in a multiple regression model, namely:
Which is perfectly fine once we have multiple things in the model.
Predictors in the Correlation Matrix
Examine the correlation matrixes for any significant predictors.
library(psych)
corr.test(clin[,vars], use="pairwise.complete.obs")
## Call:corr.test(x = clin[, vars], use = "pairwise.complete.obs")
## Correlation matrix
## MS.QoL DREEM.S.SP DREEM.A.SP Resilience BDI Age
## MS.QoL 1.00 0.55 0.40 0.32 -0.41 -0.07
## DREEM.S.SP 0.55 1.00 0.57 0.43 -0.52 -0.05
## DREEM.A.SP 0.40 0.57 1.00 0.48 -0.33 -0.09
## Resilience 0.32 0.43 0.48 1.00 -0.57 0.04
## BDI -0.41 -0.52 -0.33 -0.57 1.00 -0.02
## Age -0.07 -0.05 -0.09 0.04 -0.02 1.00
## Sample Size
## [1] 491
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## MS.QoL DREEM.S.SP DREEM.A.SP Resilience BDI Age
## MS.QoL 0.00 0.00 0.00 0.00 0.0 0.44
## DREEM.S.SP 0.00 0.00 0.00 0.00 0.0 0.86
## DREEM.A.SP 0.00 0.00 0.00 0.00 0.0 0.23
## Resilience 0.00 0.00 0.00 0.00 0.0 0.86
## BDI 0.00 0.00 0.00 0.00 0.0 0.86
## Age 0.11 0.29 0.05 0.37 0.6 0.00
##
## To see confidence intervals of the correlations, print with the short=FALSE option
corr.test(clin[,vars], use="pairwise.complete.obs")$r # The matrix of correlations
## MS.QoL DREEM.S.SP DREEM.A.SP Resilience BDI
## MS.QoL 1.00000000 0.54744658 0.40488523 0.3175923 -0.40809301
## DREEM.S.SP 0.54744658 1.00000000 0.57218064 0.4257209 -0.52053588
## DREEM.A.SP 0.40488523 0.57218064 1.00000000 0.4819358 -0.32875516
## Resilience 0.31759228 0.42572085 0.48193583 1.0000000 -0.56656655
## BDI -0.40809301 -0.52053588 -0.32875516 -0.5665665 1.00000000
## Age -0.07208173 -0.04819259 -0.09025951 0.0401443 -0.02372227
## Age
## MS.QoL -0.07208173
## DREEM.S.SP -0.04819259
## DREEM.A.SP -0.09025951
## Resilience 0.04014430
## BDI -0.02372227
## Age 1.00000000
corr.test(clin[,vars], use="pairwise.complete.obs")$n # Number of cases per correlation
## [1] 491
corr.test(clin[,vars], use="pairwise.complete.obs")$p # two tailed probability of t for each correlation. For symmetric matrices, p values adjusted for multiple tests are reported above the diagonal.
## MS.QoL DREEM.S.SP DREEM.A.SP Resilience BDI
## MS.QoL 0.00000e+00 0.0000000 0.000000e+00 3.434586e-12 0.000000e+00
## DREEM.S.SP 0.00000e+00 0.0000000 0.000000e+00 0.000000e+00 0.000000e+00
## DREEM.A.SP 0.00000e+00 0.0000000 0.000000e+00 0.000000e+00 5.409007e-13
## Resilience 5.72431e-13 0.0000000 0.000000e+00 0.000000e+00 0.000000e+00
## BDI 0.00000e+00 0.0000000 7.727152e-14 0.000000e+00 0.000000e+00
## Age 1.10661e-01 0.2865258 4.560823e-02 3.747404e-01 6.000112e-01
## Age
## MS.QoL 0.4426441
## DREEM.S.SP 0.8595773
## DREEM.A.SP 0.2280411
## Resilience 0.8595773
## BDI 0.8595773
## Age 0.0000000
corr.test(clin[,vars], use="pairwise.complete.obs")$t # value of t-test for each correlation
## MS.QoL DREEM.S.SP DREEM.A.SP Resilience BDI
## MS.QoL Inf 14.466165 9.791869 7.406480 -9.8848750
## DREEM.S.SP 14.466165 Inf 15.427876 10.404006 -13.4812048
## DREEM.A.SP 9.791869 15.427876 Inf 12.162901 -7.6977543
## Resilience 7.406480 10.404006 12.162901 Inf -15.2044151
## BDI -9.884875 -13.481205 -7.697754 -15.204415 Inf
## Age -1.598125 -1.066939 -2.004120 0.888441 -0.5247263
## Age
## MS.QoL -1.5981253
## DREEM.S.SP -1.0669392
## DREEM.A.SP -2.0041198
## Resilience 0.8884410
## BDI -0.5247263
## Age Inf
Confidence Intervals
And we can use the confint() funciton in R to again find the confidence intervals of each slope in the regression model.
confint(Regression)
## 2.5 % 97.5 %
## (Intercept) 3.170622049 6.17671584
## DREEM.S.SP 0.103999969 0.17477991
## DREEM.A.SP 0.008652434 0.06475334
## Resilience -0.012011893 0.01209879
## BDI -0.057318060 -0.01540232
## Age -0.075260330 0.01717907
Standardized Betas
We can even use the lmbeta() function to get at the concept of the standardized beta for every coefficient in the model, effectively freeing up everyone from their scale.
lmBeta(Regression)
## DREEM.S.SP DREEM.A.SP Resilience BDI Age
## 0.3873277406 0.1241752275 0.0003453895 -0.1665514046 -0.0461722487
Variance Inflation Factor(Tolerância)
Tolerance defines how much variance left over in this particular independent variable once I know all the other independence in the model.
library(car)
var1 = vif(Regression)
tolerancia = 1/var1
tolerancia
## DREEM.S.SP DREEM.A.SP Resilience BDI Age
## 0.5484335 0.5888836 0.5774369 0.5754930 0.9821422
Diagnóstico: Residual vs. Fitted
plot(Regression, which = 1)
Diagnóstico: Cook´s Distance
cutoff <- 4/(Regression$df)
plot(Regression, which=4, cook.levels=cutoff)
plot(Regression, which = 4, cook.levels = cutoff, id.n = 7)
Proportion of Variance Accounted for
#install.packages('SDSFoundations')
library(SDSFoundations)
pCorr(Regression)
## Partial_Corr Partial_Corr_sq Part_Corr Part_Corr_sq
## DREEM.S.SP 0.3315348929 1.099154e-01 0.286840575 8.227752e-02
## DREEM.A.SP 0.1159533555 1.344518e-02 0.095290520 9.080283e-03
## Resilience 0.0003215396 1.033877e-07 0.000262459 6.888471e-08
## BDI -0.1529677527 2.339913e-02 -0.126347985 1.596381e-02
## Age -0.0559705907 3.132707e-03 -0.045758123 2.093806e-03