How to

TABELAS

Tabelar uma variável categórica
Tabelar duas variáveis categóricas uma contra a outra
Tabelar o percentual de UMA variável categórica
Tabelar o percentual de DUAS variáveis categóricas
Tabelar o percentual da linha de DUAS variáveis categóricas
Tabelar o percentual da coluna de DUAS variáveis categóricas
Acrescentar margens a uma tabela com DUAS variavéis categóricas

GRÁFICOS

Gráfico de barras de distribuição - UMA VARIÁVEL CATEGÓRICA
Gráfico de barras de frequência - UMA VARIÁVEL CATEGÓRICA
Gráfico de barras de distribuição - DUAS VARIÁVEIS CATEGÓRICAS
Gráfico de barras de frequência - DUAS VARIÁVEIS CATEGÓRICAS
Histogram
Barplot
Plot
Scatter Plot
Boxplot
Legendas

TESTES DE HIPÓTESES

CHI-QUADRADO (Goodness of fit) - Avaliar se há diferença entre dados categóricos dada uma afirmativa de frequência
CHI-QUADRADO (Test of Independence) - Verificar se duas variáveis categóricas são independentes entre si
ONE-WAY ANOVA - Verificar se há diferença entre a média de 3 ou mais grupos de uma mesma variável

REGRESSÃO LINEAR

Regressão Linear Simples
Regressão Linear Múltipla

20171017102100

Tabelar o percentual da linha de DUAS variáveis categóricas

Basta usar o parâmetro 1 para o prop.table()

# For this tutorial we will be using the Austin City Limits dataset,
# data from artists perfoming on the show ACL live regarding 
# gender, name, age abd whether or not they won a grammy

acl = read.csv("AustinCityLimits.csv")

gtab2 = table(acl$Grammy, acl$Gender)
gtab2

##    
##      F  M
##   N 21 46
##   Y 14 35

prop.table(gtab2,1)

##    
##             F         M
##   N 0.3134328 0.6865672
##   Y 0.2857143 0.7142857

20171017102200

Tabelar o percentual da coluna de DUAS variáveis categóricas

Basta usar o parâmetro 2 para o prop.table()

# For this tutorial we will be using the Austin City Limits dataset,
# data from artists perfoming on the show ACL live regarding 
# gender, name, age abd whether or not they won a grammy

acl = read.csv("AustinCityLimits.csv")

gtab2 = table(acl$Grammy, acl$Gender)
gtab2

##    
##      F  M
##   N 21 46
##   Y 14 35

prop.table(gtab2,2)

##    
##             F         M
##   N 0.6000000 0.5679012
##   Y 0.4000000 0.4320988

20171017091000

Tabelar duas variáveis categóricas uma contra a outra

Contabiliza o total de ocorrências entre DUAS variáveis categóricas.

# For this tutorial we will be using the Austin City Limits dataset,
# data from artists perfoming on the show ACL live regarding 
# gender, name, age abd whether or not they won a grammy

acl = read.csv("AustinCityLimits.csv")

gtab2 = table(acl$Grammy, acl$Gender)
gtab2

##    
##      F  M
##   N 21 46
##   Y 14 35

20171017091400

Tabelar o percentual de UMA variável categórica

Tabela a distribuição de percentuais de UMA variável categórica.

# For this tutorial we will be using the Austin City Limits dataset,
# data from artists perfoming on the show ACL live regarding 
# gender, name, age abd whether or not they won a grammy

acl = read.csv("AustinCityLimits.csv")

gtab = table(acl$Grammy) 
gtab

## 
##  N  Y 
## 67 49

prop.table(gtab)

## 
##         N         Y 
## 0.5775862 0.4224138

20171017091800

Tabelar o percentual de DUAS variáveis categóricas

Tabela a distribuição de percentuais entre DUAS variáveIS categóricaS.

# For this tutorial we will be using the Austin City Limits dataset,
# data from artists perfoming on the show ACL live regarding 
# gender, name, age abd whether or not they won a grammy

acl = read.csv("AustinCityLimits.csv")

gtab2 = table(acl$Grammy, acl$Gender)
gtab2 # table of counts

##    
##      F  M
##   N 21 46
##   Y 14 35

# to see the proportions within a certain

prop.table(gtab2)

##    
##             F         M
##   N 0.1810345 0.3965517
##   Y 0.1206897 0.3017241

20171017090300

Tabelar uma variável categórica

Contabiliza o total de ocorrências de UMA variáveil categórica.

Counts the number of occurrences of categorical variables in a column.

# For this tutorial we will be using the Austin City Limits dataset,
# data from artists perfoming on the show ACL live regarding 
# gender, name, age abd whether or not they won a grammy

acl = read.csv("AustinCityLimits.csv")


gtab = table(acl$Grammy) 
gtab

## 
##  N  Y 
## 67 49

20171019100100

Acrescentar margens a uma tabela com DUAS variavéis categóricas

# For this tutorial we will be using the Austin City Limits dataset,
# data from artists perfoming on the show ACL live regarding 
# gender, name, age abd whether or not they won a grammy

acl = read.csv("AustinCityLimits.csv")

gtab2 = table(acl$Grammy, acl$Gender)
gtab2

##    
##      F  M
##   N 21 46
##   Y 14 35

# Margens horizontais e verticais (ambas)
addmargins(gtab2)

##      
##         F   M Sum
##   N    21  46  67
##   Y    14  35  49
##   Sum  35  81 116

# Apenas margem horizontal, totaliza as colunas (use omparametro 1).
addmargins(gtab2,1)

##      
##        F  M
##   N   21 46
##   Y   14 35
##   Sum 35 81

# Apenas margem vertical, totaliza as linhas (use omparametro 2)
addmargins(gtab2,2)

##    
##      F  M Sum
##   N 21 46  67
##   Y 14 35  49

20171019102200

Gráfico de barras da distribuição de UMA VARIÁVEL CATEGÓRICA.

# For this tutorial we will be using the Austin City Limits dataset,
# data from artists perfoming on the show ACL live regarding 
# gender, name, age abd whether or not they won a grammy

acl = read.csv("AustinCityLimits.csv")

gtab = table(acl$Grammy) 
gtab

## 
##  N  Y 
## 67 49

# Gráfico da tabela (barplot)
barplot(gtab) 
# Gráfico da variável categórica (plot)
plot(acl$Grammy)

20171026110500

Gráfico de barras da frequência de UMA VARIÁVEL CATEGÓRICA.

# For this tutorial we will be using the Austin City Limits dataset,
# data from artists perfoming on the show ACL live regarding 
# gender, name, age abd whether or not they won a grammy

acl = read.csv("AustinCityLimits.csv")
gtab = table(acl$Grammy) 
gtab.prop = prop.table(gtab)
gtab.prop

## 
##         N         Y 
## 0.5775862 0.4224138

# Frequency table
barplot(gtab.prop)

20171026111000

Gráfico de barras de distribuição - DUAS VARIÁVEIS CATEGÓRICAS

# For this tutorial we will be using the Austin City Limits dataset,
# data from artists perfoming on the show ACL live regarding 
# gender, name, age abd whether or not they won a grammy

acl = read.csv("AustinCityLimits.csv")
gtab2 = table(acl$Grammy, acl$Gender)
gtab2

##    
##      F  M
##   N 21 46
##   Y 14 35

# Contingency table

# Empilhado
barplot(gtab2, legend = T)

# Lado a lado
barplot(gtab2, beside=T, legend=T)

20171026111010

Gráfico de barras de frequência - DUAS VARIÁVEIS CATEGÓRICAS

# For this tutorial we will be using the Austin City Limits dataset,
# data from artists perfoming on the show ACL live regarding 
# gender, name, age abd whether or not they won a grammy

acl = read.csv("AustinCityLimits.csv")
gtab2 = table(acl$Grammy, acl$Gender)
gtab2

##    
##      F  M
##   N 21 46
##   Y 14 35

gtab = table(acl$Grammy) 
gtab2.frequencia = prop.table(gtab2, 2)
gtab2.frequencia

##    
##             F         M
##   N 0.6000000 0.5679012
##   Y 0.4000000 0.4320988

# aqui vale chamar a atenção para o seguinte, para fazer o gráfico de barras vertical, stacked ou side-by-side é importante que o eixo y some 100% para o gráfico fazer sentdio. é por isso que se usa o opção 2 no prop.table() de form aque a soma vertical dos valores dê 100%.



# Frequency table


# Empílhada
barplot(gtab2.frequencia, legend=T)

# Lado-a-lado
barplot(gtab2.frequencia, beside = T, legend=T)

20171026113600

TESTE CHI-QUADRADO - Avaliar se há diferença entre dados categóricos

Teste estatístico aplicado a dados categóricos para avaliar quão provável é que qualquer diferença observada aconteça ao acaso. É adequado para amostras não pareadas. Testa-se a hipótese nula afirmando-se que a distribuição de frequências de um certo evento observado em uma amostra é consistente com uma distribuição teórica particular. Os eventos considerados devem ser mutuamente excludentes e devem ter probabilidade total 1. Um caso comum para este teste é quando os eventos cobrem um valor de saída de uma variável categórica. Um simples exemplo é a hipótese de que um dado de 6 lados é “honesto” (isto é, todos os sei possíveis valores - 1, 2, 3, 4, 5 e 6 - são equiprováveis).

# For this tutorial we will be using the Austin City Limits dataset,
# data from artists perfoming on the show ACL live regarding 
# gender, name, age abd whether or not they won a grammy

acl = read.csv("AustinCityLimits.csv")

### Create a table to show the counts of each Grammy
gtab <- table(acl$Grammy)
gtab

## 
##  N  Y 
## 67 49

# esta tabela armazena a real distribuição de vencedores (Y) e perdedores (N) do Emmy. ( observe que ele tabela em ordem alfabética)

# Suponha que apareça a afirmação de que a distribuição entre vencedores seja:

# Vencedores (Y): 1/3 (0,33333333333333333333333333333333)
# Perdedores (N): 2/3 (0,66666666666666666666666666666667)

# A hipótes aqui é:
#H0 = Não há diferença entre a distribuição real e a afirmação)

# para testar


### Create a vector of expected proportions.
claimp <- c(2/3, 1/3)
claimp # here we have the claimed count

## [1] 0.6666667 0.3333333

### Using the chi-square test
# Run goodness of fit

chisq.test(gtab, p = claimp)

## 
##  Chi-squared test for given probabilities
## 
## data:  gtab
## X-squared = 4.1422, df = 1, p-value = 0.04183

# p < 0.05 rejeitamos a Hypótese 0. A afirmação não é verdeira. Há diferença.

# Para o teste do chi=quadrado ser válido 
# a assumption of suficiente sample size deve ser válida.

# essa assumption é a de que deve haver ao menos 5 registros na distribuião
# para testar isso usamos

chisq.test(gtab, p = claimp)$expected

##        N        Y 
## 77.33333 38.66667

# como há mais que 5 registos em cada categoria o chi.square test é aplicável e o resultado válido.

20171026115400

TESTE CHI-QUADRADO (Test of Independence) - Verrificar se duas variáveis categóricas são independentes entre si

Nesse caso, uma “observação” consiste dos valores de dois valores de saída e a hipótese nula de que a ocorrência desses valores de saída são estatisticamente independentes.

# For this tutorial we will be using the Austin City Limits dataset,
# data from artists perfoming on the show ACL live regarding 
# gender, name, age abd whether or not they won a grammy

acl = read.csv("AustinCityLimits.csv")

# para efeito de exemplo vamos verificar se as variáveis:

# Grammy - Se um artista ganhou ou não um grame
# e
# Age.Group - Faixa etária a que um artista pertence 

# São independentes entre sí´.

# Hipótese
# H0 - Variáveis são independentes


# Primeiro - Clicamos a contingency table (tabela de distribuição das variáveis)

grammyage <- table(acl$Grammy, acl$Age.Group)
grammyage

##    
##     Fifties or Older Forties Thirties Twenties
##   N               17      13       28        9
##   Y               15      17       12        5

# antes mesmo de realizarmos o chi.square test precisamos ver se a 
# Assumption of suficiente sample size foi atendida

chisq.test(grammyage)$expected

##    
##     Fifties or Older  Forties Thirties Twenties
##   N         18.48276 17.32759 23.10345 8.086207
##   Y         13.51724 12.67241 16.89655 5.913793

# percebemos que nenhum elemento da tabela é menor que cinco logo a regra foi atendida e podemos usar o chi.square test.

chisq.test(grammyage, correct = FALSE)

## 
##  Pearson's Chi-squared test
## 
## data:  grammyage
## X-squared = 5.5415, df = 3, p-value = 0.1362

##############
### IMPORTANTE
##############

# Como verificamos a Assumtion sabemos que o chi.square pode ser aplicado. 
# Entretanto a função chisa.test() tem embutida nela o default de que se a
# assunção não for atendida o test é realizado assim mesmo e corrigido..
# como não precisamos corrigior então devemos desabilitar essa função
# de correção que está por padrão ativada (true) donde o correct = FALSE

# como p > .05 não podemos rejeitar a H0
# em outras palavras não podemos dizer que vencer ou não um emmy
# é independente da faixa etária do artista.

20171125093800

ONE-WAY ANOVA

Here is the discription of how to performa a ONE-WAY ANOVA hypothesis test.

# Load the dataset
film <- read.csv("FilmData.csv", sep=",")

Here we have the distribuiton of days a film spent in theather for each genre

# Load the dataset
boxplot(film$Days ~ film$Genre, range=0)

If we wanted to test the following hypothesis: “Is the mean number of days spent in theaters equal across all four genres?”"

We could run a one-way ANOVA on this data.

1) Step One (Analysing the means)

Before we get to the anova funciton let´s see the aggregate() function that will help us look at the means of this variable across all four groups and also let us test one of the assumptions of the ANOVA. Aggregate() basically takes a numeric variable as a function of a categorical variable and will produce a certain descriptive statistic for every single group.

aggregate(Days ~ Genre, film, mean)

##              Genre     Days
## 1 Action/Adventure 140.0631
## 2        Animation 154.5714
## 3           Comedy 136.0588
## 4            Drama 138.1111

We can see that animations spent the longest number of days in the theathers while Comedies spent the least.

2) Step Two - Testing the assumption that the groups have equal variances

What a anova is going to tell us is if theses means are significantly different from each other or not. We can use aggregtate() to test the assumtpion of the ANOVA that is that the groups have equal variances we could ask instead for the mean for the standard deviation

aggregate(film$Days ~ film$Genre, film, sd) # not working, manually we have

##         film$Genre film$Days
## 1 Action/Adventure  38.35202
## 2        Animation  30.79639
## 3           Comedy  27.30492
## 4            Drama  37.45479

Assumption of equal variability:

Diz que a maior SD (Standard deviation) de uma categoria não deve ser maior que o dobro da SD (Standard Deviation) da menor categoria.

No caso a menor categoria é 54.60984. O dobro disso é 54.60984. A maior categoria é 38.35202, muito menor que o dobro. Assim, se a maior categoria é menor que o dobro da menor categoria então a asumption of equla variablilty is met.

3) Step three - Calculating the ANOVA

What a ANOVA is going to tell us is if these means are significant or not.

daysmodel <- aov(film$Days ~ film$Genre)

So if we submit that, we’re not going to see any output yet, because basically, R just assigned this ANOVA model to our object name, “daysmodel”. In order to retrieve the information from this model, we need tO pass it through one more function, which is called summary().

  summary(daysmodel)

##              Df Sum Sq Mean Sq F value Pr(>F)
## film$Genre    3   3162    1054   0.785  0.504
## Residuals   147 197278    1342

So passing our ANOVA model, “daysmodel”, through the summary() function, we’re going to get our ANOVA table displayed in the Console window.

So we can see here with a p-value greater than 0.05, we would fail to reject the null hypothesis that the mean of all four groups are equal to each other.

4) Step Four - What to do if the null hypothesis is rejected

So if it were the case that we had a significant p-value for this ANOVA, we would follow up this analysis with a post hoc analysis, where we find which means are different from each other.

So just to illustrate that process, I want to show you one more function, which will run the Tukey pairwise comparisons for every single group in your model.

All you have to do is give the object name, daysmodel, into this function, called TukeyHSD().

TukeyHSD(daysmodel)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = film$Days ~ film$Genre)
## 
## $`film$Genre`
##                                  diff       lwr      upr     p adj
## Animation-Action/Adventure  14.508366 -12.49138 41.50811 0.5036180
## Comedy-Action/Adventure     -4.004240 -28.79839 20.78991 0.9750409
## Drama-Action/Adventure      -1.951952 -34.94619 31.04229 0.9987000
## Comedy-Animation           -18.512605 -52.87019 15.84498 0.5012353
## Drama-Animation            -16.460317 -57.13357 24.21293 0.7193063
## Drama-Comedy                 2.052288 -37.19152 41.29610 0.9991003

And what it will output for you are every single pairwise comparison, the difference and the means, a lower and upper confidence interval for that difference, and then the p-value that is adjusted based on the Tukey adjustment. We can see that none of these pairwise comparisons are significant, which makes sense because our overall ANOVA wasn’t significant.

But if you did have a significant ANOVA, you would go into this table and see where those differences actually occur.

20171201063000

Você póde adicionar uma série de informações a um gráfico. Uma série de comandos de plotagem obedencem a mesma sintaxe, logo a mesma é ilustrativa e ampla. Você pode concatenar os parâmetros num único comando para gerar um gráfico cheio de informações.

# Carrega o dataset
animaldata = read.csv('AnimalData.csv', sep=',', header=T)

Adicionar um título

plot(animaldata$Sex, main="Title of the graph")

Adicionar o texto do eixo x

plot(animaldata$Sex, xlab="X axis of the graph")

Adicionar o texto do eixo y

plot(animaldata$Sex, ylab="y axis of the graph")

20171201064200

Histrograma

Histograma de uma variável numérica

animaldata = read.csv('AnimalData.csv', sep=',', header=T)
hist(animaldata$Age.Intake)

20171201064201

Barplot

Gráfico de barras de uma variável categórica precisa ser uma tabela.

animaldata = read.csv('AnimalData.csv', sep=',', header=T)
gtab = table(animaldata$Dog.Group)
barplot(gtab)

20171201064202

plot

Gráfico de uma variável categórica pode ser gerado direto da variável e vai se comportar como se fosse uma tabela.

animaldata = read.csv('AnimalData.csv', sep=',', header=T)
plot(animaldata$Dog.Group)

20180112083700

ScatterPlot

plot(mtcars$disp, mtcars$drat)

20171218140900

Boxplot

Boxplot simples.

film = read.csv("FilmData.csv", sep=",")

boxplot(film$Days)

Boxplot removendo outliers (o boxplot expande até o outlier na realidade).

film = read.csv("FilmData.csv", sep=",")

boxplot(film$Days, range=0)

Boxplot de uma variável agrupada por outra.

film = read.csv("FilmData.csv", sep=",")
boxplot(film$Days ~ film$Genre)

20180114110000

Regressão Linear Simples

Neste exemplo, vamos analisar a influência de BDI sobre QoL.

res = read.csv("TempskiResilience.csv")
clin <- res[res$Group == "Clinical Sciences",] # Subset of students in the clinical sciences programm.

Matriz de correlação básica
Modelo de Regressão
Intervalos de confiança
Betas padronizados
Linearidade
Residual vs. Fitted
Cook´s Distance

20180122104100

Modelo de Regressão

Use para analisar a influência da variável independente sobre a variável de interesse (dependente).

regressionSimple <- lm(QoL ~ BDI, data=clin)
summary(regressionSimple)

## 
## Call:
## lm(formula = QoL ~ BDI, data = clin)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6587 -0.6587  0.0687  0.7962  2.6138 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.544521   0.088521  96.526   <2e-16 ***
## BDI         -0.068137   0.007626  -8.935   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.159 on 489 degrees of freedom
## Multiple R-squared:  0.1404, Adjusted R-squared:  0.1386 
## F-statistic: 79.84 on 1 and 489 DF,  p-value: < 2.2e-16

20180122101101

Intervalos de confiança

Use para saber os intervalos de confiança para o slope e o intercept.

confint(regressionSimple)

##                   2.5 %      97.5 %
## (Intercept)  8.37059347  8.71844843
## BDI         -0.08311937 -0.05315364

20180122101102

Betas padronizados

Use para saber os valores da regressão se antes fosse feito um z-score dos valores de X e Y. Ou seja, fornece os betas padronizados do modelo.

library(SDSFoundations)
lmBeta(regressionSimple)

##        BDI 
## -0.3746403

20180122101104

Linearidade

Use para ver a relação de linearidade entre essas duas variáveis.

plot(clin$BDI, clin$QoL)
abline(regressionSimple)

20180122101106

Matriz de correlação básica

Use para verificar qual o grau da correlação entre essas duas variáveis.

vars <- c("QoL", "BDI")
cor(clin[,vars])

##            QoL        BDI
## QoL  1.0000000 -0.3746403
## BDI -0.3746403  1.0000000

20180122101107

Residual vs. Fitted

Usado para ver se os pontos se distribuem uniformemente em relação à linha vermelha.

plot(regressionSimple, which = 1)

20180122101108

Cook´s Distance

Use para visualizar os resgistros (outliers) que podem estar influenciando o modelo.

cutoff <- 4/regressionSimple$df

plot(regressionSimple, which = 4, cook.levels = cutoff)

plot(regressionSimple, which = 4, cook.levels = cutoff, id.n = 7)

20180117091400

Regressão Linear Múltipla

res = read.csv("TempskiResilience.csv")

# or
# library(SDSFoundations)
# res = TempskiResilience

clin <- res[res$Group == "Clinical Sciences",] # Subset of students in the clinical sciences programm.

Regressão Linear Múltipla
Correlation Matrix
Correlation Matrix (Predictors)
Confidence Intervals
Standardized Betas
Proportion of Variance Accounted for
Variance Inflation Factor(Tolerância)
Diagnóstico: Residual vs. Fitted
Diagnóstico: Cook´s Distance

20180124115800

Regressão Linear Múltipla

Regression <- lm(MS.QoL ~ DREEM.S.SP + DREEM.A.SP + Resilience + BDI + Age, data=clin)
summary(Regression)

## 
## Call:
## lm(formula = MS.QoL ~ DREEM.S.SP + DREEM.A.SP + Resilience + 
##     BDI + Age, data = clin)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5697 -0.7056  0.0725  0.8409  3.9567 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.674e+00  7.650e-01   6.110 2.05e-09 ***
## DREEM.S.SP   1.394e-01  1.801e-02   7.739 5.87e-14 ***
## DREEM.A.SP   3.670e-02  1.428e-02   2.571 0.010439 *  
## Resilience   4.345e-05  6.135e-03   0.007 0.994353    
## BDI         -3.636e-02  1.067e-02  -3.409 0.000707 ***
## Age         -2.904e-02  2.352e-02  -1.235 0.217592    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.229 on 485 degrees of freedom
## Multiple R-squared:  0.3337, Adjusted R-squared:  0.3269 
## F-statistic: 48.59 on 5 and 485 DF,  p-value: < 2.2e-16

20180124091501

Correlation Matrix

In fact, we can confirm the simultaneous impact of the independent variables by examining the correlation matrix.

vars <- c("MS.QoL", "DREEM.S.SP", "DREEM.A.SP", "Resilience", "BDI", "Age")
cor(clin[,vars], use="pairwise.complete.obs")

##                 MS.QoL  DREEM.S.SP  DREEM.A.SP Resilience         BDI
## MS.QoL      1.00000000  0.54744658  0.40488523  0.3175923 -0.40809301
## DREEM.S.SP  0.54744658  1.00000000  0.57218064  0.4257209 -0.52053588
## DREEM.A.SP  0.40488523  0.57218064  1.00000000  0.4819358 -0.32875516
## Resilience  0.31759228  0.42572085  0.48193583  1.0000000 -0.56656655
## BDI        -0.40809301 -0.52053588 -0.32875516 -0.5665665  1.00000000
## Age        -0.07208173 -0.04819259 -0.09025951  0.0401443 -0.02372227
##                    Age
## MS.QoL     -0.07208173
## DREEM.S.SP -0.04819259
## DREEM.A.SP -0.09025951
## Resilience  0.04014430
## BDI        -0.02372227
## Age         1.00000000

20180124093100

Predictors in the Correlation Matrix

Examine the correlation matrixes for any significant predictors.

library(psych)
corr.test(clin[,vars], use="pairwise.complete.obs")

## Call:corr.test(x = clin[, vars], use = "pairwise.complete.obs")
## Correlation matrix 
##            MS.QoL DREEM.S.SP DREEM.A.SP Resilience   BDI   Age
## MS.QoL       1.00       0.55       0.40       0.32 -0.41 -0.07
## DREEM.S.SP   0.55       1.00       0.57       0.43 -0.52 -0.05
## DREEM.A.SP   0.40       0.57       1.00       0.48 -0.33 -0.09
## Resilience   0.32       0.43       0.48       1.00 -0.57  0.04
## BDI         -0.41      -0.52      -0.33      -0.57  1.00 -0.02
## Age         -0.07      -0.05      -0.09       0.04 -0.02  1.00
## Sample Size 
## [1] 491
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##            MS.QoL DREEM.S.SP DREEM.A.SP Resilience BDI  Age
## MS.QoL       0.00       0.00       0.00       0.00 0.0 0.44
## DREEM.S.SP   0.00       0.00       0.00       0.00 0.0 0.86
## DREEM.A.SP   0.00       0.00       0.00       0.00 0.0 0.23
## Resilience   0.00       0.00       0.00       0.00 0.0 0.86
## BDI          0.00       0.00       0.00       0.00 0.0 0.86
## Age          0.11       0.29       0.05       0.37 0.6 0.00
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

corr.test(clin[,vars], use="pairwise.complete.obs")$r # The matrix of correlations

##                 MS.QoL  DREEM.S.SP  DREEM.A.SP Resilience         BDI
## MS.QoL      1.00000000  0.54744658  0.40488523  0.3175923 -0.40809301
## DREEM.S.SP  0.54744658  1.00000000  0.57218064  0.4257209 -0.52053588
## DREEM.A.SP  0.40488523  0.57218064  1.00000000  0.4819358 -0.32875516
## Resilience  0.31759228  0.42572085  0.48193583  1.0000000 -0.56656655
## BDI        -0.40809301 -0.52053588 -0.32875516 -0.5665665  1.00000000
## Age        -0.07208173 -0.04819259 -0.09025951  0.0401443 -0.02372227
##                    Age
## MS.QoL     -0.07208173
## DREEM.S.SP -0.04819259
## DREEM.A.SP -0.09025951
## Resilience  0.04014430
## BDI        -0.02372227
## Age         1.00000000

corr.test(clin[,vars], use="pairwise.complete.obs")$n # Number of cases per correlation

## [1] 491

corr.test(clin[,vars], use="pairwise.complete.obs")$p # two tailed probability of t for each correlation. For symmetric matrices, p values adjusted for multiple tests are reported above the diagonal.

##                 MS.QoL DREEM.S.SP   DREEM.A.SP   Resilience          BDI
## MS.QoL     0.00000e+00  0.0000000 0.000000e+00 3.434586e-12 0.000000e+00
## DREEM.S.SP 0.00000e+00  0.0000000 0.000000e+00 0.000000e+00 0.000000e+00
## DREEM.A.SP 0.00000e+00  0.0000000 0.000000e+00 0.000000e+00 5.409007e-13
## Resilience 5.72431e-13  0.0000000 0.000000e+00 0.000000e+00 0.000000e+00
## BDI        0.00000e+00  0.0000000 7.727152e-14 0.000000e+00 0.000000e+00
## Age        1.10661e-01  0.2865258 4.560823e-02 3.747404e-01 6.000112e-01
##                  Age
## MS.QoL     0.4426441
## DREEM.S.SP 0.8595773
## DREEM.A.SP 0.2280411
## Resilience 0.8595773
## BDI        0.8595773
## Age        0.0000000

corr.test(clin[,vars], use="pairwise.complete.obs")$t # value of t-test for each correlation

##               MS.QoL DREEM.S.SP DREEM.A.SP Resilience         BDI
## MS.QoL           Inf  14.466165   9.791869   7.406480  -9.8848750
## DREEM.S.SP 14.466165        Inf  15.427876  10.404006 -13.4812048
## DREEM.A.SP  9.791869  15.427876        Inf  12.162901  -7.6977543
## Resilience  7.406480  10.404006  12.162901        Inf -15.2044151
## BDI        -9.884875 -13.481205  -7.697754 -15.204415         Inf
## Age        -1.598125  -1.066939  -2.004120   0.888441  -0.5247263
##                   Age
## MS.QoL     -1.5981253
## DREEM.S.SP -1.0669392
## DREEM.A.SP -2.0041198
## Resilience  0.8884410
## BDI        -0.5247263
## Age               Inf

20180124091502

Confidence Intervals

And we can use the confint() funciton in R to again find the confidence intervals of each slope in the regression model.

confint(Regression)

##                    2.5 %      97.5 %
## (Intercept)  3.170622049  6.17671584
## DREEM.S.SP   0.103999969  0.17477991
## DREEM.A.SP   0.008652434  0.06475334
## Resilience  -0.012011893  0.01209879
## BDI         -0.057318060 -0.01540232
## Age         -0.075260330  0.01717907

20180124091503

Standardized Betas

We can even use the lmbeta() function to get at the concept of the standardized beta for every coefficient in the model, effectively freeing up everyone from their scale.

lmBeta(Regression)

##    DREEM.S.SP    DREEM.A.SP    Resilience           BDI           Age 
##  0.3873277406  0.1241752275  0.0003453895 -0.1665514046 -0.0461722487

20180124115500

Variance Inflation Factor(Tolerância)

Tolerance defines how much variance left over in this particular independent variable once I know all the other independence in the model.

library(car)

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

var1 = vif(Regression)
tolerancia = 1/var1
tolerancia

## DREEM.S.SP DREEM.A.SP Resilience        BDI        Age 
##  0.5484335  0.5888836  0.5774369  0.5754930  0.9821422

20180124120501

Diagnóstico: Residual vs. Fitted

plot(Regression, which = 1)

20180124120502

Diagnóstico: Cook´s Distance

cutoff <- 4/(Regression$df) 
plot(Regression, which=4, cook.levels=cutoff)

plot(Regression, which = 4, cook.levels = cutoff, id.n = 7)

20180125094900

Proportion of Variance Accounted for

#install.packages('SDSFoundations')
library(SDSFoundations)
pCorr(Regression)

##             Partial_Corr Partial_Corr_sq    Part_Corr Part_Corr_sq
## DREEM.S.SP  0.3315348929    1.099154e-01  0.286840575 8.227752e-02
## DREEM.A.SP  0.1159533555    1.344518e-02  0.095290520 9.080283e-03
## Resilience  0.0003215396    1.033877e-07  0.000262459 6.888471e-08
## BDI        -0.1529677527    2.339913e-02 -0.126347985 1.596381e-02
## Age        -0.0559705907    3.132707e-03 -0.045758123 2.093806e-03