1 Required Packages

These are the required packages for this session:

#install.packages("ggplot2")
library(ggplot2)
data(msleep)

In today’s session we will explore the dataset ‘msleep’ included in the ‘ggplot2’ package.

2 Inspecting data

msleep

## # A tibble: 83 x 11
##    name   genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##    <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
##  1 Cheet~ Acin~ carni Carn~ lc                  12.1      NA        NA      11.9
##  2 Owl m~ Aotus omni  Prim~ <NA>                17         1.8      NA       7  
##  3 Mount~ Aplo~ herbi Rode~ nt                  14.4       2.4      NA       9.6
##  4 Great~ Blar~ omni  Sori~ lc                  14.9       2.3       0.133   9.1
##  5 Cow    Bos   herbi Arti~ domesticated         4         0.7       0.667  20  
##  6 Three~ Brad~ herbi Pilo~ <NA>                14.4       2.2       0.767   9.6
##  7 North~ Call~ carni Carn~ vu                   8.7       1.4       0.383  15.3
##  8 Vespe~ Calo~ <NA>  Rode~ <NA>                 7        NA        NA      17  
##  9 Dog    Canis carni Carn~ domesticated        10.1       2.9       0.333  13.9
## 10 Roe d~ Capr~ herbi Arti~ lc                   3        NA        NA      21  
## # ... with 73 more rows, and 2 more variables: brainwt <dbl>, bodywt <dbl>

?msleep

## starting httpd help server ... done

That shows you the first 10 rows of the data, and some of its columns. It also gives another important piece of information: 83 x 11, meaning that the dataset has 83 rows (i.e. 83 observations) and 11 columns (with each column corresponding to a variable in the dataset).

There are however better methods for looking at the data. To view all 83 rows and all 11 variables, use:

You’ll notice that some cells have the value NA instead of a proper value. NA stands for Not Available, and is a placeholder used by R to point out missing data. In this case, it means that the value is unknown.

To find information about the data frame containing the data, some useful functions are:

head(msleep)

## # A tibble: 6 x 11
##   name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##   <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Cheetah Acin~ carni Carn~ lc                  12.1      NA        NA      11.9
## 2 Owl mo~ Aotus omni  Prim~ <NA>                17         1.8      NA       7  
## 3 Mounta~ Aplo~ herbi Rode~ nt                  14.4       2.4      NA       9.6
## 4 Greate~ Blar~ omni  Sori~ lc                  14.9       2.3       0.133   9.1
## 5 Cow     Bos   herbi Arti~ domesticated         4         0.7       0.667  20  
## 6 Three-~ Brad~ herbi Pilo~ <NA>                14.4       2.2       0.767   9.6
## # ... with 2 more variables: brainwt <dbl>, bodywt <dbl>

tail(msleep)

## # A tibble: 6 x 11
##   name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##   <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Tenrec  Tenr~ omni  Afro~ <NA>                15.6       2.3      NA       8.4
## 2 Tree s~ Tupa~ omni  Scan~ <NA>                 8.9       2.6       0.233  15.1
## 3 Bottle~ Turs~ carni Ceta~ <NA>                 5.2      NA        NA      18.8
## 4 Genet   Gene~ carni Carn~ <NA>                 6.3       1.3      NA      17.7
## 5 Arctic~ Vulp~ carni Carn~ <NA>                12.5      NA        NA      11.5
## 6 Red fox Vulp~ carni Carn~ <NA>                 9.8       2.4       0.35   14.2
## # ... with 2 more variables: brainwt <dbl>, bodywt <dbl>

dim(msleep)

## [1] 83 11

str(msleep)

## tibble [83 x 11] (S3: tbl_df/tbl/data.frame)
##  $ name        : chr [1:83] "Cheetah" "Owl monkey" "Mountain beaver" "Greater short-tailed shrew" ...
##  $ genus       : chr [1:83] "Acinonyx" "Aotus" "Aplodontia" "Blarina" ...
##  $ vore        : chr [1:83] "carni" "omni" "herbi" "omni" ...
##  $ order       : chr [1:83] "Carnivora" "Primates" "Rodentia" "Soricomorpha" ...
##  $ conservation: chr [1:83] "lc" NA "nt" "lc" ...
##  $ sleep_total : num [1:83] 12.1 17 14.4 14.9 4 14.4 8.7 7 10.1 3 ...
##  $ sleep_rem   : num [1:83] NA 1.8 2.4 2.3 0.7 2.2 1.4 NA 2.9 NA ...
##  $ sleep_cycle : num [1:83] NA NA NA 0.133 0.667 ...
##  $ awake       : num [1:83] 11.9 7 9.6 9.1 20 9.6 15.3 17 13.9 21 ...
##  $ brainwt     : num [1:83] NA 0.0155 NA 0.00029 0.423 NA NA NA 0.07 0.0982 ...
##  $ bodywt      : num [1:83] 50 0.48 1.35 0.019 600 ...

names(msleep)

##  [1] "name"         "genus"        "vore"         "order"        "conservation"
##  [6] "sleep_total"  "sleep_rem"    "sleep_cycle"  "awake"        "brainwt"     
## [11] "bodywt"

dim returns the numbers of rows and columns of the data frame.
str returns information about the 11 variables. Of particular importance are the data types of the variables, which tells us what kind of data we are dealing with (numerical, categorical, dates, or something else).
names returns a vector containing the names of the variables.

3 Descriptives

Generally, statistical variables can be:

Continuous variables that are numeric. They represent a measurable quantity. (i.e. Age, BMI, …).
Categorical variables that take on values that are names or labels (i.e. Sex, Colors, …).

3.1 Numerical data

A convenient way to get some descriptive statistics giving a summary of each variable is to use the summary function:

summary(msleep)

##      name              genus               vore              order          
##  Length:83          Length:83          Length:83          Length:83         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  conservation        sleep_total      sleep_rem      sleep_cycle    
##  Length:83          Min.   : 1.90   Min.   :0.100   Min.   :0.1167  
##  Class :character   1st Qu.: 7.85   1st Qu.:0.900   1st Qu.:0.1833  
##  Mode  :character   Median :10.10   Median :1.500   Median :0.3333  
##                     Mean   :10.43   Mean   :1.875   Mean   :0.4396  
##                     3rd Qu.:13.75   3rd Qu.:2.400   3rd Qu.:0.5792  
##                     Max.   :19.90   Max.   :6.600   Max.   :1.5000  
##                                     NA's   :22      NA's   :51      
##      awake          brainwt            bodywt        
##  Min.   : 4.10   Min.   :0.00014   Min.   :   0.005  
##  1st Qu.:10.25   1st Qu.:0.00290   1st Qu.:   0.174  
##  Median :13.90   Median :0.01240   Median :   1.670  
##  Mean   :13.57   Mean   :0.28158   Mean   : 166.136  
##  3rd Qu.:16.15   3rd Qu.:0.12550   3rd Qu.:  41.750  
##  Max.   :22.10   Max.   :5.71200   Max.   :6654.000  
##                  NA's   :27

For the text variables, this doesn’t provide any information at the moment. But for the numerical variables, it provides a lot of useful information.

Sometimes we want to compute just one of these, and other times we may want to compute summary statistics not included in summary. Let’s say that we want to compute some descriptive statistics for the sleep_total variable. To access a vector inside a data frame, we use a dollar sign: data_frame_name$vector_name. So to access the sleep_total vector in the msleep data frame, we write:

msleep$sleep_total

##  [1] 12.1 17.0 14.4 14.9  4.0 14.4  8.7  7.0 10.1  3.0  5.3  9.4 10.0 12.5 10.3
## [16]  8.3  9.1 17.4  5.3 18.0  3.9 19.7  2.9  3.1 10.1 10.9 14.9 12.5  9.8  1.9
## [31]  2.7  6.2  6.3  8.0  9.5  3.3 19.4 10.1 14.2 14.3 12.8 12.5 19.9 14.6 11.0
## [46]  7.7 14.5  8.4  3.8  9.7 15.8 10.4 13.5  9.4 10.3 11.0 11.5 13.7  3.5  5.6
## [61] 11.1 18.1  5.4 13.0  8.7  9.6  8.4 11.3 10.6 16.6 13.8 15.9 12.8  9.1  8.6
## [76] 15.8  4.4 15.6  8.9  5.2  6.3 12.5  9.8

Some examples of functions that can be used to compute descriptive statistics for this vector are:

mean(msleep$sleep_total)      # Mean

## [1] 10.43373

median(msleep$sleep_total)    # Median

## [1] 10.1

max(msleep$sleep_total)       # Max

## [1] 19.9

min(msleep$sleep_total)       # Min

## [1] 1.9

sd(msleep$sleep_total)        # Standard deviation

## [1] 4.450357

var(msleep$sleep_total)       # Variance

## [1] 19.80568

quantile(msleep$sleep_total)  # Various quantiles

##    0%   25%   50%   75%  100% 
##  1.90  7.85 10.10 13.75 19.90

To see how many animals sleep for more than 8 hours a day, we can use the following:

sum(msleep$sleep_total > 8)   # Frequency (count)

## [1] 61

mean(msleep$sleep_total > 8)  # Relative frequency (proportion)

## [1] 0.7349398

Now, let’s try to compute the mean value for the length of REM sleep for the animals:

mean(msleep$sleep_rem)

## [1] NA

The above call returns the answer NA. The reason is that there are NA values in the sleep_rem vector (22 of them, as we saw before). What we actually wanted was the mean value among the animals for which we know the REM sleep. In order to ignore NA’s in the computation, we set na.rm = TRUE in the function call:

mean(msleep$sleep_rem, na.rm = TRUE)

## [1] 1.87541

Note that the NA values have not been removed from msleep. Setting na.rm = TRUE simply tells R to ignore them in a particular computation, not to delete them.

We run into the same problem if we try to compute the correlation between sleep_total and sleep_rem:

cor(msleep$sleep_total, msleep$sleep_rem)

## [1] NA

A quick look at the documentation (?cor), tells us that the argument used to ignore NA values has a different name for cor - it’s not na.rm but use. The reason will become evident later on, when we study more than two variables at a time. For now, we set use = "complete.obs" to compute the correlation using only observations with complete data (i.e. no missing values):

cor(msleep$sleep_total, msleep$sleep_rem, use = “complete.obs”)

3.1.1 Types of plots for numerical data

Scatterplot

# Base R:
plot(msleep$sleep_total, msleep$sleep_rem)

# ggplot2:
ggplot(msleep, aes(x = sleep_total, y = sleep_rem)) + geom_point()

## Warning: Removed 22 rows containing missing values (geom_point).

by groups:

ggplot(msleep, aes(brainwt, sleep_total)) + 
      geom_point() +
      xlab("Brain weight (logarithmic scale)") +
      ylab("Total sleep time") +
      scale_x_log10() +
      facet_wrap(~ vore)

## Warning: Removed 27 rows containing missing values (geom_point).

Boxplots

Another option for comparing groups is boxplots.

# Base R:
boxplot(sleep_total ~ vore, data = msleep)

# ggplot2:
ggplot(msleep, aes(vore, sleep_total)) +
      geom_boxplot()

Histograms

Histograms are used to show the distribution of a continuous variable.

# Base R:
hist(msleep$sleep_total)

# ggplot2:
ggplot(msleep, aes(sleep_total)) +
      geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3.2 Categorical data

Some of the variables in the database are categorical rather than numerical. It therefore makes no sense to compute means or largest values.

For categorical variables (often called factors in R), we can instead create a table showing the frequencies of different categories using table:

table(msleep$vore)

## 
##   carni   herbi insecti    omni 
##      19      32       5      20

To instead show the proportion of different categories, we can apply proportions to the table that we just created:

proportions(table(msleep$vore))

## 
##      carni      herbi    insecti       omni 
## 0.25000000 0.42105263 0.06578947 0.26315789

The table function can also be used to construct a cross table that shows the counts for different combinations of two categorical variables:

# Counts:
table(msleep$vore, msleep$conservation)

##          
##           cd domesticated en lc nt vu
##   carni    1            2  1  5  1  4
##   herbi    1            7  2 10  3  3
##   insecti  0            0  1  2  0  0
##   omni     0            1  0  8  0  0

# Proportions, per row:
proportions(table(msleep$vore, msleep$conservation),
            margin = 1)

##          
##                   cd domesticated         en         lc         nt         vu
##   carni   0.07142857   0.14285714 0.07142857 0.35714286 0.07142857 0.28571429
##   herbi   0.03846154   0.26923077 0.07692308 0.38461538 0.11538462 0.11538462
##   insecti 0.00000000   0.00000000 0.33333333 0.66666667 0.00000000 0.00000000
##   omni    0.00000000   0.11111111 0.00000000 0.88888889 0.00000000 0.00000000

# Proportions, per column:
proportions(table(msleep$vore, msleep$conservation),
            margin = 2)

##          
##                  cd domesticated        en        lc        nt        vu
##   carni   0.5000000    0.2000000 0.2500000 0.2000000 0.2500000 0.5714286
##   herbi   0.5000000    0.7000000 0.5000000 0.4000000 0.7500000 0.4285714
##   insecti 0.0000000    0.0000000 0.2500000 0.0800000 0.0000000 0.0000000
##   omni    0.0000000    0.1000000 0.0000000 0.3200000 0.0000000 0.0000000

We can also convert these variables into factor variables, and then use the summary() function.

msleep$vore <- factor(msleep$vore)
msleep$conservation <- factor(msleep$conservation)
summary(msleep)

##      name              genus                vore       order          
##  Length:83          Length:83          carni  :19   Length:83         
##  Class :character   Class :character   herbi  :32   Class :character  
##  Mode  :character   Mode  :character   insecti: 5   Mode  :character  
##                                        omni   :20                     
##                                        NA's   : 7                     
##                                                                       
##                                                                       
##        conservation  sleep_total      sleep_rem      sleep_cycle    
##  cd          : 2    Min.   : 1.90   Min.   :0.100   Min.   :0.1167  
##  domesticated:10    1st Qu.: 7.85   1st Qu.:0.900   1st Qu.:0.1833  
##  en          : 4    Median :10.10   Median :1.500   Median :0.3333  
##  lc          :27    Mean   :10.43   Mean   :1.875   Mean   :0.4396  
##  nt          : 4    3rd Qu.:13.75   3rd Qu.:2.400   3rd Qu.:0.5792  
##  vu          : 7    Max.   :19.90   Max.   :6.600   Max.   :1.5000  
##  NA's        :29                    NA's   :22      NA's   :51      
##      awake          brainwt            bodywt        
##  Min.   : 4.10   Min.   :0.00014   Min.   :   0.005  
##  1st Qu.:10.25   1st Qu.:0.00290   1st Qu.:   0.174  
##  Median :13.90   Median :0.01240   Median :   1.670  
##  Mean   :13.57   Mean   :0.28158   Mean   : 166.136  
##  3rd Qu.:16.15   3rd Qu.:0.12550   3rd Qu.:  41.750  
##  Max.   :22.10   Max.   :5.71200   Max.   :6654.000  
##                  NA's   :27

3.2.1 Types of plots for categorical data

Barcharts

Bar charts are discrete analogues to histograms, where the category counts are represented by bars. The code for creating them is:

# Base R
barplot(table(msleep$vore))

# ggplot2
ggplot(msleep, aes(vore)) +
      geom_bar()

4 Statistical Data Analysis

R has thousands of functions for running different statistical hypothesis tests.

4.1 Bivariate Analysis

Categorical variable - Categorical Variable (i.e. Sex, Race)
Continuous variable - Continuous variable (i.e. Age, BMI)

4.1.1 Bivariate analysis for categorical variables

4.1.1.1 Chi-squared Test of Independence

msleep.t <- table(msleep$vore, msleep$conservation)
chisq.test(msleep.t)

## Warning in chisq.test(msleep.t): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  msleep.t
## X-squared = 15.751, df = 15, p-value = 0.3988

INTERPRETATION: Since the P-value (0.3988) is greater than the significance level (0.05), we cannot reject the null hypothesis. Thus, we conclude that there isn’t a relationship between the conservation status of the animals and the feeding type.

4.1.1.2 Relative Risk

# Example: Data from a report on the relationship between aspirin use and myocardial infarction.
library("epitools")
MI <- matrix(c(10933, 10845, 104, 189), nrow = 2)
dimnames(MI) <- list("Group" = c("Aspirin","Placebo"), "MI" = c("No","Yes"))
MI

##          MI
## Group        No Yes
##   Aspirin 10933 104
##   Placebo 10845 189

rr.out <- riskratio(MI)
rr.out$measure

##          risk ratio with 95% C.I.
## Group     estimate    lower    upper
##   Aspirin 1.000000       NA       NA
##   Placebo 1.817802 1.433031 2.305884

INTERPRETATION: The estimated risk of MI were 81.78% higher for the placebo group.

Notice: With a retrospective case-control data, direct calculations of the relative risk should not be performed, as the results are not meaningful. In these cases we use the Odds ratio measure.

4.1.2 Bivariate analysis for continuous variables

4.1.2.1 Correlation test

Correlation coefficients measure the strength of association between two variables.The sign and the absolute value of a correlation coefficient describe the direction and the magnitude of the relationship between two variables.

The value of a correlation coefficient, $\rho$, ranges between -1 and 1.
A positive correlation means that if one variable gets bigger, the other variable tends to get bigger ($\rho$ ~ 1).
A negative correlation means that if one variable gets bigger, the other variable tends to get smaller ($\rho$ ~ -1).
The weakest linear relationship is indicated by a correlation coefficient equal to 0.

How to calculate a correlation coefficient with R?

cor(msleep$sleep_rem, msleep$brainwt, use="complete.obs")

## [1] -0.2213348

Test for correlation:

cor.test(msleep$sleep_rem, msleep$brainwt, use="complete.obs")

## 
##  Pearson's product-moment correlation
## 
## data:  msleep$sleep_rem and msleep$brainwt
## t = -1.5393, df = 46, p-value = 0.1306
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.47556189  0.06701441
## sample estimates:
##        cor 
## -0.2213348

INTERPRETATION: Since the P-value (0.1306) is greater than the significance level (0.05), we cannot reject the null hypothesis. Hence, there are not statistically correlation between the total amount of hours of rem sleep and brain weight in kilograms.

Correlation is defined as the strength of the linear relationship of two variables. So it measures whether, if we increase or decrease one variable by a certain factor, the other variable will also increase by that same factor.

Association, on the other hand, has much less strict definitions, and can be used to explain a lot of different things, and to express any sort of connection between variables/factors.

4.1.2.2 Linear Regression

m1 <- lm(msleep$brainwt ~ msleep$bodywt, data=msleep)
summary(m1)

## 
## Call:
## lm(formula = msleep$brainwt ~ msleep$bodywt, data = msleep)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78804 -0.08422 -0.07634 -0.02839  2.06190 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.592e-02  4.821e-02   1.782   0.0804 .  
## msleep$bodywt 9.639e-04  5.027e-05  19.176   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3526 on 54 degrees of freedom
##   (27 observations deleted due to missingness)
## Multiple R-squared:  0.8719, Adjusted R-squared:  0.8696 
## F-statistic: 367.7 on 1 and 54 DF,  p-value: < 2.2e-16

INTERPRETATION: There is a significant effect. So, a change in a kg in the body weight decreases in 0.002 kg their brain weight. However, the proportion of variance in the dependent variable (brainwt) that is predictible from the independent variable (bodywt) is around less than 10%.

4.1.3 General Inference

4.1.3.1 Test to compare two variances

s.v.c <- subset(msleep, vore=="carni")
s.v.h <- subset(msleep, vore=="herbi")  
var.test(s.v.c$sleep_rem, s.v.h$sleep_rem)

## 
##  F test to compare two variances
## 
## data:  s.v.c$sleep_rem and s.v.h$sleep_rem
## F = 4.0789, num df = 9, denom df = 23, p-value = 0.006162
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##   1.493392 14.789090
## sample estimates:
## ratio of variances 
##           4.078912

INTERPRETATION: Since the P-value (0.006) is less thant the significance level (0.05), we can reject the null hypothesis. Hence, variances are not statistically equal.

4.1.3.2 Two sample t-test

t.test(s.v.c$sleep_rem, s.v.h$sleep_rem)

## 
##  Welch Two Sample t-test
## 
## data:  s.v.c$sleep_rem and s.v.h$sleep_rem
## t = 1.4935, df = 10.888, p-value = 0.1637
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4390548  2.2857215
## sample estimates:
## mean of x mean of y 
##  2.290000  1.366667

t.test(s.v.c$sleep_rem, s.v.h$sleep_rem, var.equal=ifelse(var.test(s.v.c$sleep_rem, s.v.h$sleep_rem)$p.value >0.05, TRUE, FALSE))

## 
##  Welch Two Sample t-test
## 
## data:  s.v.c$sleep_rem and s.v.h$sleep_rem
## t = 1.4935, df = 10.888, p-value = 0.1637
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4390548  2.2857215
## sample estimates:
## mean of x mean of y 
##  2.290000  1.366667

INTERPRETATION: Since the P-value (0.1637) is greater than the significance level (0.05), we cannot reject the null hypothesis. Hence, means are statistically equal.

4.1.3.3 Normality test

shapiro.test(msleep$sleep_rem)

## 
##  Shapiro-Wilk normality test
## 
## data:  msleep$sleep_rem
## W = 0.87866, p-value = 2.078e-05

INTERPRETATION: Since the P-value (2.078e-05) is less than the significance level (0.05), we can reject the null hypothesis. Hence, the total amount of rem sleep houris is not normally distributed.

4.1.4 General Linear Models

4.1.4.1 Simple Linear Regression

m1 <- glm(brainwt ~ bodywt, data=msleep)
summary(m1)

## 
## Call:
## glm(formula = brainwt ~ bodywt, data = msleep)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.78804  -0.08422  -0.07634  -0.02839   2.06190  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.592e-02  4.821e-02   1.782   0.0804 .  
## bodywt      9.639e-04  5.027e-05  19.176   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1243423)
## 
##     Null deviance: 52.4361  on 55  degrees of freedom
## Residual deviance:  6.7145  on 54  degrees of freedom
##   (27 observations deleted due to missingness)
## AIC: 46.14
## 
## Number of Fisher Scoring iterations: 2

4.1.4.2 Simple Logistic Regression

msleepsubset <- msleep[msleep$vore == "carni" | msleep$vore == "herbi", ]
m2 <- glm(vore ~ bodywt, data=msleepsubset, family=binomial)
summary(m2)

## 
## Call:
## glm(formula = vore ~ bodywt, family = binomial, data = msleepsubset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5869  -1.3585   0.9608   1.0072   1.0074  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.4140627  0.3060479   1.353    0.176
## bodywt      0.0006388  0.0008417   0.759    0.448
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67.350  on 50  degrees of freedom
## Residual deviance: 65.972  on 49  degrees of freedom
##   (7 observations deleted due to missingness)
## AIC: 69.972
## 
## Number of Fisher Scoring iterations: 5

4.1.4.3 Multiple Linear Regression

mli <- glm(brainwt ~ bodywt+sleep_rem+awake, data=msleep)
summary(mli)

## 
## Call:
## glm(formula = brainwt ~ bodywt + sleep_rem + awake, data = msleep)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.21778  -0.07603  -0.03778   0.01021   1.16733  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -0.1211979  0.1918975  -0.632  0.53093   
## bodywt       0.0008430  0.0002782   3.030  0.00409 **
## sleep_rem    0.0124268  0.0349919   0.355  0.72419   
## awake        0.0123744  0.0103746   1.193  0.23936   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.03929021)
## 
##     Null deviance: 2.4805  on 47  degrees of freedom
## Residual deviance: 1.7288  on 44  degrees of freedom
##   (35 observations deleted due to missingness)
## AIC: -13.324
## 
## Number of Fisher Scoring iterations: 2

4.1.4.4 Multiple Logistic Regression

mlo <- glm(vore ~ bodywt+sleep_rem+awake, data=msleepsubset, family=binomial)
summary(mlo)

## 
## Call:
## glm(formula = vore ~ bodywt + sleep_rem + awake, family = binomial, 
##     data = msleepsubset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0028  -0.4869   0.4305   0.7280   1.5219  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) 12.7337858  5.1813126   2.458   0.0140 *
## bodywt       0.0009856  0.0022703   0.434   0.6642  
## sleep_rem   -2.3696908  0.9740284  -2.433   0.0150 *
## awake       -0.5350938  0.2469066  -2.167   0.0302 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 41.194  on 33  degrees of freedom
## Residual deviance: 30.769  on 30  degrees of freedom
##   (24 observations deleted due to missingness)
## AIC: 38.769
## 
## Number of Fisher Scoring iterations: 5

Materials adapted from: http://modernstatisticswithr.com/thebasics.html#descstats

Basic Statistics with R

Natalia Vilor-Tejedor (nvilor@barcelonabeta.org)