# packages to load (if they are already installed)
library(ggplot2)
library(lattice)
library(car)
## Loading required package: carData
library(cowplot)
## 
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggplot2':
## 
##     ggsave
library(EnvStats)
## 
## Attaching package: 'EnvStats'
## The following object is masked from 'package:car':
## 
##     qqPlot
## The following objects are masked from 'package:stats':
## 
##     predict, predict.lm
## The following object is masked from 'package:base':
## 
##     print.default
library(data.table)

Descriptive Statistics

R have a large number of ways to calculate descriptive statistics on the datasets, some are included in the basic installation, and others in packages that need download-installation (using install.packages) and load (activation) to the environment (using library).

Using summary

Let start using the basic procedures (function summary), with the GaltonFamilies dataset.

# create a data frame with the CSV file and make it available for the procedure
famH <- read.csv("GaltonFamilies.csv")
head(famH)
# create a list of variables to analyze
myvars <- c("father", "mother", "childHeight")

# use summary with the selected variables
summary(famH[myvars])
##      father         mother       childHeight   
##  Min.   :62.0   Min.   :58.00   Min.   :56.00  
##  1st Qu.:68.0   1st Qu.:63.00   1st Qu.:64.00  
##  Median :69.0   Median :64.00   Median :66.50  
##  Mean   :69.2   Mean   :64.09   Mean   :66.75  
##  3rd Qu.:71.0   3rd Qu.:65.88   3rd Qu.:69.70  
##  Max.   :78.5   Max.   :70.50   Max.   :79.00
Using function and sapply

A more general way to calculate basic statistics is using the sapply procedure, with the syntax:

sapply(x, FUN, options))

where FUN is a simple system function (like mean(var), sd(var), et c.) or an user-defined function, with the following syntax:

myfunction <- function(arg1, arg2, … ){
statements
return(object)
}

# defining function
mystats <- function(x){
                m <- mean(x)
                md <- median(x)
                n <- length(x)
                s <- sd(x)
                return(c(n=n, mean=m, median=md, stdev=s))
}
# arguments (variables) to use
myvars <- c("father", "mother", "childHeight")
# sapply function on dataset
sapply(famH[myvars], mystats)
##            father     mother childHeight
## n      934.000000 934.000000  934.000000
## mean    69.197109  64.089293   66.745931
## median  69.000000  64.000000   66.500000
## stdev    2.476479   2.290886    3.579251
Using aggregate

When you have variables that can be considered as factor, you can use the aggregate function to obtain basic statistics aggregating by such factors. The function has the following syntax:

aggregate(x, by, FUN)

#read data and attach object for easier use of the variables
famH <- read.csv('GaltonFamilies.csv')
tail(famH)
#calculate the mean of numeric variables, aggregated by gender and number of children as factors; missing values are removed
aggdata <- aggregate(famH$childHeight, by = list(famH$gender, famH$children), mean)
aggdata
#set names for resulting data table - what happen if we skip this line of code?
aggdata <- setNames(aggdata, c("Gender","Number of children","media"))
aggdata
Using a data.table

The data.table package allows for an improved functionality of data.frames operations. We will calculate several statistics in the GaltonFamilies data, using aggregation. First, we have to convert a data.frame into a data.table. Thereafter you can select subgroups and apply functions to them.

#conversion of data.frame to data.table
famDT <- data.table(famH)
class(famDT)
## [1] "data.table" "data.frame"
#mean, median standar deviation by gender, number of children
famBS <- famDT[, list(Media=mean(childHeight), Mediana=median(childHeight), StDev=sd(childHeight)), by=list(Gender=gender, Num.children=children)]
famBS
#ordering results
famBS[order(Num.children, Gender), ]
where to look in the book?

chapters 11, 18


Introduction to graphs

R is a great platform for building graphs. Literally, in a typical interactive session, you build a graph one statement at a time, adding features, until you have what you want.
The base graphics system of R is described at the beginning of chapter 7 in the Lander’s book. Two other systems, that are widely used, and provide extensive options are lattice and ggplot2. We will be mostly using the base graphics system and ggplot2.

The primary graph for a variable: the histogram

Histograms display the distribution of a continuous variable by dividing the range of scores into a specified number of bins on the x-axis and displaying the frequency of scores in each bin on the y-axis.

Basic histograms

The following chunk produces four histograms, using the R base graphics system.

#reading data on Melocactus intortus
melodata <- read.csv("melocactus.csv", header = TRUE)
melodata
#this is to produce a 2x2 graphs arrangement
par(mfrow=c(2,2))

# the most basic form
hist(melodata$alturatotal)  

# controlling some aspects of the histogram
hist(melodata$alturatotal,
     breaks=12,
     col="green",
     xlab="Altura de la planta, cm",
     main="Colored histogram with 12 bins")

# with a density curve and rug plot 
hist(melodata$alturatotal,
     freq=FALSE,
     breaks=12,
     col="green",
     xlab="Altura de la planta, cm",
     main="Histogram, rug plot, density curve")
rug(jitter(melodata$alturatotal))
lines(density(melodata$alturatotal), col="red", lwd=2)

# including a normal curve based on the data
x <- melodata$alturatotal
h<-hist(x,
        breaks=12,
        col="green",
        xlab="Altura de la planta, cm",
        main="Histogram with normal curve and box")
xfit<-seq(min(x), max(x), length=80)
yfit<-dnorm(xfit, mean=mean(x), sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="red", lwd=2)
box()

Introducing ggplot2: building a histogram

Now we are going to build a histogram using ggplot2. ggplot2 provides a system for creating graphs based on the grammar of graphics. The intention of the ggplot2 package is to provide a comprehensive, grammar-based system for generating graphs in a unified and coherent manner, allowing users to create new and innovative data visualizations. The power of this approach has led ggplot2 to become an important tool for visualizing data using R.

First, let see a basic ggplot2 histogram:

melodata <- read.csv("melocactus.csv", header = TRUE)

ggplot(melodata, aes(alturatotal))+
  geom_histogram(color="white", bins = 14)

Now a more detailed histogram, including several layers:

hist.melodata <- ggplot(melodata, aes(alturatotal)) + 
  geom_histogram(aes(y=..density..), bins = 14, colour="white", fill="green") +
  geom_rug(sides = "t", color = "black") +
  labs(x="Altura total de la planta,cm", y = "Density") +
  stat_function(fun = dnorm, 
                args = list(mean = mean(melodata$alturatotal, na.rm = TRUE), 
                            sd = sd(melodata$alturatotal, na.rm = TRUE)), 
                colour = "red", size = 1)
hist.melodata

The anatomy of a box-and-whiskers plot

A box-and-whiskers plot describes the distribution of a continuous variable by plotting its five-number summary: the minimum, lower quartile (25th percentile), median (50th percentile), upper quartile (75th percentile), and maximum. It can also display observations that may be outliers (values outside the range of ± 1.5*IQR, where IQR is the interquartile range defined as the upper quartile minus the lower quartile). By default, each whisker extends to the most extreme data point, which is no more than 1.5 times the interquartile range for the box. Values outside this range are depicted as dots.


Now we are going to analyze the Melocactus data using a box-plot.

ggplot(melodata, aes(y = alturatotal, x = "0")) +
  geom_boxplot(default = FALSE, fill="cornflowerblue", color="black") +
  geom_point(position = "jitter", size = 0.5, color="blue", alpha=.5) +
  labs(x = "Melocactus intortus", y = "Altura total de la planta, cm")
## Warning: Ignoring unknown parameters: default

Related to box-plots are the violin plots; the violin plots provide more visual cues as to the distribution of scores over the range of heights.

ggplot(melodata, aes(x="0", y=alturatotal)) +
  geom_violin(fill="lightblue") +
  geom_point(position = "jitter", color = "blue", alpha = 0.3) +
  labs(x = "Melocactus intortus", y = "Altura total de la planta, cm")

Exercises

Build a graph, with the Melocactus data, that combines the box-plot and violin plot.

Bar graphs for categorical variables

Counts of cases for categorical variable are usually presented using bar graphs. Here we use data from a clinical trial of a treatment for arthritis, comparing the outcomes for treated individuals versus individuals receiving a placebo.

load("Arthr.Rdata")
head(Arthritis)
A <- ggplot(Arthritis, aes(x=Treatment, fill=Improved)) +
     geom_bar(position="stack")
B <- ggplot(Arthritis, aes(x=Treatment, fill=Improved)) +
     geom_bar(position="dodge")
C <- ggplot(Arthritis, aes(x=Treatment, fill=Improved)) +
     geom_bar(position="fill")
#using the package cowplot to group graphs built separatelly
plot_grid(A, B, C, ncol = 2, labels = "AUTO")

Exercise

Try to improve the graphs, with smaller fonts.

where to look in the book?

chapter 7


Statistical tests and graphs for the relationship between two variables

In this section we will consider the testing of statistical hypothesis for the relationship of two or more variables, and the graphs that will help in the interpretation of the results.

Regression analysis

Regression analysis is a broad term for a set of methodologies used to predict a response variable (also called a dependent, criterion, or outcome variable) from one or more predictor variables (also called independent or explanatory variables). In general, regression analysis can be used to identify the explanatory variables that are related to a response variable, to describe the form of the relationships involved, and to provide an equation for predicting the response variable from the explanatory variables.

Simple linear regression

When the regression model contains one dependent variable and one independent variable, the approach is called simple linear regression.

The data set women in the base installation provides the height and weight for a set of 15 women ages 30 to 39. Suppose you want to predict weight from height. Having an equation for predicting weight from height can help you to identify overweight or underweight individuals.

But first, we have to prove that our data are close to a normal distribution, an assumption of the parametric methods, like linear regression. First we will build a Q-Q plot that will show how good is the agreement between the quantiles of the data (using z transformation) and the theoretical quantile of the corresponding normal distribution (\(\mu = 0, \sigma = 1\))

#Q-Q plot
women
qqPlot(women$weight, add.line = TRUE, points.col = "blue", line.col = "red")

qqPlot(women$height, add.line = TRUE, points.col = "blue", line.col = "red")

Now we will test normality using the Shapiro-Wilk test, with the null hypothesis that data are normally distributed.

shapiro.test(women$weight)
## 
##  Shapiro-Wilk normality test
## 
## data:  women$weight
## W = 0.96036, p-value = 0.6986
shapiro.test(women$height)
## 
##  Shapiro-Wilk normality test
## 
## data:  women$height
## W = 0.96359, p-value = 0.7545

Now we can proceed with the parametric regression analysis:

fit <- lm(weight ~ height, data=women)
summary(fit)
## 
## Call:
## lm(formula = weight ~ height, data = women)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7333 -1.1333 -0.3833  0.7417  3.1167 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
## height        3.45000    0.09114   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14
#graph with regression line, confidence interval (95%)
ggplot(data=women, aes(x=women$height, y=women$weight)) +
  geom_point(pch=19, color="blue", size=2) +
  geom_smooth(method="lm", color="red", linetype=2) +
  labs(x="Height, inches", y="Weight, lb")

#graph: residuals
plot(women$height,residuals(fit),
     xlab="Height (in inches)",
     ylab="residuals")

The resulting predictive equation is:


From the Pr(>|t|) column, you see that the regression coefficient (3.45) is significantly different from zero (p < 0.001) and indicates that there is an expected increase of 3.45 pounds of weight for every 1 inch increase in height. The multiple R-squared (0.991) indicates that the model accounts for 99.1% of the variance in weights. The residual standard error (1.53 pounds) can be thought of as the average error in predicting weight from height using this model. The F statistic tests whether the predictor variables, taken together, predict the response variable above chance levels.

Polynomial regression

The residuals plot from the linear regression, indicates a non-random distribution of the prediction error of the linear model. You might be able to improve your prediction using a regression with a quadratic term (that is, X2), so we have now a polynomial regression

fit2 <- lm(weight ~ height + I(height^2), data=women)
summary(fit2)
## 
## Call:
## lm(formula = weight ~ height + I(height^2), data = women)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50941 -0.29611 -0.00941  0.28615  0.59706 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 261.87818   25.19677  10.393 2.36e-07 ***
## height       -7.34832    0.77769  -9.449 6.58e-07 ***
## I(height^2)   0.08306    0.00598  13.891 9.32e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3841 on 12 degrees of freedom
## Multiple R-squared:  0.9995, Adjusted R-squared:  0.9994 
## F-statistic: 1.139e+04 on 2 and 12 DF,  p-value: < 2.2e-16
ggplot(women, aes(x = women$height, y = women$weight)) +
  geom_point(aes(x = women$height, y = women$weight), color ="blue") +
  stat_smooth(aes(), method = "lm", formula = y ~ poly(x,2), color ="red", linetype=2, se =TRUE, size = 1) +
  labs(x="Height, inches", y="Weight, lb.")

plot(women$height,residuals(fit2),
     xlab="Height (in inches)",
     ylab="residuals")

The new polynomial equation is:


The amount of variance accounted for has increased to 99.9%. The significance of the squared term (t = 13.89, p < .001) suggests that inclusion of the quadratic term improves the model fit. The prediction error (residuals) is smaller, and their distribution looks more random.

where to look in the book?

chapter 19

Extra

Bubble plots: the third dimension in two dimensions

A bubble plot is a scatterplot where a third dimension is added: the value of an additional variable is represented through the size of the dots. You need three numerical variables as input: one is represented by the X axis, one by the Y axis, and one by the size.

Using the GaltonFamilies.csv dataset we will look for the relationship between child height (childHeight), and father height (father), but also looking for the variable mother height (mother), in the same graph.

famH <- read.csv("GaltonFamilies.csv", header = TRUE)
head(famH)
ggplot(data=famH, aes(x=father, y=childHeight)) +
  geom_point(aes(size=mother),colour = "lightblue") +
  scale_size_continuous(range = c(-2,3), name = "Mother height, inches") +
  labs(x="Father height, inches", y="Child height, inches")

References

Kabacoff, R., 2015. R in action: data analysis and graphics with R, Second edition. ed. Manning, Shelter Island.

Verzani, J., 2012. Getting started with RStudio. O’Reilly, Sebastopol, Calif.