# packages to load (if they are already installed)
library(ggplot2)
library(lattice)
library(car)
## Loading required package: carData
library(carData)
library(cowplot)
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggplot2':
##
## ggsave
library(EnvStats)
##
## Attaching package: 'EnvStats'
## The following object is masked from 'package:car':
##
## qqPlot
## The following objects are masked from 'package:stats':
##
## predict, predict.lm
## The following object is masked from 'package:base':
##
## print.default
library(data.table)
R have a large number of ways to calculate descriptive statistics on the datasets, some are included in the basic installation, and others in packages that need download-installation (using install.packages) and load (activation) to the environment (using library).
Let start using the basic procedures (function summary), with the mtcars dataset.
# create a data frame with the CSV file and make it available for the procedure
mtcars <- read.csv("mtcars.csv")
mtcars
# create a list of variables to analyze
myvars <- c("mpg", "hp", "wt")
# use summary with the selected variables
summary(mtcars[myvars])
## mpg hp wt
## Min. :10.40 Min. : 52.0 Min. :1.513
## 1st Qu.:15.43 1st Qu.: 96.5 1st Qu.:2.581
## Median :19.20 Median :123.0 Median :3.325
## Mean :20.09 Mean :146.7 Mean :3.217
## 3rd Qu.:22.80 3rd Qu.:180.0 3rd Qu.:3.610
## Max. :33.90 Max. :335.0 Max. :5.424
A more general way to calculate basic statistics is using the sapply procedure, with the syntax:
sapply(x, FUN, options))
where FUN is a simple system function (like mean(var), sd(var), et c.) or an user-defined function, with the following syntax:
myfunction <- function(arg1, arg2, … ){
statements
return(object)
}
# defining function
mystats <- function(x){
m <- mean(x)
md <- median(x)
n <- length(x)
s <- sd(x)
return(c(n=n, mean=m, median=md, stdev=s))
}
# arguments (variables) to use
myvars <- c("mpg", "hp", "wt")
# sapply function on dataset
sapply(mtcars[myvars], mystats)
## mpg hp wt
## n 32.000000 32.00000 32.0000000
## mean 20.090625 146.68750 3.2172500
## median 19.200000 123.00000 3.3250000
## stdev 6.026948 68.56287 0.9784574
When you have variables that can be considered as factor, you can use the aggregate function to obtain basic statistics aggregating by such factors. The function has the following syntax:
aggregate(x, by, FUN)
#read data and attach object for easier use of the variables
mtcars <- read.csv('mtcars.csv')
head(mtcars)
#calculate the mean of numeric variables, aggregated by cyl and gear as factors; missing values are removed
aggdata <- aggregate(mtcars$mpg, by = list(mtcars$cyl, mtcars$am), mean)
aggdata
#set names for resulting data table - what happen if we skip this line of code?
aggdata <- setNames(aggdata, c("CYL","A=0/M=1","media"))
aggdata
The data.table package allows for an improved functionality of data.frames operations. We will calculate several statistics in the mtcars data, using aggregation. First, we have to convert a data.frame into a data.table. Thereafter you can select subgroups and apply functions to them.
#conversion of data.frame to data.table
mtcarsDT <- data.table(mtcars)
class(mtcarsDT)
## [1] "data.table" "data.frame"
#mean, median standar deviation by cyl, am
mtcarsBS <- mtcarsDT[, list(Media=mean(mpg), Mediana=median(mpg), StDev=sd(mpg)), by=list(CYL=cyl, A.M=am)]
mtcarsBS
#ordering results
mtcarsBS[order(A.M, CYL), ]
chapters 11, 18
R is a great platform for building graphs. Literally, in a typical interactive session, you build a graph one statement at a time, adding features, until you have what you want.
The base graphics system of R is described at the beginning of chapter 7 in the Lander’s book. Two other systems, that are widely used, and provide extensive options are lattice and ggplot2. We will be mostly using the base graphics system and ggplot2.
Histograms display the distribution of a continuous variable by dividing the range of scores into a specified number of bins on the x-axis and displaying the frequency of scores in each bin on the y-axis.
The following chunk produces four histograms, using the R base graphics system.
#reading data on Melocactus intortus
melodata <- read.csv("melocactus.csv", header = TRUE)
melodata
#this is to produce a 2x2 graphs arrangement
par(mfrow=c(2,2))
# the most basic form
hist(melodata$alturatotal)
# controlling some aspects of the histogram
hist(melodata$alturatotal,
breaks=12,
col="green",
xlab="Altura de la planta, cm",
main="Colored histogram with 12 bins")
# with a density curve and rug plot
hist(melodata$alturatotal,
freq=FALSE,
breaks=12,
col="green",
xlab="Altura de la planta, cm",
main="Histogram, rug plot, density curve")
rug(jitter(melodata$alturatotal))
lines(density(melodata$alturatotal), col="red", lwd=2)
# including a normal curve based on the data
x <- melodata$alturatotal
h<-hist(x,
breaks=12,
col="green",
xlab="Altura de la planta, cm",
main="Histogram with normal curve and box")
xfit<-seq(min(x), max(x), length=80)
yfit<-dnorm(xfit, mean=mean(x), sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="red", lwd=2)
box()
Now we are going to build a histogram using ggplot2. ggplot2 provides a system for creating graphs based on the grammar of graphics. The intention of the ggplot2 package is to provide a comprehensive, grammar-based system for generating graphs in a unified and coherent manner, allowing users to create new and innovative data visualizations. The power of this approach has led ggplot2 to become an important tool for visualizing data using R.
First, let see a basic ggplot2 histogram:
melodata <- read.csv("melocactus.csv", header = TRUE)
ggplot(melodata, aes(alturatotal))+
geom_histogram(color="white", bins = 14)
Now a more detailed histogram, including several layers:
hist.melodata <- ggplot(melodata, aes(alturatotal)) +
geom_histogram(aes(y=..density..), bins = 14, colour="white", fill="green") +
geom_rug(sides = "t", color = "black") +
labs(x="Altura total de la planta,cm", y = "Density") +
stat_function(fun = dnorm,
args = list(mean = mean(melodata$alturatotal, na.rm = TRUE),
sd = sd(melodata$alturatotal, na.rm = TRUE)),
colour = "red", size = 1)
hist.melodata
A box-and-whiskers plot describes the distribution of a continuous variable by plotting its five-number summary: the minimum, lower quartile (25th percentile), median (50th percentile), upper quartile (75th percentile), and maximum. It can also display observations that may be outliers (values outside the range of ± 1.5*IQR, where IQR is the interquartile range defined as the upper quartile minus the lower quartile). By default, each whisker extends to the most extreme data point, which is no more than 1.5 times the interquartile range for the box. Values outside this range are depicted as dots.
Now we are going to analyze the Melocactus data using a box-plot.
ggplot(melodata, aes(y = alturatotal, x = "0")) +
geom_boxplot(default = FALSE, fill="cornflowerblue", color="black") +
geom_point(position = "jitter", size = 0.5, color="blue", alpha=.5) +
labs(x = "Melocactus intortus", y = "Altura total de la planta, cm")
## Warning: Ignoring unknown parameters: default
Related to box-plots are the violin plots; the violin plots provide more visual cues as to the distribution of scores over the range of heights.
ggplot(melodata, aes(x="0", y=alturatotal)) +
geom_violin(fill="lightblue") +
geom_point(position = "jitter", color = "blue", alpha = 0.3) +
labs(x = "Melocactus intortus", y = "Altura total de la planta, cm")
Build a graph, with the Melocactus data, that combines the box-plot and violin plot.
Counts of cases for categorical variable are usually presented using bar graphs. Here we use data from a clinical trial of a treatment for arthritis, comparing the outcomes for treated individuals versus individuals receiving a placebo.
load("Arthr.Rdata")
head(Arthritis)
A <- ggplot(Arthritis, aes(x=Treatment, fill=Improved)) +
geom_bar(position="stack")
B <- ggplot(Arthritis, aes(x=Treatment, fill=Improved)) +
geom_bar(position="dodge")
C <- ggplot(Arthritis, aes(x=Treatment, fill=Improved)) +
geom_bar(position="fill")
#using the package cowplot to group graphs built separatelly
plot_grid(A, B, C, ncol = 2, labels = "AUTO")
Try to improve the graphs, with smaller fonts.
chapter 7
In this section we will consider the testing of statistical hypothesis for the relationship of two or more variables, and the graphs that will help in the interpretation of the results.
Regression analysis is a broad term for a set of methodologies used to predict a response variable (also called a dependent, criterion, or outcome variable) from one or more predictor variables (also called independent or explanatory variables). In general, regression analysis can be used to identify the explanatory variables that are related to a response variable, to describe the form of the relationships involved, and to provide an equation for predicting the response variable from the explanatory variables.
When the regression model contains one dependent variable and one independent variable, the approach is called simple linear regression.
The data set women in the base installation provides the height and weight for a set of 15 women ages 30 to 39. Suppose you want to predict weight from height. Having an equation for predicting weight from height can help you to identify overweight or underweight individuals.
But first, we have to prove that our data are close to a normal distribution, an assumption of the parametric methods, like linear regression. First we will build a Q-Q plot that will show how good is the agreement between the quantiles of the data (using z transformation) and the theoretical quantile of the corresponding normal distribution (\(\mu = 0, \sigma = 1\))
#Q-Q plot
women
qqPlot(women$weight, add.line = TRUE, points.col = "blue", line.col = "red")
qqPlot(women$height, add.line = TRUE, points.col = "blue", line.col = "red")
Now we will test normality using the Shapiro-Wilk test, with the null hypothesis that data are normally distributed.
shapiro.test(women$weight)
##
## Shapiro-Wilk normality test
##
## data: women$weight
## W = 0.96036, p-value = 0.6986
shapiro.test(women$height)
##
## Shapiro-Wilk normality test
##
## data: women$height
## W = 0.96359, p-value = 0.7545
Now we can proceed with the parametric regression analysis:
fit <- lm(weight ~ height, data=women)
summary(fit)
##
## Call:
## lm(formula = weight ~ height, data = women)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7333 -1.1333 -0.3833 0.7417 3.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
## height 3.45000 0.09114 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
#graph with regression line, confidence interval (95%)
ggplot(data=women, aes(x=women$height, y=women$weight)) +
geom_point(pch=19, color="blue", size=2) +
geom_smooth(method="lm", color="red", linetype=2) +
labs(x="Height, inches", y="Weight, lb")
#graph: residuals
plot(women$height,residuals(fit),
xlab="Height (in inches)",
ylab="residuals")
The resulting predictive equation is:
From the Pr(>|t|) column, you see that the regression coefficient (3.45) is significantly different from zero (p < 0.001) and indicates that there is an expected increase of 3.45 pounds of weight for every 1 inch increase in height. The multiple R-squared (0.991) indicates that the model accounts for 99.1% of the variance in weights. The residual standard error (1.53 pounds) can be thought of as the average error in predicting weight from height using this model. The F statistic tests whether the predictor variables, taken together, predict the response variable above chance levels.
The residuals plot from the linear regression, indicates a non-random distribution of the prediction error of the linear model. You might be able to improve your prediction using a regression with a quadratic term (that is, X2), so we have now a polynomial regression
fit2 <- lm(weight ~ height + I(height^2), data=women)
summary(fit2)
##
## Call:
## lm(formula = weight ~ height + I(height^2), data = women)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50941 -0.29611 -0.00941 0.28615 0.59706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 261.87818 25.19677 10.393 2.36e-07 ***
## height -7.34832 0.77769 -9.449 6.58e-07 ***
## I(height^2) 0.08306 0.00598 13.891 9.32e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3841 on 12 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9994
## F-statistic: 1.139e+04 on 2 and 12 DF, p-value: < 2.2e-16
ggplot(women, aes(x = women$height, y = women$weight)) +
geom_point(aes(x = women$height, y = women$weight), color ="blue") +
stat_smooth(aes(), method = "lm", formula = y ~ poly(x,2), color ="red", linetype=2, se =TRUE, size = 1) +
labs(x="Height, inches", y="Weight, lb.")
plot(women$height,residuals(fit2),
xlab="Height (in inches)",
ylab="residuals")
The new polynomial equation is:
The amount of variance accounted for has increased to 99.9%. The significance of the squared term (t = 13.89, p < .001) suggests that inclusion of the quadratic term improves the model fit. The prediction error (residuals) is smaller, and their distribution looks more random.
chapter 19
Extra
A bubble plot is a scatterplot where a third dimension is added: the value of an additional variable is represented through the size of the dots. You need three numerical variables as input: one is represented by the X axis, one by the Y axis, and one by the size.
Using the mtcars dataset we will look again fot the relationship between efficiency (mpg), and weight, but also looking for the variable disp (engine displacement), in the same graph.
mtcars <- read.csv("mtcars.csv", header = TRUE)
mtcars
ggplot(data=mtcars, aes(x=wt, y=mpg)) +
geom_point(aes(size=disp),colour = "lightblue") +
geom_text(aes(label=model), size=2, angle=45) +
scale_size_continuous(range=c(2,15), name = "disp., cu.in.") +
labs(x="Weight, x1000 lb", y="Miles per Gallon")
Kabacoff, R., 2015. R in action: data analysis and graphics with R, Second edition. ed. Manning, Shelter Island.
Verzani, J., 2012. Getting started with RStudio. O’Reilly, Sebastopol, Calif.