This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
Store the data in the variable my_data
my_data <- iris
my_data
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
## 27 5.0 3.4 1.6 0.4 setosa
## 28 5.2 3.5 1.5 0.2 setosa
## 29 5.2 3.4 1.4 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 31 4.8 3.1 1.6 0.2 setosa
## 32 5.4 3.4 1.5 0.4 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 34 5.5 4.2 1.4 0.2 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 37 5.5 3.5 1.3 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 40 5.1 3.4 1.5 0.2 setosa
## 41 5.0 3.5 1.3 0.3 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 44 5.0 3.5 1.6 0.6 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 47 5.1 3.8 1.6 0.2 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 49 5.3 3.7 1.5 0.2 setosa
## 50 5.0 3.3 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 52 6.4 3.2 4.5 1.5 versicolor
## 53 6.9 3.1 4.9 1.5 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 55 6.5 2.8 4.6 1.5 versicolor
## 56 5.7 2.8 4.5 1.3 versicolor
## 57 6.3 3.3 4.7 1.6 versicolor
## 58 4.9 2.4 3.3 1.0 versicolor
## 59 6.6 2.9 4.6 1.3 versicolor
## 60 5.2 2.7 3.9 1.4 versicolor
## 61 5.0 2.0 3.5 1.0 versicolor
## 62 5.9 3.0 4.2 1.5 versicolor
## 63 6.0 2.2 4.0 1.0 versicolor
## 64 6.1 2.9 4.7 1.4 versicolor
## 65 5.6 2.9 3.6 1.3 versicolor
## 66 6.7 3.1 4.4 1.4 versicolor
## 67 5.6 3.0 4.5 1.5 versicolor
## 68 5.8 2.7 4.1 1.0 versicolor
## 69 6.2 2.2 4.5 1.5 versicolor
## 70 5.6 2.5 3.9 1.1 versicolor
## 71 5.9 3.2 4.8 1.8 versicolor
## 72 6.1 2.8 4.0 1.3 versicolor
## 73 6.3 2.5 4.9 1.5 versicolor
## 74 6.1 2.8 4.7 1.2 versicolor
## 75 6.4 2.9 4.3 1.3 versicolor
## 76 6.6 3.0 4.4 1.4 versicolor
## 77 6.8 2.8 4.8 1.4 versicolor
## 78 6.7 3.0 5.0 1.7 versicolor
## 79 6.0 2.9 4.5 1.5 versicolor
## 80 5.7 2.6 3.5 1.0 versicolor
## 81 5.5 2.4 3.8 1.1 versicolor
## 82 5.5 2.4 3.7 1.0 versicolor
## 83 5.8 2.7 3.9 1.2 versicolor
## 84 6.0 2.7 5.1 1.6 versicolor
## 85 5.4 3.0 4.5 1.5 versicolor
## 86 6.0 3.4 4.5 1.6 versicolor
## 87 6.7 3.1 4.7 1.5 versicolor
## 88 6.3 2.3 4.4 1.3 versicolor
## 89 5.6 3.0 4.1 1.3 versicolor
## 90 5.5 2.5 4.0 1.3 versicolor
## 91 5.5 2.6 4.4 1.2 versicolor
## 92 6.1 3.0 4.6 1.4 versicolor
## 93 5.8 2.6 4.0 1.2 versicolor
## 94 5.0 2.3 3.3 1.0 versicolor
## 95 5.6 2.7 4.2 1.3 versicolor
## 96 5.7 3.0 4.2 1.2 versicolor
## 97 5.7 2.9 4.2 1.3 versicolor
## 98 6.2 2.9 4.3 1.3 versicolor
## 99 5.1 2.5 3.0 1.1 versicolor
## 100 5.7 2.8 4.1 1.3 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
## 102 5.8 2.7 5.1 1.9 virginica
## 103 7.1 3.0 5.9 2.1 virginica
## 104 6.3 2.9 5.6 1.8 virginica
## 105 6.5 3.0 5.8 2.2 virginica
## 106 7.6 3.0 6.6 2.1 virginica
## 107 4.9 2.5 4.5 1.7 virginica
## 108 7.3 2.9 6.3 1.8 virginica
## 109 6.7 2.5 5.8 1.8 virginica
## 110 7.2 3.6 6.1 2.5 virginica
## 111 6.5 3.2 5.1 2.0 virginica
## 112 6.4 2.7 5.3 1.9 virginica
## 113 6.8 3.0 5.5 2.1 virginica
## 114 5.7 2.5 5.0 2.0 virginica
## 115 5.8 2.8 5.1 2.4 virginica
## 116 6.4 3.2 5.3 2.3 virginica
## 117 6.5 3.0 5.5 1.8 virginica
## 118 7.7 3.8 6.7 2.2 virginica
## 119 7.7 2.6 6.9 2.3 virginica
## 120 6.0 2.2 5.0 1.5 virginica
## 121 6.9 3.2 5.7 2.3 virginica
## 122 5.6 2.8 4.9 2.0 virginica
## 123 7.7 2.8 6.7 2.0 virginica
## 124 6.3 2.7 4.9 1.8 virginica
## 125 6.7 3.3 5.7 2.1 virginica
## 126 7.2 3.2 6.0 1.8 virginica
## 127 6.2 2.8 4.8 1.8 virginica
## 128 6.1 3.0 4.9 1.8 virginica
## 129 6.4 2.8 5.6 2.1 virginica
## 130 7.2 3.0 5.8 1.6 virginica
## 131 7.4 2.8 6.1 1.9 virginica
## 132 7.9 3.8 6.4 2.0 virginica
## 133 6.4 2.8 5.6 2.2 virginica
## 134 6.3 2.8 5.1 1.5 virginica
## 135 6.1 2.6 5.6 1.4 virginica
## 136 7.7 3.0 6.1 2.3 virginica
## 137 6.3 3.4 5.6 2.4 virginica
## 138 6.4 3.1 5.5 1.8 virginica
## 139 6.0 3.0 4.8 1.8 virginica
## 140 6.9 3.1 5.4 2.1 virginica
## 141 6.7 3.1 5.6 2.4 virginica
## 142 6.9 3.1 5.1 2.3 virginica
## 143 5.8 2.7 5.1 1.9 virginica
## 144 6.8 3.2 5.9 2.3 virginica
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
# Print the first 6 rows
head(my_data, 6)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Roughly speaking, the central tendency measures the “average” or the “middle” of your data. The most commonly used measures include:
the mean: the average value. It’s sensitive to outliers.
the median: the middle value. It’s a robust alternative to mean.
and the mode: the most frequent value
In R,
The function mean() and median() can be used to compute the mean and the median, respectively;
The function mfv() [in the modeest R package] can be used to compute the mode of a variable.
The R code below computes the mean, median and the mode of the variable Sepal.Length [in my_data data set]:
# Compute the mean value
mean(my_data$Sepal.Length)
## [1] 5.843333
# Compute the median value
median(my_data$Sepal.Length)
## [1] 5.8
# Compute the mode
#install.packages("modeest")
require(modeest)
## Loading required package: modeest
## Warning: package 'modeest' was built under R version 4.3.2
mfv(my_data$Sepal.Length)
## [1] 5
Measures of variability gives how “spread out” the data are.
Range corresponds to biggest value minus the smallest value. It gives you the full spread of the data.
# Compute the minimum value
min(my_data$Sepal.Length)
## [1] 4.3
# Compute the maximum value
max(my_data$Sepal.Length)
## [1] 7.9
# Range
range(my_data$Sepal.Length)
## [1] 4.3 7.9
Recall that, quartiles divide the data into 4 parts. Note that, the interquartile range (IQR) - corresponding to the difference between the first and third quartiles - is sometimes used as a robust alternative to the standard deviation.
R function:
quantile(x, probs = seq(0, 1, 0.25))
x: numeric vector whose sample quantiles are wanted. probs: numeric vector of probabilities with values in [0,1].
quantile(my_data$Sepal.Length)
## 0% 25% 50% 75% 100%
## 4.3 5.1 5.8 6.4 7.9
By default, the function returns the minimum, the maximum and three quartiles (the 0.25, 0.50 and 0.75 quartiles).
quantile(my_data$Sepal.Length, seq(0, 1, 0.1))
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 4.30 4.80 5.00 5.27 5.60 5.80 6.10 6.30 6.52 6.90 7.90
IQR(my_data$Sepal.Length)
## [1] 1.3
# Compute the variance
var(my_data$Sepal.Length)
## [1] 0.6856935
# Compute the standard deviation =
# square root of th variance
sd(my_data$Sepal.Length)
## [1] 0.8280661
Median absolute deviation
# Compute the median
median(my_data$Sepal.Length)
## [1] 5.8
# Compute the median absolute deviation
mad(my_data$Sepal.Length)
## [1] 1.03782
Range. It’s not often used because it’s very sensitive to outliers. Interquartile range. It’s pretty robust to outliers. It’s used a lot in combination with the median.
Variance. It’s completely uninterpretable because it doesn’t use the same units as the data. It’s almost never used except as a mathematical tool
Standard deviation. This is the square root of the variance. It’s expressed in the same units as the data. The standard deviation is often used in the situation where the mean is the measure of central tendency.
Median absolute deviation. It’s a robust way to estimate the standard deviation, for data with outliers. It’s not used very often.
In summary, the IQR and the standard deviation are the two most common measures used to report the variability of the data.
Computing an overall summary of a variable and an entire data frame summary() function
Summary of a single variable. Five values are returned: the mean, median, 25th and 75th quartiles, min and max in one single line call:
summary(my_data$Sepal.Length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.300 5.100 5.800 5.843 6.400 7.900
In this case, the function summary() is automatically applied to each column. The format of the result depends on the type of the data contained in the column.
For example: If the column is a numeric variable, mean, median, min, max and quartiles are returned.
If the column is a factor variable, the number of observations in each group is returned.
summary(my_data, digits = 1)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## Min. :4 Min. :2 Min. :1 Min. :0.1 setosa :50
## 1st Qu.:5 1st Qu.:3 1st Qu.:2 1st Qu.:0.3 versicolor:50
## Median :6 Median :3 Median :4 Median :1.3 virginica :50
## Mean :6 Mean :3 Mean :4 Mean :1.2
## 3rd Qu.:6 3rd Qu.:3 3rd Qu.:5 3rd Qu.:1.8
## Max. :8 Max. :4 Max. :7 Max. :2.5
sapply() function
It’s also possible to use the function sapply() to apply a particular function over a list or vector. For instance, we can use it, to compute for each column in a data frame, the mean, sd, var, min, quantile, …
# Compute the mean of each column
sapply(my_data[, -5], mean)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843333 3.057333 3.758000 1.199333
# Compute quartiles
sapply(my_data[, -5], quantile)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0% 4.3 2.0 1.00 0.1
## 25% 5.1 2.8 1.60 0.3
## 50% 5.8 3.0 4.35 1.3
## 75% 6.4 3.3 5.10 1.8
## 100% 7.9 4.4 6.90 2.5
stat.desc() function
The function stat.desc() [in pastecs package], provides other useful statistics including:
the median the mean the standard error on the mean (SE.mean) the confidence interval of the mean (CI.mean) at the p level (default is 0.95) the variance (var) the standard deviation (std.dev) and the variation coefficient (coef.var) defined as the standard deviation divided by the mean
#Install pastecs package
#install.packages("pastecs")
se the function stat.desc() to compute descriptive statistics
# Compute descriptive statistics
library(pastecs)
## Warning: package 'pastecs' was built under R version 4.3.2
res <- stat.desc(my_data[, -5])
round(res, 2)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## nbr.val 150.00 150.00 150.00 150.00
## nbr.null 0.00 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00 0.00
## min 4.30 2.00 1.00 0.10
## max 7.90 4.40 6.90 2.50
## range 3.60 2.40 5.90 2.40
## sum 876.50 458.60 563.70 179.90
## median 5.80 3.00 4.35 1.30
## mean 5.84 3.06 3.76 1.20
## SE.mean 0.07 0.04 0.14 0.06
## CI.mean.0.95 0.13 0.07 0.28 0.12
## var 0.69 0.19 3.12 0.58
## std.dev 0.83 0.44 1.77 0.76
## coef.var 0.14 0.14 0.47 0.64
Case of missing values
Note that, when the data contains missing values, some R functions will return errors or NA even if just a single value is missing.
For example, the mean() function will return NA if even only one value is missing in a vector. This can be avoided using the argument na.rm = TRUE, which tells to the function to remove any NAs before calculations. An example using the mean function is as follow:
mean(my_data$Sepal.Length, na.rm = TRUE)
## [1] 5.843333
Graphical display of distributions
The R package ggpubr will be used to create graphs.
Installation and loading ggpubr Install the latest version from GitHub as follow:
# Install
#if(!require(devtools)) install.packages("devtools")
#devtools::install_github("kassambara/ggpubr")
#Load ggpubr as follow:
library(ggpubr)
## Loading required package: ggplot2
Box plots
ggboxplot(my_data, y = "Sepal.Length", width = 0.5)
Histogram
Histograms show the number of observations that fall within specified divisions (i.e., bins).
Histogram plot of Sepal.Length with mean line (dashed line).
gghistogram(my_data, x = "Sepal.Length", bins = 9,
add = "mean")
## Warning: `geom_vline()`: Ignoring `mapping` because `xintercept` was provided.
## Warning: `geom_vline()`: Ignoring `data` because `xintercept` was provided.
Empirical cumulative distribution function (ECDF)
ECDF is the fraction of data smaller than or equal to x.
ggecdf(my_data, x = "Sepal.Length")
Q-Q plots
QQ plots is used to check whether the data is normally distributed.
ggqqplot(my_data, x = "Sepal.Length")
To compute summary statistics by groups, the functions group_by() and summarise() [in dplyr package] can be used.
We want to group the data by Species and then: compute the number of element in each group. R function: n() compute the mean. R function mean() and the standard deviation. R function sd() The function %>% is used to chaine operations.
Install ddplyr as follow:
#install.packages("dplyr")
Descriptive statistics by groups:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:pastecs':
##
## first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
group_by(my_data, Species) %>%
summarise(
count = n(),
mean = mean(Sepal.Length, na.rm = TRUE),
sd = sd(Sepal.Length, na.rm = TRUE)
)
## # A tibble: 3 × 4
## Species count mean sd
## <fct> <int> <dbl> <dbl>
## 1 setosa 50 5.01 0.352
## 2 versicolor 50 5.94 0.516
## 3 virginica 50 6.59 0.636
Graphics for grouped data:
library("ggpubr")
# Box plot colored by groups: Species
ggboxplot(my_data, x = "Species", y = "Sepal.Length",
color = "Species",
palette = c("#00AFBB", "#E7B800", "#FC4E07"))
# Stripchart colored by groups: Species
ggstripchart(my_data, x = "Species", y = "Sepal.Length",
color = "Species",
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
add = "mean_sd")
Note that, when the number of observations per groups is small, it’s recommended to use strip chart compared to box plots.
Frequency tables A frequency table (or contingency table) is used to describe categorical variables. It contains the counts at each combination of factor levels.
R function to generate tables: table()
Create some data Distribution of hair and eye color by sex of 592 students:
# Hair/eye color data
df <- as.data.frame(HairEyeColor)
hair_eye_col <- df[rep(row.names(df), df$Freq), 1:3]
rownames(hair_eye_col) <- 1:nrow(hair_eye_col)
head(hair_eye_col)
## Hair Eye Sex
## 1 Black Brown Male
## 2 Black Brown Male
## 3 Black Brown Male
## 4 Black Brown Male
## 5 Black Brown Male
## 6 Black Brown Male
#hair/eye variables
Hair <- hair_eye_col$Hair
Eye <- hair_eye_col$Eye
imple frequency distribution: one categorical variable Table of counts
# Frequency distribution of hair color
table(Hair)
## Hair
## Black Brown Red Blond
## 108 286 71 127
# Frequency distribution of eye color
table(Eye)
## Eye
## Brown Blue Hazel Green
## 220 215 93 64
Graphics: to create the graphics, we start by converting the table as a data frame.
# Compute table and convert as data frame
df <- as.data.frame(table(Hair))
df
## Hair Freq
## 1 Black 108
## 2 Brown 286
## 3 Red 71
## 4 Blond 127
# Visualize using bar plot
library(ggpubr)
ggbarplot(df, x = "Hair", y = "Freq")
Two-way contingency table: Two categorical variables
tbl2 <- table(Hair , Eye)
tbl2
## Eye
## Hair Brown Blue Hazel Green
## Black 68 20 15 5
## Brown 119 84 54 29
## Red 26 17 14 14
## Blond 7 94 10 16
It’s also possible to use the function xtabs(), which will create cross tabulation of data frames with a formula interface.
xtabs(~ Hair + Eye, data = hair_eye_col)
## Eye
## Hair Brown Blue Hazel Green
## Black 68 20 15 5
## Brown 119 84 54 29
## Red 26 17 14 14
## Blond 7 94 10 16
Graphics: to create the graphics, we start by converting the table as a data frame.
df <- as.data.frame(tbl2)
head(df)
## Hair Eye Freq
## 1 Black Brown 68
## 2 Brown Brown 119
## 3 Red Brown 26
## 4 Blond Brown 7
## 5 Black Blue 20
## 6 Brown Blue 84
# Visualize using bar plot
library(ggpubr)
ggbarplot(df, x = "Hair", y = "Freq",
color = "Eye",
palette = c("brown", "blue", "gold", "green"))
# position dodge
ggbarplot(df, x = "Hair", y = "Freq",
color = "Eye", position = position_dodge(),
palette = c("brown", "blue", "gold", "green"))
Multiway tables: More than two categorical variables Hair and Eye color distributions by sex using xtabs():
xtabs(~Hair + Eye + Sex, data = hair_eye_col)
## , , Sex = Male
##
## Eye
## Hair Brown Blue Hazel Green
## Black 32 11 10 3
## Brown 53 50 25 15
## Red 10 10 7 7
## Blond 3 30 5 8
##
## , , Sex = Female
##
## Eye
## Hair Brown Blue Hazel Green
## Black 36 9 5 2
## Brown 66 34 29 14
## Red 16 7 7 7
## Blond 4 64 5 8
You can also use the function ftable() [for flat contingency tables]. It returns a nice output compared to xtabs() when you have more than two variables:
ftable(Sex + Hair ~ Eye, data = hair_eye_col)
## Sex Male Female
## Hair Black Brown Red Blond Black Brown Red Blond
## Eye
## Brown 32 53 10 3 36 66 16 4
## Blue 11 50 10 30 9 34 7 64
## Hazel 10 25 7 5 5 29 7 5
## Green 3 15 7 8 2 14 7 8
Compute table margins and relative frequency Table margins correspond to the sums of counts along rows or columns of the table. Relative frequencies express table entries as proportions of table margins (i.e., row or column totals).
The function margin.table() and prop.table() can be used to compute table margins and relative frequencies, respectively.
Format of the functions:
margin.table(x, margin = NULL) prop.table(x, margin = NULL)
x: table margin: index number (1 for rows and 2 for columns) compute table margins:
Hair <- hair_eye_col$Hair
Eye <- hair_eye_col$Eye
# Hair/Eye color table
he.tbl <- table(Hair, Eye)
he.tbl
## Eye
## Hair Brown Blue Hazel Green
## Black 68 20 15 5
## Brown 119 84 54 29
## Red 26 17 14 14
## Blond 7 94 10 16
# Margin of rows
margin.table(he.tbl, 1)
## Hair
## Black Brown Red Blond
## 108 286 71 127
#Margin of columns
margin.table(he.tbl, 2)
## Eye
## Brown Blue Hazel Green
## 220 215 93 64
Compute relative frequencies:
# Frequencies relative to row total
prop.table(he.tbl, 1)
## Eye
## Hair Brown Blue Hazel Green
## Black 0.62962963 0.18518519 0.13888889 0.04629630
## Brown 0.41608392 0.29370629 0.18881119 0.10139860
## Red 0.36619718 0.23943662 0.19718310 0.19718310
## Blond 0.05511811 0.74015748 0.07874016 0.12598425
#Table of percentages
round(prop.table(he.tbl, 1), 2)*100
## Eye
## Hair Brown Blue Hazel Green
## Black 63 19 14 5
## Brown 42 29 19 10
## Red 37 24 20 20
## Blond 6 74 8 13
#To express the frequencies relative to the grand total, use this:
he.tbl/sum(he.tbl)
## Eye
## Hair Brown Blue Hazel Green
## Black 0.114864865 0.033783784 0.025337838 0.008445946
## Brown 0.201013514 0.141891892 0.091216216 0.048986486
## Red 0.043918919 0.028716216 0.023648649 0.023648649
## Blond 0.011824324 0.158783784 0.016891892 0.027027027
library("dplyr")
library("ggpubr")
my_data <- ToothGrowth
Check your data We start by displaying a random sample of 10 rows using the function sample_n()[in dplyr package].
set.seed(1234)
dplyr::sample_n(my_data, 10)
## len supp dose
## 1 21.5 VC 2.0
## 2 17.3 VC 1.0
## 3 27.3 OJ 2.0
## 4 18.5 VC 2.0
## 5 8.2 OJ 0.5
## 6 26.4 OJ 1.0
## 7 25.8 OJ 1.0
## 8 5.2 VC 0.5
## 9 6.4 VC 0.5
## 10 9.4 OJ 0.5
Assess the normality of the data in R We want to test if the variable len (tooth length) is normally distributed.
Case of large sample sizes If the sample size is large enough (n > 30), we can ignore the distribution of the data and use parametric tests.
The central limit theorem tells us that no matter what distribution things have, the sampling distribution tends to be normal if the sample is large enough (n > 30).
However, to be consistent, normality can be checked by visual inspection [normal plots (histogram), Q-Q plot (quantile-quantile plot)] or by significance tests].
Density plot and Q-Q plot can be used to check normality visually.
library("ggpubr")
ggdensity(my_data$len,
main = "Density plot of tooth length",
xlab = "Tooth length")
library(ggpubr)
ggqqplot(my_data$len)
It’s also possible to use the function qqPlot() [in car package]:
library("car")
## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.2
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
qqPlot(my_data$len)
## [1] 23 1
As all the points fall approximately along this reference line, we can assume normality.
Normality test Visual inspection, described in the previous section, is usually unreliable. It’s possible to use a significance test comparing the sample distribution to a normal one in order to ascertain whether data show or not a serious deviation from normality.
There are several methods for normality test such as Kolmogorov-Smirnov (K-S) normality test and Shapiro-Wilk’s test.
The null hypothesis of these tests is that “sample distribution is normal”.
If the test is significant, the distribution is non-normal.
Shapiro-Wilk’s method is widely recommended for normality test and it provides better power than K-S. It is based on the correlation between the data and the corresponding normal scores.
Note that, normality test is sensitive to sample size. Small samples most often pass normality tests. Therefore, it’s important to combine visual inspection and significance test in order to take the right decision.
The R function shapiro.test() can be used to perform the Shapiro-Wilk test of normality for one variable (univariate):
shapiro.test(my_data$len)
##
## Shapiro-Wilk normality test
##
## data: my_data$len
## W = 0.96743, p-value = 0.1091
From the output, the p-value > 0.05 implying that the distribution of the data are not significantly different from normal distribution. In other words, we can assume the normality.
The chi-square test of independence is used to analyze the frequency table (i.e. contengency table) formed by two categorical variables. The chi-square test evaluates whether there is a significant association between the categories of the two variables.
Contingency table can be visualized using the function balloonplot() [in gplots package]. This function draws a graphical matrix where each cell contains a dot whose size reflects the relative magnitude of the corresponding component.
# Create the observed data matrix
observed <- matrix(c(30, 50, 20, 40, 20, 10, 30, 30, 10), nrow = 3, byrow = TRUE)
rownames(observed) <- c("Car", "Bicycle", "Walk")
colnames(observed) <- c("Reading", "Gaming", "Socializing")
contigency_table<- observed
contigency_table
## Reading Gaming Socializing
## Car 30 50 20
## Bicycle 40 20 10
## Walk 30 30 10
hypothesis formulation
h0:hobbies and transport are independent
h1:hobbies and transport are dependent
level of significance
alpha=0.05
# Calculate row totals and column totals
row_totals <- rowSums(observed)
col_totals <- colSums(observed)
total_obs <- sum(observed)
# Calculate expected frequencies
expected <- outer(row_totals, col_totals) / total_obs
expected
## Reading Gaming Socializing
## Car 41.66667 41.66667 16.66667
## Bicycle 29.16667 29.16667 11.66667
## Walk 29.16667 29.16667 11.66667
# Compute the chi-square statistic
chi2_stat <- sum((observed - expected)^2 / expected)
chi2_stat
## [1] 13.02857
# Degrees of freedom
dof <- (nrow(observed) - 1) * (ncol(observed) - 1)
dof
## [1] 4
# Print results
cat("Chi2 Statistic:", chi2_stat, "\n")
## Chi2 Statistic: 13.02857
cat("Degrees of Freedom:", dof, "\n")
## Degrees of Freedom: 4
qchisq(alpha,dof)
## [1] 0.710723
reject ho if chisq >= to the critical chi so 13.08 is greater than 0.710 so there hobbies and transport are dependent
by function
chi.sq<-chisq.test(contigency_table)
chi.sq
##
## Pearson's Chi-squared test
##
## data: contigency_table
## X-squared = 13.029, df = 4, p-value = 0.01114
if p<alpha reject ho
if(chi.sq$p.value<alpha)
{
cat("reject ho")
}else
{
cat("accept ho")
}
## reject ho
my_data<-ToothGrowth
library("dplyr")
sample_n(my_data, 10)
## len supp dose
## 1 17.3 VC 1.0
## 2 5.8 VC 0.5
## 3 9.7 OJ 0.5
## 4 16.5 OJ 0.5
## 5 18.5 VC 2.0
## 6 32.5 VC 2.0
## 7 10.0 VC 0.5
## 8 22.5 VC 1.0
## 9 17.3 VC 1.0
## 10 26.4 OJ 2.0
We want to test the equality of variances between the two groups OJ and VC in the column “supp”.
Preleminary test to check F-test assumptions
F-test is very sensitive to departure from the normal assumption. You need to check whether the data is normally distributed before using the F-test.
Shapiro-Wilk test can be used to test whether the normal assumption holds. It’s also possible to use Q-Q plot (quantile-quantile plot) to graphically evaluate the normality of a variable. Q-Q plot draws the correlation between a given sample and the normal distribution.
If there is doubt about normality, the better choice is to use Levene’s test or Fligner-Killeen test, which are less sensitive to departure from normal assumption.
hypothesis:
ho:equality of var
h1:no eqaualaity of var
level of significance
alpha=0.05
res<-var.test(len~supp,data=my_data)
print(res)
##
## F test to compare two variances
##
## data: len by supp
## F = 0.6386, num df = 29, denom df = 29, p-value = 0.2331
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.3039488 1.3416857
## sample estimates:
## ratio of variances
## 0.6385951
if(res$p.value<alpha)
{
cat("reject ho")
}else
{
cat("accept ho")
}
## accept ho