library(knitr)
# descriptive
library(MASS)
library(psych)
library(gmodels)
library(vcd)
library(plotrix)
# inferential
library(Lock5Data)
library(car)
library(forecast)
library(tseries)
library(Kendall)
library(seasonal)
library(ResourceSelection)
library(survival)
library(ggplot2)
library(agricolae)
R was initially written as a research project by Ross Ihaka and Robert Gentleman at Department of Statistics in the University of Auckland, New Zealand during 1990s. To provide a statistical environment in their teaching lab (a language for data analysis and graphics) based on the S language. The aim of the training is to introduce model applications commonly used in the analysis of data from studies.We seek to give sufficient background so that you better understand the types of analyses that can be implemented for designed experiments. We give practical examples and hands-on applications so that you can experience how these analyses are conducted. We introduce a framework that will easily extend to more complex structures in the models. We will assume some underlying knowledge of mathematical concepts and statistical notation. # Basic descriptive analysis and graphics in R
summary(mtcars$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.43 19.20 20.09 22.80 33.90
#One Quantitative (y) and One Categorical (x)
by(InsectSprays$count,InsectSprays$spray,mean,na.rm=TRUE)
## InsectSprays$spray: A
## [1] 14.5
## ------------------------------------------------------------
## InsectSprays$spray: B
## [1] 15.33333
## ------------------------------------------------------------
## InsectSprays$spray: C
## [1] 2.083333
## ------------------------------------------------------------
## InsectSprays$spray: D
## [1] 4.916667
## ------------------------------------------------------------
## InsectSprays$spray: E
## [1] 3.5
## ------------------------------------------------------------
## InsectSprays$spray: F
## [1] 16.66667
Pie chart is popularly used in practice to show percentage break down of data. It is a circle representing a set of data by dividing the circle into sectors proportional to the number of items in the categories or a pie chart is a circle representing the total, cut into slices in proportional to the size of the parts that make up the total. It gives the proportional sizes of different data groups as slice of a pie or a circle.
slices <- c(10, 12,4, 16, 8)
lbls <- c("US", "UK", "Australia", "Germany", "France")
pie(slices, labels = lbls,main="Simple Pie Chart")
# Simple bar plot
Simple Bar Chart: is a diagram in which categories of a variable are marked on the X axis and the frequencies of the categories are marked on the Y axis. It is applicable for discrete variables, that is, for data given according to some period, places and timings. These periods and timings are represented on the base line (X-axis) at regular interval and the corresponding frequencies are represented on the Y-axis.
barplot(GNP ~ Year, data = longley,col = "red", lwd = 2)
# Stacked BarPlots
Multiple Bar Charts: used to display data on more than one variable. In the multiple bars diagram two or more sets of inter-related data are interpreted.
barplot(VADeaths, beside = TRUE,
col = c("lightblue", "mistyrose", "lightcyan",
"lavender", "cornsilk"),
legend.text = rownames(VADeaths), ylim = c(0, 100))
title(main = "Death Rates in Virginia", font.main = 4)
Produce box-and-whisker plot(s) of the given (grouped) values.
boxplot(mtcars$hp ~ mtcars$am,
xlab="Transformation",
ylab="horse power",
main = "Boxplots of horse power by Transformation Status",
col="lightblue")
A scatter plot is a type of graph used to show the relationship between two numerical variables. Each point represents one observation.
plot(mtcars$mpg,mtcars$weight,
xlab="Weight" ,
ylab="miles per age",
main="Plot of Weight vs miles per age ")
A histogram is graph showing the frequency of occurrence of each value of the variable being analyzed. In histogram, data are plotted as a series of rectangles. Class intervals are shown on the “X-axis” and the frequencies on the “Y-axis”. The height of each rectangle represents the frequency of the class interval. Each rectangle is formed with the other so as to give a continuous picture.
hist(mtcars$mpg, breaks=12,col="red",
xlab="Miles Per Gallon",
main="Colored histogram with 12 bins")
A 2-D plot displays the relationship between two variables on a flat graph using the x- and y-axes. A 3-D plot shows the relationship among three variables by adding a z-axis, giving the graph depth and a more realistic view of data.
x <- 10*(1:nrow(volcano))
y <- 10*(1:ncol(volcano))
image(x, y, volcano, col = terrain.colors(100),
axes = FALSE)
# Adds contour lines to the current plot.
contour(x, y, volcano, levels = seq(90, 200, by=5),
add = TRUE, col = "peru")
# Adds x and y axes to the plot.
axis(1, at = seq(100, 800, by = 100))
axis(2, at = seq(100, 600, by = 100))
# Draws a box around the plot.
box()
# Adds a title.
title(main = "Maunga Whau Volcano", font.main = 4)
Inferential statistics is a branch of statistics that uses sample data to make conclusions or predictions about a larger population. It involves methods like estimation, hypothesis testing, and confidence intervals to draw reliable inferences.
The chi-square test is a statistical method used to determine whether there is a significant association between categorical variables or whether observed data fits an expected distribution.
chisq.test(mtcars$vs,mtcars$am)#H1:there is significant association
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mtcars$vs and mtcars$am
## X-squared = 0.34754, df = 1, p-value = 0.5555
SRS is a method of selecting a sample where every member of the population has an equal chance of being chosen.
sample(1:2000,100)
## [1] 1177 1984 1523 1401 1006 430 856 212 853 1959 815 986 626 589 1858
## [16] 494 929 1581 800 754 982 894 15 337 299 258 1525 1758 817 1960
## [31] 1335 920 1168 958 1797 167 778 1552 501 1976 799 1848 133 1765 1066
## [46] 1462 1033 1186 376 1491 241 1328 1664 558 245 1566 1610 1135 14 1399
## [61] 1456 1087 1132 11 1746 1494 1389 925 1378 1262 1449 1668 1990 1846 1913
## [76] 1764 1419 135 875 492 981 551 1778 1911 1196 327 305 1267 873 1160
## [91] 246 1561 1855 581 910 1094 542 914 268 426
sample(c("H","T"), 100, replace =TRUE)
## [1] "T" "H" "H" "H" "T" "H" "H" "H" "T" "T" "H" "H" "T" "T" "H" "T" "T" "T"
## [19] "H" "H" "T" "H" "T" "T" "H" "H" "H" "H" "H" "T" "H" "T" "H" "H" "H" "H"
## [37] "H" "T" "T" "H" "H" "H" "H" "H" "T" "T" "H" "H" "H" "H" "T" "T" "H" "H"
## [55] "H" "T" "T" "T" "H" "H" "T" "H" "H" "H" "T" "T" "H" "H" "H" "T" "H" "T"
## [73] "H" "T" "T" "T" "T" "H" "H" "T" "H" "T" "H" "H" "H" "H" "T" "T" "T" "T"
## [91] "H" "T" "T" "H" "T" "T" "T" "T" "T" "H"
They are pre-calculated values used to compare with your test statistic to decide whether to accept or reject a hypothesis. # Tabular value in the normal distribution
1-pnorm(2.15)
## [1] 0.01577761
qt(0.025,20)
## [1] -2.085963
qf(0.05,3,20, lower.tail=F)#To find upper 5% point for an F(3,20) distribution
## [1] 3.098391
Parametric tests are statistical tests that assume the data follows a specific distribution (usually normal distribution) and involve parameters like mean and standard deviation. Non-parametric tests do not assume any specific distribution of the data. # Two sample Parametric model A two-sample parametric model is used to compare the means of two independent groups when the data is numerical and assumes a normal distribution.
t.test(cars$dist,cars$speed) #H1: true difference in means is not equal to 0
##
## Welch Two Sample t-test
##
## data: cars$dist and cars$speed
## t = 7.4134, df = 53.119, p-value = 9.628e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 20.11848 35.04152
## sample estimates:
## mean of x mean of y
## 42.98 15.40
A paired sample t-test (also called dependent t-test) is used to compare the means of two related groups to see if there is a significant difference.
t.test(cars$dist,cars$speed, paired=TRUE, alternative = 'two')
##
## Paired t-test
##
## data: cars$dist and cars$speed
## t = 8.9753, df = 49, p-value = 6.417e-12
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 21.40484 33.75516
## sample estimates:
## mean difference
## 27.58
The F-test is a parametric test used to determine whether two populations have equal variances. It is often used before performing t-tests to check the assumption of equal variance.
var.test(cars$dist,cars$speed) #H1:true ratio of variances is not equal to 1
##
## F test to compare two variances
##
## data: cars$dist and cars$speed
## F = 23.751, num df = 49, denom df = 49, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 13.47817 41.85388
## sample estimates:
## ratio of variances
## 23.75108
The two-sample Wilcoxon test (also called the Mann–Whitney U test) is a non-parametric test used to compare the medians of two independent groups when the data does not assume a normal distribution.
wilcox.test(cars$dist,cars$speed,conf.int=TRUE)#H1: true location shift is not equal to 0
##
## Wilcoxon rank sum test with continuity correction
##
## data: cars$dist and cars$speed
## W = 2198, p-value = 6.319e-11
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## 15.99997 31.00003
## sample estimates:
## difference in location
## 22
The matched-pairs Wilcoxon test (also called the Wilcoxon Signed-Rank Test) is a non-parametric test used to compare two related or paired samples when the data does not assume normality.
wilcox.test(cars$dist,cars$speed,conf.int=TRUE, paired=TRUE)
##
## Wilcoxon signed rank test with continuity correction
##
## data: cars$dist and cars$speed
## V = 1268.5, p-value = 1.148e-09
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## 19.49998 32.50006
## sample estimates:
## (pseudo)median
## 25.00005
The Ansari–Bradley test is a non-parametric test used to compare the variability (dispersion) of two independent groups when the data does not assume normality.
ansari.test(cars$dist,cars$speed)#H1:true ratio of scales is not equal to 1
##
## Ansari-Bradley test
##
## data: cars$dist and cars$speed
## AB = 1092, p-value = 0.01328
## alternative hypothesis: true ratio of scales is not equal to 1
Simple linear regression is a statistical method used to model the relationship between two variables: one independent variable (X) and one dependent variable (Y).
sl=lm(cars$speed~cars$dist)
summary(sl)
##
## Call:
## lm(formula = cars$speed ~ cars$dist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5293 -2.1550 0.3615 2.4377 6.4179
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.28391 0.87438 9.474 1.44e-12 ***
## cars$dist 0.16557 0.01749 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.156 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
A scatter plot is often used to check if a linear relationship exists between two variables before performing linear regression.
plot(cars$dist,cars$speed, xlab="distance",ylab="speed",main="Scatter plot")
abline(lm(cars$speed~cars$dist),lty=2)
Both the QQ plot and the Shapiro-Wilk test are used to check if data is normally distributed, which is an important assumption for many parametric tests.
qqnorm(resid(sl))#normality
qqline(resid(sl),lty=2)
shapiro.test(cars$speed) #H1: The data are not normally distributed
##
## Shapiro-Wilk normality test
##
## data: cars$speed
## W = 0.97765, p-value = 0.4576
A Residuals vs Fitted plot is a diagnostic tool in regression to check whether the variance of residuals is constant across all levels of predicted values.
plot(resid(sl),fitted(sl))# constant variance
abline(h=0,lty=2)
Residual diagnostic plots are graphical tools used to check the assumptions of a regression model, such as linearity, constant variance, and normality of errors. They help ensure the model is reliable.
par(mfrow=c(3,2))
plot(sl, which=1:5)
plot(cars$dist,cars$speed, xlab="distance",ylab="speed",main="Scatter plot")
Model prediction refers to using a fitted statistical model to estimate or forecast the value of the dependent variable based on given independent variable(s).
predict(sl,int="p")
## Warning in predict.lm(sl, int = "p"): predictions on current data refer to _future_ responses
## fit lwr upr
## 1 8.615041 2.046715 15.18337
## 2 9.939581 3.427222 16.45194
## 3 8.946176 2.392929 15.49942
## 4 11.926392 5.475837 18.37695
## 5 10.932987 4.454892 17.41108
## 6 9.939581 3.427222 16.45194
## 7 11.264122 4.795959 17.73228
## 8 12.588663 6.152686 19.02464
## 9 13.913203 7.497220 20.32919
## 10 11.098554 4.625520 17.57159
## 11 12.919798 6.489968 19.34963
## 12 10.601852 4.113078 17.09063
## 13 11.595257 5.136275 18.05424
## 14 12.257527 5.814641 18.70041
## 15 12.919798 6.489968 19.34963
## 16 12.588663 6.152686 19.02464
## 17 13.913203 7.497220 20.32919
## 18 13.913203 7.497220 20.32919
## 19 15.900014 9.490931 22.30910
## 20 12.588663 6.152686 19.02464
## 21 14.244338 7.831434 20.65724
## 22 18.217960 11.781853 24.65407
## 23 21.529312 14.990142 28.06848
## 24 11.595257 5.136275 18.05424
## 25 12.588663 6.152686 19.02464
## 26 17.224555 10.804639 23.64447
## 27 13.582068 7.162237 20.00190
## 28 14.906609 8.497548 21.31567
## 29 13.582068 7.162237 20.00190
## 30 14.906609 8.497548 21.31567
## 31 16.562284 10.149326 22.97524
## 32 15.237744 8.829448 21.64604
## 33 17.555690 11.131142 23.98024
## 34 20.867041 14.354431 27.37965
## 35 22.191582 15.622947 28.76022
## 36 14.244338 7.831434 20.65724
## 37 15.900014 9.490931 22.30910
## 38 19.542501 13.074147 26.01085
## 39 13.582068 7.162237 20.00190
## 40 16.231149 9.820514 22.64178
## 41 16.893420 10.477367 23.30947
## 42 17.555690 11.131142 23.98024
## 43 18.880230 12.429514 25.33095
## 44 19.211366 12.752207 25.67052
## 45 17.224555 10.804639 23.64447
## 46 19.873636 13.395335 26.35194
## 47 23.516123 16.879994 30.15225
## 48 23.681690 17.036335 30.32705
## 49 28.152015 21.194665 35.10936
## 50 22.357149 15.780699 28.93360
A quadratic model is a type of regression model used to describe the relationship between a dependent variable 𝑌and an independent variable 𝑋when the relationship is curved, not linear.
Qm=lm(formula = cars$speed~ cars$dist + I(cars$dist^2))
summary(Qm)
##
## Call:
## lm(formula = cars$speed ~ cars$dist + I(cars$dist^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.559 -1.722 0.473 1.932 5.942
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.1439610 1.2954573 3.971 0.000244 ***
## cars$dist 0.3274544 0.0547392 5.982 2.86e-07 ***
## I(cars$dist^2) -0.0015284 0.0004939 -3.095 0.003316 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.907 on 47 degrees of freedom
## Multiple R-squared: 0.7101, Adjusted R-squared: 0.6978
## F-statistic: 57.57 on 2 and 47 DF, p-value: 2.299e-13
ANOVA (Analysis of Variance) can be used to compare nested regression models to see if a more complex model significantly improves the fit compared to a simpler model.
anova(sl,Qm) #H1:There is a significant improvement in the model
## Analysis of Variance Table
##
## Model 1: cars$speed ~ cars$dist
## Model 2: cars$speed ~ cars$dist + I(cars$dist^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 48 478.02
## 2 47 397.11 1 80.911 9.5762 0.003316 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation measures the strength and direction of a relationship between two variables. It tells us how one variable changes when the other changes. The Pearson correlation is rooted in the two-dimensional normal distribution where the theoretical correlation describes the contour ellipses for the density. Spearman’s ρ, you may be interested in nonparametric variants.These have the advantage of not depending on the normal distribution and, indeed, being invariant to monotone transformations of the coordinates. cor.test(cars\(dist,cars\)speed,method=“spearman”)
cor(cars$dist,cars$speed,use="complete.obs")
## [1] 0.8068949
cor(cars,use="complete.obs")
## speed dist
## speed 1.0000000 0.8068949
## dist 0.8068949 1.0000000
cor.test(cars$speed,cars$dist)
##
## Pearson's product-moment correlation
##
## data: cars$speed and cars$dist
## t = 9.464, df = 48, p-value = 1.49e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6816422 0.8862036
## sample estimates:
## cor
## 0.8068949
Multiple regression is an extension of simple linear regression that predicts a dependent variable (Y) using two or more independent variables (X₁, X₂, …, Xₖ).
ml=lm(UrbanPop~Murder+Assault+Rape, data=USArrests)
summary(ml)
##
## Call:
## lm(formula = UrbanPop ~ Murder + Assault + Rape, data = USArrests)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.456 -6.950 0.077 7.770 25.221
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.84187 4.82483 10.952 2.09e-14 ***
## Murder -1.41154 0.71954 -1.962 0.0559 .
## Assault 0.05190 0.04161 1.247 0.2186
## Rape 0.69841 0.26776 2.608 0.0122 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.08 on 46 degrees of freedom
## Multiple R-squared: 0.2337, Adjusted R-squared: 0.1837
## F-statistic: 4.676 on 3 and 46 DF, p-value: 0.006208
ANOVA is a statistical method used to compare the means of three or more groups to see if at least one group mean is significantly different from the others.
summary(InsectSprays)# data sammary for each variable
## count spray
## Min. : 0.00 A:12
## 1st Qu.: 3.00 B:12
## Median : 7.00 C:12
## Mean : 9.50 D:12
## 3rd Qu.:14.25 E:12
## Max. :26.00 F:12
Anova.Spray<-aov(sqrt(count)~spray,data= InsectSprays)
summary(Anova.Spray)
## Df Sum Sq Mean Sq F value Pr(>F)
## spray 5 88.44 17.688 44.8 <2e-16 ***
## Residuals 66 26.06 0.395
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Bartlett’s test is a statistical test used to check whether multiple groups have equal variances. It is important because many parametric tests assume homogeneity of variances.
bartlett.test(sqrt(count)~spray,data= InsectSprays)
##
## Bartlett test of homogeneity of variances
##
## data: sqrt(count) by spray
## Bartlett's K-squared = 3.7525, df = 5, p-value = 0.5856
#Ho: the distribution of a variable has the same variance in all groups
The Kruskal–Wallis test is a non-parametric alternative to one-way ANOVA. It is used to compare three or more independent groups when the data does not meet the normality assumption.
kruskal.test(sqrt(count)~spray,data= InsectSprays)
##
## Kruskal-Wallis rank sum test
##
## data: sqrt(count) by spray
## Kruskal-Wallis chi-squared = 54.691, df = 5, p-value = 1.511e-10
#H1: the location parameters of the distribution of x are differ in at least one group (sample).
Repeated Measures ANOVA is an extension of ANOVA used when the same subjects are measured multiple times under different conditions or over time. It accounts for within-subject correlation.
rmanova=aov(weight~Time+Error(Chick/Time),data=ChickWeight)
summary(rmanova)
##
## Error: Chick
## Df Sum Sq Mean Sq F value Pr(>F)
## Time 1 89700 89700 9.776 0.003 **
## Residuals 48 440405 9175
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Error: Chick:Time
## Df Sum Sq Mean Sq F value Pr(>F)
## Time 1 1962914 1962914 280.1 <2e-16 ***
## Residuals 49 343364 7007
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Error: Within
## Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 478 78173 163.5
Logistic regression is a statistical model used to predict a binary or categorical outcome based on one or more predictor variables.
fit.full <- glm( mtcars$am~ mtcars$vs+mtcars$mpg+mtcars$cyl, data=mtcars, family="binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(fit.full)
##
## Call:
## glm(formula = mtcars$am ~ mtcars$vs + mtcars$mpg + mtcars$cyl,
## family = "binomial", data = mtcars)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 69.0287 12251.6506 0.006 0.996
## mtcars$vs -37.9198 6125.8244 -0.006 0.995
## mtcars$mpg 0.2910 0.1933 1.505 0.132
## mtcars$cyl -9.4229 1531.4562 -0.006 0.995
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43.230 on 31 degrees of freedom
## Residual deviance: 20.578 on 28 degrees of freedom
## AIC: 28.578
##
## Number of Fisher Scoring iterations: 18
A time series model is used to analyze data points collected or recorded over time to understand patterns, trends, and make forecasts.
summary(AirPassengers)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 104.0 180.0 265.5 280.3 360.5 622.0
plot(AirPassengers)
boxplot(AirPassengers~cycle(AirPassengers))
tsdata=ts(AirPassengers,freq=12)
ddata=decompose(tsdata, "multiplicative")
plot(ddata)
mymodel=arima(AirPassengers)# shows the best model
summary(mymodel)
## Length Class Mode
## coef 1 -none- numeric
## sigma2 1 -none- numeric
## var.coef 1 -none- numeric
## mask 1 -none- logical
## loglik 1 -none- numeric
## aic 1 -none- numeric
## arma 7 -none- numeric
## residuals 144 ts numeric
## call 2 -none- call
## series 1 -none- character
## code 1 -none- numeric
## n.cond 1 -none- numeric
## nobs 1 -none- numeric
## model 10 -none- list
plot.ts(arima(AirPassengers)$residuals)
acf(ts(arima(AirPassengers)$residuals),main="ACF Residuals")
pacf(ts(arima(AirPassengers)$residuals),main="PACF Residuals")
#H0: The residuals are independently distributed.
#HA: The residuals are not independently distributed; they exhibit serial correlation.
Box.test(arima(AirPassengers)$residuals,lag=5,type="Ljung-Box")#lag 5
##
## Box-Ljung test
##
## data: arima(AirPassengers)$residuals
## X-squared = 504.8, df = 5, p-value < 2.2e-16
Box.test(arima(AirPassengers)$residuals,lag=10,type="Ljung-Box")# lag 10
##
## Box-Ljung test
##
## data: arima(AirPassengers)$residuals
## X-squared = 857.07, df = 10, p-value < 2.2e-16
Box.test(arima(AirPassengers)$residuals,lag=15,type="Ljung-Box")#lag 15
##
## Box-Ljung test
##
## data: arima(AirPassengers)$residuals
## X-squared = 1241.5, df = 15, p-value < 2.2e-16