Data contains information about erruption of geyser located in USA.
We will be testing how much time is on average passing between the eruptions of geyser, the numbers will be expressed in minutes. Faihful data contains 272 observations of 2 explanatory variables (eruptions and waiting(time)) # Regression analysis First, let’s load all the packages available in R and select our data, faithuful.
data(package = .packages(all.available = TRUE))
data("faithful")
Now let’s rename faithful to mydata for more clear view. Also let’s load library(car)
mydata <- faithful
library(car)
## Loading required package: carData
Saving our data in head and checking the first 6 rows. 1st row represents eruptions and the second one represents waiting time between the eruptions.
head(mydata)
## eruptions waiting
## 1 3.600 79
## 2 1.800 54
## 3 3.333 74
## 4 2.283 62
## 5 4.533 85
## 6 2.883 55
We have to take a look at our data with description in library (psych). We can now see mean, standard deviations, skewness, kurtosis etc.
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:car':
##
## logit
describe(mydata)
## vars n mean sd median trimmed mad min max range skew
## eruptions 1 272 3.49 1.14 4 3.53 0.95 1.6 5.1 3.5 -0.41
## waiting 2 272 70.90 13.59 76 71.50 11.86 43.0 96.0 53.0 -0.41
## kurtosis se
## eruptions -1.51 0.07
## waiting -1.16 0.82
And let’s print the scatterplot Matrix.
scatterplotMatrix(mydata, smooth = FALSE)
Loading library
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, units
Let’s take a look at the correlation coefficient using spearmen method. Looks like it’s 77% correlated. Because of tied data our cor.test cannot compute exact p-value with ties.
cor(mydata$eruptions, mydata$waiting,
method = c("spearman"))
## [1] 0.7779721
cor.test(mydata$eruptions, mydata$waiting,
method = c("spearman"))
## Warning in cor.test.default(mydata$eruptions, mydata$waiting, method =
## c("spearman")): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: mydata$eruptions and mydata$waiting
## S = 744659, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.7779721
It looks like we cannot get rid of this correlation because waiting time and eruption time go with one and another. Let’s plot it.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
We will draw eruptions. First we need to setup our graphicon.
eruptions <- ggplot(mydata[mydata$eruptions == "eruptions", ], aes(x = eruptions)) +
theme_linedraw() +
geom_histogram() +
ylab("Frequency")
waiting <- ggplot(mydata[mydata$eruptions == "waiting", ], aes(x = eruptions)) +
theme_linedraw() +
geom_histogram() +
ylab("Frequency")
And now after we set all up, let’s plot the eruptions.
ggplot(mydata, aes(x=waiting, y=eruptions)) +
geom_point() +
ggtitle("Eruptions")
ggplot(mydata, aes(x=eruptions)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Parametric t test:
H0: There is no significant difference in the mean waiting time between eruptions. H1: There is a significant difference in the mean waiting time between eruptions.
t.test(mydata$eruptions, mydata$waiting,
paired = TRUE,
alternative = "less")
##
## Paired t-test
##
## data: mydata$eruptions and mydata$waiting
## t = -88.398, df = 271, p-value < 2.2e-16
## alternative hypothesis: true mean difference is less than 0
## 95 percent confidence interval:
## -Inf -66.15066
## sample estimates:
## mean difference
## -67.40928
We can reject the null at alpha=0.05 and conclude from H1: there is a significant difference in the mean waiting time between eruptions
Let’s check for normallity of mydata. We will check for this non parametric test with shapiro test.
shapiro.test(mydata$waiting)
##
## Shapiro-Wilk normality test
##
## data: mydata$waiting
## W = 0.92215, p-value = 1.015e-10
H0: We have normal distribution H1: We don’t have a normal distribution
We can reject the null at alpha=0.05 and conclude from H1: we don’t have a normal distribution.
shapiro.test(mydata$eruptions)
##
## Shapiro-Wilk normality test
##
## data: mydata$eruptions
## W = 0.84592, p-value = 9.036e-16
H:0 We have normal distribution H:1 We don’t have a normal distribution We can reject the null at alpha=0.05.
It is clear we do not follow a normal distribution in eruptions and in waiting time between the eruptions.
Now we will draw waiting time between the eruptions in minutes.
hist(mydata$waiting,
freq=FALSE,
xlab = "Waiting time, showcased in minutes",
main="Waiting time between the eruptions in minutes",
breaks = 20)
Intervals are spanning from a minimum of 43 minutes to a maximum of 96 minutes. On average, these intervals are around 70.9 minutes in mydata, ceteris paribus
And now let’s see histogram of eruptions lenghts, also in minutes.
hist(mydata$eruptions,
freq=T,
xlab = "Duration of eruptions, showcased in minutes",
main="Histogram of eruptions lenghts, showcased in minutes",
breaks=20)
Duration of eruptions vary from 1.6 minutes to a maximum of 5.1 minutes. On average, eruptions last around 3.5 minutes, ceteris paribus.
summary(mydata)
## eruptions waiting
## Min. :1.600 Min. :43.0
## 1st Qu.:2.163 1st Qu.:58.0
## Median :4.000 Median :76.0
## Mean :3.488 Mean :70.9
## 3rd Qu.:4.454 3rd Qu.:82.0
## Max. :5.100 Max. :96.0
To sum it all up, our research question has been answered: Duration of eruptions vary from 1.6 minutes to a maximum of 5.1 minutes. On average, eruptions last around 3.5 minutes, ceteris paribus. Intervals are spanning from a minimum of 43 minutes to a maximum of 96 minutes. On average, these intervals are around 70.9 minutes in mydata, ceteris paribus.