By Ina Prosen

Data comes directly from R studio in package called “faithful”

Data contains information about erruption of geyser located in USA.

Tests

We will be testing how much time is on average passing between the eruptions of geyser, the numbers will be expressed in minutes. Faihful data contains 272 observations of 2 explanatory variables (eruptions and waiting(time)) # Regression analysis First, let’s load all the packages available in R and select our data, faithuful.

data(package = .packages(all.available = TRUE))
data("faithful")

Now let’s rename faithful to mydata for more clear view. Also let’s load library(car)

mydata <- faithful
library(car)

## Loading required package: carData

Saving our data in head and checking the first 6 rows. 1st row represents eruptions and the second one represents waiting time between the eruptions.

head(mydata)

##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55

We have to take a look at our data with description in library (psych). We can now see mean, standard deviations, skewness, kurtosis etc.

library(psych)

## 
## Attaching package: 'psych'

## The following object is masked from 'package:car':
## 
##     logit

describe(mydata)

##           vars   n  mean    sd median trimmed   mad  min  max range  skew
## eruptions    1 272  3.49  1.14      4    3.53  0.95  1.6  5.1   3.5 -0.41
## waiting      2 272 70.90 13.59     76   71.50 11.86 43.0 96.0  53.0 -0.41
##           kurtosis   se
## eruptions    -1.51 0.07
## waiting      -1.16 0.82

And let’s print the scatterplot Matrix.

scatterplotMatrix(mydata, smooth = FALSE)

Loading library

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:psych':
## 
##     describe

## The following objects are masked from 'package:base':
## 
##     format.pval, units

Let’s take a look at the correlation coefficient using spearmen method. Looks like it’s 77% correlated. Because of tied data our cor.test cannot compute exact p-value with ties.

cor(mydata$eruptions, mydata$waiting,
 method = c("spearman"))

## [1] 0.7779721

cor.test(mydata$eruptions, mydata$waiting,
 method = c("spearman"))

## Warning in cor.test.default(mydata$eruptions, mydata$waiting, method =
## c("spearman")): Cannot compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  mydata$eruptions and mydata$waiting
## S = 744659, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.7779721

It looks like we cannot get rid of this correlation because waiting time and eruption time go with one and another. Let’s plot it.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

We will draw eruptions. First we need to setup our graphicon.

eruptions <- ggplot(mydata[mydata$eruptions == "eruptions", ], aes(x = eruptions)) + 
  theme_linedraw() +
  geom_histogram() +
  ylab("Frequency") 
  

 waiting <- ggplot(mydata[mydata$eruptions == "waiting", ], aes(x = eruptions)) + 
  theme_linedraw() +
  geom_histogram() +
  ylab("Frequency")

And now after we set all up, let’s plot the eruptions.

ggplot(mydata, aes(x=waiting, y=eruptions)) +
  geom_point() +
  ggtitle("Eruptions")

ggplot(mydata, aes(x=eruptions)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Parametric t test:

H0: There is no significant difference in the mean waiting time between eruptions. H1: There is a significant difference in the mean waiting time between eruptions.

t.test(mydata$eruptions, mydata$waiting,
       paired = TRUE,
       alternative = "less")

## 
##  Paired t-test
## 
## data:  mydata$eruptions and mydata$waiting
## t = -88.398, df = 271, p-value < 2.2e-16
## alternative hypothesis: true mean difference is less than 0
## 95 percent confidence interval:
##       -Inf -66.15066
## sample estimates:
## mean difference 
##       -67.40928

We can reject the null at alpha=0.05 and conclude from H1: there is a significant difference in the mean waiting time between eruptions

Let’s check for normallity of mydata. We will check for this non parametric test with shapiro test.

shapiro.test(mydata$waiting)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$waiting
## W = 0.92215, p-value = 1.015e-10

H0: We have normal distribution H1: We don’t have a normal distribution

We can reject the null at alpha=0.05 and conclude from H1: we don’t have a normal distribution.

shapiro.test(mydata$eruptions)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$eruptions
## W = 0.84592, p-value = 9.036e-16

H:0 We have normal distribution H:1 We don’t have a normal distribution We can reject the null at alpha=0.05.

It is clear we do not follow a normal distribution in eruptions and in waiting time between the eruptions.

Now we will draw waiting time between the eruptions in minutes.

hist(mydata$waiting,
     freq=FALSE, 
     xlab = "Waiting time, showcased in minutes", 
     main="Waiting time between the eruptions in minutes",
     breaks = 20)

Intervals are spanning from a minimum of 43 minutes to a maximum of 96 minutes. On average, these intervals are around 70.9 minutes in mydata, ceteris paribus

And now let’s see histogram of eruptions lenghts, also in minutes.

hist(mydata$eruptions,
     freq=T, 
     xlab = "Duration of eruptions, showcased in minutes", 
     main="Histogram of eruptions lenghts, showcased in minutes",
     breaks=20)

Duration of eruptions vary from 1.6 minutes to a maximum of 5.1 minutes. On average, eruptions last around 3.5 minutes, ceteris paribus.

summary(mydata)

##    eruptions        waiting    
##  Min.   :1.600   Min.   :43.0  
##  1st Qu.:2.163   1st Qu.:58.0  
##  Median :4.000   Median :76.0  
##  Mean   :3.488   Mean   :70.9  
##  3rd Qu.:4.454   3rd Qu.:82.0  
##  Max.   :5.100   Max.   :96.0

To sum it all up, our research question has been answered: Duration of eruptions vary from 1.6 minutes to a maximum of 5.1 minutes. On average, eruptions last around 3.5 minutes, ceteris paribus. Intervals are spanning from a minimum of 43 minutes to a maximum of 96 minutes. On average, these intervals are around 70.9 minutes in mydata, ceteris paribus.

Research question: What is the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.

2024-01-07

By Ina Prosen

Data comes directly from R studio in package called “faithful”

Tests