Worksheet Week 6 - Part 1

Questions start here

Section 1: (12 points) Use the “trees” dataset for the following items. This dataset has three variables (Girth, Height, Volume) on 31 felled black cherry trees.

(1)(a) Use data(trees) to load the dataset. Check and output the structure with str(). Use apply() to return the mean values for the three variables. Output these values. Using R and logicals, determine the number of trees with Volume greater than the mean Volume; effectively, how many rows have Volume greater than the mean Volume.

data(trees)

str(trees)

## 'data.frame':    31 obs. of  3 variables:
##  $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
##  $ Height: num  70 65 63 72 81 83 66 75 80 75 ...
##  $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...

# Use apply() to return the mean values for the three variables
apply(trees, 2, mean)

##    Girth   Height   Volume 
## 13.24839 76.00000 30.17097

# Calculate the number of trees with Volume greater than the mean Volume
mean_volume <- mean(trees$Volume)
sum(trees$Volume > mean_volume)

## [1] 12

(4)(b) Girth is defined as the diameter of a tree taken at 4 feet 6 inches from the ground. Convert each diameter to a radius, r. Calculate the cross-sectional area of each tree using pi times the squared radius. What is the interquartile range (IQR) of areas?

# Convert Girth (diameter) to radius
trees$Radius <- trees$Girth / 2

# Calculate the cross-sectional area for each tree
trees$Area <- pi * trees$Radius^2

IQR(trees$Area)

## [1] 87.1949

(4)(c) Create a histogram of the areas calculated in (b). Title and label the axis.

hist(trees$Area,
     main = "Histogram of Tree Cross-Sectional Areas",
     xlab = "Cross-Sectional Area (sq. units)",
     ylab = "Frequency")

(4)(d) Identify the tree with the largest area and output on one line its row number and three measurements.

max_area_row <- which.max(trees$Area)

cat("Row:", max_area_row, 
    "- Girth:", trees$Girth[max_area_row], 
    "Height:", trees$Height[max_area_row], 
    "Volume:", trees$Volume[max_area_row], "\n")

## Row: 31 - Girth: 20.6 Height: 87 Volume: 77

Section 2: (12 points) The Student’s t distribution is an example of a symmetric, bell-shaped distribution but with ‘heavier’ tails than a normal distribution. This problem involves comparing the two.

2(a) Use set.seed(9999) and rt() with n = 100, df = 10 to generate a random sample designated as y. Generate a second random sample designated as x with set.seed(123) and rnorm() using n = 100, mean = 0 and sd = 1.25.

Generate a new object using cbind(x, y). Do not output this object; instead, assign it to a new name. Pass this object to apply() and compute the inter-quartile range (IQR) for each column: x and y. Use the function IQR() for this purpose. Round the results to four decimal places and present (this exercise shows the similarity of the IQR values.).

For information about rt(), use help(rt) or ?rt(). Do not output x or y.

set.seed(9999)
y <- rt(n = 100, df = 10)

set.seed(123)
x <- rnorm(n = 100, mean = 0, sd = 1.25)


data_combined <- cbind(x, y)


iqr_values <- round(apply(data_combined, 2, IQR), 4)

iqr_values

##      x      y 
## 1.4821 1.5242

(2)(b) This item will illustrate the difference between a normal and heavy-tailed distribution. For base R plots, use par(mfrow = c(2, 2)) to generate a display with four diagrams; grid.arrange() for ggplots. On the first row, for the normal results, present a histogram and a horizontal boxplot for x in color. For the t results, present a histogram and a horizontal boxplot for y in color.

par(mfrow = c(2, 2))

# Histogram for x (normal distribution)
hist(x, main = "Histogram of x (Normal)", xlab = "Values of x", col = "lightblue", border = "black")

# Horizontal boxplot for x
boxplot(x, horizontal = TRUE, main = "Boxplot of x (Normal)", col = "lightgreen")

# Histogram for y (t distribution)
hist(y, main = "Histogram of y (Student's t)", xlab = "Values of y", col = "lightcoral", border = "black")

# Horizontal boxplot for y
boxplot(y, horizontal = TRUE, main = "Boxplot of y (Student's t)", col = "lightpink")

(2)(c) QQ plots are useful for detecting the presence of heavy-tailed distributions. Present side-by-side QQ plots, one for each sample, using qqnorm() and qqline(). Add color and titles. In base R plots, “cex” can be used to control the size of the plotted data points and text; “size” for ggplot2 figures. Lastly, determine if there are any extreme outliers in either sample.Remember extreme outliers are based on 3 multiplied by the IQR in the box plot. R uses a default value of 1.5 times the IQR to define outliers (not extreme) in both boxplot and boxplot stats.

par(mfrow = c(1, 2))

# QQ plot for x (normal distribution)
qqnorm(x, main = "QQ Plot of x (Normal)", col = "blue", cex = 0.7)
qqline(x, col = "red")

# QQ plot for y (t distribution)
qqnorm(y, main = "QQ Plot of y (Student's t)", col = "darkgreen", cex = 0.7)
qqline(y, col = "red")

# Calculate the IQR for each sample
iqr_x <- IQR(x)
iqr_y <- IQR(y)

# Identify extreme outliers based on 3 * IQR rule
extreme_outliers_x <- x[x < (quantile(x, 0.25) - 3 * iqr_x) | x > (quantile(x, 0.75) + 3 * iqr_x)]
extreme_outliers_y <- y[y < (quantile(y, 0.25) - 3 * iqr_y) | y > (quantile(y, 0.75) + 3 * iqr_y)]

# Output extreme outliers, if any
cat("Extreme outliers in x (Normal):", extreme_outliers_x, "\n")

## Extreme outliers in x (Normal):

cat("Extreme outliers in y (Student's t):", extreme_outliers_y, "\n")

## Extreme outliers in y (Student's t):

Section 3 (3) Conditional probabilities appear in many contexts and, in particular, are used by Bayes’ Theorem. Correlations are another means for evaluating dependency between variables. The dataset “faithful”” is part of the “datasets” package and may be loaded with the statement data(faithful). It contains 272 observations of 2 variables; waiting time between eruptions (in minutes) and the duration of the eruption (in minutes) for the Old Faithful geyser in Yellowstone National Park.

(3)(a) (3 points) Load the “faithful” dataset and present summary statistics and a histogram of waiting times. Additionally, compute the empirical conditional probability of an eruption less than 3.0 minutes, if the waiting time exceeds 70 minutes.

data(faithful)

summary(faithful)

##    eruptions        waiting    
##  Min.   :1.600   Min.   :43.0  
##  1st Qu.:2.163   1st Qu.:58.0  
##  Median :4.000   Median :76.0  
##  Mean   :3.488   Mean   :70.9  
##  3rd Qu.:4.454   3rd Qu.:82.0  
##  Max.   :5.100   Max.   :96.0

# Histogram of waiting times
hist(faithful$waiting, main = "Histogram of Waiting Times", xlab = "Waiting Time (minutes)", col = "lightblue", border = "black")

# Compute the empirical conditional probability
# Define the conditions
condition <- faithful$waiting > 70
eruption_condition <- faithful$eruptions < 3.0

# Calculate conditional probability
conditional_probability <- sum(eruption_condition & condition) / sum(condition)
cat("Empirical conditional probability of eruption < 3.0 minutes given waiting time > 70 minutes:", conditional_probability, "\n")

## Empirical conditional probability of eruption < 3.0 minutes given waiting time > 70 minutes: 0.006060606

(3 points) Identify any observations in “faithful” for which the waiting time exceeds 70 minutes and the eruptions are less than 3.0 minutes. List and show any such observations in a distinct color on a scatterplot of all eruption (vertical axis) and waiting times (horizontal axis). Include a horizontal line at eruption = 3.0, and a vertical line at waiting time = 70. Add a title and appropriate text.

# Identify observations where waiting time > 70 minutes and eruption < 3.0 minutes
observations <- faithful[faithful$waiting > 70 & faithful$eruptions < 3.0, ]
observations

##     eruptions waiting
## 211     2.383      71

# Create scatterplot of all data with special color for the identified observations
plot(faithful$waiting, faithful$eruptions, 
     main = "Eruption Duration vs. Waiting Time for Old Faithful",
     xlab = "Waiting Time (minutes)", ylab = "Eruption Duration (minutes)",
     pch = 19, col = "blue")

# Add points for the identified observations in a distinct color (e.g., red)
points(observations$waiting, observations$eruptions, 
       col = "red", pch = 19, cex = 1.2)

# Add horizontal line at eruption = 3.0 and vertical line at waiting time = 70
abline(h = 3.0, col = "darkgreen", lty = 2, lwd = 2)
abline(v = 70, col = "darkgreen", lty = 2, lwd = 2)

# Add text labels for clarity
text(85, 4.5, "Waiting Time > 70 & Eruption < 3.0", col = "red", pos = 4)
text(50, 3.1, "Eruption = 3.0 minutes", col = "darkgreen", pos = 4)
text(72, 2, "Waiting = 70 minutes", col = "darkgreen", pos = 4)

(1.5 point) What does the plot suggest about the relationship between eruption time and waiting time?

Answer: (I feel like this the plot suggests a positive relationship between eruption duration and waiting time. As the waiting time increases, there is a tendency for eruption durations to be longer, particularly evident in the clustering of points at higher waiting times and longer eruptions. The observation highlighted in red, where the waiting time exceeds 70 minutes but the eruption duration is less than 3 minutes, appears to be an outlier. This indicates that, although longer waiting times generally lead to longer eruptions, there are exceptions to this pattern, hinting at variability in the relationship.)

(3)(b) (4.5 points) Past research indicates that the waiting times between consecutive eruptions are not independent. This problem will check to see if there is evidence of this. Form consecutive pairs of waiting times. In other words, pair the first and second waiting times, pair the third and fourth waiting times, and so forth. There are 136 resulting consecutive pairs of waiting times. Form a data frame with the first column containing the first waiting time in a pair and the second column with the second waiting time in a pair. Plot the pairs with the second member of a pair on the vertical axis and the first member on the horizontal axis.

One way to do this is to pass the vector of waiting times - faithful$waiting - to matrix(), specifying 2 columns for our matrix, with values organized by row; i.e. byrow = TRUE.

# Form consecutive pairs of waiting times
waiting_pairs <- matrix(faithful$waiting, ncol = 2, byrow = TRUE)
waiting_df <- data.frame(First_Waiting = waiting_pairs[, 1], Second_Waiting = waiting_pairs[, 2])

# Plot the pairs with the second member of each pair on the vertical axis
plot(waiting_df$First_Waiting, waiting_df$Second_Waiting, 
     main = "Scatterplot of Consecutive Waiting Times",
     xlab = "First Waiting Time (minutes)",
     ylab = "Second Waiting Time (minutes)",
     pch = 19, col = "blue")

# Display the resulting data frame
print(waiting_df)

##     First_Waiting Second_Waiting
## 1              79             54
## 2              74             62
## 3              85             55
## 4              88             85
## 5              51             85
## 6              54             84
## 7              78             47
## 8              83             52
## 9              62             84
## 10             52             79
## 11             51             47
## 12             78             69
## 13             74             83
## 14             55             76
## 15             78             79
## 16             73             77
## 17             66             80
## 18             74             52
## 19             48             80
## 20             59             90
## 21             80             58
## 22             84             58
## 23             73             83
## 24             64             53
## 25             82             59
## 26             75             90
## 27             54             80
## 28             54             83
## 29             71             64
## 30             77             81
## 31             59             84
## 32             48             82
## 33             60             92
## 34             78             78
## 35             65             73
## 36             82             56
## 37             79             71
## 38             62             76
## 39             60             78
## 40             76             83
## 41             75             82
## 42             70             65
## 43             73             88
## 44             76             80
## 45             48             86
## 46             60             90
## 47             50             78
## 48             63             72
## 49             84             75
## 50             51             82
## 51             62             88
## 52             49             83
## 53             81             47
## 54             84             52
## 55             86             81
## 56             75             59
## 57             89             79
## 58             59             81
## 59             50             85
## 60             59             87
## 61             53             69
## 62             77             56
## 63             88             81
## 64             45             82
## 65             55             90
## 66             45             83
## 67             56             89
## 68             46             82
## 69             51             86
## 70             53             79
## 71             81             60
## 72             82             77
## 73             76             59
## 74             80             49
## 75             96             53
## 76             77             77
## 77             65             81
## 78             71             70
## 79             81             93
## 80             53             89
## 81             45             86
## 82             58             78
## 83             66             76
## 84             63             88
## 85             52             93
## 86             49             57
## 87             77             68
## 88             81             81
## 89             73             50
## 90             85             74
## 91             55             77
## 92             83             83
## 93             51             78
## 94             84             46
## 95             83             55
## 96             81             57
## 97             76             84
## 98             77             81
## 99             87             77
## 100            51             78
## 101            60             82
## 102            91             53
## 103            78             46
## 104            77             84
## 105            49             83
## 106            71             80
## 107            49             75
## 108            64             76
## 109            53             94
## 110            55             76
## 111            50             82
## 112            54             75
## 113            78             79
## 114            78             78
## 115            70             79
## 116            70             54
## 117            86             50
## 118            90             54
## 119            54             77
## 120            79             64
## 121            75             47
## 122            86             63
## 123            85             82
## 124            57             82
## 125            67             74
## 126            54             83
## 127            73             73
## 128            88             80
## 129            71             83
## 130            56             79
## 131            78             84
## 132            58             83
## 133            43             60
## 134            75             81
## 135            46             90
## 136            46             74

(3)(c) (2) Test the hypothesis of independence with a two-sided test at the 5% level using the Kendall correlation coefficient.

# Form consecutive pairs of waiting times
waiting_pairs <- matrix(faithful$waiting, ncol = 2, byrow = TRUE)
waiting_df <- data.frame(First_Waiting = waiting_pairs[, 1], Second_Waiting = waiting_pairs[, 2])

# Perform Kendall's correlation test
cor.test(waiting_df$First_Waiting, waiting_df$Second_Waiting, method = "kendall", alternative = "two.sided")

## 
##  Kendall's rank correlation tau
## 
## data:  waiting_df$First_Waiting and waiting_df$Second_Waiting
## z = -4.9482, p-value = 7.489e-07
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##        tau 
## -0.2935579

Worksheet Week 6 - Part 1 _ Jiancong Zhu

Questions start here

Section 1: (12 points) Use the “trees” dataset for the following items. This dataset has three variables (Girth, Height, Volume) on 31 felled black cherry trees.

Section 2: (12 points) The Student’s t distribution is an example of a symmetric, bell-shaped distribution but with ‘heavier’ tails than a normal distribution. This problem involves comparing the two.

Section 3