(1)(a) Use data(trees) to load the dataset. Check and output the structure with str(). Use apply() to return the mean values for the three variables. Output these values. Using R and logicals, determine the number of trees with Volume greater than the mean Volume; effectively, how many rows have Volume greater than the mean Volume.
data(trees)
str(trees)
## 'data.frame': 31 obs. of 3 variables:
## $ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
## $ Height: num 70 65 63 72 81 83 66 75 80 75 ...
## $ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
# Use apply() to return the mean values for the three variables
apply(trees, 2, mean)
## Girth Height Volume
## 13.24839 76.00000 30.17097
# Calculate the number of trees with Volume greater than the mean Volume
mean_volume <- mean(trees$Volume)
sum(trees$Volume > mean_volume)
## [1] 12
(4)(b) Girth is defined as the diameter of a tree taken at 4 feet 6 inches from the ground. Convert each diameter to a radius, r. Calculate the cross-sectional area of each tree using pi times the squared radius. What is the interquartile range (IQR) of areas?
# Convert Girth (diameter) to radius
trees$Radius <- trees$Girth / 2
# Calculate the cross-sectional area for each tree
trees$Area <- pi * trees$Radius^2
IQR(trees$Area)
## [1] 87.1949
(4)(c) Create a histogram of the areas calculated in (b). Title and label the axis.
hist(trees$Area,
main = "Histogram of Tree Cross-Sectional Areas",
xlab = "Cross-Sectional Area (sq. units)",
ylab = "Frequency")
(4)(d) Identify the tree with the largest area and output on one line its row number and three measurements.
max_area_row <- which.max(trees$Area)
cat("Row:", max_area_row,
"- Girth:", trees$Girth[max_area_row],
"Height:", trees$Height[max_area_row],
"Volume:", trees$Volume[max_area_row], "\n")
## Row: 31 - Girth: 20.6 Height: 87 Volume: 77
2(a) Use set.seed(9999) and rt() with n = 100, df = 10 to generate a random sample designated as y. Generate a second random sample designated as x with set.seed(123) and rnorm() using n = 100, mean = 0 and sd = 1.25.
Generate a new object using cbind(x, y). Do not output this object; instead, assign it to a new name. Pass this object to apply() and compute the inter-quartile range (IQR) for each column: x and y. Use the function IQR() for this purpose. Round the results to four decimal places and present (this exercise shows the similarity of the IQR values.).
For information about rt(), use help(rt) or ?rt(). Do not output x or y.
set.seed(9999)
y <- rt(n = 100, df = 10)
set.seed(123)
x <- rnorm(n = 100, mean = 0, sd = 1.25)
data_combined <- cbind(x, y)
iqr_values <- round(apply(data_combined, 2, IQR), 4)
iqr_values
## x y
## 1.4821 1.5242
(2)(b) This item will illustrate the difference between a normal and heavy-tailed distribution. For base R plots, use par(mfrow = c(2, 2)) to generate a display with four diagrams; grid.arrange() for ggplots. On the first row, for the normal results, present a histogram and a horizontal boxplot for x in color. For the t results, present a histogram and a horizontal boxplot for y in color.
par(mfrow = c(2, 2))
# Histogram for x (normal distribution)
hist(x, main = "Histogram of x (Normal)", xlab = "Values of x", col = "lightblue", border = "black")
# Horizontal boxplot for x
boxplot(x, horizontal = TRUE, main = "Boxplot of x (Normal)", col = "lightgreen")
# Histogram for y (t distribution)
hist(y, main = "Histogram of y (Student's t)", xlab = "Values of y", col = "lightcoral", border = "black")
# Horizontal boxplot for y
boxplot(y, horizontal = TRUE, main = "Boxplot of y (Student's t)", col = "lightpink")
(2)(c) QQ plots are useful for detecting the presence of heavy-tailed distributions. Present side-by-side QQ plots, one for each sample, using qqnorm() and qqline(). Add color and titles. In base R plots, “cex” can be used to control the size of the plotted data points and text; “size” for ggplot2 figures. Lastly, determine if there are any extreme outliers in either sample.Remember extreme outliers are based on 3 multiplied by the IQR in the box plot. R uses a default value of 1.5 times the IQR to define outliers (not extreme) in both boxplot and boxplot stats.
par(mfrow = c(1, 2))
# QQ plot for x (normal distribution)
qqnorm(x, main = "QQ Plot of x (Normal)", col = "blue", cex = 0.7)
qqline(x, col = "red")
# QQ plot for y (t distribution)
qqnorm(y, main = "QQ Plot of y (Student's t)", col = "darkgreen", cex = 0.7)
qqline(y, col = "red")
# Calculate the IQR for each sample
iqr_x <- IQR(x)
iqr_y <- IQR(y)
# Identify extreme outliers based on 3 * IQR rule
extreme_outliers_x <- x[x < (quantile(x, 0.25) - 3 * iqr_x) | x > (quantile(x, 0.75) + 3 * iqr_x)]
extreme_outliers_y <- y[y < (quantile(y, 0.25) - 3 * iqr_y) | y > (quantile(y, 0.75) + 3 * iqr_y)]
# Output extreme outliers, if any
cat("Extreme outliers in x (Normal):", extreme_outliers_x, "\n")
## Extreme outliers in x (Normal):
cat("Extreme outliers in y (Student's t):", extreme_outliers_y, "\n")
## Extreme outliers in y (Student's t):
(3)(a) (3 points) Load the “faithful” dataset and present summary statistics and a histogram of waiting times. Additionally, compute the empirical conditional probability of an eruption less than 3.0 minutes, if the waiting time exceeds 70 minutes.
data(faithful)
summary(faithful)
## eruptions waiting
## Min. :1.600 Min. :43.0
## 1st Qu.:2.163 1st Qu.:58.0
## Median :4.000 Median :76.0
## Mean :3.488 Mean :70.9
## 3rd Qu.:4.454 3rd Qu.:82.0
## Max. :5.100 Max. :96.0
# Histogram of waiting times
hist(faithful$waiting, main = "Histogram of Waiting Times", xlab = "Waiting Time (minutes)", col = "lightblue", border = "black")
# Compute the empirical conditional probability
# Define the conditions
condition <- faithful$waiting > 70
eruption_condition <- faithful$eruptions < 3.0
# Calculate conditional probability
conditional_probability <- sum(eruption_condition & condition) / sum(condition)
cat("Empirical conditional probability of eruption < 3.0 minutes given waiting time > 70 minutes:", conditional_probability, "\n")
## Empirical conditional probability of eruption < 3.0 minutes given waiting time > 70 minutes: 0.006060606
# Identify observations where waiting time > 70 minutes and eruption < 3.0 minutes
observations <- faithful[faithful$waiting > 70 & faithful$eruptions < 3.0, ]
observations
## eruptions waiting
## 211 2.383 71
# Create scatterplot of all data with special color for the identified observations
plot(faithful$waiting, faithful$eruptions,
main = "Eruption Duration vs. Waiting Time for Old Faithful",
xlab = "Waiting Time (minutes)", ylab = "Eruption Duration (minutes)",
pch = 19, col = "blue")
# Add points for the identified observations in a distinct color (e.g., red)
points(observations$waiting, observations$eruptions,
col = "red", pch = 19, cex = 1.2)
# Add horizontal line at eruption = 3.0 and vertical line at waiting time = 70
abline(h = 3.0, col = "darkgreen", lty = 2, lwd = 2)
abline(v = 70, col = "darkgreen", lty = 2, lwd = 2)
# Add text labels for clarity
text(85, 4.5, "Waiting Time > 70 & Eruption < 3.0", col = "red", pos = 4)
text(50, 3.1, "Eruption = 3.0 minutes", col = "darkgreen", pos = 4)
text(72, 2, "Waiting = 70 minutes", col = "darkgreen", pos = 4)
Answer: (I feel like this the plot suggests a positive relationship between eruption duration and waiting time. As the waiting time increases, there is a tendency for eruption durations to be longer, particularly evident in the clustering of points at higher waiting times and longer eruptions. The observation highlighted in red, where the waiting time exceeds 70 minutes but the eruption duration is less than 3 minutes, appears to be an outlier. This indicates that, although longer waiting times generally lead to longer eruptions, there are exceptions to this pattern, hinting at variability in the relationship.)
(3)(b) (4.5 points) Past research indicates that the waiting times between consecutive eruptions are not independent. This problem will check to see if there is evidence of this. Form consecutive pairs of waiting times. In other words, pair the first and second waiting times, pair the third and fourth waiting times, and so forth. There are 136 resulting consecutive pairs of waiting times. Form a data frame with the first column containing the first waiting time in a pair and the second column with the second waiting time in a pair. Plot the pairs with the second member of a pair on the vertical axis and the first member on the horizontal axis.
One way to do this is to pass the vector of waiting times - faithful$waiting - to matrix(), specifying 2 columns for our matrix, with values organized by row; i.e. byrow = TRUE.
# Form consecutive pairs of waiting times
waiting_pairs <- matrix(faithful$waiting, ncol = 2, byrow = TRUE)
waiting_df <- data.frame(First_Waiting = waiting_pairs[, 1], Second_Waiting = waiting_pairs[, 2])
# Plot the pairs with the second member of each pair on the vertical axis
plot(waiting_df$First_Waiting, waiting_df$Second_Waiting,
main = "Scatterplot of Consecutive Waiting Times",
xlab = "First Waiting Time (minutes)",
ylab = "Second Waiting Time (minutes)",
pch = 19, col = "blue")
# Display the resulting data frame
print(waiting_df)
## First_Waiting Second_Waiting
## 1 79 54
## 2 74 62
## 3 85 55
## 4 88 85
## 5 51 85
## 6 54 84
## 7 78 47
## 8 83 52
## 9 62 84
## 10 52 79
## 11 51 47
## 12 78 69
## 13 74 83
## 14 55 76
## 15 78 79
## 16 73 77
## 17 66 80
## 18 74 52
## 19 48 80
## 20 59 90
## 21 80 58
## 22 84 58
## 23 73 83
## 24 64 53
## 25 82 59
## 26 75 90
## 27 54 80
## 28 54 83
## 29 71 64
## 30 77 81
## 31 59 84
## 32 48 82
## 33 60 92
## 34 78 78
## 35 65 73
## 36 82 56
## 37 79 71
## 38 62 76
## 39 60 78
## 40 76 83
## 41 75 82
## 42 70 65
## 43 73 88
## 44 76 80
## 45 48 86
## 46 60 90
## 47 50 78
## 48 63 72
## 49 84 75
## 50 51 82
## 51 62 88
## 52 49 83
## 53 81 47
## 54 84 52
## 55 86 81
## 56 75 59
## 57 89 79
## 58 59 81
## 59 50 85
## 60 59 87
## 61 53 69
## 62 77 56
## 63 88 81
## 64 45 82
## 65 55 90
## 66 45 83
## 67 56 89
## 68 46 82
## 69 51 86
## 70 53 79
## 71 81 60
## 72 82 77
## 73 76 59
## 74 80 49
## 75 96 53
## 76 77 77
## 77 65 81
## 78 71 70
## 79 81 93
## 80 53 89
## 81 45 86
## 82 58 78
## 83 66 76
## 84 63 88
## 85 52 93
## 86 49 57
## 87 77 68
## 88 81 81
## 89 73 50
## 90 85 74
## 91 55 77
## 92 83 83
## 93 51 78
## 94 84 46
## 95 83 55
## 96 81 57
## 97 76 84
## 98 77 81
## 99 87 77
## 100 51 78
## 101 60 82
## 102 91 53
## 103 78 46
## 104 77 84
## 105 49 83
## 106 71 80
## 107 49 75
## 108 64 76
## 109 53 94
## 110 55 76
## 111 50 82
## 112 54 75
## 113 78 79
## 114 78 78
## 115 70 79
## 116 70 54
## 117 86 50
## 118 90 54
## 119 54 77
## 120 79 64
## 121 75 47
## 122 86 63
## 123 85 82
## 124 57 82
## 125 67 74
## 126 54 83
## 127 73 73
## 128 88 80
## 129 71 83
## 130 56 79
## 131 78 84
## 132 58 83
## 133 43 60
## 134 75 81
## 135 46 90
## 136 46 74
(3)(c) (2) Test the hypothesis of independence with a two-sided test at the 5% level using the Kendall correlation coefficient.
# Form consecutive pairs of waiting times
waiting_pairs <- matrix(faithful$waiting, ncol = 2, byrow = TRUE)
waiting_df <- data.frame(First_Waiting = waiting_pairs[, 1], Second_Waiting = waiting_pairs[, 2])
# Perform Kendall's correlation test
cor.test(waiting_df$First_Waiting, waiting_df$Second_Waiting, method = "kendall", alternative = "two.sided")
##
## Kendall's rank correlation tau
##
## data: waiting_df$First_Waiting and waiting_df$Second_Waiting
## z = -4.9482, p-value = 7.489e-07
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.2935579