library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
x_test<-c(2,4,10,20)
y_test<-c(0,0,0,36)
plot(x_test,type="b")
plot(y_test,type="b")
sum(x_test)/length(x_test)
## [1] 9
sum(y_test)/length(y_test)
## [1] 9
What do we notice? Aside from having the same mean, these two vectors have different distributions. We use the mean to describe a group of data, but this can also obscure more than it reveals.
We can think of the mean as a “statistical model” for what the data is telling us. But does the model fit?
plot(x_test,type="b")
abline(h=mean(x_test),col="red")
plot(y_test,type="b")
abline(h=mean(y_test),col="red")
We can quantify the distribution of a series using the var() command
var(x_test)
## [1] 65.33333
var(y_test)
## [1] 324
The fact that var(x) is so much lower than var(y) means that var(x) has less variance than y. The minimum amount of variance is “zero”, that is, no variance at all:
z_test<-c(1,1,1,1,1,1)
var(z_test)
## [1] 0
(there is no variance in “Z” because all the numbers are the same)
There is a better way to talk about variance. Standard deviation! var(x) calculates the deviation squared (good for relative comparisons), whereas standard deviation gives us the original units:
sd(x_test)
## [1] 8.082904
sd(y_test)
## [1] 18
sd(z_test)
## [1] 0
(you can take the numbers above and raise them to the “^2” (second power) get the var() score)
SD is a much more direct measurement of variance. Imagine X and Y are different categories of cars: x as a group varies about 8 cars from the mean, while y varies exactly 18 cars from the mean of y. This is much more concrete than “the variance of the set is (18)^2”.
Another way of thinking about SD is that the score represents how “far away” the “average” score is from the mean.
If I gave everyone a quiz, and the average score in class was 80, with a SD of 7, it means that around 68% of scores are +/-7 points from 80.
sd_data <- data.frame(
sd_range = factor(c("1 SD", "2 SD", "3 SD"), levels = c("1 SD", "2 SD", "3 SD")),
percentage = c(68.27, 95.45, 99.73) # These are the percentages for 1, 2, and 3 SD in a normal distribution
)
ggplot(sd_data, aes(x = sd_range, y = percentage)) +
geom_bar(stat = "identity", fill = "steelblue") +
geom_text(aes(label = paste0(percentage, "%")), vjust = -0.5) +
labs(title = "Percentage of Data within Standard Deviation Ranges",
x = "Standard Deviation Range",
y = "Percentage of Data (%)") +
theme_minimal()
# I would like to thank some robot for taking the time to write this for me
x_values <- seq(-4, 4, length = 1000)
# Compute the y-values (probability density) for the normal distribution
y_values <- dnorm(x_values)
# Create a data frame for plotting
normal_data <- data.frame(x = x_values, y = y_values)
# Create the ggplot
ggplot(normal_data, aes(x = x, y = y)) +
# Plot the normal distribution curve
geom_line(color = "blue", size = 1) +
# Shade the area within 1 standard deviation (68.27%)
geom_area(data = subset(normal_data, x >= -1 & x <= 1), aes(x=x, y=y), fill = "green", alpha = 0.5) +
# Shade the area within 2 standard deviations (95.45%)
geom_area(data = subset(normal_data, x >= -2 & x <= 2), aes(x=x, y=y), fill = "yellow", alpha = 0.3) +
# Shade the area within 3 standard deviations (99.73%)
geom_area(data = subset(normal_data, x >= -3 & x <= 3), aes(x=x, y=y), fill = "red", alpha = 0.2) +
# Add titles and labels
labs(title = "Normal Distribution with SD Ranges",
x = "Standard Deviations",
y = "Density") +
# Add vertical lines for SD ranges
geom_vline(xintercept = c(-1, 1), linetype = "dashed", color = "green", size = 1) +
geom_vline(xintercept = c(-2, 2), linetype = "dashed", color = "yellow", size = 1) +
geom_vline(xintercept = c(-3, 3), linetype = "dashed", color = "red", size = 1) +
# Customize the theme
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
So what’s the catch? The catch is: SD’s are not really accurate unless we are talking about normal distributions (like the one above).
data<-y_test
mean_data <- mean(data) # Mean = 9
sd_data <- sd(data) # Standard Deviation = 18
# Create a data frame for the histogram
df_hist <- data.frame(value = data)
# Generate values for the normal distribution curve
x_values <- seq(mean_data - 4 * sd_data, mean_data + 4 * sd_data, length.out = 1000)
normal_density <- dnorm(x_values, mean = mean_data, sd = sd_data)
# Scale the normal density to match the histogram's y-axis
bin_width <- 5
scaling_factor <- length(data) * bin_width
normal_density_scaled <- normal_density * scaling_factor
# Create a data frame for the normal distribution curve
df_norm <- data.frame(x = x_values, y = normal_density_scaled)
# Add a column indicating the standard deviation regions
df_norm$sigma_region <- with(df_norm, ifelse(
x >= mean_data - sd_data & x <= mean_data + sd_data, 'Within 1 SD',
ifelse(
(x >= mean_data - 2 * sd_data & x < mean_data - sd_data) | (x > mean_data + sd_data & x <= mean_data + 2 * sd_data), 'Within 2 SD',
ifelse(
(x >= mean_data - 3 * sd_data & x < mean_data - 2 * sd_data) | (x > mean_data + 2 * sd_data & x <= mean_data + 3 * sd_data), 'Within 3 SD',
'Beyond 3 SD'
)
)
))
# Define colors for each standard deviation region
sd_colors <- c('Within 1 SD' = 'green', 'Within 2 SD' = 'yellow', 'Within 3 SD' = 'orange', 'Beyond 3 SD' = 'red')
# Plot
ggplot() +
# Histogram of your data
geom_histogram(data = df_hist, aes(x = value), binwidth = bin_width, fill = 'skyblue', color = 'black', alpha = 0.7, boundary = 0) +
# Shaded areas under the normal curve for standard deviation regions
geom_area(data = df_norm, aes(x = x, y = y, fill = sigma_region), alpha = 0.5) +
# Normal distribution curve
geom_line(data = df_norm, aes(x = x, y = y), color = 'blue', size = 1) +
# Vertical lines at mean and standard deviations
geom_vline(xintercept = mean_data, linetype = 'dashed', color = 'black') +
geom_vline(xintercept = mean_data + sd_data, linetype = 'dotted', color = 'black') +
geom_vline(xintercept = mean_data - sd_data, linetype = 'dotted', color = 'black') +
geom_vline(xintercept = mean_data + 2 * sd_data, linetype = 'dotdash', color = 'black') +
geom_vline(xintercept = mean_data - 2 * sd_data, linetype = 'dotdash', color = 'black') +
geom_vline(xintercept = mean_data + 3 * sd_data, linetype = 'longdash', color = 'black') +
geom_vline(xintercept = mean_data - 3 * sd_data, linetype = 'longdash', color = 'black') +
# Annotations for standard deviation percentages
annotate('text', x = mean_data, y = max(normal_density_scaled) * 0.9, label = 'Mean', hjust = -0.1, color = 'black') +
annotate('text', x = mean_data + sd_data, y = max(normal_density_scaled) * 0.8, label = '+1 SD', hjust = -0.1, color = 'black') +
annotate('text', x = mean_data - sd_data, y = max(normal_density_scaled) * 0.8, label = '-1 SD', hjust = 1.1, color = 'black') +
annotate('text', x = mean_data + 2 * sd_data, y = max(normal_density_scaled) * 0.7, label = '+2 SD', hjust = -0.1, color = 'black') +
annotate('text', x = mean_data - 2 * sd_data, y = max(normal_density_scaled) * 0.7, label = '-2 SD', hjust = 1.1, color = 'black') +
annotate('text', x = mean_data + 3 * sd_data, y = max(normal_density_scaled) * 0.6, label = '+3 SD', hjust = -0.1, color = 'black') +
annotate('text', x = mean_data - 3 * sd_data, y = max(normal_density_scaled) * 0.6, label = '-3 SD', hjust = 1.1, color = 'black') +
# Labels and theme
labs(
title = 'Histogram with Superimposed Normal Distribution and SD Regions',
x = 'Value',
y = 'Frequency',
fill = 'Standard Deviation Regions'
) +
scale_fill_manual(values = sd_colors) +
theme_minimal() +
xlim(min(data) - 10, max(data) + 10)
## Warning: Removed 611 rows containing non-finite outside the scale range
## (`stat_align()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 611 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_vline()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_vline()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_vline()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
While they have the same means, the groups have a different distributions.
Lets talk for a moment about creating a normal distribution.
R can create a list of 1000 random numbers. For example:
test_norm<-rnorm(1000,mean=0)
tibble(test_norm)
## # A tibble: 1,000 × 1
## test_norm
## <dbl>
## 1 1.39
## 2 -0.105
## 3 0.00108
## 4 0.469
## 5 -0.255
## 6 -0.807
## 7 0.674
## 8 -1.71
## 9 0.797
## 10 1.23
## # ℹ 990 more rows
hist(test_norm)
rug(test_norm)
abline(v=mean(test_norm),col='red',lwd=4)
Remember we asked R to create a normal distribution with a “mean=0”. Did it work?
mean(test_norm)
## [1] -0.01104704
Very close to zero! Not perfect, but not terrible either.
No model is ever going to perfectly match the data. Reality is too complex – even when we ask a computer to pick random numbers, it will pick a few that throw off the mean by just a little bit.
What now?
We can test to see, mathematically, how close a model is to the actual data.
Another way of putting this: we can decide how much of the data is explained by the model. (Aside from just eyeing the numbers or the graph)
Believe it or not, this is the “main idea” of this entire course: data can be explained by models of data. Means (or averages) are just simple models. Models can be “fitted” to data.
One way to ask the question: “How well does ‘an average of 9’ describe each of our two examples?”
We can calculate the error - or - the VARIANCE:
x_errors<-x_test-9
print(x_test)
## [1] 2 4 10 20
print(c(9,9,9,9))
## [1] 9 9 9 9
print(x_errors)
## [1] -7 -5 1 11
Lets create a data frame with X, Y and the average for both (9):
xyz_test<-data.frame(x_test,y_test)
xyz_test$average<-9
xyz_test<-cbind(xyz_test,x_errors)
xyz_test<-xyz_test%>%mutate(y_errors=y_test-average)
head(xyz_test)
## x_test y_test average x_errors y_errors
## 1 2 0 9 -7 -9
## 2 4 0 9 -5 -9
## 3 10 0 9 1 -9
## 4 20 36 9 11 27
Don’t worry about the following code, just look at the graph:
ggplot(xyz_test,aes(x_test,average)) +geom_line() +
geom_pointrange(aes(ymin=average,ymax=x_test))+
geom_point(aes(y=x_test))+
geom_label(aes(label=x_test),nudge_y=xyz_test$x-xyz_test$average,nudge_x=-1)+
geom_abline(intercept = 9,slope=0,color='red')
Here’s the same graph for “Y”:
ggplot(xyz_test,aes(y_test,average)) +geom_line() +
geom_pointrange(aes(ymin=average,ymax=y_test))+
geom_point(aes(y=y_test))+
geom_label(aes(label=y_test),nudge_y=xyz_test$y-xyz_test$average,nudge_x=-1)+
geom_abline(intercept = 9,slope=0,color='red')
What do you notice? We can immediately tell that the model (an average of 9) does not fit the data very well.
The values of “x” and “y” are all a little different from the average (9). You can see, visually, how much error there is in the model.
IDEA: maybe we can just sum the errors in each measurement?
sum(x_errors)
## [1] 0
Whoops.
If we add all the errors together, we get exactly zero. The errors all cancel each other out.
Is there a way to fix this?
We can square the errors, so they no longer cancel out
x_errors_sq<-x_errors*x_errors
print(x_errors_sq)
## [1] 49 25 1 121
sum(x_errors_sq)
## [1] 196
This is called “the sum of squared errors”. As you can see, it will get bigger the more observations we have, so it’s best to divide by the number of observations to get a more accurate representation of error:
# don't worry about the "-1" at the end of this equation -- its explained in the reading. Just go with it for now.
sum(x_errors_sq)/length(xyz_test$x_test)-1
## [1] 48
This is the VARIANCE of x (2,4,10,20). Okay, so what?
We can compare models and see how big their variance is. Bigger the variance = less accurate model.
Remember model “y”? Lets test it:
#what are the errors of model y? It's the model (9) minus the values:
y_errors<-y_test-9
print(y_errors)
## [1] -9 -9 -9 27
#whats the square of errors?
y_error_sq<-y_errors*y_errors
print(y_error_sq)
## [1] 81 81 81 729
# whats the sum of the squares?
sum(y_error_sq)
## [1] 972
# whats the actual VARIANCE?
sum(y_error_sq)/length(xyz_test$y_test)-1
## [1] 242
Now we can judge which model is more accurate:
sum(x_errors_sq)/length(xyz_test$x_test)-1
## [1] 48
sum(y_error_sq)/length(xyz_test$y_test)-1
## [1] 242
Don’t get caught up in the equations – soon you’ll see that R does all this for you automatically.
var.test(xyz_test$x_test,xyz_test$y_test)
##
## F test to compare two variances
##
## data: xyz_test$x_test and xyz_test$y_test
## F = 0.20165, num df = 3, denom df = 3, p-value = 0.2213
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.01306067 3.11325077
## sample estimates:
## ratio of variances
## 0.2016461
R is telling us to not believe our eyes: the variance between X and Y cannot be meaningfully distinguished!
Who’s right?
Both are right. One possibility: x and y both deviate significantly from a normal distribution
DUE IN TWO WEEKS: The Testable Hypothesis Selection: