Homework recap: Many people had an issue with plots because they chose noncontious data (ex. number of cylinders on a car– there is not .5 of a cylinder). For a scatter plot, both values need to be continuous Hint: Anything with a decimal point is a continuous

Homework: Numbers only have one property (p. 24) Statistics isn’t really a science– it’s more of a technology to help us understand the data around us

Recap on Dispersion

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

x_test<-c(2,4,10,20)
y_test<-c(0,0,0,36)

The above code created new vectors– the “c” command basically combines a string of numbers

plot(x_test,type="b")

The type=“b” piece draws lines between the dots.

plot(y_test,type="b")

Both x and y vector have the same mean This is how you get the mean manually:

sum(x_test)/length(x_test)

## [1] 9

sum(y_test)/length(y_test)

## [1] 9

What do we notice? Aside from having the same mean, these two vectors have different distributions. We use the mean to describe a group of data, but this can also obscure more than it reveals.

We can think of the mean as a “statistical model” for what the data is telling us. But does the model fit?

~For the rest of the semester, we’ll be thinking of how to use mathematical models to tell us information about the data. The average is the easiest of these models.

plot(x_test,type="b")
abline(h=mean(x_test),col="red")

The red line- created by “abline” -is where the mean is. How well does it relate to the data? Not well…. the dots are far away from the mean, and the line only intersects with the mean once. This model is not great.

plot(y_test,type="b")
abline(h=mean(y_test),col="red")

This also doesn’t describe the data well at all!

We can quantify the distribution of a series using the var() command Var stands for variance. It tells us how much distance there is between the values and the mean–

var(x_test)

## [1] 65.33333

var(y_test)

## [1] 324

The fact that var(x) is so much lower than var(y) means that var(x) has less variance than y. The minimum amount of variance is “zero”, that is, no variance at all:

z_test<-c(1,1,1,1,1,1)

var(z_test)

## [1] 0

(there is no variance in “Z” because all the numbers are the same)

There is a better way to talk about variance. Standard deviation (sd)! var(x) calculates the deviation squared (good for relative comparisons), whereas standard deviation gives us the original units:

sd(x_test)

## [1] 8.082904

sd(y_test)

## [1] 18

sd(z_test)

## [1] 0

(you can take the numbers above and raise them to the “^2” (second power) get the var() score)

SD is a much more direct measurement of variance. Imagine X and Y are different categories of cars: x as a group varies about 8 cars from the mean, while y varies exactly 18 cars from the mean of y. This is much more concrete than “the variance of the set is (18)^2”.

Another way of thinking about SD is that the score represents how “far away” the “average” score is from the mean.

If I gave everyone a quiz, and the average score in class was 80, with a SD of 7, it means that around 68% of scores are +/-7 points from 80.

You don’t need to know how to do the wild stuff in the code below–

sd_data <- data.frame(
  sd_range = factor(c("1 SD", "2 SD", "3 SD"), levels = c("1 SD", "2 SD", "3 SD")),
  percentage = c(68.27, 95.45, 99.73) # These are the percentages for 1, 2, and 3 SD in a normal distribution
)

ggplot(sd_data, aes(x = sd_range, y = percentage)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_text(aes(label = paste0(percentage, "%")), vjust = -0.5) +
  labs(title = "Percentage of Data within Standard Deviation Ranges",
       x = "Standard Deviation Range",
       y = "Percentage of Data (%)") +
  theme_minimal()

How did we get these numbers? It’s math (specifically calculus). Ew.

# I would like to thank some robot for taking the time to write this for me

x_values <- seq(-4, 4, length = 1000)

# Compute the y-values (probability density) for the normal distribution
y_values <- dnorm(x_values)

# Create a data frame for plotting
normal_data <- data.frame(x = x_values, y = y_values)

# Create the ggplot
ggplot(normal_data, aes(x = x, y = y)) +
  # Plot the normal distribution curve
  geom_line(color = "blue", size = 1) +
  
  # Shade the area within 1 standard deviation (68.27%)
  geom_area(data = subset(normal_data, x >= -1 & x <= 1), aes(x=x, y=y), fill = "green", alpha = 0.5) +
  
  # Shade the area within 2 standard deviations (95.45%)
  geom_area(data = subset(normal_data, x >= -2 & x <= 2), aes(x=x, y=y), fill = "yellow", alpha = 0.3) +
  
  # Shade the area within 3 standard deviations (99.73%)
  geom_area(data = subset(normal_data, x >= -3 & x <= 3), aes(x=x, y=y), fill = "red", alpha = 0.2) +
  
  # Add titles and labels
  labs(title = "Normal Distribution with SD Ranges",
       x = "Standard Deviations",
       y = "Density") +
  
  # Add vertical lines for SD ranges
  geom_vline(xintercept = c(-1, 1), linetype = "dashed", color = "green", size = 1) +
  geom_vline(xintercept = c(-2, 2), linetype = "dashed", color = "yellow", size = 1) +
  geom_vline(xintercept = c(-3, 3), linetype = "dashed", color = "red", size = 1) +
  
  # Customize the theme
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Another note– this is where the ‘margin of error’ comes in. Example: political polls with a 3% margin of error means the standard deviation is 3% (so there’s a 68.27% chance the poll is accurate to within 3% +/-)

So what’s the catch? The catch is: SD’s are not really accurate unless we are talking about normal distributions (like the one above). Example with the number set [0 0 0 36]:

data<-y_test
mean_data <- mean(data)  # Mean = 9
sd_data <- sd(data)      # Standard Deviation = 18

# Create a data frame for the histogram
df_hist <- data.frame(value = data)

# Generate values for the normal distribution curve

x_values <- seq(mean_data - 4 * sd_data, mean_data + 4 * sd_data, length.out = 1000)
normal_density <- dnorm(x_values, mean = mean_data, sd = sd_data)

# Scale the normal density to match the histogram's y-axis
bin_width <- 5
scaling_factor <- length(data) * bin_width
normal_density_scaled <- normal_density * scaling_factor

# Create a data frame for the normal distribution curve
df_norm <- data.frame(x = x_values, y = normal_density_scaled)

# Add a column indicating the standard deviation regions
df_norm$sigma_region <- with(df_norm, ifelse(
  x >= mean_data - sd_data & x <= mean_data + sd_data, 'Within 1 SD',
  ifelse(
    (x >= mean_data - 2 * sd_data & x < mean_data - sd_data) | (x > mean_data + sd_data & x <= mean_data + 2 * sd_data), 'Within 2 SD',
    ifelse(
      (x >= mean_data - 3 * sd_data & x < mean_data - 2 * sd_data) | (x > mean_data + 2 * sd_data & x <= mean_data + 3 * sd_data), 'Within 3 SD',
      'Beyond 3 SD'
    )
  )
))

# Define colors for each standard deviation region
sd_colors <- c('Within 1 SD' = 'green', 'Within 2 SD' = 'yellow', 'Within 3 SD' = 'orange', 'Beyond 3 SD' = 'red')

# Plot
ggplot() +
  # Histogram of your data
  geom_histogram(data = df_hist, aes(x = value), binwidth = bin_width, fill = 'skyblue', color = 'black', alpha = 0.7, boundary = 0) +
  # Shaded areas under the normal curve for standard deviation regions
  geom_area(data = df_norm, aes(x = x, y = y, fill = sigma_region), alpha = 0.5) +
  # Normal distribution curve
  geom_line(data = df_norm, aes(x = x, y = y), color = 'blue', size = 1) +
  # Vertical lines at mean and standard deviations
  geom_vline(xintercept = mean_data, linetype = 'dashed', color = 'black') +
  geom_vline(xintercept = mean_data + sd_data, linetype = 'dotted', color = 'black') +
  geom_vline(xintercept = mean_data - sd_data, linetype = 'dotted', color = 'black') +
  geom_vline(xintercept = mean_data + 2 * sd_data, linetype = 'dotdash', color = 'black') +
  geom_vline(xintercept = mean_data - 2 * sd_data, linetype = 'dotdash', color = 'black') +
  geom_vline(xintercept = mean_data + 3 * sd_data, linetype = 'longdash', color = 'black') +
  geom_vline(xintercept = mean_data - 3 * sd_data, linetype = 'longdash', color = 'black') +
  # Annotations for standard deviation percentages
  annotate('text', x = mean_data, y = max(normal_density_scaled) * 0.9, label = 'Mean', hjust = -0.1, color = 'black') +
  annotate('text', x = mean_data + sd_data, y = max(normal_density_scaled) * 0.8, label = '+1 SD', hjust = -0.1, color = 'black') +
  annotate('text', x = mean_data - sd_data, y = max(normal_density_scaled) * 0.8, label = '-1 SD', hjust = 1.1, color = 'black') +
  annotate('text', x = mean_data + 2 * sd_data, y = max(normal_density_scaled) * 0.7, label = '+2 SD', hjust = -0.1, color = 'black') +
  annotate('text', x = mean_data - 2 * sd_data, y = max(normal_density_scaled) * 0.7, label = '-2 SD', hjust = 1.1, color = 'black') +
  annotate('text', x = mean_data + 3 * sd_data, y = max(normal_density_scaled) * 0.6, label = '+3 SD', hjust = -0.1, color = 'black') +
  annotate('text', x = mean_data - 3 * sd_data, y = max(normal_density_scaled) * 0.6, label = '-3 SD', hjust = 1.1, color = 'black') +
  # Labels and theme
  labs(
    title = 'Histogram with Superimposed Normal Distribution and SD Regions',
    x = 'Value',
    y = 'Frequency',
    fill = 'Standard Deviation Regions'
  ) +
  scale_fill_manual(values = sd_colors) +
  theme_minimal() +
  xlim(min(data) - 10, max(data) + 10)

## Warning: Removed 611 rows containing non-finite outside the scale range
## (`stat_align()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_bar()`).

## Warning: Removed 611 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_vline()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_vline()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_vline()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).

If data isn’t distributed like a bell curve, the standard deviation will not work!

Which line best describes the data?

While they have the same means, the groups have a different distributions.

Lets talk for a moment about creating a normal distribution.(basically a bell curve)

R can create a list of 1000 random numbers. For example: (the test_norm command is creating a list of 1000 numbers with a mean of 0)

test_norm<-rnorm(1000,mean=0)

tibble(test_norm)

## # A tibble: 1,000 × 1
##    test_norm
##        <dbl>
##  1     0.866
##  2    -0.442
##  3     0.499
##  4    -0.197
##  5    -2.61 
##  6     0.808
##  7    -0.979
##  8    -1.02 
##  9     0.400
## 10    -2.14 
## # ℹ 990 more rows

hist(test_norm)
rug(test_norm)
abline(v=mean(test_norm),col='red',lwd=4)

Remember we asked R to create a normal distribution with a “mean=0”. Did it work?

mean(test_norm)

## [1] 0.0530999

Very close to zero! Not perfect, but not terrible either.

IMPORTANT IDEA:

No model is ever going to perfectly match the data. Reality is too complex – even when we ask a computer to pick random numbers, it will pick a few that throw off the mean by just a little bit.

What now?

We can test to see, mathematically, how close a model is to the actual data.

Another way of putting this: we can decide how much of the data is explained by the model. (Aside from just eyeing the numbers or the graph)

Believe it or not, this is the “main idea” of this entire course: data can be explained by models of data. Means (or averages) are just simple models. Models can be “fitted” to data.

One way to ask the question: “How well does ‘an average of 9’ describe each of our two examples?”

We can calculate the error - or - the VARIANCE:

This calculates the distance of x from 9:

x_errors<-x_test-9
print(x_errors)

## [1] -7 -5  1 11

Lets create a data frame with X, Y and the average for both (9): data.frame basically creates a spreadsheet- every comma is like an additional column

xyz_test<-data.frame(x_test,y_test)

xyz_test$average<-9

xyz_test<-cbind(xyz_test,x_errors)

head(xyz_test)

##   x_test y_test average x_errors
## 1      2      0       9       -7
## 2      4      0       9       -5
## 3     10      0       9        1
## 4     20     36       9       11

Don’t worry about the following code, just look at the graph:

ggplot(xyz_test,aes(x_test,average)) +geom_line() +
  geom_pointrange(aes(ymin=average,ymax=x_test))+
  geom_point(aes(y=x_test))+
  geom_label(aes(label=x_test),nudge_y=xyz_test$x-xyz_test$average,nudge_x=-1)+
  geom_abline(intercept = 9,slope=0,color='red')

This graph is showing the physical distance between the points and the models (mean). An accurate model would have the points be very close to it. It would have the minimum amount of error.

Here’s the same graph for “Y”:

ggplot(xyz_test,aes(y_test,average)) +geom_line() +
  geom_pointrange(aes(ymin=average,ymax=y_test))+
  geom_point(aes(y=y_test))+
  geom_label(aes(label=y_test),nudge_y=xyz_test$y-xyz_test$average,nudge_x=-1)+
  geom_abline(intercept = 9,slope=0,color='red')

What do you notice? We can immediately tell that the model (an average of 9) does not fit the data very well.

The values of “x” and “y” are all a little different from the average (9). You can see, visually, how much error there is in the model.

IDEA: maybe we can just sum the errors in each measurement?

sum(x_errors)

## [1] 0

Whoops.

If we add all the errors together, we get exactly zero. The errors all cancel each other out. This will always happen!

Is there a way to fix this?

We can square the errors, so they no longer cancel out

x_errors_sq<-x_errors*x_errors

print(x_errors_sq)

## [1]  49  25   1 121

sum(x_errors_sq)

## [1] 196

This is called “the sum of squared errors”. As you can see, it will get bigger the more observations we have, so it’s best to divide by the number of observations to get a more accurate representation of error:

# don't worry about the "-1" at the end of this equation -- its explained in the reading. Just go with it for now.

sum(x_errors_sq)/length(xyz_test$x_test)-1

## [1] 48

This is the VARIANCE of x (2,4,10,20). Okay, so what?

We can compare models and see how big their variance is. Bigger the variance = less accurate model. !!!!!

Remember model “y”? Lets test it:

#what are the errors of model y? It's the model (9) minus the values:

y_errors<-y_test-9
print(y_errors)

## [1] -9 -9 -9 27

#whats the square of errors?

y_error_sq<-y_errors*y_errors
print(y_error_sq)

## [1]  81  81  81 729

# whats the sum of the squares?

sum(y_error_sq)

## [1] 972

# whats the actual VARIANCE?

sum(y_error_sq)/length(xyz_test$y_test)-1

## [1] 242

Now we can judge which model is more accurate:

sum(x_errors_sq)/length(xyz_test$x_test)-1

## [1] 48

sum(y_error_sq)/length(xyz_test$y_test)-1

## [1] 242

Don’t get caught up in the equations – soon you’ll see that R does all this for you automatically. We just went through the process the old fashioned way to better understand the math and reasoning.

var.test(xyz_test$x_test,xyz_test$y_test)

## 
##  F test to compare two variances
## 
## data:  xyz_test$x_test and xyz_test$y_test
## F = 0.20165, num df = 3, denom df = 3, p-value = 0.2213
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.01306067 3.11325077
## sample estimates:
## ratio of variances 
##          0.2016461

R is telling us to not believe our eyes: the variance between X and Y cannot be meaningfully distinguished!

Who’s right?

Both are right. One possibility: x and y both deviate significantly from a normal distribution

The P value is essentially derived from the standard deviation/margin of error *remember– standard deviations only work on bell-curved shaped data!!!

HOMEWORK 4

DUE IN TWO WEEKS: The Testable Hypothesis Selection:

Write up a brief testable hypothesis (including an alternate hypothesis) concerning some variables in the data you selected. This can be just long enough to explain the variables and how you think they’re interacting. This is due in two weeks, submit in a word document via canvas.

–look at the variables in your data, understand what they mean, and form a hypothesis based on your research question. You need a good research question before you can form a testable hypothesis. Not a good hypothesis: “I want to know the connection between x and y” What question are you trying to answer?? Read the professor’s Research Questions guide on Canvas/RPubs

Week 4 Lesson

2024-06-11

Recap on Dispersion

Which line best describes the data?

IMPORTANT IDEA:

HOMEWORK 4