EPS 700 Lab 2

Author

Ao (Alan) Huang

rm(list = ls())

Section 1: Simulate Dice Rolls (2 points)

Step 1: Create a Six-Sided Die in R. (.25 points) To do this, save a vector with six values (1,2,3,4,5,6) to a label you will recognize. Paste your code below.

-Manually calculate the expected value of one roll of this die here. Optionally, you can check whether you are right by using the weighted_mean() function, with the die vector and a vector containing the probabilities as the two arguments. This is documented in the lab skeleton.

- What is the expected value of 10 rolls?

set.seed(1)
#create a six-sided die#
die <- c(1, 2, 3, 4, 5, 6)
         
#optional: check EV#
probsdie <-c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6)
# weighted.mean(die,probsdie)
cat("The expected value of one roll:", weighted.mean(die,probsdie), "\n")
The expected value of one roll: 3.5 
#roll die 10 times#
tenth <- sample(die, 10, replace = TRUE)
cat("The expected value of 10 rolls:", mean(tenth), "\n")
The expected value of 10 rolls: 3 

Step 2: Roll the die. (.75 points) This can be done using the sample function. The sample() function takes a sample using arguments for the following: the vector or other element to draw from, the number of items to choose, a true/false indicator for whether the draw is with replacement, and an optional weighting argument which we will not use. For this exercise, draw from the vector you created in step 1. - Roll the die 5 times and save the results to a vector. Calculate the mean, and paste your code and results below. -Repeat the exercise for rolling the die 10 times, 1000 times, and 100000 times. Fill in the table below. Save each to a different value you will recognize.

#roll die 5 times#
fifth <- sample(die, 5, replace = TRUE)
cat("The expected value of 5 rolls:", mean(fifth), "\n")
The expected value of 5 rolls: 3.8 
tenth <- sample(die, 10, replace = TRUE)
cat("The expected value of 10 rolls:", mean(tenth), "\n")
The expected value of 10 rolls: 3.7 
thousandth <- sample(die, 1000, replace = TRUE)
cat("The expected value of 1000 rolls:", mean(thousandth), "\n")
The expected value of 1000 rolls: 3.516 
hundredthousandth <- sample(die, 100000, replace = TRUE)
cat("The expected value of 100000 rolls:", mean(hundredthousandth), "\n")
The expected value of 100000 rolls: 3.50495 

Step 3: Tabulate and graph the largest sample. (1 point) First, you will put the results of the 100000-time sample into a table using the data.frame() function and the table() function in unison. The basic logic is that we are telling R to turn the sample we saved previously into a table, then store that table as a data frame. This allows us to use it in graphs and plots! Both functions have many possible arguments, but we will be using the default values and will not need to specify any. You can review the code in the Lab Skeleton file. - Once you’ve created and saved your data frame, use the head() command to inspect it. Paste results here (use Courier size 9 for formatting!) - Use any graphing/plotting code with which you are comfortable to create a histogram of the results. Paste the code and the histogram below. How would you characterize the distribution? - Finally, conceptually (no need to code): If we were to take a sample of 30 dice rolls but save only the mean, and do that 10,000 times, what would the distribution of those saved means look like? Would it be the same or different from the histogram above?

#make a table which shows the distribution#
thotab <- data.frame(table(hundredthousandth))
head(thotab)
  hundredthousandth  Freq
1                 1 16685
2                 2 16573
3                 3 16522
4                 4 16737
5                 5 16748
6                 6 16735
#bar chart#
library(ggplot2)
ggplot(thotab, aes(hundredthousandth)) + 
  geom_bar(aes(weight = Freq))

cat("The distribution looks like a uniform distribution", "\n")
The distribution looks like a uniform distribution 
# Simulate rolling the die, calculate means, and store them
sample_means <- numeric(10000)
for (i in 1:10000) {
  rolls <- sample(die, 30, replace = TRUE)
  sample_means[i] <- mean(rolls)
}

# Plot the distribution of sample means
ggplot(data = data.frame(sample_means), aes(x = sample_means)) +
  geom_histogram(binwidth = 0.1, color = "black", fill = "gray") +
  theme_minimal() +
  labs(title = "Distribution of Sample Means (30 dice rolls, 10,000 samples)",
       x = "Mean of Sample",
       y = "Frequency")

cat("The distribution looks like a normal distribution because of Central Limit Theorem (CLT). 
    \nRegardless of the original distribution of the population 
    \n(in this case, uniform distribution for dice rolls), 
    \nthe sampling distribution of the sample mean will         
    \napproach a normal distribution as the sample size increases.")
The distribution looks like a normal distribution because of Central Limit Theorem (CLT). 
    
Regardless of the original distribution of the population 
    
(in this case, uniform distribution for dice rolls), 
    
the sampling distribution of the sample mean will         
    
approach a normal distribution as the sample size increases.

Section 2: Create and Explore a Probability Distribution (2 Points)

For this section, we’ll pretend we’ve opened a bag of multi-colored candies and counted out 100. The breakdown is as follows: Blue: 25 // Green: 15 // Red: 10 // Orange: 38 // Purple: 12. We’ll try to understand the frequency of the candies and what that implies about the proportion of all candies made.

Step 1: Create a probability distribution table. We will do this by creating two vectors: one for the color names, one for the values. Then, we will use the data.frame() function to combine them in a table. Paste your code and your table below.

colors <- c("Blue", "Green", "Red", "Orange", "Purple")
counts <- c(25, 15, 10, 38, 12)
proportions <- counts / sum(counts)
candy_distribution <- data.frame(Color = colors, Count = counts, Proportion = proportions)

print(candy_distribution)
   Color Count Proportion
1   Blue    25       0.25
2  Green    15       0.15
3    Red    10       0.10
4 Orange    38       0.38
5 Purple    12       0.12

Step 2: Create a bar chart of your table. You can use any function that you are familiar with, but my example will use GGPlot. Note the use of the “identity” stat, which tells GGPlot to use the actual values in the table instead of a count. Paste your code and your bar chart below.

ggplot(candy_distribution, aes(x = Color, y = Count, fill = Color)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Candy Color Distribution", x = "Color", y = "Count") +
  scale_fill_manual(values = c("Blue" = "blue", "Green" = "green", 
  "Red" = "red", "Orange" = "orange", "Purple" = "purple"))

Step 3: Imagine you like Green candies twice as much as each of the others, and you want to quantify this preference. To do so, add a column to your probability table containing value weightings for each color. You can do this by creating a vector of the values, where all values are 1, except Green which is 2.

weightings <- rep(1, length(colors))  # Start with all weightings as 1
weightings[colors == "Green"] <- 2    # Set Green's weighting to 2

candy_distribution$Weightings <- weightings
print(candy_distribution)
   Color Count Proportion Weightings
1   Blue    25       0.25          1
2  Green    15       0.15          2
3    Red    10       0.10          1
4 Orange    38       0.38          1
5 Purple    12       0.12          1

Step 4: Using this valuation, calculate the expected values of 1, 10, and 50 draws from the bag (with replacement). You can use the vectors to do this easily, by multiplying them. (Hint: Remember that expected value is the sum of all possible (probability x value). The expected value of one draw, then, is the sum of the product of the weighted values and the probabilities. Hint 2: If all candies were the same value, the expected value of a draw would be that value. Expect a result that’s a bit higher than 1.)

expected_value_1_draw <- sum(candy_distribution$Proportion * candy_distribution$Weightings)

expected_value_10_draws <- expected_value_1_draw * 10
expected_value_50_draws <- expected_value_1_draw * 50

cat("Expected value of 1 draw: ", expected_value_1_draw, "\n")
Expected value of 1 draw:  1.15 
cat("Expected value of 10 draws: ", expected_value_10_draws, "\n")
Expected value of 10 draws:  11.5 
cat("Expected value of 50 draws: ", expected_value_50_draws, "\n")
Expected value of 50 draws:  57.5 

Step 6: Practice conditional probability/draw without replacement. You draw ten candies, and to your surprise they are all red. Calculate the new expected value of one draw from the 90 remaining candies in the bag, assuming the values associated with each color remain the same. Feel free to use R or manually calculate this. This is a bit of an R puzzle. Two hints: re-use code you’ve used already, and don’t forget that you can include arithmetic expressions as the elements of a vector.

candy_distribution_adjusted <- candy_distribution
candy_distribution_adjusted$Count[candy_distribution_adjusted$Color == "Red"] <-
  candy_distribution_adjusted$Count[candy_distribution_adjusted$Color == "Red"] - 10
candy_distribution_adjusted$Proportion <-
  candy_distribution_adjusted$Count / sum(candy_distribution_adjusted$Count)

expected_value_1_draw_adjusted <- sum(candy_distribution_adjusted$Proportion 
  *candy_distribution_adjusted$Weightings)
cat("New expected value for 1 draw after drawing \n ten Red candies (without replacement):", 
    expected_value_1_draw_adjusted, "\n")
New expected value for 1 draw after drawing 
 ten Red candies (without replacement): 1.166667 

Section 3: Calculate Z-Scores (1 Point)

This section is just nuts-and-bolts practice for creating and interpreting Z-Scores.

Step 1: Using R, calculate Z-Scores to complete the following table. I don’t use a z-score function. Instead, I just use R’s native calculation to do this. Recall that a Z score is just: mean – value / sd. To get the percentile, use the pnorm() function.

data <- data.frame(
  Mean = c(24, 100, 0, -15),
  StandardDeviation = c(4, 13, 1, 46),
  Value = c(27, 96, -0.35, 100)
)

# Calculate Z-Scores
data$ZScore <- (data$Value - data$Mean) / data$StandardDeviation

# Calculate Percentiles using the pnorm function
data$Percentile <- pnorm(data$ZScore)

# Print the updated data frame
print(data)
  Mean StandardDeviation  Value     ZScore Percentile
1   24                 4  27.00  0.7500000  0.7733726
2  100                13  96.00 -0.3076923  0.3791582
3    0                 1  -0.35 -0.3500000  0.3631693
4  -15                46 100.00  2.5000000  0.9937903

Step 2: Create a vector and convert it to Z-scores. Create any vector of 20 values (build in some variance for best effect) for a ratio variable. Next, use arithmetic functions in R on the vector to calculate z scores, and save them to a new vector.

values <- rnorm(20, mean = 50, sd = 10)

values_mean <- mean(values)
values_sd <- sd(values)

# Use arithmetic functions to calculate z scores
z_scores_vector <- (values - values_mean) / values_sd

# Print the original values and their corresponding z scores
data.frame(OriginalValues = values, ZScores = z_scores_vector)
   OriginalValues    ZScores
1        47.57590  0.1246393
2        35.76870 -1.5132923
3        40.45875 -0.8626742
4        59.59374  1.7917915
5        51.17877  0.6244413
6        37.20838 -1.3135757
7        49.71483  0.4213586
8        41.67557 -0.6938726
9        56.79127  1.4030251
10       50.95910  0.5939680
11       51.53765  0.6742258
12       40.21648 -0.8962831
13       49.99136  0.4597193
14       47.73164  0.1462435
15       51.73596  0.7017352
16       54.49404  1.0843455
17       44.12552 -0.3540091
18       41.50675 -0.7172931
19       32.67445 -1.9425372
20       48.60965  0.2680441

Step 3: Create a plot using the following code and paste it below. - How many values in your plot are above one standard deviation greater than the mean? plot(zscores, type=“o”,col=“blue”) (Feel free to change the color) (zscores is your vector of zscores) - Which values are more than two standard deviations above or below the mean?

plot(z_scores_vector, type="o", col="blue")

count_above_one_sd <- sum(z_scores_vector > 1)

values_more_than_two_sd <- values[z_scores_vector > 2 | z_scores_vector < -2]

cat("Number of values above one SD:", count_above_one_sd, "\n")
Number of values above one SD: 3 
cat("Values more than two SDs from the mean:", values_more_than_two_sd, "\n")
Values more than two SDs from the mean:  

Section 4: Predict the Weather (2 Points)

Here, we will upload and work with some fabricated Weather data. This data is taken from a month of observations in an unnamed coastal town. Each row represents one day, and the values are a reading taken at the same time each day.

Step 1. Download the Weather file from Blackboard onto your computer, then import it into R using the Read_csv() function. - Inspect the file using head() and describe() (Note: ensure the Psych package is loaded).

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(psych)

Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha
weather_data <- read_csv('WeatherLab.csv')
Rows: 31 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): Day, CloudCover, PrecipType, Anomalies
dbl  (6): Date, PrecipAmount(in), TempF, Humidity, WindSpeedKmH, Pressure
time (1): SunsetTime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(weather_data)
# A tibble: 6 × 11
   Date Day       CloudCover    PrecipType `PrecipAmount(in)` TempF Humidity
  <dbl> <chr>     <chr>         <chr>                   <dbl> <dbl>    <dbl>
1     6 Saturday  Mostly Cloudy Hail                      1      69       68
2    20 Saturday  Very Cloudy   Hail                      1.5    64       83
3    21 Sunday    Very Cloudy   Hail                      2      66       86
4    31 Wednesday Mostly Cloudy Hail                      1      71       84
5     1 Monday    Partly Cloudy None                      0      68       78
6     2 Tuesday   Clear         None                      0      87       47
# ℹ 4 more variables: WindSpeedKmH <dbl>, Pressure <dbl>, SunsetTime <time>,
#   Anomalies <chr>
describe(weather_data)
Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
                 vars  n    mean    sd  median trimmed   mad  min     max range
Date                1 31   16.00  9.09   16.00   16.00 11.86    1   31.00 30.00
Day*                2 31    4.10  2.07    4.00    4.12  2.97    1    7.00  6.00
CloudCover*         3 31    2.23  1.23    2.00    2.16  1.48    1    4.00  3.00
PrecipType*         4 31    2.29  0.69    2.00    2.36  1.48    1    3.00  2.00
PrecipAmount(in)    5 31    1.76  2.18    1.00    1.42  1.48    0    9.50  9.50
TempF               6 31   73.10  7.72   74.00   73.12  8.90   58   87.00 29.00
Humidity            7 31   63.90 20.75   68.00   64.20 23.72   28   98.00 70.00
WindSpeedKmH        8 31   10.39  4.20   11.00   10.36  2.97    2   22.00 20.00
Pressure            9 31 1010.06  4.58 1011.14 1009.97  6.81 1004 1016.64 12.64
SunsetTime         10 31     NaN    NA      NA     NaN    NA  Inf    -Inf  -Inf
Anomalies*         11 31    4.48  1.81    4.00    4.24  0.00    1   10.00  9.00
                  skew kurtosis   se
Date              0.00    -1.32 1.63
Day*             -0.04    -1.43 0.37
CloudCover*       0.31    -1.58 0.22
PrecipType*      -0.42    -0.97 0.12
PrecipAmount(in)  1.48     2.63 0.39
TempF            -0.08    -1.06 1.39
Humidity         -0.23    -1.33 3.73
WindSpeedKmH      0.22     0.45 0.75
Pressure          0.05    -1.73 0.82
SunsetTime          NA       NA   NA
Anomalies*        1.42     2.26 0.32

Step 2. Create a dummy variable for rainy days. A dummy variable is a variable that’s coded as either 1 or 0, depending on whether that observation meets certain criteria. In this case, we want to append a new variable that is 1 when the day was rainy, and 0 when it was not.

We will use the mutate() function from dplyr, which is similar to the transform() function we used in Lab 1. We will also use the ifelse command, which proudces one answer if a certain condition is met, and another if it is not. The full code (also included in the skeleton) is more as follows: weather %>% mutate(Rain = ifelse(PrecipType == ‘Rain’, 1,0)) Hint: Don’t forget to assign the result of this operation to a dataframe. You can overwrite weather, or assign to a new df.

weather_data <- weather_data %>%
  mutate(Rain = ifelse(PrecipType == 'Rain', 1, 0))
head(weather_data)
# A tibble: 6 × 12
   Date Day       CloudCover    PrecipType `PrecipAmount(in)` TempF Humidity
  <dbl> <chr>     <chr>         <chr>                   <dbl> <dbl>    <dbl>
1     6 Saturday  Mostly Cloudy Hail                      1      69       68
2    20 Saturday  Very Cloudy   Hail                      1.5    64       83
3    21 Sunday    Very Cloudy   Hail                      2      66       86
4    31 Wednesday Mostly Cloudy Hail                      1      71       84
5     1 Monday    Partly Cloudy None                      0      68       78
6     2 Tuesday   Clear         None                      0      87       47
# ℹ 5 more variables: WindSpeedKmH <dbl>, Pressure <dbl>, SunsetTime <time>,
#   Anomalies <chr>, Rain <dbl>

Step 3. Calculate the percentage of days which are rainy. This is simply the mean of the new column you just created. (Think about why this might be).

percentage_rainy <- mean(weather_data$Rain) * 100

cat("Percentage of rainy days:", percentage_rainy, "%", "\n")
Percentage of rainy days: 41.93548 % 

Step 4. Assume the same rainy-day rate will hold over the next month, and that the next month is also 31 days. Use the binomial distribution to answer the following questions. (Refer to Table 9.3 in the Navarro book for the functions to do this in R. 1. What is the probability that exactly 11 days will be rainy? 2. What is the probability that no more than 16 days will be rainy? 3. What is the number of rainy days such that it is equally likely that it will or will not rain at least that much? 4. What is the probability that between 10 and 17 days will be rainy? HINT: This will require subtracting one function from another.

p_rainy <- percentage_rainy / 100
n_days <- 31

# 1. Probability that exactly 11 days will be rainy
prob_11_rainy <- dbinom(11, size = n_days, prob = p_rainy)

# 2. Probability that no more than 16 days will be rainy
prob_at_most_16_rainy <- pbinom(16, size = n_days, prob = p_rainy)

# 3. The number of rainy days such that it is equally 
# likely that it will or will not rain at least that much
median_rainy_days <- qbinom(0.5, size = n_days, prob = p_rainy)

# 4. Probability that between 10 and 17 days will be rainy
prob_between_10_and_17 <- pbinom(17, size = n_days, prob = p_rainy) - 
  pbinom(9, size = n_days, prob = p_rainy)

cat("Probability exactly 11 days will be rainy:", prob_11_rainy, "\n")
Probability exactly 11 days will be rainy: 0.1133329 
cat("Probability no more than 16 days will be rainy:", prob_at_most_16_rainy, "\n")
Probability no more than 16 days will be rainy: 0.8979717 
cat("Number of rainy days for a 50% chance of at 
    least that many rainy days:", median_rainy_days, "\n")
Number of rainy days for a 50% chance of at 
    least that many rainy days: 13 
cat("Probability between 10 and 17 days will be rainy:", prob_between_10_and_17, "\n")
Probability between 10 and 17 days will be rainy: 0.8485221 

Step 5. Investigate the relationship between Temperature and Wind Speed. Create a scatterplot of Temperature by Humidity. What would you say the relationship between these variables is, based on the scatterplot? But is this relationship true if we just look at Clear days? Filter by CloudCover == Clear and produce another scatterplot to investigate. Does the relationship hold, or is the relationship different for clear days?

# Scatterplot of Temperature by Humidity for the entire dataset
ggplot(weather_data, aes(x = TempF, y = Humidity)) +
  geom_point() +
  labs(title = "Scatterplot of Temperature by Humidity",
       x = "Temperature",
       y = "Humidity") +
  theme_minimal()

cat("For the entire dataset, the relationship between Temperature and Humidity is negative.")
For the entire dataset, the relationship between Temperature and Humidity is negative.
# Now, filter the data for clear days and create a scatterplot
clear_days <- weather_data %>% 
  filter(CloudCover == "Clear")

# Scatterplot of Temperature by Humidity for clear days
ggplot(clear_days, aes(x = TempF, y = Humidity)) +
  geom_point() +
  labs(title = "Scatterplot of Temperature by Humidity on Clear Days",
       x = "Temperature",
       y = "Humidity") +
  theme_minimal()

cat("For clear days, the relationship between Temperature and Humidity is \n
    still kind of negative, but very weak.")
For clear days, the relationship between Temperature and Humidity is 

    still kind of negative, but very weak.

Feedback

a. How long did this lab take you to complete?

3h+

b. How confident do you feel about the statistical concepts covered in this lab?

Very

c. How comfortable are you with the R code used in this lab?

Very

d. What sections, if any, did you find particularly frustrating/challenging?

None

e. What sections, if any, did you find particularly useful/illuminating?

CLT part, Section 1 Step 3 last question

The codes are also publicly available at: https://rpubs.com/AlanHuang/EPS700_Lab2