rm(list = ls())
EPS 700 Lab 2
Section 1: Simulate Dice Rolls (2 points)
Step 1: Create a Six-Sided Die in R. (.25 points) To do this, save a vector with six values (1,2,3,4,5,6) to a label you will recognize. Paste your code below.
-Manually calculate the expected value of one roll of this die here. Optionally, you can check whether you are right by using the weighted_mean() function, with the die vector and a vector containing the probabilities as the two arguments. This is documented in the lab skeleton.
- What is the expected value of 10 rolls?
set.seed(1)
#create a six-sided die#
<- c(1, 2, 3, 4, 5, 6)
die
#optional: check EV#
<-c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6)
probsdie # weighted.mean(die,probsdie)
cat("The expected value of one roll:", weighted.mean(die,probsdie), "\n")
The expected value of one roll: 3.5
#roll die 10 times#
<- sample(die, 10, replace = TRUE)
tenth cat("The expected value of 10 rolls:", mean(tenth), "\n")
The expected value of 10 rolls: 3
Step 2: Roll the die. (.75 points) This can be done using the sample function. The sample() function takes a sample using arguments for the following: the vector or other element to draw from, the number of items to choose, a true/false indicator for whether the draw is with replacement, and an optional weighting argument which we will not use. For this exercise, draw from the vector you created in step 1. - Roll the die 5 times and save the results to a vector. Calculate the mean, and paste your code and results below. -Repeat the exercise for rolling the die 10 times, 1000 times, and 100000 times. Fill in the table below. Save each to a different value you will recognize.
#roll die 5 times#
<- sample(die, 5, replace = TRUE)
fifth cat("The expected value of 5 rolls:", mean(fifth), "\n")
The expected value of 5 rolls: 3.8
<- sample(die, 10, replace = TRUE)
tenth cat("The expected value of 10 rolls:", mean(tenth), "\n")
The expected value of 10 rolls: 3.7
<- sample(die, 1000, replace = TRUE)
thousandth cat("The expected value of 1000 rolls:", mean(thousandth), "\n")
The expected value of 1000 rolls: 3.516
<- sample(die, 100000, replace = TRUE)
hundredthousandth cat("The expected value of 100000 rolls:", mean(hundredthousandth), "\n")
The expected value of 100000 rolls: 3.50495
Step 3: Tabulate and graph the largest sample. (1 point) First, you will put the results of the 100000-time sample into a table using the data.frame() function and the table() function in unison. The basic logic is that we are telling R to turn the sample we saved previously into a table, then store that table as a data frame. This allows us to use it in graphs and plots! Both functions have many possible arguments, but we will be using the default values and will not need to specify any. You can review the code in the Lab Skeleton file. - Once you’ve created and saved your data frame, use the head() command to inspect it. Paste results here (use Courier size 9 for formatting!) - Use any graphing/plotting code with which you are comfortable to create a histogram of the results. Paste the code and the histogram below. How would you characterize the distribution? - Finally, conceptually (no need to code): If we were to take a sample of 30 dice rolls but save only the mean, and do that 10,000 times, what would the distribution of those saved means look like? Would it be the same or different from the histogram above?
#make a table which shows the distribution#
<- data.frame(table(hundredthousandth))
thotab head(thotab)
hundredthousandth Freq
1 1 16685
2 2 16573
3 3 16522
4 4 16737
5 5 16748
6 6 16735
#bar chart#
library(ggplot2)
ggplot(thotab, aes(hundredthousandth)) +
geom_bar(aes(weight = Freq))
cat("The distribution looks like a uniform distribution", "\n")
The distribution looks like a uniform distribution
# Simulate rolling the die, calculate means, and store them
<- numeric(10000)
sample_means for (i in 1:10000) {
<- sample(die, 30, replace = TRUE)
rolls <- mean(rolls)
sample_means[i]
}
# Plot the distribution of sample means
ggplot(data = data.frame(sample_means), aes(x = sample_means)) +
geom_histogram(binwidth = 0.1, color = "black", fill = "gray") +
theme_minimal() +
labs(title = "Distribution of Sample Means (30 dice rolls, 10,000 samples)",
x = "Mean of Sample",
y = "Frequency")
cat("The distribution looks like a normal distribution because of Central Limit Theorem (CLT).
\nRegardless of the original distribution of the population
\n(in this case, uniform distribution for dice rolls),
\nthe sampling distribution of the sample mean will
\napproach a normal distribution as the sample size increases.")
The distribution looks like a normal distribution because of Central Limit Theorem (CLT).
Regardless of the original distribution of the population
(in this case, uniform distribution for dice rolls),
the sampling distribution of the sample mean will
approach a normal distribution as the sample size increases.
Section 2: Create and Explore a Probability Distribution (2 Points)
For this section, we’ll pretend we’ve opened a bag of multi-colored candies and counted out 100. The breakdown is as follows: Blue: 25 // Green: 15 // Red: 10 // Orange: 38 // Purple: 12. We’ll try to understand the frequency of the candies and what that implies about the proportion of all candies made.
Step 1: Create a probability distribution table. We will do this by creating two vectors: one for the color names, one for the values. Then, we will use the data.frame() function to combine them in a table. Paste your code and your table below.
<- c("Blue", "Green", "Red", "Orange", "Purple")
colors <- c(25, 15, 10, 38, 12)
counts <- counts / sum(counts)
proportions <- data.frame(Color = colors, Count = counts, Proportion = proportions)
candy_distribution
print(candy_distribution)
Color Count Proportion
1 Blue 25 0.25
2 Green 15 0.15
3 Red 10 0.10
4 Orange 38 0.38
5 Purple 12 0.12
Step 2: Create a bar chart of your table. You can use any function that you are familiar with, but my example will use GGPlot. Note the use of the “identity” stat, which tells GGPlot to use the actual values in the table instead of a count. Paste your code and your bar chart below.
ggplot(candy_distribution, aes(x = Color, y = Count, fill = Color)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Candy Color Distribution", x = "Color", y = "Count") +
scale_fill_manual(values = c("Blue" = "blue", "Green" = "green",
"Red" = "red", "Orange" = "orange", "Purple" = "purple"))
Step 3: Imagine you like Green candies twice as much as each of the others, and you want to quantify this preference. To do so, add a column to your probability table containing value weightings for each color. You can do this by creating a vector of the values, where all values are 1, except Green which is 2.
<- rep(1, length(colors)) # Start with all weightings as 1
weightings == "Green"] <- 2 # Set Green's weighting to 2
weightings[colors
$Weightings <- weightings
candy_distributionprint(candy_distribution)
Color Count Proportion Weightings
1 Blue 25 0.25 1
2 Green 15 0.15 2
3 Red 10 0.10 1
4 Orange 38 0.38 1
5 Purple 12 0.12 1
Step 4: Using this valuation, calculate the expected values of 1, 10, and 50 draws from the bag (with replacement). You can use the vectors to do this easily, by multiplying them. (Hint: Remember that expected value is the sum of all possible (probability x value). The expected value of one draw, then, is the sum of the product of the weighted values and the probabilities. Hint 2: If all candies were the same value, the expected value of a draw would be that value. Expect a result that’s a bit higher than 1.)
<- sum(candy_distribution$Proportion * candy_distribution$Weightings)
expected_value_1_draw
<- expected_value_1_draw * 10
expected_value_10_draws <- expected_value_1_draw * 50
expected_value_50_draws
cat("Expected value of 1 draw: ", expected_value_1_draw, "\n")
Expected value of 1 draw: 1.15
cat("Expected value of 10 draws: ", expected_value_10_draws, "\n")
Expected value of 10 draws: 11.5
cat("Expected value of 50 draws: ", expected_value_50_draws, "\n")
Expected value of 50 draws: 57.5
Step 6: Practice conditional probability/draw without replacement. You draw ten candies, and to your surprise they are all red. Calculate the new expected value of one draw from the 90 remaining candies in the bag, assuming the values associated with each color remain the same. Feel free to use R or manually calculate this. This is a bit of an R puzzle. Two hints: re-use code you’ve used already, and don’t forget that you can include arithmetic expressions as the elements of a vector.
<- candy_distribution
candy_distribution_adjusted $Count[candy_distribution_adjusted$Color == "Red"] <-
candy_distribution_adjusted$Count[candy_distribution_adjusted$Color == "Red"] - 10
candy_distribution_adjusted$Proportion <-
candy_distribution_adjusted$Count / sum(candy_distribution_adjusted$Count)
candy_distribution_adjusted
<- sum(candy_distribution_adjusted$Proportion
expected_value_1_draw_adjusted *candy_distribution_adjusted$Weightings)
cat("New expected value for 1 draw after drawing \n ten Red candies (without replacement):",
"\n") expected_value_1_draw_adjusted,
New expected value for 1 draw after drawing
ten Red candies (without replacement): 1.166667
Section 3: Calculate Z-Scores (1 Point)
This section is just nuts-and-bolts practice for creating and interpreting Z-Scores.
Step 1: Using R, calculate Z-Scores to complete the following table. I don’t use a z-score function. Instead, I just use R’s native calculation to do this. Recall that a Z score is just: mean – value / sd. To get the percentile, use the pnorm() function.
<- data.frame(
data Mean = c(24, 100, 0, -15),
StandardDeviation = c(4, 13, 1, 46),
Value = c(27, 96, -0.35, 100)
)
# Calculate Z-Scores
$ZScore <- (data$Value - data$Mean) / data$StandardDeviation
data
# Calculate Percentiles using the pnorm function
$Percentile <- pnorm(data$ZScore)
data
# Print the updated data frame
print(data)
Mean StandardDeviation Value ZScore Percentile
1 24 4 27.00 0.7500000 0.7733726
2 100 13 96.00 -0.3076923 0.3791582
3 0 1 -0.35 -0.3500000 0.3631693
4 -15 46 100.00 2.5000000 0.9937903
Step 2: Create a vector and convert it to Z-scores. Create any vector of 20 values (build in some variance for best effect) for a ratio variable. Next, use arithmetic functions in R on the vector to calculate z scores, and save them to a new vector.
<- rnorm(20, mean = 50, sd = 10)
values
<- mean(values)
values_mean <- sd(values)
values_sd
# Use arithmetic functions to calculate z scores
<- (values - values_mean) / values_sd
z_scores_vector
# Print the original values and their corresponding z scores
data.frame(OriginalValues = values, ZScores = z_scores_vector)
OriginalValues ZScores
1 47.57590 0.1246393
2 35.76870 -1.5132923
3 40.45875 -0.8626742
4 59.59374 1.7917915
5 51.17877 0.6244413
6 37.20838 -1.3135757
7 49.71483 0.4213586
8 41.67557 -0.6938726
9 56.79127 1.4030251
10 50.95910 0.5939680
11 51.53765 0.6742258
12 40.21648 -0.8962831
13 49.99136 0.4597193
14 47.73164 0.1462435
15 51.73596 0.7017352
16 54.49404 1.0843455
17 44.12552 -0.3540091
18 41.50675 -0.7172931
19 32.67445 -1.9425372
20 48.60965 0.2680441
Step 3: Create a plot using the following code and paste it below. - How many values in your plot are above one standard deviation greater than the mean? plot(zscores, type=“o”,col=“blue”) (Feel free to change the color) (zscores is your vector of zscores) - Which values are more than two standard deviations above or below the mean?
plot(z_scores_vector, type="o", col="blue")
<- sum(z_scores_vector > 1)
count_above_one_sd
<- values[z_scores_vector > 2 | z_scores_vector < -2]
values_more_than_two_sd
cat("Number of values above one SD:", count_above_one_sd, "\n")
Number of values above one SD: 3
cat("Values more than two SDs from the mean:", values_more_than_two_sd, "\n")
Values more than two SDs from the mean:
Section 4: Predict the Weather (2 Points)
Here, we will upload and work with some fabricated Weather data. This data is taken from a month of observations in an unnamed coastal town. Each row represents one day, and the values are a reading taken at the same time each day.
Step 1. Download the Weather file from Blackboard onto your computer, then import it into R using the Read_csv() function. - Inspect the file using head() and describe() (Note: ensure the Psych package is loaded).
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ lubridate 1.9.3 ✔ tibble 3.2.1
✔ purrr 1.0.2 ✔ tidyr 1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(psych)
Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':
%+%, alpha
<- read_csv('WeatherLab.csv') weather_data
Rows: 31 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Day, CloudCover, PrecipType, Anomalies
dbl (6): Date, PrecipAmount(in), TempF, Humidity, WindSpeedKmH, Pressure
time (1): SunsetTime
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(weather_data)
# A tibble: 6 × 11
Date Day CloudCover PrecipType `PrecipAmount(in)` TempF Humidity
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 6 Saturday Mostly Cloudy Hail 1 69 68
2 20 Saturday Very Cloudy Hail 1.5 64 83
3 21 Sunday Very Cloudy Hail 2 66 86
4 31 Wednesday Mostly Cloudy Hail 1 71 84
5 1 Monday Partly Cloudy None 0 68 78
6 2 Tuesday Clear None 0 87 47
# ℹ 4 more variables: WindSpeedKmH <dbl>, Pressure <dbl>, SunsetTime <time>,
# Anomalies <chr>
describe(weather_data)
Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
vars n mean sd median trimmed mad min max range
Date 1 31 16.00 9.09 16.00 16.00 11.86 1 31.00 30.00
Day* 2 31 4.10 2.07 4.00 4.12 2.97 1 7.00 6.00
CloudCover* 3 31 2.23 1.23 2.00 2.16 1.48 1 4.00 3.00
PrecipType* 4 31 2.29 0.69 2.00 2.36 1.48 1 3.00 2.00
PrecipAmount(in) 5 31 1.76 2.18 1.00 1.42 1.48 0 9.50 9.50
TempF 6 31 73.10 7.72 74.00 73.12 8.90 58 87.00 29.00
Humidity 7 31 63.90 20.75 68.00 64.20 23.72 28 98.00 70.00
WindSpeedKmH 8 31 10.39 4.20 11.00 10.36 2.97 2 22.00 20.00
Pressure 9 31 1010.06 4.58 1011.14 1009.97 6.81 1004 1016.64 12.64
SunsetTime 10 31 NaN NA NA NaN NA Inf -Inf -Inf
Anomalies* 11 31 4.48 1.81 4.00 4.24 0.00 1 10.00 9.00
skew kurtosis se
Date 0.00 -1.32 1.63
Day* -0.04 -1.43 0.37
CloudCover* 0.31 -1.58 0.22
PrecipType* -0.42 -0.97 0.12
PrecipAmount(in) 1.48 2.63 0.39
TempF -0.08 -1.06 1.39
Humidity -0.23 -1.33 3.73
WindSpeedKmH 0.22 0.45 0.75
Pressure 0.05 -1.73 0.82
SunsetTime NA NA NA
Anomalies* 1.42 2.26 0.32
Step 2. Create a dummy variable for rainy days. A dummy variable is a variable that’s coded as either 1 or 0, depending on whether that observation meets certain criteria. In this case, we want to append a new variable that is 1 when the day was rainy, and 0 when it was not.
We will use the mutate() function from dplyr, which is similar to the transform() function we used in Lab 1. We will also use the ifelse command, which proudces one answer if a certain condition is met, and another if it is not. The full code (also included in the skeleton) is more as follows: weather %>% mutate(Rain = ifelse(PrecipType == ‘Rain’, 1,0)) Hint: Don’t forget to assign the result of this operation to a dataframe. You can overwrite weather, or assign to a new df.
<- weather_data %>%
weather_data mutate(Rain = ifelse(PrecipType == 'Rain', 1, 0))
head(weather_data)
# A tibble: 6 × 12
Date Day CloudCover PrecipType `PrecipAmount(in)` TempF Humidity
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 6 Saturday Mostly Cloudy Hail 1 69 68
2 20 Saturday Very Cloudy Hail 1.5 64 83
3 21 Sunday Very Cloudy Hail 2 66 86
4 31 Wednesday Mostly Cloudy Hail 1 71 84
5 1 Monday Partly Cloudy None 0 68 78
6 2 Tuesday Clear None 0 87 47
# ℹ 5 more variables: WindSpeedKmH <dbl>, Pressure <dbl>, SunsetTime <time>,
# Anomalies <chr>, Rain <dbl>
Step 3. Calculate the percentage of days which are rainy. This is simply the mean of the new column you just created. (Think about why this might be).
<- mean(weather_data$Rain) * 100
percentage_rainy
cat("Percentage of rainy days:", percentage_rainy, "%", "\n")
Percentage of rainy days: 41.93548 %
Step 4. Assume the same rainy-day rate will hold over the next month, and that the next month is also 31 days. Use the binomial distribution to answer the following questions. (Refer to Table 9.3 in the Navarro book for the functions to do this in R. 1. What is the probability that exactly 11 days will be rainy? 2. What is the probability that no more than 16 days will be rainy? 3. What is the number of rainy days such that it is equally likely that it will or will not rain at least that much? 4. What is the probability that between 10 and 17 days will be rainy? HINT: This will require subtracting one function from another.
<- percentage_rainy / 100
p_rainy <- 31
n_days
# 1. Probability that exactly 11 days will be rainy
<- dbinom(11, size = n_days, prob = p_rainy)
prob_11_rainy
# 2. Probability that no more than 16 days will be rainy
<- pbinom(16, size = n_days, prob = p_rainy)
prob_at_most_16_rainy
# 3. The number of rainy days such that it is equally
# likely that it will or will not rain at least that much
<- qbinom(0.5, size = n_days, prob = p_rainy)
median_rainy_days
# 4. Probability that between 10 and 17 days will be rainy
<- pbinom(17, size = n_days, prob = p_rainy) -
prob_between_10_and_17 pbinom(9, size = n_days, prob = p_rainy)
cat("Probability exactly 11 days will be rainy:", prob_11_rainy, "\n")
Probability exactly 11 days will be rainy: 0.1133329
cat("Probability no more than 16 days will be rainy:", prob_at_most_16_rainy, "\n")
Probability no more than 16 days will be rainy: 0.8979717
cat("Number of rainy days for a 50% chance of at
least that many rainy days:", median_rainy_days, "\n")
Number of rainy days for a 50% chance of at
least that many rainy days: 13
cat("Probability between 10 and 17 days will be rainy:", prob_between_10_and_17, "\n")
Probability between 10 and 17 days will be rainy: 0.8485221
Step 5. Investigate the relationship between Temperature and Wind Speed. Create a scatterplot of Temperature by Humidity. What would you say the relationship between these variables is, based on the scatterplot? But is this relationship true if we just look at Clear days? Filter by CloudCover == Clear and produce another scatterplot to investigate. Does the relationship hold, or is the relationship different for clear days?
# Scatterplot of Temperature by Humidity for the entire dataset
ggplot(weather_data, aes(x = TempF, y = Humidity)) +
geom_point() +
labs(title = "Scatterplot of Temperature by Humidity",
x = "Temperature",
y = "Humidity") +
theme_minimal()
cat("For the entire dataset, the relationship between Temperature and Humidity is negative.")
For the entire dataset, the relationship between Temperature and Humidity is negative.
# Now, filter the data for clear days and create a scatterplot
<- weather_data %>%
clear_days filter(CloudCover == "Clear")
# Scatterplot of Temperature by Humidity for clear days
ggplot(clear_days, aes(x = TempF, y = Humidity)) +
geom_point() +
labs(title = "Scatterplot of Temperature by Humidity on Clear Days",
x = "Temperature",
y = "Humidity") +
theme_minimal()
cat("For clear days, the relationship between Temperature and Humidity is \n
still kind of negative, but very weak.")
For clear days, the relationship between Temperature and Humidity is
still kind of negative, but very weak.
Feedback
a. How long did this lab take you to complete?
3h+
b. How confident do you feel about the statistical concepts covered in this lab?
Very
c. How comfortable are you with the R code used in this lab?
Very
d. What sections, if any, did you find particularly frustrating/challenging?
None
e. What sections, if any, did you find particularly useful/illuminating?
CLT part, Section 1 Step 3 last question
The codes are also publicly available at: https://rpubs.com/AlanHuang/EPS700_Lab2