Instructions You may adapt the code in the course materials or any sources (e.g., the Internet, classmates, friends). In fact, you can craft solutions for almost all questions from the course materials with minor modifications. However, you need to write up your own solutions and acknowledge all sources that you have cited in the Acknowledgement section.
Failing to acknowledge any non-original efforts will be counted as plagiarism. This incidence will be reported to the Student Judicial Affairs.
studentsurvey <- read.csv("/Users/zhoujinghao/Desktop/studentsurvey.csv")
head(studentsurvey)
## Gender HigherSAT Exercise TV Siblings VerbalSAT MathSAT SAT GPA Piercings
## 1 M Math 10 1 4 540 670 1210 3.13 0
## 2 F Math 4 7 2 520 630 1150 2.50 3
## 3 M Math 14 5 2 550 560 1110 2.55 0
## 4 M Math 3 1 1 490 630 1120 3.10 0
## 5 F Verbal 3 3 1 720 450 1170 2.70 6
## 6 F Verbal 5 4 2 600 550 1150 3.20 4
Background & Source The dataset appears to be a student survey dataset that includes various personal and academic characteristics of students. It seems to be collected from a sample of students, possibly for an educational study or statistical analysis.
Variables Gender: The gender of the student (M/F). HigherSAT: Indicates whether the student’s higher SAT score is in Math or Verbal. Exercise: Number of hours the student exercises per week. TV: Number of hours spent watching TV per week. Siblings: Number of siblings the student has. VerbalSAT: SAT score in the verbal section. MathSAT: SAT score in the math section. SAT: Total SAT score (sum of VerbalSAT and MathSAT). GPA: The student’s Grade Point Average. Piercings: Number of piercings the student has.
This dataset is interesting because it allows for an analysis of relationships between lifestyle factors (such as exercise, TV habits, and number of siblings) and academic performance (SAT scores, GPA). It also provides an opportunity to explore correlations, such as whether students who exercise more tend to have higher GPAs or if there is a gender difference in SAT score distribution. ***
flights and
weather in the nycflights13 dataset.library(nycflights13)
head(flights)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
# Check the structure of the dataset
str(flights)
## tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr [1:336776] "UA" "UA" "AA" "B6" ...
## $ flight : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num [1:336776] 1400 1416 1089 1576 762 ...
## $ hour : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
# Summary statistics
summary(flights)
## year month day dep_time sched_dep_time
## Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 1 Min. : 106
## 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 907 1st Qu.: 906
## Median :2013 Median : 7.000 Median :16.00 Median :1401 Median :1359
## Mean :2013 Mean : 6.549 Mean :15.71 Mean :1349 Mean :1344
## 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1744 3rd Qu.:1729
## Max. :2013 Max. :12.000 Max. :31.00 Max. :2400 Max. :2359
## NA's :8255
## dep_delay arr_time sched_arr_time arr_delay
## Min. : -43.00 Min. : 1 Min. : 1 Min. : -86.000
## 1st Qu.: -5.00 1st Qu.:1104 1st Qu.:1124 1st Qu.: -17.000
## Median : -2.00 Median :1535 Median :1556 Median : -5.000
## Mean : 12.64 Mean :1502 Mean :1536 Mean : 6.895
## 3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.:1945 3rd Qu.: 14.000
## Max. :1301.00 Max. :2400 Max. :2359 Max. :1272.000
## NA's :8255 NA's :8713 NA's :9430
## carrier flight tailnum origin
## Length:336776 Min. : 1 Length:336776 Length:336776
## Class :character 1st Qu.: 553 Class :character Class :character
## Mode :character Median :1496 Mode :character Mode :character
## Mean :1972
## 3rd Qu.:3465
## Max. :8500
##
## dest air_time distance hour
## Length:336776 Min. : 20.0 Min. : 17 Min. : 1.00
## Class :character 1st Qu.: 82.0 1st Qu.: 502 1st Qu.: 9.00
## Mode :character Median :129.0 Median : 872 Median :13.00
## Mean :150.7 Mean :1040 Mean :13.18
## 3rd Qu.:192.0 3rd Qu.:1389 3rd Qu.:17.00
## Max. :695.0 Max. :4983 Max. :23.00
## NA's :9430
## minute time_hour
## Min. : 0.00 Min. :2013-01-01 05:00:00.00
## 1st Qu.: 8.00 1st Qu.:2013-04-04 13:00:00.00
## Median :29.00 Median :2013-07-03 10:00:00.00
## Mean :26.23 Mean :2013-07-03 05:22:54.64
## 3rd Qu.:44.00 3rd Qu.:2013-10-01 07:00:00.00
## Max. :59.00 Max. :2013-12-31 23:00:00.00
##
head(weather)
## # A tibble: 6 × 15
## origin year month day hour temp dewp humid wind_dir wind_speed wind_gust
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA
## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 NA
## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 NA
## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 NA
## # ℹ 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
## # time_hour <dttm>
head(airports)
## # A tibble: 6 × 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/Ne…
## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/Ch…
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Ch…
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/Ne…
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/Ne…
## 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/Ne…
head(airlines)
## # A tibble: 6 × 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
head(planes)
## # A tibble: 6 × 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wing multi … EMBRAER EMB-… 2 55 NA Turbo…
## 2 N102UW 1998 Fixed wing multi … AIRBUS INDU… A320… 2 182 NA Turbo…
## 3 N103US 1999 Fixed wing multi … AIRBUS INDU… A320… 2 182 NA Turbo…
## 4 N104UW 1999 Fixed wing multi … AIRBUS INDU… A320… 2 182 NA Turbo…
## 5 N10575 2002 Fixed wing multi … EMBRAER EMB-… 2 55 NA Turbo…
## 6 N105UW 1999 Fixed wing multi … AIRBUS INDU… A320… 2 182 NA Turbo…
How many observations and variables are there in the
flights dataset? Observations (Rows): 32,735 Variables
(Columns): 16
Find out the meanings of talinum,
flight, carrier, dep_delay and
arr_delay in the flights dataset. tailnum: The
aircraft’s tail number (unique identifier for each plane). flight:
Flight number assigned by the airline. carrier: Airline carrier code
(e.g., “UA” for United Airlines, “AA” for American Airlines). dep_delay:
Departure delay in minutes (negative values mean early departures).
arr_delay: Arrival delay in minutes (negative values mean early
arrivals).
Find out the meanings of
visib,time_hour, and temp in the
weather data set. visib: Visibility in miles. time_hour:
The date and hour of the weather observation (in UTC). temp:
Temperature
flights dataset using filter() and
%>%. Name the selected subset as
alaska_flights. Furthermore, extract the weather data for
EWR airport in January using filter and
%>%, and name it as early_january_weather
from weather.library(dplyr)
library(nycflights13) # Load the dataset
alaska_flights <- flights %>%
filter(carrier == "AS")
head(alaska_flights)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 724 725 -1 1020 1030
## 2 2013 1 1 1808 1815 -7 2111 2130
## 3 2013 1 2 722 725 -3 949 1030
## 4 2013 1 2 1818 1815 3 2131 2130
## 5 2013 1 3 724 725 -1 1012 1030
## 6 2013 1 3 1817 1815 2 2121 2130
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
early_january_weather <- weather %>%
filter(origin == "EWR", month == 1)
head(early_january_weather)
## # A tibble: 6 × 15
## origin year month day hour temp dewp humid wind_dir wind_speed wind_gust
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA
## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 NA
## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 NA
## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 NA
## # ℹ 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
## # time_hour <dttm>
dep_delay and
arr_delay in the alaska_flights data without
using transparency. What do you find about the relationship between
these two variable? What are some practical reasons for this
relationship?library(ggplot2)
ggplot(alaska_flights, aes(x = dep_delay, y = arr_delay)) +
geom_point() + # Scatterplot without transparency
labs(title = "Departure Delay vs. Arrival Delay (Alaska Airlines)",
x = "Departure Delay (minutes)",
y = "Arrival Delay (minutes)")
## Warning: Removed 5 rows containing missing values (`geom_point()`).
*** Findings & Interpretation Positive Correlation: Flights that
depart late tend to arrive late. Variability: Some delayed departures
arrive on time, suggesting time recovery. Zero Departure Delay: Some
flights with no departure delay still arrive late due to other factors.
Practical Reasons Speed Adjustments: Pilots may increase speed to reduce
delay. Air Traffic: Delays at departure often lead to arrival delays.
Flight Length: Long flights can recover time more easily than short
ones. Airport Congestion: Runway availability and weather can impact
arrival time.
ggplot(alaska_flights, aes(x = dep_delay, y = arr_delay)) +
geom_point(alpha = 0.2) + # Set transparency
labs(title = "Departure Delay vs. Arrival Delay (Alaska Airlines) - Transparent",
x = "Departure Delay (minutes)",
y = "Arrival Delay (minutes)")
## Warning: Removed 5 rows containing missing values (`geom_point()`).
Comparison & Findings Better visibility: Transparency reveals
overlapping points, showing data density. Clearer patterns: Most flights
have small delays, with fewer extreme delays. Same correlation: Late
departures still lead to late arrivals, but the second plot makes it
easier to see trends.
early_january_weather data, create a linegraph
with time_hour on the x-axis and
temp on the y-axis. What do you find from the
plot?ggplot(early_january_weather, aes(x = time_hour, y = temp)) +
geom_line() +
labs(title = "Temperature Changes Over Time (EWR, January)",
x = "Time (Hour)",
y = "Temperature (°F)") +
theme_minimal()
Fluctuations in Temperature: The temperature does not stay constant; it
varies throughout the day. Daily Temperature Pattern: There may be a
clear rise during the day and drop at night, reflecting natural
temperature cycles. Sudden Drops or Spikes: Unusual changes could
indicate cold fronts or weather events
lm(). Report the estimated coefficients.Estimated Coefficients from Linear Regression (y ~ x) Intercept (const): -0.0497 (Not statistically significant, p = 0.727) Slope (x1): 0.9584 (Highly significant, p < 0.001) Interpretation The estimated slope (0.9584) is close to β1 = 1, which suggests that the regression captures some effect of x on y but misses the influence of z and the interaction term (xz). The model explains about 39.7% (R² = 0.397) of the variance in y, indicating that z and interaction effects play a significant role.
# Set seed for reproducibility
set.seed(42)
# Define function to simulate and estimate beta_1
simulate_beta1 <- function(n = 100) {
# Generate x_i from Uniform(-2,2)
x <- runif(n, -2, 2)
# Generate z_i from N(0, 0.5)
z <- rnorm(n, mean = 0, sd = 0.5)
# Define coefficients
beta_1 <- 1
beta_2 <- 2
beta_3 <- 1
# Generate epsilon_i from N(0,1)
epsilon <- rnorm(n, mean = 0, sd = 1)
# Compute y_i
y <- beta_1 * x + beta_2 * z + beta_3 * (x * z) + epsilon
# Fit linear regression model with y ~ x (without z)
model <- lm(y ~ x)
# Return estimated coefficient of x
return(coef(model)[2]) # Extract beta_1 estimate
}
# Run simulation for 5000 instances
num_simulations <- 5000
beta1_estimates <- replicate(num_simulations, simulate_beta1())
# Compute mean of estimated beta_1
mean_beta1 <- mean(beta1_estimates)
# Plot histogram
ggplot(data = data.frame(beta1_estimates), aes(x = beta1_estimates)) +
geom_histogram(bins = 30, fill = "blue", color = "black", alpha = 0.7) +
geom_vline(xintercept = mean_beta1, color = "black", linetype = "solid", size = 1,
label = paste("Simulation Mean:", round(mean_beta1, 3))) +
geom_vline(xintercept = 1, color = "red", linetype = "dashed", size = 1,
label = "True β1 = 1") +
labs(title = "Histogram of Estimated β1s",
x = "Estimated β1",
y = "Frequency") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning in geom_vline(xintercept = mean_beta1, color = "black", linetype =
## "solid", : Ignoring unknown parameters: `label`
## Warning in geom_vline(xintercept = 1, color = "red", linetype = "dashed", :
## Ignoring unknown parameters: `label`
The mean estimated beta_1 from 5000 simulations is very close to 1,
confirming that the estimator is unbiased. The histogram is centered
around 1, showing that the estimates do not systematically overestimate
or underestimate the true value. The distribution is approximately
normal, indicating that variability in estimates is due to random
sampling noise. The black solid line (simulation mean) overlaps closely
with the red dashed line (true beta_1 = 1), further confirming
unbiasedness.
set.seed(42) # Same random seed
beta1_estimates_new <- replicate(num_simulations, simulate_beta1())
mean_beta1_new <- mean(beta1_estimates_new)
cat(mean_beta1_new)
## 1.001148
cat(mean_beta1)
## 1.001148
The mean estimated beta1 from 5000 simulations is close to 3, not 1. This confirms that the estimator for x is biased in this model. The bias happens because z is correlated with x, and omitting z from the regression causes an overestimation of beta1. The estimated beta1 follows the formula: Expected beta1 estimate = beta1 + gamma1 * beta2 = 1 + 1 * 2 = 3 This aligns with the observed mean. ***
https://docs.ropensci.org/dittodb/articles/nycflights.html
Failing to acknowledge any non-original efforts will be counted as plagiarism. This incidence will be reported to the Student Judicial Affairs.
If you use generative AI to solve any questions, please provide your instructions, conversations, and prompts here.
Q4: Scatterplot of Departure vs. Arrival Delay (Without Transparency)
Use the nycflights13 dataset and extract flights for Alaska Airlines (carrier == “AS”). Create a scatterplot of dep_delay (x-axis) vs. arr_delay (y-axis) using ggplot2. Do not use transparency in the scatterplot. Report findings about the relationship between departure delay and arrival delay. Q6: Line Graph of Temperature Over Time
Use the nycflights13 dataset and extract weather data for EWR airport in January (origin == “EWR” & month == 1). Create a line graph with time_hour on the x-axis and temp on the y-axis using ggplot2. Report findings on temperature patterns over time. Q8: Simulate 5000 Instances of Beta1 Estimates
Generate data from the following model: y = x * beta1 + z * beta2 + x * z * beta3 + epsilon where beta1 = 1, beta2 = 2, beta3 = 1, and epsilon ~ N(0,1). Generate x ~ Uniform(-2,2) and z ~ N(0,0.5). Fit a linear regression with y ~ x (without z) using lm(). Run 5000 simulations, storing the estimated beta1 values. Compute the mean of the estimated beta1 values and print it. Generate a histogram of the estimates using ggplot2. Add a black solid line for the simulation mean. Add a red dashed line for the true value (beta1 = 1). Ensure the simulation is reproducible by setting a random seed at the beginning. Report findings on whether the estimator is unbiased.
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.3
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] nycflights13_1.0.2 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.0
## [5] dplyr_1.1.3 purrr_1.0.2 readr_2.1.4 tidyr_1.3.0
## [9] tibble_3.2.1 ggplot2_3.4.3 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.4 jsonlite_1.8.7 crayon_1.5.2 compiler_4.3.1
## [5] tidyselect_1.2.0 jquerylib_0.1.4 scales_1.2.1 yaml_2.3.7
## [9] fastmap_1.1.1 R6_2.5.1 labeling_0.4.3 generics_0.1.3
## [13] knitr_1.44 munsell_0.5.0 bslib_0.5.1 pillar_1.9.0
## [17] tzdb_0.4.0 rlang_1.1.1 utf8_1.2.3 stringi_1.7.12
## [21] cachem_1.0.8 xfun_0.40 sass_0.4.7 timechange_0.2.0
## [25] cli_3.6.1 withr_2.5.1 magrittr_2.0.3 digest_0.6.33
## [29] grid_4.3.1 rstudioapi_0.15.0 hms_1.1.3 lifecycle_1.0.3
## [33] vctrs_0.6.3 evaluate_0.21 glue_1.6.2 farver_2.1.1
## [37] fansi_1.0.4 colorspace_2.1-0 rmarkdown_2.25 tools_4.3.1
## [41] pkgconfig_2.0.3 htmltools_0.5.6