Instructions You may adapt the code in the course materials or any sources (e.g., the Internet, classmates, friends). In fact, you can craft solutions for almost all questions from the course materials with minor modifications. However, you need to write up your own solutions and acknowledge all sources that you have cited in the Acknowledgement section.

Failing to acknowledge any non-original efforts will be counted as plagiarism. This incidence will be reported to the Student Judicial Affairs.

Pick a data set that you find interesting. This data set should contain at least four variables. Briefly explain the data set (background, source, variables, samples, etc.), and why it interests you.

The world happiness report is interesting to me, it is a data set that quantifies happiness with many variables, and is created by the UN on an annual basis. The source is from the WHR webiste and it samples almost all countires in the world. There are variables such as logGDP, social support, life expectancy, perceptions of corruption and more. It interests me because I like to know more about other cultures and how they differ from my own.

For Questions 2 to 8, we visit flights and weather in the nycflights13 dataset.

How many observations and variables are there in the flights dataset?
Find out the meanings of talinum, flight, carrier, dep_delay and arr_delay in the flights dataset.
Find out the meanings of visib,time_hour, and temp in the weather data set.

Plane tail number Flight number. Two letter carrier abbreviation. Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.

?weather Visibility in miles Date and hour of the recording as a POSIXct date Temperature in F ***

Extract the entries for Alaska Airlines from the flights dataset using filter() and %>%. Name the selected subset as alaska_flights. Furthermore, extract the weather data for EWR airport in January using filter and %>%, and name it as early_january_weather from weather.

alaska_flights <- flights %>%
  filter(carrier == "AS") 
early_january_weather <- weather %>%
  filter(origin == "EWR" & month == 1)

Create a scatterplot of dep_delay and arr_delay in the alaska_flights data without using transparency. What do you find about the relationship between these two variable? What are some practical reasons for this relationship?

plot(alaska_flights$dep_delay, alaska_flights$arr_delay,
     xlab = "Departure Delay (minutes)",
     ylab = "Arrival Delay (minutes)",
     main = "Scatterplot of Departure Delay vs Arrival Delay - Alaska Airlines")

There is postive linear relationship between the arrival delay and departure delay. If the plane arrives late, it will most likely depart late as well.

Re-draw the plot in Part 4 with transparency set to be 0.2. Compare the two plot and explain your findings, if any.

plot(alaska_flights$dep_delay, alaska_flights$arr_delay,
     xlab = "Departure Delay (minutes)",
     ylab = "Arrival Delay (minutes)",
     main = "Scatterplot of Departure Delay vs Arrival Delay - Alaska Airlines",
     pch = 16, col = rgb(0, 0, 0, 0.2))

It looks like very late arrival times are uncommon, and actually most flights appear to arrive slightly before they are scheduled to. ***

For the early_january_weather data, create a linegraph with time_hour on the x-axis and temp on the y-axis. What do you find from the plot?

ggplot(early_january_weather, aes(x = time_hour, y = temp)) +
  geom_line() +
  labs(x = "Time Hour", y = "Temperature (F)", title = "Temperature Variation in Early January (EWR)")

You can see that each spike is the day night cycle, with temperatures spiking in the day and falling in the night. It aslo tends to increase as the month goes on, but not always. ***

Generate data from the following model. \[y_i = x_i\beta_1 + z_i\beta_2+x_i z_i \beta_3+ \epsilon_i, i = 1,\ldots, 100\] where \(\beta_1=1\), \(\beta_2=2\), \(\beta_3 =1\), and \(\epsilon_i \sim {N}(0,1)\). For \(x_i\) and \(z_i\), generate them as \[ x_i \sim\ {\rm uniform} \ (-2,2) \ {\rm and}\ z_i =\tilde{\epsilon}_i,\] where \(\tilde{\epsilon}_i\sim {N}(0,0.5)\). Fit a linear regression with \(y\) as the outcome and \(x\) as the covariate (i.e., without \(z\)) using lm(). Report the estimated coefficients.

set.seed(169)
n <- 100
epsilon <- rnorm(n)
x <- runif(n, -2, 2)
z_tilde <- rnorm(n, 0, 0.5)

beta1 <- 1
beta2 <- 2
beta3 <- 1
y <- x * beta1 + beta2 * z_tilde + x * z_tilde * beta3 + epsilon
data <- data.frame(y, x, z_tilde)
model <- lm(y ~ x, data = data)
summary(model)

## 
## Call:
## lm(formula = y ~ x, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6607 -0.9577  0.0935  0.9829  4.6518 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.2873     0.1638  -1.754   0.0826 .  
## x             0.7619     0.1436   5.307 6.94e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.637 on 98 degrees of freedom
## Multiple R-squared:  0.2232, Adjusted R-squared:  0.2153 
## F-statistic: 28.16 on 1 and 98 DF,  p-value: 6.941e-07

(Intercept) -0.2873 .
x 0.7619

Wrap the code in Part 7 into a function, and run a simulation of 5000 instances to obtain the estimated coefficients. Draw a histogram of the estimated coefficients \(\hat{\beta}_1\)s of \(x\) from your simulation, add a vertical line (solid, black) to represent the simulation mean, and add another vertical line (dashed, red) to represent the true value (i.e., 1). Report your findings (hint: you may want to review unbiasedness).

simulate_and_plot <- function(num_simulations = 5000) {
 
  beta1_estimates <- numeric(num_simulations)
  

  set.seed(169)

  for (i in 1:num_simulations) {
    n <- 100
    epsilon <- rnorm(n)
    x <- runif(n, -2, 2)
    z_tilde <- rnorm(n, 0, 0.5)
    beta1 <- 1
    beta2 <- 2
    beta3 <- 1
    y <- x * beta1 + beta2 * z_tilde + x * z_tilde * beta3 + epsilon
    
   data <- data.frame(y, x, z_tilde)
  
   model <- lm(y ~ x, data = data)
    

   beta1_estimates[i] <- coef(model)[2]
  }
  
  
  hist(beta1_estimates, main = "Histogram of Beta1 Estimates",
       xlab = "Estimated Coefficient (Beta1)", col = "lightblue", border = "black")
  
  
  abline(v = mean(beta1_estimates), col = "black", lwd = 2)
  
 
  abline(v = 1, col = "red", lty = 2, lwd = 2)
  
  
  return(beta1_estimates)
}


beta1_estimates <- simulate_and_plot()

cat("Mean of estimated coefficients (Beta1):", mean(beta1_estimates), "\n")

## Mean of estimated coefficients (Beta1): 1.00195

Report the simulation mean from Part 8. Then run another simulation under the same setting to verify that your simulation is reproducible. Specifically, the new simulation mean should be identical to the mean from Part 8.

mean(beta1_estimates)

## [1] 1.00195

beta1_estimates2 <- simulate_and_plot()

mean(beta1_estimates2)

## [1] 1.00195

Generate data from a similar model.
\[y_i = x_i\beta_1 + z_i\beta_2+\epsilon_i, i = 1,\ldots, 100\] where \(\beta_1=1\), \(\beta_2=2\), and \(\epsilon_i \sim N(0,1)\). For \(x_i\) and \(z_i\), generate them as \[ x_i \sim\ {\rm uniform} \ (-2,2) \ {\rm and}\ z_i = \gamma_1 x_i+\gamma_0+\tilde{\epsilon}_i,\] where \(\gamma_1=1\), \(\gamma_0=1\)and \(\tilde{\epsilon}_i\sim N(0,0.5)\). Suppose we fit a regression model with \(y\) as the outcome and \(x\) as the covariate (i.e., without \(z\)). Examine whether, in this model, the estimator for the coefficient of \(x\) is unbiased using simulation.

simulate_and_check_bias <- function(num_simulations = 5000) {

  beta1_estimates <- numeric(num_simulations)
  

  set.seed(123)
  
 
  for (i in 1:num_simulations) {
   
    n <- 100
    epsilon <- rnorm(n)
    x <- runif(n, -2, 2)
    gamma1 <- 1
    gamma0 <- 1
    z_tilde <- rnorm(n, 0, 0.5)
    
    
    z <- gamma1 * x + gamma0 + z_tilde
    
   
    beta1 <- 1
    beta2 <- 2
    y <- x * beta1 + z * beta2 + epsilon
    
    
    data <- data.frame(y, x)
    
    
    model <- lm(y ~ x, data = data)
    
    
    beta1_estimates[i] <- coef(model)[2]
  }
  
  
  return(beta1_estimates)
}


beta1_estimates <- simulate_and_check_bias()


cat("Mean of estimated coefficients (Beta1):", mean(beta1_estimates), "\n")

## Mean of estimated coefficients (Beta1): 3.001196

It is biased because the simulation value is not close to one as it was before. It is close to 3. ***

Acknowledgement

Failing to acknowledge any non-original efforts will be counted as plagiarism. This incidence will be reported to the Student Judicial Affairs.

Used CHAT GPT

If you use generative AI to solve any questions, please provide your instructions, conversations, and prompts here.

provide r code . For Questions 2 to 8, we visit flights and weather in the nycflights13 dataset. (i) How many observations and variables are there in the flights dataset? (ii) Find out the meanings of talinum, flight, carrier, dep_delay and arr_delay in the flights dataset. (iii) Find out the meanings of visib,time_hour, and temp in the weather data set.

ChatGPT Certainly! To answer the questions using R, you can utilize the nycflights13 package which includes the flights and weather datasets. Ensure you have the package installed by running install.packages(“nycflights13”) if you haven’t already.

Here’s the R code:

R Copy code # Install and load the nycflights13 package install.packages(“nycflights13”) library(nycflights13)

(i) How many observations and variables are there in the flights dataset?

dim(flights)

(ii) Find out the meanings of talinum, flight, carrier, dep_delay, and arr_delay in the flights dataset.

You can use the ? operator to access the documentation

?flights

(iii) Find out the meanings of visib, time_hour, and temp in the weather dataset.

?weather Explanation:

dim(flights) provides the number of observations (rows) and variables (columns) in the flights dataset. ?flights opens the documentation for the flights dataset, where you can find the meanings of various variables. ?weather similarly opens the documentation for the weather dataset to find the meanings of variables like visib, time_hour, and temp. User ) Extract the entries for Alaska Airlines from the flights dataset using filter() and %>%. Name the selected subset as alaska_flights. Furthermore, extract the weather data for EWR airport in January using filter and %>%, and name it as early_january_weather from weather. ChatGPT Certainly! You can use the filter() function along with the %>% (pipe) operator from the dplyr package to accomplish this. Make sure to load the dplyr package if you haven’t already.

R Copy code # Install and load the dplyr package install.packages(“dplyr”) library(dplyr)

Extract the entries for Alaska Airlines from the flights dataset

alaska_flights <- flights %>% filter(carrier == “AS”) # Assuming “AS” is the code for Alaska Airlines

Extract the weather data for EWR airport in January from weather dataset

early_january_weather <- weather %>% filter(origin == “EWR” & month == 1) Explanation:

filter(carrier == “AS”) selects rows in the flights dataset where the carrier is “AS” (assuming “AS” is the code for Alaska Airlines). filter(origin == “EWR” & month == 1) selects rows in the weather dataset where the origin is “EWR” (Newark Airport) and the month is January. User how do i see the data set in r ChatGPT To view the contents of a dataset in R, you can use functions like head(), tail(), or simply type the dataset name. Here’s how you can do it:

View the first few rows of the dataset using head():

R Copy code head(flights) This will display the first 6 rows of the flights dataset.

View the last few rows of the dataset using tail():

R Copy code tail(flights) This will display the last 6 rows of the flights dataset.

View the entire dataset:

R Copy code flights Typing the dataset name and running the code will display the entire dataset in the console. However, be cautious with this approach for large datasets as it may overwhelm the console.

Subset of the dataset: If you want to see only specific columns or rows, you can use indexing. For example:

R Copy code flights[, c(“carrier”, “dep_delay”, “arr_delay”)] This will show only the specified columns for all rows.

Choose the method that suits your needs based on the size and structure of your dataset. The head() and tail() functions are particularly useful for quickly inspecting the beginning or end of a dataset.

User how do i find the 2 letter code for alaskan airlines ChatGPT To find the two-letter code for Alaska Airlines, you can refer to the airlines dataset in the nycflights13 package. Here’s how you can do it:

R Copy code # Load the nycflights13 package if not already loaded library(nycflights13)

View the airlines dataset

airlines Look for the row corresponding to Alaska Airlines in the airlines dataset. The two-letter code, known as the IATA code, is typically found in the “carrier” column. Once you find the relevant row, note the two-letter code associated with Alaska Airlines.

Assuming “AS” is the two-letter code for Alaska Airlines, you can use this code in your previous tasks, such as filtering the flights dataset for Alaska Airlines flights.

User Create a scatterplot of dep_delay and arr_delay in the alaska_flights data without using transparency. What do you find about the relationship between these two variable? What are some practical reasons for this relationship? ChatGPT To create a scatterplot of dep_delay and arr_delay in the alaska_flights dataset without using transparency, you can use the plot() function. Here’s the code:

R Copy code # Assuming alaska_flights has been created previously plot(alaska_flights\(dep_delay, alaska_flights\)arr_delay, xlab = “Departure Delay (minutes)”, ylab = “Arrival Delay (minutes)”, main = “Scatterplot of Departure Delay vs Arrival Delay - Alaska Airlines”) This code will create a basic scatterplot with dep_delay on the x-axis and arr_delay on the y-axis for Alaska Airlines flights.

Interpretation:

If the points cluster around a diagonal line from the bottom left to the top right, it suggests a positive correlation between departure delay and arrival delay. If the points are scattered with no clear pattern, it suggests a weak or no correlation. Practical Reasons for the Relationship:

Operational Issues: Delays in departure may lead to cascading effects on the entire flight schedule, resulting in delays at subsequent stages, including arrival. This could be due to issues such as aircraft turnaround time, ground operations, and scheduling constraints.

Air Traffic Management: Congested airspace or busy airports can contribute to delays in both departure and arrival. Air traffic management and airport congestion can impact the efficiency of flights.

Weather Conditions: Adverse weather conditions can affect both departure and arrival times. Airlines may experience delays due to weather-related issues such as storms, low visibility, or other weather-related challenges.

Maintenance and Turnaround Time: Aircraft maintenance requirements and the time required for aircraft turnaround between flights can influence both departure and arrival delays.

Analyzing the scatterplot and considering these practical reasons can provide insights into the factors influencing the relationship between departure delay and arrival delay for Alaska Airlines flights.

User (5) Re-draw the plot in Part 4 with transparency set to be 0.2. Compare the two plot and explain your findings, if any. ChatGPT To redraw the scatterplot in Part 4 with transparency set to 0.2, you can use the alpha parameter in the plot() function. Here’s the updated code:

R Copy code # Assuming alaska_flights has been created previously plot(alaska_flights\(dep_delay, alaska_flights\)arr_delay, xlab = “Departure Delay (minutes)”, ylab = “Arrival Delay (minutes)”, main = “Scatterplot of Departure Delay vs Arrival Delay - Alaska Airlines”, pch = 16, col = rgb(0, 0, 0, 0.2)) This code uses the alpha parameter to set the transparency level to 0.2 (ranges from 0 for fully transparent to 1 for fully opaque). The pch parameter sets the plotting character, and col sets the color with transparency.

Comparison and Interpretation:

Adding transparency (setting alpha to 0.2) allows you to visualize overlapping points more easily. This can be particularly useful when there is a concentration of points in certain regions. By comparing the two plots (with and without transparency), you can observe whether there are clusters of points or patterns that were less apparent in the initial plot. If transparency reveals certain patterns or concentrations of points, it may suggest specific conditions or factors contributing to the relationship between departure delay and arrival delay for Alaska Airlines flights. Adjusting transparency is a helpful technique for enhancing the interpretability of scatterplots, especially when dealing with a large number of data points or areas of overlap.

User (6) For the early_january_weather data, create a linegraph with time_hour on the x-axis and temp on the y-axis. What do you find from the plot? ChatGPT To create a line graph for the early_january_weather data with time_hour on the x-axis and temp on the y-axis, you can use the plot() or ggplot2 package. Here, I’ll provide an example using the base R plot() function:

R Copy code # Assuming early_january_weather has been created previously plot(early_january_weather\(time_hour, early_january_weather\)temp, type = “l”, # “l” for line graph xlab = “Time Hour”, ylab = “Temperature (F)”, main = “Temperature Variation in Early January (EWR)”) This code uses the plot() function with type = “l” to create a line graph. It plots the temperature (temp) against the time in hours (time_hour) for early January at Newark Airport (EWR).

Interpretation:

The line graph will show how the temperature varies over time in early January at EWR. Patterns such as daily temperature fluctuations or trends can be observed. If there are sharp changes or consistent patterns, they may indicate certain weather conditions during the specified period. For more advanced and customizable visualizations, you might consider using the ggplot2 package:

R Copy code # Load ggplot2 if not already loaded library(ggplot2)

Create a line graph with ggplot2

ggplot(early_january_weather, aes(x = time_hour, y = temp)) + geom_line() + labs(x = “Time Hour”, y = “Temperature (F)”, title = “Temperature Variation in Early January (EWR)”) This ggplot2 code achieves the same result but provides more flexibility for customization and additional features.

User (7) Generate data from the following model. \[y_i = x_i\beta_1 + z_i\beta_2+x_i z_i \beta_3+ \epsilon_i, i = 1,\ldots, 100\] where \(\beta_1=1\), \(\beta_2=2\), \(\beta_3 =1\), and \(\epsilon_i \sim {N}(0,1)\). For \(x_i\) and \(z_i\), generate them as \[ x_i \sim\ {\rm uniform} \ (-2,2) \ {\rm and}\ z_i =\tilde{\epsilon}_i,\] where \(\tilde{\epsilon}_i\sim {N}(0,0.5)\). Fit a linear regression with \(y\) as the outcome and \(x\) as the covariate (i.e., without \(z\)) using lm(). Report the estimated coefficients. ChatGPT To generate data from the provided model and fit a linear regression model using R, you can follow these steps:

R Copy code # Set the seed for reproducibility set.seed(123)

Generate data

n <- 100 epsilon <- rnorm(n) x <- runif(n, -2, 2) z_tilde <- rnorm(n, 0, 0.5)

Generate y based on the provided model

beta1 <- 1 beta2 <- 2 beta3 <- 1 y <- x * beta1 + beta2 * z_tilde + x * z_tilde * beta3 + epsilon

Create a data frame with the generated data

data <- data.frame(y, x, z_tilde)

Fit a linear regression model

model <- lm(y ~ x, data = data)

Display the summary of the regression model

summary(model) In this code:

We generate 100 observations for the variables x, z_tilde, and epsilon based on the specified distributions. We then use these variables to generate the response variable y according to the provided model. We create a data frame (data) with y, x, and z_tilde. Finally, we fit a linear regression model (lm()) with y as the outcome and x as the covariate. The summary(model) command displays the estimated coefficients, standard errors, t-values, and p-values for the linear regression model. The estimated coefficient for x should provide an estimate of \(\beta_1\) in the generated model.

User (8) Wrap the code in Part 7 into a function, and run a simulation of 5000 instances to obtain the estimated coefficients. Draw a histogram of the estimated coefficients \(\hat{\beta}_1\)s of \(x\) from your simulation, add a vertical line (solid, black) to represent the simulation mean, and add another vertical line (dashed, red) to represent the true value (i.e., 1). Report your findings (hint: you may want to review unbiasedness). ChatGPT Certainly! Here is the R code to wrap the simulation into a function, run the simulation 5000 times, and then create a histogram of the estimated coefficients for \(\beta_1\):

R Copy code # Define a function for the simulation simulate_and_plot <- function(num_simulations = 5000) { # Storage for estimated coefficients beta1_estimates <- numeric(num_simulations)

# Set the seed for reproducibility set.seed(123)

# Simulation loop for (i in 1:num_simulations) { # Generate data n <- 100 epsilon <- rnorm(n) x <- runif(n, -2, 2) z_tilde <- rnorm(n, 0, 0.5)

# Generate y based on the provided model
beta1 <- 1
beta2 <- 2
beta3 <- 1
y <- x * beta1 + beta2 * z_tilde + x * z_tilde * beta3 + epsilon

# Create a data frame with the generated data
data <- data.frame(y, x, z_tilde)

# Fit a linear regression model
model <- lm(y ~ x, data = data)

# Store the estimated coefficient for x (beta1)
beta1_estimates[i] <- coef(model)[2]

}

# Plot histogram of beta1 estimates hist(beta1_estimates, main = “Histogram of Beta1 Estimates”, xlab = “Estimated Coefficient (Beta1)”, col = “lightblue”, border = “black”)

# Add a vertical line for the mean of the estimates abline(v = mean(beta1_estimates), col = “black”, lwd = 2)

# Add another vertical line for the true value abline(v = 1, col = “red”, lty = 2, lwd = 2)

# Return the estimated coefficients return(beta1_estimates) }

Run the simulation and obtain the estimated coefficients

beta1_estimates <- simulate_and_plot()

Display mean of the estimated coefficients

cat(“Mean of estimated coefficients (Beta1):”, mean(beta1_estimates), “”) Findings:

The histogram visually represents the distribution of the estimated coefficients for \(\beta_1\) from the simulation. The vertical solid black line represents the mean of the estimated coefficients, and the dashed red line represents the true value of \(\beta_1\) (which is 1 in this case). If the mean of the estimated coefficients is close to the true value, it suggests that the estimator is unbiased. In this case, the mean should be close to 1, indicating unbiasedness. User (10) Generate data from a similar model.
\[y_i = x_i\beta_1 + z_i\beta_2+\epsilon_i, i = 1,\ldots, 100\] where \(\beta_1=1\), \(\beta_2=2\), and \(\epsilon_i \sim N(0,1)\). For \(x_i\) and \(z_i\), generate them as \[ x_i \sim\ {\rm uniform} \ (-2,2) \ {\rm and}\ z_i = \gamma_1 x_i+\gamma_0+\tilde{\epsilon}_i,\] where \(\gamma_1=1\), \(\gamma_0=1\)and \(\tilde{\epsilon}_i\sim N(0,0.5)\). Suppose we fit a regression model with \(y\) as the outcome and \(x\) as the covariate (i.e., without \(z\)). Examine whether, in this model, the estimator for the coefficient of \(x\) is unbiased using simulation.
ChatGPT To examine whether the estimator for the coefficient of \(x\) is unbiased in the provided model, we can conduct a simulation similar to what we did in Part 8. In this case, we need to generate data from the specified model, fit a linear regression model, and check the bias of the estimated coefficient for \(x\).

Here’s the R code for the simulation:

R Copy code # Define a function for the simulation simulate_and_check_bias <- function(num_simulations = 5000) { # Storage for estimated coefficients beta1_estimates <- numeric(num_simulations)

# Set the seed for reproducibility set.seed(123)

# Simulation loop for (i in 1:num_simulations) { # Generate data n <- 100 epsilon <- rnorm(n) x <- runif(n, -2, 2) gamma1 <- 1 gamma0 <- 1 z_tilde <- rnorm(n, 0, 0.5)

# Generate z based on the provided model
z <- gamma1 * x + gamma0 + z_tilde

# Generate y based on the model
beta1 <- 1
beta2 <- 2
y <- x * beta1 + z * beta2 + epsilon

# Create a data frame with the generated data
data <- data.frame(y, x)

# Fit a linear regression model
model <- lm(y ~ x, data = data)

# Store the estimated coefficient for x (beta1)
beta1_estimates[i] <- coef(model)[2]

}

# Return the estimated coefficients return(beta1_estimates) }

Run the simulation and obtain the estimated coefficients

beta1_estimates <- simulate_and_check_bias()

Display mean of the estimated coefficients

cat(“Mean of estimated coefficients (Beta1):”, mean(beta1_estimates), “”) In this simulation:

We generate data according to the specified model. We fit a linear regression model with \(y\) as the outcome and \(x\) as the covariate. We check whether the mean of the estimated coefficients for \(x\) is close to the true value of \(\beta_1\) (which is 1 in this case). If the mean of the estimated coefficients is close to the true value, it suggests that the estimator for the coefficient of \(x\) is unbiased.

User Error in simulate_and_check_bias() : could not find function “simulate_and_check_bias ChatGPT My apologies for any confusion. It seems like there might be an issue with the code execution. Let’s make sure you have copied and pasted the entire code correctly. I’ll provide it again for your reference:

R Copy code # Define a function for the simulation simulate_and_check_bias <- function(num_simulations = 5000) { # Storage for estimated coefficients beta1_estimates <- numeric(num_simulations)

# Set the seed for reproducibility set.seed(123)

# Simulation loop for (i in 1:num_simulations) { # Generate data n <- 100 epsilon <- rnorm(n) x <- runif(n, -2, 2) gamma1 <- 1 gamma0 <- 1 z_tilde <- rnorm(n, 0, 0.5)

# Generate z based on the provided model
z <- gamma1 * x + gamma0 + z_tilde

# Generate y based on the model
beta1 <- 1
beta2 <- 2
y <- x * beta1 + z * beta2 + epsilon

# Create a data frame with the generated data
data <- data.frame(y, x)

# Fit a linear regression model
model <- lm(y ~ x, data = data)

# Store the estimated coefficient for x (beta1)
beta1_estimates[i] <- coef(model)[2]

}

# Return the estimated coefficients return(beta1_estimates) }

Run the simulation and obtain the estimated coefficients

beta1_estimates <- simulate_and_check_bias()

Display mean of the estimated coefficients

cat(“Mean of estimated coefficients (Beta1):”, mean(beta1_estimates), “”) Please ensure that you are copying and running the entire code, and let me know if you encounter any further issues.

User Mean of estimated coefficients (Beta1): 3.001196 . What would this value indicate ChatGPT The mean of the estimated coefficients for Beta1 being approximately 3.0012 suggests a potential issue with bias in the estimation process. In an unbiased estimation, you would expect the mean of the estimated coefficients to be close to the true value of the parameter being estimated.

In the specified simulation, the true value of Beta1 is 1. The fact that the mean of the estimated coefficients is significantly different (around 3) suggests a bias in the estimation process. A biased estimator systematically overestimates or underestimates the true parameter value. In this case, the estimator for the coefficient of x appears to be upward biased.

To address this issue, you may want to review the model specification, data generation process, or consider alternative estimation methods to obtain an unbiased estimate. It’s important to ensure that the assumptions of the linear regression model are met and that the model is correctly specified to avoid bias in parameter estimation.

Session information

sessionInfo()

## R version 4.1.3 (2022-03-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] nycflights13_1.0.2 forcats_0.5.1      stringr_1.4.0      dplyr_1.0.8       
##  [5] purrr_0.3.4        readr_2.1.2        tidyr_1.2.0        tibble_3.1.6      
##  [9] ggplot2_3.3.5      tidyverse_1.3.1   
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.2 xfun_0.39        bslib_0.3.1      haven_2.4.3     
##  [5] colorspace_2.0-3 vctrs_0.4.0      generics_0.1.2   htmltools_0.5.2 
##  [9] yaml_2.3.5       utf8_1.2.2       rlang_1.0.2      jquerylib_0.1.4 
## [13] pillar_1.7.0     withr_2.5.0      glue_1.6.2       DBI_1.1.2       
## [17] dbplyr_2.1.1     modelr_0.1.8     readxl_1.4.0     lifecycle_1.0.1 
## [21] munsell_0.5.0    gtable_0.3.0     cellranger_1.1.0 rvest_1.0.2     
## [25] evaluate_0.15    labeling_0.4.2   knitr_1.38       tzdb_0.3.0      
## [29] fastmap_1.1.0    fansi_1.0.3      highr_0.9        broom_0.8.0     
## [33] backports_1.4.1  scales_1.2.0     jsonlite_1.8.0   farver_2.1.0    
## [37] fs_1.5.2         hms_1.1.1        digest_0.6.29    stringi_1.7.6   
## [41] grid_4.1.3       cli_3.2.0        tools_4.1.3      magrittr_2.0.3  
## [45] sass_0.4.1       crayon_1.5.1     pkgconfig_2.0.3  ellipsis_0.3.2  
## [49] xml2_1.3.3       reprex_2.0.1     lubridate_1.8.0  assertthat_0.2.1
## [53] rmarkdown_2.13   httr_1.4.2       rstudioapi_0.13  R6_2.5.1        
## [57] compiler_4.1.3

STA 141A: Assignment 2

(Sam Davison 918184869)

Acknowledgement

(i) How many observations and variables are there in the flights dataset?

(ii) Find out the meanings of talinum, flight, carrier, dep_delay, and arr_delay in the flights dataset.

You can use the ? operator to access the documentation

(iii) Find out the meanings of visib, time_hour, and temp in the weather dataset.

Extract the entries for Alaska Airlines from the flights dataset

Extract the weather data for EWR airport in January from weather dataset

View the airlines dataset

Create a line graph with ggplot2

Generate data

Generate y based on the provided model

Create a data frame with the generated data

Fit a linear regression model

Display the summary of the regression model

Run the simulation and obtain the estimated coefficients

Display mean of the estimated coefficients

Run the simulation and obtain the estimated coefficients

Display mean of the estimated coefficients

Run the simulation and obtain the estimated coefficients

Display mean of the estimated coefficients

Session information