Instructions You may adapt the code in the course materials or any sources (e.g., the Internet, classmates, friends). In fact, you can craft solutions for almost all questions from the course materials with minor modifications. However, you need to write up your own solutions and acknowledge all sources that you have cited in the Acknowledgement section.

Failing to acknowledge any non-original efforts will be counted as plagiarism. This incidence will be reported to the Student Judicial Affairs.


library(nycflights13)
  1. Pick a data set that you find interesting. This data set should contain at least four variables. Briefly explain the data set (background, source, variables, samples, etc.), and why it interests you.

The dataset I chose is a pre-installed dataset in R Studio called mpg. I chose this dataset because I have an interest in cars. It has 234 observations and 11 columns (variables). The variables include manufacturer, model, displ, year, cyl, trans, drv, cty, hwy, fl, class. Each observation is a car.

dim(mpg)
## [1] 234  11
summary(mpg)
##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

  1. For Questions 2 to 8, we visit flights and weather in the nycflights13 dataset.
  1. How many observations and variables are there in the flights dataset?
  2. Find out the meanings of talinum, flight, carrier, dep_delay and arr_delay in the flights dataset.
  3. Find out the meanings of visib,time_hour, and temp in the weather data set.

i.) There are 336776 observations in the flights dataset.

dim(flights)
## [1] 336776     19

ii.) tailnum is the plane tail number, flight is the flight number, carrier is the two letter carrier abbreviation, dep_delay is the departure delay, and arr_delay is the arrival delay.

iii.) visib is visibility in miles, time_hour is date and hour of the recording, and temp is temperature in F.

head(weather)
## # A tibble: 6 × 15
##   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust
##   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>
## 1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4         NA
## 2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06        NA
## 3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5         NA
## 4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7         NA
## 5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7         NA
## 6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5         NA
## # … with 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
## #   time_hour <dttm>

  1. Extract the entries for Alaska Airlines from the flights dataset using filter() and %>%. Name the selected subset as alaska_flights. Furthermore, extract the weather data for EWR airport in January using filter and %>%, and name it as early_january_weather from weather.
alaska_flights <- flights %>%
  filter(carrier == "AS")
early_january_weather <- weather %>%
  filter(origin == "EWR")

  1. Create a scatterplot of dep_delay and arr_delay in the alaska_flights data without using transparency. What do you find about the relationship between these two variable? What are some practical reasons for this relationship?

I notice that as departure delays increase, arrival delays increase as well. This is likely due to the fact that departure delays cause arrival delays, thus, when there are longer departure waits, there are longer arrival waits. I also noticed that most of the data points are concentrated around 0-50 minutes.

plot(alaska_flights$dep_delay, alaska_flights$arr_delay)


  1. Re-draw the plot in Part 4 with transparency set to be 0.2. Compare the two plot and explain your findings, if any.

Findings: after setting the transparency to 0.2, I can see that many of the data points are concentrated around 0-50 minutes

ggplot(alaska_flights, aes(dep_delay, arr_delay)) +
  geom_point(alpha = 0.2)
## Warning: Removed 5 rows containing missing values (geom_point).


  1. For the early_january_weather data, create a linegraph with time_hour on the x-axis and temp on the y-axis. What do you find from the plot?

I notice that temperature is the highest around July, which is accurate because that is when summer is. The lowest temperatures are around January, when winter is. The line graph allows me to see small oscillations in the data, which I assume represent the time of day. So at night it is cool and during the day it is warmer.

ggplot(early_january_weather, aes(time_hour, temp)) +
  geom_line()


  1. Generate data from the following model. \[y_i = x_i\beta_1 + z_i\beta_2+x_i z_i \beta_3+ \epsilon_i, i = 1,\ldots, 100\] where \(\beta_1=1\), \(\beta_2=2\), \(\beta_3 =1\), and \(\epsilon_i \sim \mathcal{N}(0,1)\). For \(x_i\) and \(z_i\), generate them as \[ x_i \sim\ {\rm uniform} \ (-2,2) \ {\rm and}\ z_i =\tilde{\epsilon}_i,\] where \(\tilde{\epsilon}_i\sim \mathcal{N}(0,0.5)\). Fit a linear regression with \(y\) as the outcome and \(x\) as the covariate (i.e., without \(z\)) using lm(). Report the estimated coefficients.

The estimated coefficient for the intercept is 0.954 and the estimated coefficient for xi is 1.222.

set.seed(1)

b1 <- 1
b2 <- 2
b3 = 1
ei <- runif(100, min = 0, max = 1)

xi <- runif(100, min = -2, max = 2)
ei_tilda <- runif(100, min = 0, max= 0.5)
zi <- ei_tilda

# regression model
yi <- xi*b1 + zi*b2 + xi*zi*b3 + ei
# estimated coefficients
lm(yi ~ xi)
## 
## Call:
## lm(formula = yi ~ xi)
## 
## Coefficients:
## (Intercept)           xi  
##       0.954        1.222

  1. Wrap the code in Part 7 into a function, and run a simulation of 5000 instances to obtain the estimated coefficients. Draw a histogram of the estimated coefficients \(\hat{\beta}_1\)s of \(x\) from your simulation, add a vertical line (solid, black) to represent the simulation mean, and add another vertical line (dashed, red) to represent the true value (i.e., 1). Report your findings (hint: you may want to review unbiasedness).

FINISH

regression <- function() {
  b1 <- 1
  b2 <- 2
  b3 = 1
  ei <- runif(100, min = 0, max = 1)
  
  xi <- runif(100, min = -2, max = 2)
  ei_tilda <- runif(100, min = 0, max= 0.5)
  zi <- ei_tilda
  
  # regression model
  yi <- xi*b1 + zi*b2 + xi*zi*b3 + ei
  # estimated coefficients
  lm(yi ~ xi)
}

coef_dist <- replicate(5000, summary(regression())$coefficients[2, 1])

hist(coef_dist)
abline(v = c(mean(coef_dist), 1), col = c('black', 'red'), lwd = 2, lty = "dashed")


  1. Report the simulation mean from Part 8. Then run another simulation under the same setting to verify that your simulation is reproducible. Specifically, the new simulation mean should be identical to the mean from Part 8.

The simulation mean from Part 8 was 1.25. When I reproduce the simulation, I get around the same mean at ~1.249.

mean(replicate(5000, summary(regression())$coefficients[2, 1]))
## [1] 1.250421

  1. Generate data from a similar model.
    \[y_i = x_i\beta_1 + z_i\beta_2+\epsilon_i, i = 1,\ldots, 100\] where \(\beta_1=1\), \(\beta_2=2\), and \(\epsilon_i \sim \mathcal{N}(0,1)\). For \(x_i\) and \(z_i\), generate them as \[ x_i \sim\ {\rm uniform} \ (-2,2) \ {\rm and}\ z_i = \gamma_1 x_i+\gamma_0+\tilde{\epsilon}_i,\] where \(\gamma_1=1\), \(\gamma_0=1\)and \(\tilde{\epsilon}_i\sim \mathcal{N}(0,0.5)\). Suppose we fit a regression model with \(y\) as the outcome and \(x\) as the covariate (i.e., without \(z\)). Examine whether, in this model, the estimator for the coefficient of \(x\) is unbiased using simulation.
b1 <- 1
b2 <- 2
ei <- runif(100, 0, 1)
xi <- runif(100, -2, 2)

y1 <- 1
y0 <- 1
ei_tilda <- runif(100, 0, 0.5)
zi <- y1*xi+ y0 + ei_tilda

yi <- xi*b1 + zi*b2 + ei

lm(yi ~ xi)
## 
## Call:
## lm(formula = yi ~ xi)
## 
## Coefficients:
## (Intercept)           xi  
##       3.069        2.976

Acknowledgement

Failing to acknowledge any non-original efforts will be counted as plagiarism. This incidence will be reported to the Student Judicial Affairs.

If you use generative AI to solve any questions, please provide your instructions, conversations, and prompts here.

Session information

sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] nycflights13_1.0.2 scales_1.2.1       forcats_0.5.2      stringr_1.4.1     
##  [5] dplyr_1.0.10       purrr_0.3.4        readr_2.1.2        tidyr_1.2.1       
##  [9] tibble_3.1.8       ggplot2_3.3.6      tidyverse_1.3.2   
## 
## loaded via a namespace (and not attached):
##  [1] lubridate_1.8.0     assertthat_0.2.1    digest_0.6.33      
##  [4] utf8_1.2.2          R6_2.5.1            cellranger_1.1.0   
##  [7] backports_1.4.1     reprex_2.0.2        evaluate_0.23      
## [10] httr_1.4.4          highr_0.9           pillar_1.8.1       
## [13] rlang_1.1.2         googlesheets4_1.0.1 readxl_1.4.1       
## [16] rstudioapi_0.14     jquerylib_0.1.4     rmarkdown_2.16     
## [19] labeling_0.4.2      googledrive_2.0.0   munsell_0.5.0      
## [22] broom_1.0.1         compiler_4.2.1      modelr_0.1.9       
## [25] xfun_0.33           pkgconfig_2.0.3     htmltools_0.5.4    
## [28] tidyselect_1.1.2    fansi_1.0.3         crayon_1.5.1       
## [31] tzdb_0.3.0          dbplyr_2.2.1        withr_2.5.0        
## [34] grid_4.2.1          jsonlite_1.8.7      gtable_0.3.1       
## [37] lifecycle_1.0.4     DBI_1.1.3           magrittr_2.0.3     
## [40] cli_3.6.1           stringi_1.7.8       cachem_1.0.6       
## [43] farver_2.1.1        fs_1.5.2            xml2_1.3.3         
## [46] bslib_0.4.0         ellipsis_0.3.2      generics_0.1.3     
## [49] vctrs_0.4.1         tools_4.2.1         glue_1.6.2         
## [52] hms_1.1.2           fastmap_1.1.0       yaml_2.3.5         
## [55] colorspace_2.0-3    gargle_1.2.1        rvest_1.0.3        
## [58] knitr_1.40          haven_2.5.1         sass_0.4.2