Instructions You may adapt the code in the course materials or any sources (e.g., the Internet, classmates, friends). In fact, you can craft solutions for almost all questions from the course materials with minor modifications. However, you need to write up your own solutions and acknowledge all sources that you have cited in the Acknowledgement section.
Failing to acknowledge any non-original efforts will be counted as plagiarism. This incidence will be reported to the Student Judicial Affairs.
library(nycflights13)
The dataset I chose is a pre-installed dataset in R Studio called mpg. I chose this dataset because I have an interest in cars. It has 234 observations and 11 columns (variables). The variables include manufacturer, model, displ, year, cyl, trans, drv, cty, hwy, fl, class. Each observation is a car.
dim(mpg)
## [1] 234 11
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
flights and
weather in the nycflights13 dataset.flights dataset?talinum, flight,
carrier, dep_delay and arr_delay
in the flights dataset.visib,time_hour,
and temp in the weather data set.i.) There are 336776 observations in the flights
dataset.
dim(flights)
## [1] 336776 19
ii.) tailnum is the plane tail number,
flight is the flight number, carrier is the
two letter carrier abbreviation, dep_delay is the departure
delay, and arr_delay is the arrival delay.
iii.) visib is visibility in miles,
time_hour is date and hour of the recording, and
temp is temperature in F.
head(weather)
## # A tibble: 6 × 15
## origin year month day hour temp dewp humid wind_dir wind_speed wind_gust
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA
## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 NA
## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 NA
## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 NA
## # … with 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
## # time_hour <dttm>
flights dataset using filter() and
%>%. Name the selected subset as
alaska_flights. Furthermore, extract the weather data for
EWR airport in January using filter and
%>%, and name it as early_january_weather
from weather.alaska_flights <- flights %>%
filter(carrier == "AS")
early_january_weather <- weather %>%
filter(origin == "EWR")
dep_delay and
arr_delay in the alaska_flights data without
using transparency. What do you find about the relationship between
these two variable? What are some practical reasons for this
relationship?I notice that as departure delays increase, arrival delays increase as well. This is likely due to the fact that departure delays cause arrival delays, thus, when there are longer departure waits, there are longer arrival waits. I also noticed that most of the data points are concentrated around 0-50 minutes.
plot(alaska_flights$dep_delay, alaska_flights$arr_delay)
Findings: after setting the transparency to 0.2, I can see that many of the data points are concentrated around 0-50 minutes
ggplot(alaska_flights, aes(dep_delay, arr_delay)) +
geom_point(alpha = 0.2)
## Warning: Removed 5 rows containing missing values (geom_point).
early_january_weather data, create a linegraph
with time_hour on the x-axis and
temp on the y-axis. What do you find from the
plot?I notice that temperature is the highest around July, which is accurate because that is when summer is. The lowest temperatures are around January, when winter is. The line graph allows me to see small oscillations in the data, which I assume represent the time of day. So at night it is cool and during the day it is warmer.
ggplot(early_january_weather, aes(time_hour, temp)) +
geom_line()
lm(). Report the
estimated coefficients.The estimated coefficient for the intercept is 0.954 and the estimated coefficient for xi is 1.222.
set.seed(1)
b1 <- 1
b2 <- 2
b3 = 1
ei <- runif(100, min = 0, max = 1)
xi <- runif(100, min = -2, max = 2)
ei_tilda <- runif(100, min = 0, max= 0.5)
zi <- ei_tilda
# regression model
yi <- xi*b1 + zi*b2 + xi*zi*b3 + ei
# estimated coefficients
lm(yi ~ xi)
##
## Call:
## lm(formula = yi ~ xi)
##
## Coefficients:
## (Intercept) xi
## 0.954 1.222
regression <- function() {
b1 <- 1
b2 <- 2
b3 = 1
ei <- runif(100, min = 0, max = 1)
xi <- runif(100, min = -2, max = 2)
ei_tilda <- runif(100, min = 0, max= 0.5)
zi <- ei_tilda
# regression model
yi <- xi*b1 + zi*b2 + xi*zi*b3 + ei
# estimated coefficients
lm(yi ~ xi)
}
coef_dist <- replicate(5000, summary(regression())$coefficients[2, 1])
hist(coef_dist)
abline(v = c(mean(coef_dist), 1), col = c('black', 'red'), lwd = 2, lty = "dashed")
The simulation mean from Part 8 was 1.25. When I reproduce the simulation, I get around the same mean at ~1.249.
mean(replicate(5000, summary(regression())$coefficients[2, 1]))
## [1] 1.250421
b1 <- 1
b2 <- 2
ei <- runif(100, 0, 1)
xi <- runif(100, -2, 2)
y1 <- 1
y0 <- 1
ei_tilda <- runif(100, 0, 0.5)
zi <- y1*xi+ y0 + ei_tilda
yi <- xi*b1 + zi*b2 + ei
lm(yi ~ xi)
##
## Call:
## lm(formula = yi ~ xi)
##
## Coefficients:
## (Intercept) xi
## 3.069 2.976
Failing to acknowledge any non-original efforts will be counted as plagiarism. This incidence will be reported to the Student Judicial Affairs.
If you use generative AI to solve any questions, please provide your instructions, conversations, and prompts here.
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] nycflights13_1.0.2 scales_1.2.1 forcats_0.5.2 stringr_1.4.1
## [5] dplyr_1.0.10 purrr_0.3.4 readr_2.1.2 tidyr_1.2.1
## [9] tibble_3.1.8 ggplot2_3.3.6 tidyverse_1.3.2
##
## loaded via a namespace (and not attached):
## [1] lubridate_1.8.0 assertthat_0.2.1 digest_0.6.33
## [4] utf8_1.2.2 R6_2.5.1 cellranger_1.1.0
## [7] backports_1.4.1 reprex_2.0.2 evaluate_0.23
## [10] httr_1.4.4 highr_0.9 pillar_1.8.1
## [13] rlang_1.1.2 googlesheets4_1.0.1 readxl_1.4.1
## [16] rstudioapi_0.14 jquerylib_0.1.4 rmarkdown_2.16
## [19] labeling_0.4.2 googledrive_2.0.0 munsell_0.5.0
## [22] broom_1.0.1 compiler_4.2.1 modelr_0.1.9
## [25] xfun_0.33 pkgconfig_2.0.3 htmltools_0.5.4
## [28] tidyselect_1.1.2 fansi_1.0.3 crayon_1.5.1
## [31] tzdb_0.3.0 dbplyr_2.2.1 withr_2.5.0
## [34] grid_4.2.1 jsonlite_1.8.7 gtable_0.3.1
## [37] lifecycle_1.0.4 DBI_1.1.3 magrittr_2.0.3
## [40] cli_3.6.1 stringi_1.7.8 cachem_1.0.6
## [43] farver_2.1.1 fs_1.5.2 xml2_1.3.3
## [46] bslib_0.4.0 ellipsis_0.3.2 generics_0.1.3
## [49] vctrs_0.4.1 tools_4.2.1 glue_1.6.2
## [52] hms_1.1.2 fastmap_1.1.0 yaml_2.3.5
## [55] colorspace_2.0-3 gargle_1.2.1 rvest_1.0.3
## [58] knitr_1.40 haven_2.5.1 sass_0.4.2