問題1. R のパッケージggplot2 を読込み, データmpg を利用する.

library(ggplot2)
## Warning: 패키지 'ggplot2'는 R 버전 4.2.3에서 작성되었습니다

問題1.(a) データmpg は何に関するデータかを説明せよ.

?mpg 
## httpd 도움말 서버를 시작합니다 ... 완료

上記のデータは、EPAにより作成された燃費に関するデータの一部であり、1999年から2008年に発売したモデルのデータで構成され車の指名度の指標として使われていることがわかる。

問題1.(b) class 列に関する集計表を作成し, 集計結果を可視化せよ.

xtabs(~class, data=mpg)
## class
##    2seater    compact    midsize    minivan     pickup subcompact        suv 
##          5         47         41         11         33         35         62
ggplot(data=mpg)+geom_bar(aes(x=class))

問題1.(c).class 列の水準別に, displ とhwy の散布図を描け.

Q1_cplot = 
  ggplot() +
  geom_point(mapping=aes(x=displ, y=hwy, color=class), 
             data=mpg,)+
  xlab('排気量') +
  ylab('高速道路での燃費')
  theme(legend.position = 'top')
## List of 1
##  $ legend.position: chr "top"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE
Q1_cplot

問題1. (d) toyota, audi, volkswagen の3 社に関して, 各社のhwy の箱ひげ図を描け.

library(tidyverse)
## Warning: 패키지 'tidyverse'는 R 버전 4.2.3에서 작성되었습니다
## Warning: 패키지 'tibble'는 R 버전 4.2.3에서 작성되었습니다
## Warning: 패키지 'tidyr'는 R 버전 4.2.3에서 작성되었습니다
## Warning: 패키지 'readr'는 R 버전 4.2.3에서 작성되었습니다
## Warning: 패키지 'purrr'는 R 버전 4.2.3에서 작성되었습니다
## Warning: 패키지 'dplyr'는 R 버전 4.2.3에서 작성되었습니다
## Warning: 패키지 'stringr'는 R 버전 4.2.3에서 작성되었습니다
## Warning: 패키지 'forcats'는 R 버전 4.2.3에서 작성되었습니다
## Warning: 패키지 'lubridate'는 R 버전 4.2.3에서 작성되었습니다
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
mpg_sub <- mpg %>% select(manufacturer, cyl, hwy) %>% group_by(manufacturer)
mpg_sub_aud <- mpg_sub %>% filter(manufacturer %in% c("audi"))
mpg_sub_toy <- mpg_sub %>% filter(manufacturer %in% c("toyota"))
mpg_sub_vol <- mpg_sub %>% filter(manufacturer %in% c("volkswagen"))
boxplot(mpg_sub_toy$hwy,
        mpg_sub_aud$hwy, 
        mpg_sub_vol$hwy,
        names = c("Toyota", "Audi", "Volkswagen"),
        xlab = c("製造社"), 
        ylab= c("高速道路での燃費")
        )

問題2. R のパッケージnycflights13 を読込み、含まれるデータを利用する。

問題2.(a) flights データから, time_hour, origin, dest, tailnum, carrier 列を取り出し, ここにcarrier 列をキーとして, airlines データを結合せよ.

library(nycflights13)
## Warning: 패키지 'nycflights13'는 R 버전 4.2.3에서 작성되었습니다
library(dplyr)
flights_subset <- flights %>%
  select(time_hour, origin, dest, tailnum, carrier)
data(airlines)
flights_with_airlines <- flights_subset %>%
  left_join(airlines, by = "carrier")
print(flights_with_airlines)
## # A tibble: 336,776 × 6
##    time_hour           origin dest  tailnum carrier name                    
##    <dttm>              <chr>  <chr> <chr>   <chr>   <chr>                   
##  1 2013-01-01 05:00:00 EWR    IAH   N14228  UA      United Air Lines Inc.   
##  2 2013-01-01 05:00:00 LGA    IAH   N24211  UA      United Air Lines Inc.   
##  3 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      American Airlines Inc.  
##  4 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      JetBlue Airways         
##  5 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Delta Air Lines Inc.    
##  6 2013-01-01 05:00:00 EWR    ORD   N39463  UA      United Air Lines Inc.   
##  7 2013-01-01 06:00:00 EWR    FLL   N516JB  B6      JetBlue Airways         
##  8 2013-01-01 06:00:00 LGA    IAD   N829AS  EV      ExpressJet Airlines Inc.
##  9 2013-01-01 06:00:00 JFK    MCO   N593JB  B6      JetBlue Airways         
## 10 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA      American Airlines Inc.  
## # ℹ 336,766 more rows

問題2.(b) (a) のデータに, airports データを利用して, 目的地(dest)の 空港の位置情報(緯度lat と経度lon) を追加せよ.

flights_subset <- flights %>%
  select(time_hour, origin, dest, tailnum, carrier)
data(airlines)
data(airports)
flights_with_airlines <- flights_subset %>%
  left_join(airlines, by = "carrier")
flights_with_airports <- flights_with_airlines %>%
  left_join(airports, by = c("dest" = "faa"))
print(flights_with_airports)
## # A tibble: 336,776 × 13
##    time_hour           origin dest  tailnum carrier name.x    name.y   lat   lon
##    <dttm>              <chr>  <chr> <chr>   <chr>   <chr>     <chr>  <dbl> <dbl>
##  1 2013-01-01 05:00:00 EWR    IAH   N14228  UA      United A… Georg…  30.0 -95.3
##  2 2013-01-01 05:00:00 LGA    IAH   N24211  UA      United A… Georg…  30.0 -95.3
##  3 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      American… Miami…  25.8 -80.3
##  4 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      JetBlue … <NA>    NA    NA  
##  5 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Delta Ai… Harts…  33.6 -84.4
##  6 2013-01-01 05:00:00 EWR    ORD   N39463  UA      United A… Chica…  42.0 -87.9
##  7 2013-01-01 06:00:00 EWR    FLL   N516JB  B6      JetBlue … Fort …  26.1 -80.2
##  8 2013-01-01 06:00:00 LGA    IAD   N829AS  EV      ExpressJ… Washi…  38.9 -77.5
##  9 2013-01-01 06:00:00 JFK    MCO   N593JB  B6      JetBlue … Orlan…  28.4 -81.3
## 10 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA      American… Chica…  42.0 -87.9
## # ℹ 336,766 more rows
## # ℹ 4 more variables: alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>

問題2.(c). (a) のデータで, carrier をDL とUA に絞り, そこに, planesデータを結合し,もっとも古い機体を使っているのはどちらの航空会社か調べよ.

flights_subset <- flights %>%
  select(time_hour, origin, dest, tailnum, carrier)
data(airlines)
flights_with_airlines <- flights_subset %>%
  left_join(airlines, by = "carrier")
flights_DL_UA <- flights_with_airlines %>%
  filter(carrier %in% c("DL", "UA"))
data(planes)
flights_planes <- flights_DL_UA %>%
  left_join(planes, by = "tailnum")
oldest_aircraft <- flights_planes %>%
  group_by(carrier) %>%
  slice(which.min(year))
print(oldest_aircraft)
## # A tibble: 2 × 14
## # Groups:   carrier [2]
##   time_hour           origin dest  tailnum carrier name               year type 
##   <dttm>              <chr>  <chr> <chr>   <chr>   <chr>             <int> <chr>
## 1 2013-04-30 08:00:00 EWR    DTW   N675MC  DL      Delta Air Lines …  1975 Fixe…
## 2 2013-01-09 06:00:00 LGA    IAH   N14629  UA      United Air Lines…  1965 Fixe…
## # ℹ 6 more variables: manufacturer <chr>, model <chr>, engines <int>,
## #   seats <int>, speed <int>, engine <chr>

最も古い機体を運用している会社はUAである。

問題3. R のパッケージbaseballr を読込み, 大谷投手(id:660271) のデータを利用する.

  1. 2023 シーズン(7/26 時点) のすべての投球データを, statcast より入手せよ.
library(baseballr)
## Warning: 패키지 'baseballr'는 R 버전 4.2.3에서 작성되었습니다
otani <- scrape_statcast_savant_pitcher("2023-04-01", "2023-07-26", 660271)

(b)球種(pitch_type), 球速(release_speed) の2 列を取り出せ. ただし, 球速はマイル表示からkm 表示に変更すること. なお, 1 マイルは1.609344 km とせよ.

otani_selected_data <- otani %>%
  select(pitch_type, release_speed) %>%
  mutate(release_speed = release_speed * 1.609344)
print(otani_selected_data)
## ── MLB Baseball Savant Statcast Search data from baseballsavant.mlb.com ────────
## ℹ Data updated: 2023-07-28 16:57:01 KST
## # A tibble: 1,708 × 2
##    pitch_type release_speed
##    <chr>              <dbl>
##  1 ST                  135.
##  2 FF                  156.
##  3 ST                  134.
##  4 ST                  139.
##  5 FF                  155.
##  6 FS                  140.
##  7 ST                  135.
##  8 FC                  146.
##  9 FF                  157.
## 10 FC                  141.
## # ℹ 1,698 more rows
  1. 球種ごとに分けて, 球速のヒストグラムを描け.
ggplot(otani_selected_data, aes(x = release_speed, fill = pitch_type)) +
  geom_histogram(binwidth = 1, position = "identity", alpha = 0.7) +
  labs(title = "大谷翔平の2023年の投球のヒストグラム",
       x = "球速(km/h)", y = "回数") +
  theme(legend.position = 'top')

  1. 球種ごとに分けて, 投球数の集計と球速の平均値と標準偏差をまとめて出力せよ.
pitch_stats <- otani_selected_data %>%
  group_by(pitch_type) %>%
  summarize(
    pitch_count = n(),
    avg_release_speed = mean(release_speed),
    std_dev_release_speed = sd(release_speed)
  )
print(pitch_stats)
## # A tibble: 7 × 4
##   pitch_type pitch_count avg_release_speed std_dev_release_speed
##   <chr>            <int>             <dbl>                 <dbl>
## 1 CU                  58              121.                  5.45
## 2 FC                 305              143.                  4.75
## 3 FF                 526              156.                  3.01
## 4 FS                 103              143.                  3.71
## 5 SI                 119              152.                  3.22
## 6 SL                   1              135.                 NA   
## 7 ST                 596              135.                  3.39