問題1. R のパッケージggplot2 を読込み, データmpg を利用する.
library(ggplot2)
## Warning: 패키지 'ggplot2'는 R 버전 4.2.3에서 작성되었습니다
問題1.(a) データmpg は何に関するデータかを説明せよ.
?mpg
## httpd 도움말 서버를 시작합니다 ... 완료
上記のデータは、EPAにより作成された燃費に関するデータの一部であり、1999年から2008年に発売したモデルのデータで構成され車の指名度の指標として使われていることがわかる。
問題1.(b) class 列に関する集計表を作成し, 集計結果を可視化せよ.
xtabs(~class, data=mpg)
## class
## 2seater compact midsize minivan pickup subcompact suv
## 5 47 41 11 33 35 62
ggplot(data=mpg)+geom_bar(aes(x=class))
問題1.(c).class 列の水準別に, displ とhwy の散布図を描け.
Q1_cplot =
ggplot() +
geom_point(mapping=aes(x=displ, y=hwy, color=class),
data=mpg,)+
xlab('排気量') +
ylab('高速道路での燃費')
theme(legend.position = 'top')
## List of 1
## $ legend.position: chr "top"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
Q1_cplot
問題1. (d) toyota, audi, volkswagen の3 社に関して, 各社のhwy の箱ひげ図を描け.
library(tidyverse)
## Warning: 패키지 'tidyverse'는 R 버전 4.2.3에서 작성되었습니다
## Warning: 패키지 'tibble'는 R 버전 4.2.3에서 작성되었습니다
## Warning: 패키지 'tidyr'는 R 버전 4.2.3에서 작성되었습니다
## Warning: 패키지 'readr'는 R 버전 4.2.3에서 작성되었습니다
## Warning: 패키지 'purrr'는 R 버전 4.2.3에서 작성되었습니다
## Warning: 패키지 'dplyr'는 R 버전 4.2.3에서 작성되었습니다
## Warning: 패키지 'stringr'는 R 버전 4.2.3에서 작성되었습니다
## Warning: 패키지 'forcats'는 R 버전 4.2.3에서 작성되었습니다
## Warning: 패키지 'lubridate'는 R 버전 4.2.3에서 작성되었습니다
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
mpg_sub <- mpg %>% select(manufacturer, cyl, hwy) %>% group_by(manufacturer)
mpg_sub_aud <- mpg_sub %>% filter(manufacturer %in% c("audi"))
mpg_sub_toy <- mpg_sub %>% filter(manufacturer %in% c("toyota"))
mpg_sub_vol <- mpg_sub %>% filter(manufacturer %in% c("volkswagen"))
boxplot(mpg_sub_toy$hwy,
mpg_sub_aud$hwy,
mpg_sub_vol$hwy,
names = c("Toyota", "Audi", "Volkswagen"),
xlab = c("製造社"),
ylab= c("高速道路での燃費")
)
問題2. R のパッケージnycflights13 を読込み、含まれるデータを利用する。
問題2.(a) flights データから, time_hour, origin, dest, tailnum, carrier 列を取り出し, ここにcarrier 列をキーとして, airlines データを結合せよ.
library(nycflights13)
## Warning: 패키지 'nycflights13'는 R 버전 4.2.3에서 작성되었습니다
library(dplyr)
flights_subset <- flights %>%
select(time_hour, origin, dest, tailnum, carrier)
data(airlines)
flights_with_airlines <- flights_subset %>%
left_join(airlines, by = "carrier")
print(flights_with_airlines)
## # A tibble: 336,776 × 6
## time_hour origin dest tailnum carrier name
## <dttm> <chr> <chr> <chr> <chr> <chr>
## 1 2013-01-01 05:00:00 EWR IAH N14228 UA United Air Lines Inc.
## 2 2013-01-01 05:00:00 LGA IAH N24211 UA United Air Lines Inc.
## 3 2013-01-01 05:00:00 JFK MIA N619AA AA American Airlines Inc.
## 4 2013-01-01 05:00:00 JFK BQN N804JB B6 JetBlue Airways
## 5 2013-01-01 06:00:00 LGA ATL N668DN DL Delta Air Lines Inc.
## 6 2013-01-01 05:00:00 EWR ORD N39463 UA United Air Lines Inc.
## 7 2013-01-01 06:00:00 EWR FLL N516JB B6 JetBlue Airways
## 8 2013-01-01 06:00:00 LGA IAD N829AS EV ExpressJet Airlines Inc.
## 9 2013-01-01 06:00:00 JFK MCO N593JB B6 JetBlue Airways
## 10 2013-01-01 06:00:00 LGA ORD N3ALAA AA American Airlines Inc.
## # ℹ 336,766 more rows
問題2.(b) (a) のデータに, airports データを利用して, 目的地(dest)の 空港の位置情報(緯度lat と経度lon) を追加せよ.
flights_subset <- flights %>%
select(time_hour, origin, dest, tailnum, carrier)
data(airlines)
data(airports)
flights_with_airlines <- flights_subset %>%
left_join(airlines, by = "carrier")
flights_with_airports <- flights_with_airlines %>%
left_join(airports, by = c("dest" = "faa"))
print(flights_with_airports)
## # A tibble: 336,776 × 13
## time_hour origin dest tailnum carrier name.x name.y lat lon
## <dttm> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 2013-01-01 05:00:00 EWR IAH N14228 UA United A… Georg… 30.0 -95.3
## 2 2013-01-01 05:00:00 LGA IAH N24211 UA United A… Georg… 30.0 -95.3
## 3 2013-01-01 05:00:00 JFK MIA N619AA AA American… Miami… 25.8 -80.3
## 4 2013-01-01 05:00:00 JFK BQN N804JB B6 JetBlue … <NA> NA NA
## 5 2013-01-01 06:00:00 LGA ATL N668DN DL Delta Ai… Harts… 33.6 -84.4
## 6 2013-01-01 05:00:00 EWR ORD N39463 UA United A… Chica… 42.0 -87.9
## 7 2013-01-01 06:00:00 EWR FLL N516JB B6 JetBlue … Fort … 26.1 -80.2
## 8 2013-01-01 06:00:00 LGA IAD N829AS EV ExpressJ… Washi… 38.9 -77.5
## 9 2013-01-01 06:00:00 JFK MCO N593JB B6 JetBlue … Orlan… 28.4 -81.3
## 10 2013-01-01 06:00:00 LGA ORD N3ALAA AA American… Chica… 42.0 -87.9
## # ℹ 336,766 more rows
## # ℹ 4 more variables: alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>
問題2.(c). (a) のデータで, carrier をDL とUA に絞り, そこに, planesデータを結合し,もっとも古い機体を使っているのはどちらの航空会社か調べよ.
flights_subset <- flights %>%
select(time_hour, origin, dest, tailnum, carrier)
data(airlines)
flights_with_airlines <- flights_subset %>%
left_join(airlines, by = "carrier")
flights_DL_UA <- flights_with_airlines %>%
filter(carrier %in% c("DL", "UA"))
data(planes)
flights_planes <- flights_DL_UA %>%
left_join(planes, by = "tailnum")
oldest_aircraft <- flights_planes %>%
group_by(carrier) %>%
slice(which.min(year))
print(oldest_aircraft)
## # A tibble: 2 × 14
## # Groups: carrier [2]
## time_hour origin dest tailnum carrier name year type
## <dttm> <chr> <chr> <chr> <chr> <chr> <int> <chr>
## 1 2013-04-30 08:00:00 EWR DTW N675MC DL Delta Air Lines … 1975 Fixe…
## 2 2013-01-09 06:00:00 LGA IAH N14629 UA United Air Lines… 1965 Fixe…
## # ℹ 6 more variables: manufacturer <chr>, model <chr>, engines <int>,
## # seats <int>, speed <int>, engine <chr>
最も古い機体を運用している会社はUAである。
問題3. R のパッケージbaseballr を読込み, 大谷投手(id:660271) のデータを利用する.
library(baseballr)
## Warning: 패키지 'baseballr'는 R 버전 4.2.3에서 작성되었습니다
otani <- scrape_statcast_savant_pitcher("2023-04-01", "2023-07-26", 660271)
(b)球種(pitch_type), 球速(release_speed) の2 列を取り出せ. ただし, 球速はマイル表示からkm 表示に変更すること. なお, 1 マイルは1.609344 km とせよ.
otani_selected_data <- otani %>%
select(pitch_type, release_speed) %>%
mutate(release_speed = release_speed * 1.609344)
print(otani_selected_data)
## ── MLB Baseball Savant Statcast Search data from baseballsavant.mlb.com ────────
## ℹ Data updated: 2023-07-28 16:57:01 KST
## # A tibble: 1,708 × 2
## pitch_type release_speed
## <chr> <dbl>
## 1 ST 135.
## 2 FF 156.
## 3 ST 134.
## 4 ST 139.
## 5 FF 155.
## 6 FS 140.
## 7 ST 135.
## 8 FC 146.
## 9 FF 157.
## 10 FC 141.
## # ℹ 1,698 more rows
ggplot(otani_selected_data, aes(x = release_speed, fill = pitch_type)) +
geom_histogram(binwidth = 1, position = "identity", alpha = 0.7) +
labs(title = "大谷翔平の2023年の投球のヒストグラム",
x = "球速(km/h)", y = "回数") +
theme(legend.position = 'top')
pitch_stats <- otani_selected_data %>%
group_by(pitch_type) %>%
summarize(
pitch_count = n(),
avg_release_speed = mean(release_speed),
std_dev_release_speed = sd(release_speed)
)
print(pitch_stats)
## # A tibble: 7 × 4
## pitch_type pitch_count avg_release_speed std_dev_release_speed
## <chr> <int> <dbl> <dbl>
## 1 CU 58 121. 5.45
## 2 FC 305 143. 4.75
## 3 FF 526 156. 3.01
## 4 FS 103 143. 3.71
## 5 SI 119 152. 3.22
## 6 SL 1 135. NA
## 7 ST 596 135. 3.39