【Task1】
Rパッケージdplyrを用いて、以下の設問の表示結果と同じになるように、 Rチャンクを挿入しコーディングせよ。

要約統計量を計算するときはsummarize()関数を使う。

【Task2】
Pythonパッケージpolarsを用いて、以下の設問の表示結果と同じになるように、 Pythonチャンクを挿入しコーディングせよ。

積集合は(論理式1) & (論理式2)のようにする（括弧が必要）。
例）カラムAで値が1，カラムBで値が2の積集合: (pl.col(“A”) == 1) & (pl.col(“B”) == 2)

要約統計量を計算するときはselect()， group_by().agg()関数を使う。

1 高校問題

データを次に示す。

d <- data.frame(
  name = c("太郎", "花子", "三郎", "良子", "次郎", "桜子", "四郎", "松子", "愛子"),
  school = c("南", "南", "南", "南", "南", "東", "東", "東", "東"),
  teacher = c("竹田", "竹田", "竹田", "竹田",  "佐藤", "佐藤", "佐藤", "鈴木", "鈴木"),
  gender = c("男", "女", "男", "女", "男", "女", "男", "女", "女"),
  math = c(4, 3, 2, 4, 3, 4, 5, 4, 5),
  reading = c(1, 5, 2, 4, 5, 4, 1, 5, 4) )

library(DT)
datatable(d)

library(data.table)
fwrite(d, file = "highschool.csv", sep = ",")

import polars as pl
#df = pl.DataFrame(r.d)
df = pl.read_csv("highschool.csv")

1.1 問題

データのカラム名を取得せよ。

【Rヒント】names

## [1] "name"    "school"  "teacher" "gender"  "math"    "reading"

【Pythonヒント】names

## ['name', 'school', 'teacher', 'gender', 'math', 'reading']

1.2 問題

学生（name）と数学（math）のデータを取得せよ。

【Rヒント】select

##   name math
## 1 太郎    4
## 2 花子    3
## 3 三郎    2
## 4 良子    4
## 5 次郎    3
## 6 桜子    4
## 7 四郎    5
## 8 松子    4
## 9 愛子    5

【Pythonヒント】select

shape: (9, 2)

name	math
str	i64
"太郎"	4
"花子"	3
"三郎"	2
"良子"	4
"次郎"	3
"桜子"	4
"四郎"	5
"松子"	4
"愛子"	5

1.3 問題

性別（gender）以外のデータを取得せよ。

【Rヒント】select

##   name school teacher math reading
## 1 太郎     南    竹田    4       1
## 2 花子     南    竹田    3       5
## 3 三郎     南    竹田    2       2
## 4 良子     南    竹田    4       4
## 5 次郎     南    佐藤    3       5
## 6 桜子     東    佐藤    4       4
## 7 四郎     東    佐藤    5       1
## 8 松子     東    鈴木    4       5
## 9 愛子     東    鈴木    5       4

【Pythonヒント】drop

shape: (9, 5)

name	school	teacher	math	reading
str	str	str	i64	i64
"太郎"	"南"	"竹田"	4	1
"花子"	"南"	"竹田"	3	5
"三郎"	"南"	"竹田"	2	2
"良子"	"南"	"竹田"	4	4
"次郎"	"南"	"佐藤"	3	5
"桜子"	"東"	"佐藤"	4	4
"四郎"	"東"	"佐藤"	5	1
"松子"	"東"	"鈴木"	4	5
"愛子"	"東"	"鈴木"	5	4

1.4 問題

3～6番目のレコードを取得せよ。

【Rヒント】slice

##   name school teacher gender math reading
## 1 三郎     南    竹田     男    2       2
## 2 良子     南    竹田     女    4       4
## 3 次郎     南    佐藤     男    3       5
## 4 桜子     東    佐藤     女    4       4

【Pythonヒント】[:]

shape: (4, 6)

name	school	teacher	gender	math	reading
str	str	str	str	i64	i64
"三郎"	"南"	"竹田"	"男"	2	2
"良子"	"南"	"竹田"	"女"	4	4
"次郎"	"南"	"佐藤"	"男"	3	5
"桜子"	"東"	"佐藤"	"女"	4	4

shape: (3, 6)

name	school	teacher	gender	math	reading
str	str	str	str	i64	i64
"三郎"	"南"	"竹田"	"男"	2	2
"良子"	"南"	"竹田"	"女"	4	4
"次郎"	"南"	"佐藤"	"男"	3	5

1.5 問題

3, 5, 9番目のレコードを取得せよ。

【Rヒント】slice

##   name school teacher gender math reading
## 1 三郎     南    竹田     男    2       2
## 2 次郎     南    佐藤     男    3       5
## 3 愛子     東    鈴木     女    5       4

【Pythonヒント】[[, , ]]

shape: (3, 6)

name	school	teacher	gender	math	reading
str	str	str	str	i64	i64
"三郎"	"南"	"竹田"	"男"	2	2
"次郎"	"南"	"佐藤"	"男"	3	5
"愛子"	"東"	"鈴木"	"女"	5	4

1.6 問題

名前がアルファベット順になるようにレコードをソートせよ。

【Rヒント】arrange

##   name school teacher gender math reading
## 1 三郎     南    竹田     男    2       2
## 2 四郎     東    佐藤     男    5       1
## 3 太郎     南    竹田     男    4       1
## 4 愛子     東    鈴木     女    5       4
## 5 松子     東    鈴木     女    4       5
## 6 桜子     東    佐藤     女    4       4
## 7 次郎     南    佐藤     男    3       5
## 8 良子     南    竹田     女    4       4
## 9 花子     南    竹田     女    3       5

【Pythonヒント】sort

shape: (9, 6)

name	school	teacher	gender	math	reading
str	str	str	str	i64	i64
"三郎"	"南"	"竹田"	"男"	2	2
"四郎"	"東"	"佐藤"	"男"	5	1
"太郎"	"南"	"竹田"	"男"	4	1
"愛子"	"東"	"鈴木"	"女"	5	4
"松子"	"東"	"鈴木"	"女"	4	5
"桜子"	"東"	"佐藤"	"女"	4	4
"次郎"	"南"	"佐藤"	"男"	3	5
"良子"	"南"	"竹田"	"女"	4	4
"花子"	"南"	"竹田"	"女"	3	5

1.7 問題

数学の点数を高い方から低い順（降順: descending order）になるようにソートせよ。

【Rヒント】arrange, desc

##   name school teacher gender math reading
## 1 四郎     東    佐藤     男    5       1
## 2 愛子     東    鈴木     女    5       4
## 3 太郎     南    竹田     男    4       1
## 4 良子     南    竹田     女    4       4
## 5 桜子     東    佐藤     女    4       4
## 6 松子     東    鈴木     女    4       5
## 7 花子     南    竹田     女    3       5
## 8 次郎     南    佐藤     男    3       5
## 9 三郎     南    竹田     男    2       2

【Pythonヒント】sort, descending

shape: (9, 6)

name	school	teacher	gender	math	reading
str	str	str	str	i64	i64
"四郎"	"東"	"佐藤"	"男"	5	1
"愛子"	"東"	"鈴木"	"女"	5	4
"太郎"	"南"	"竹田"	"男"	4	1
"良子"	"南"	"竹田"	"女"	4	4
"桜子"	"東"	"佐藤"	"女"	4	4
"松子"	"東"	"鈴木"	"女"	4	5
"花子"	"南"	"竹田"	"女"	3	5
"次郎"	"南"	"佐藤"	"男"	3	5
"三郎"	"南"	"竹田"	"男"	2	2

1.8 問題

数学、国語の点数を高い方から低い順（降順: descending order）になるようにソートせよ。なお、数学の順位を最優先とする。

【Rヒント】arrange, desc

##   name school teacher gender math reading
## 1 愛子     東    鈴木     女    5       4
## 2 四郎     東    佐藤     男    5       1
## 3 松子     東    鈴木     女    4       5
## 4 良子     南    竹田     女    4       4
## 5 桜子     東    佐藤     女    4       4
## 6 太郎     南    竹田     男    4       1
## 7 花子     南    竹田     女    3       5
## 8 次郎     南    佐藤     男    3       5
## 9 三郎     南    竹田     男    2       2

【Pythonヒント】sort, descending, [, ]

shape: (9, 6)

name	school	teacher	gender	math	reading
str	str	str	str	i64	i64
"愛子"	"東"	"鈴木"	"女"	5	4
"四郎"	"東"	"佐藤"	"男"	5	1
"松子"	"東"	"鈴木"	"女"	4	5
"良子"	"南"	"竹田"	"女"	4	4
"桜子"	"東"	"佐藤"	"女"	4	4
"太郎"	"南"	"竹田"	"男"	4	1
"花子"	"南"	"竹田"	"女"	3	5
"次郎"	"南"	"佐藤"	"男"	3	5
"三郎"	"南"	"竹田"	"男"	2	2

1.9 問題

名前（name）と国語（reading）の列のみを抽出せよ。

【Rヒント】select

##   name reading
## 1 太郎       1
## 2 花子       5
## 3 三郎       2
## 4 良子       4
## 5 次郎       5
## 6 桜子       4
## 7 四郎       1
## 8 松子       5
## 9 愛子       4

【Pythonヒント】select, [, ]

shape: (9, 2)

name	reading
str	i64
"太郎"	1
"花子"	5
"三郎"	2
"良子"	4
"次郎"	5
"桜子"	4
"四郎"	1
"松子"	5
"愛子"	4

1.10 問題

数学（math）の平均値を計算せよ。

【Rヒント】mean

## [1] 3.777778

【Pythonヒント】mean

## 3.7777777777777777

1.11 問題

先生（teacher）ごとに数学（math）の平均値を計算せよ。

【Rヒント】group_by, summarize, mean

## # A tibble: 3 × 2
##   teacher math_mean
##   <chr>       <dbl>
## 1 佐藤         4   
## 2 竹田         3.25
## 3 鈴木         4.5

【Pythonヒント】group_by, agg, mean, alias

shape: (3, 2)

teacher	math_mean
str	f64
"鈴木"	4.5
"竹田"	3.25
"佐藤"	4.0

1.12 問題

先生（teacher）ごとの学生数を計算せよ。

【Rヒント】group_by, summarize, n

## # A tibble: 3 × 2
##   teacher     n
##   <chr>   <int>
## 1 佐藤        3
## 2 竹田        4
## 3 鈴木        2

【Pyhonヒント】group_by, agg, len

shape: (3, 2)

teacher	len
str	u32
"佐藤"	3
"竹田"	4
"鈴木"	2

1.13 問題

カウント関数を使用して，先生（teacher）ごとの学生数を計算せよ。

【Rヒント】count

##   teacher n
## 1    佐藤 3
## 2    竹田 4
## 3    鈴木 2

【Pythonヒント】value_counts

shape: (3, 2)

teacher	count
str	u32
"佐藤"	3
"竹田"	4
"鈴木"	2

1.14 問題

count関数を使用して，先生（teacher）ごとの男女別学生数を計算せよ。

【Rヒント】count

##   teacher gender n
## 1    佐藤     女 1
## 2    佐藤     男 2
## 3    竹田     女 2
## 4    竹田     男 2
## 5    鈴木     女 2

【Pythonヒント】group_by, agg, len

shape: (5, 3)

teacher	gender	len
str	str	u32
"佐藤"	"女"	1
"竹田"	"男"	2
"竹田"	"女"	2
"佐藤"	"男"	2
"鈴木"	"女"	2

1.15 問題

女子の数学（math）と国語（reading）の点数を取得せよ。

【Rヒント】filter, select

##   name gender math reading
## 1 花子     女    3       5
## 2 良子     女    4       4
## 3 桜子     女    4       4
## 4 松子     女    4       5
## 5 愛子     女    5       4

【Pythonヒント】 filter, col, select

shape: (5, 4)

name	gender	math	reading
str	str	i64	i64
"花子"	"女"	3	5
"良子"	"女"	4	4
"桜子"	"女"	4	4
"松子"	"女"	4	5
"愛子"	"女"	5	4

1.16 問題

南高校の男子の国語（reading）の点数を取得せよ。

【Rヒント】filter

##   name school teacher gender math reading
## 1 太郎     南    竹田     男    4       1
## 2 三郎     南    竹田     男    2       2
## 3 次郎     南    佐藤     男    3       5

【Pythonヒント】filter, col, () & ()

shape: (3, 6)

name	school	teacher	gender	math	reading
str	str	str	str	i64	i64
"太郎"	"南"	"竹田"	"男"	4	1
"三郎"	"南"	"竹田"	"男"	2	2
"次郎"	"南"	"佐藤"	"男"	3	5

1.17 問題

学生数が3名以上の先生（teacher）のデータを取得せよ。

【Rヒント】group_by, filter, n():行数取得関数

## # A tibble: 7 × 6
## # Groups:   teacher [2]
##   name  school teacher gender  math reading
##   <chr> <chr>  <chr>   <chr>  <dbl>   <dbl>
## 1 太郎  南     竹田    男         4       1
## 2 花子  南     竹田    女         3       5
## 3 三郎  南     竹田    男         2       2
## 4 良子  南     竹田    女         4       4
## 5 次郎  南     佐藤    男         3       5
## 6 桜子  東     佐藤    女         4       4
## 7 四郎  東     佐藤    男         5       1

【Pythonヒント】（難）2文で実行。group_by, agg, filter, col, select, is_in

shape: (7, 6)

name	school	teacher	gender	math	reading
str	str	str	str	i64	i64
"太郎"	"南"	"竹田"	"男"	4	1
"花子"	"南"	"竹田"	"女"	3	5
"三郎"	"南"	"竹田"	"男"	2	2
"良子"	"南"	"竹田"	"女"	4	4
"次郎"	"南"	"佐藤"	"男"	3	5
"桜子"	"東"	"佐藤"	"女"	4	4
"四郎"	"東"	"佐藤"	"男"	5	1

1.18 問題

数学（math）と国語（reading）の合計点（total）を作成せよ。

【Rヒント】mutate

##   name school teacher gender math reading total
## 1 太郎     南    竹田     男    4       1     5
## 2 花子     南    竹田     女    3       5     8
## 3 三郎     南    竹田     男    2       2     4
## 4 良子     南    竹田     女    4       4     8
## 5 次郎     南    佐藤     男    3       5     8
## 6 桜子     東    佐藤     女    4       4     8
## 7 四郎     東    佐藤     男    5       1     6
## 8 松子     東    鈴木     女    4       5     9
## 9 愛子     東    鈴木     女    5       4     9

【Pythonヒント】with_columns, col, alias

shape: (9, 7)

name	school	teacher	gender	math	reading	total
str	str	str	str	i64	i64	i64
"太郎"	"南"	"竹田"	"男"	4	1	5
"花子"	"南"	"竹田"	"女"	3	5	8
"三郎"	"南"	"竹田"	"男"	2	2	4
"良子"	"南"	"竹田"	"女"	4	4	8
"次郎"	"南"	"佐藤"	"男"	3	5	8
"桜子"	"東"	"佐藤"	"女"	4	4	8
"四郎"	"東"	"佐藤"	"男"	5	1	6
"松子"	"東"	"鈴木"	"女"	4	5	9
"愛子"	"東"	"鈴木"	"女"	5	4	9

1.19 問題

数学（math）を100点満点に換算（新カラム名：math100）せよ。

【Rヒント】mutate

##   name school teacher gender math reading math100
## 1 太郎     南    竹田     男    4       1      80
## 2 花子     南    竹田     女    3       5      60
## 3 三郎     南    竹田     男    2       2      40
## 4 良子     南    竹田     女    4       4      80
## 5 次郎     南    佐藤     男    3       5      60
## 6 桜子     東    佐藤     女    4       4      80
## 7 四郎     東    佐藤     男    5       1     100
## 8 松子     東    鈴木     女    4       5      80
## 9 愛子     東    鈴木     女    5       4     100

【Pythonヒント】with_columns, col, alias

shape: (9, 7)

name	school	teacher	gender	math	reading	math100
str	str	str	str	i64	i64	i64
"太郎"	"南"	"竹田"	"男"	4	1	80
"花子"	"南"	"竹田"	"女"	3	5	60
"三郎"	"南"	"竹田"	"男"	2	2	40
"良子"	"南"	"竹田"	"女"	4	4	80
"次郎"	"南"	"佐藤"	"男"	3	5	60
"桜子"	"東"	"佐藤"	"女"	4	4	80
"四郎"	"東"	"佐藤"	"男"	5	1	100
"松子"	"東"	"鈴木"	"女"	4	5	80
"愛子"	"東"	"鈴木"	"女"	5	4	100

2 スターウォーズ問題

映画「Star Wars」の登場人物（人・ロボット）のデータを用いて，設問に答えよ。

starwars |> select(-films) |> as.data.frame() -> d
datatable(d)

fwrite(d, file = "starwars.csv", sep = ",")

2.1 問題

人間（Human）は何人いるか示せ。

【Rヒント】filter, count

##  [1] "name"       "height"     "mass"       "hair_color" "skin_color"
##  [6] "eye_color"  "birth_year" "sex"        "gender"     "homeworld" 
## [11] "species"    "vehicles"   "starships"

##    n
## 1 35

【Pythonヒント】filter, col, select, count

## ['name', 'height', 'mass', 'hair_color', 'skin_color', 'eye_color', 'birth_year', 'sex', 'gender', 'homeworld', 'species', 'vehicles', 'starships']

shape: (1, 1)

species
u32
35

2.2 問題

人間の男女は，それぞれ何人いるか示せ。
gendar:性別，feminine:女性的，masculine:男性的

【Rヒント】filter, group_by, count

## # A tibble: 2 × 2
## # Groups:   gender [2]
##   gender        n
##   <chr>     <int>
## 1 feminine      9
## 2 masculine    26

【Pythonヒント】filter, col, group_by, len

shape: (2, 2)

gender	len
str	u32
"feminine"	9
"masculine"	26

2.3 問題

どの惑星（homeworld）出身が多いか，１位と２位を示せ。

【Rヒント】count, slice_max

##   homeworld  n
## 1     Naboo 11
## 2  Tatooine 10
## 3      <NA> 10

【Pythonヒント】group_by, len, top_k

shape: (3, 2)

homeworld	len
str	u32
"Naboo"	11
"Tatooine"	10
null	10

2.4 問題

惑星（homeworld）Nabooから来た，目（eye_color）がオレンジ色（orange）をした宇宙人の身長（height）の平均を示せ。

【Rヒント】filter, summarize, mean

##   mean(height)
## 1     208.6667

【Pythonヒント】filter, () & (), select, mean

shape: (1, 1)

height
f64
208.666667

2.5 問題

ロボット（Droid）全体について，身長の最頻値，平均値，標準偏差を示せ。

【Rヒント】filter, summarize, median, mean, sd, na.rm = T（NAを除くオプション）

##   meadin  mean       sd
## 1     97 131.2 49.14977

【Pythonヒント】filter, col, select([, , ]), median, mean, std, alias

shape: (1, 3)

median	mean	std
f64	f64	f64
97.0	131.2	49.149771

2.6 問題

数値データのカラムだけを抽出し，年齢（birth_year）上位5までのレコードを表示せよ。

【ヒント】where(is.numeric), slice_max

##   height mass birth_year
## 1     66   17        896
## 2    175 1358        600
## 3    228  112        200
## 4    167   75        112
## 5    193   80        102

【Pythonヒント】select, col, drop_nulls, top_k, nulls_last import polars.selectors as cs, numeric

shape: (5, 3)

height	mass	birth_year
i64	f64	f64
66	17.0	896.0
175	1358.0	600.0
228	112.0	200.0
167	75.0	112.0
193	80.0	102.0

2.7 問題

数値データのカラムだけを抽出し，NA（not available）を含むレコードを削除し，相関表（相関行列）を求めよ。

【ヒント】where(is.numeric), na.omit, cor

##                height      mass birth_year
## height      1.0000000 0.1016533 -0.4135510
## mass        0.1016533 1.0000000  0.4781391
## birth_year -0.4135510 0.4781391  1.0000000

【Pythonヒント】select, col, drop_nulls, top_k import polars.selectors as cs

shape: (3, 3)

height	mass	birth_year
f64	f64	f64
1.0	0.101653	-0.413551
0.101653	1.0	0.478139
-0.413551	0.478139	1.0

2.8 問題

文字列データのカラムだけを抽出し，名前をアルファベット順に並べ，先頭の4名のレコードを表示せよ。

【ヒント】select, where(is.character), arrange, head

##               name hair_color   skin_color eye_color    sex    gender homeworld
## 1           Ackbar       none brown mottle    orange   male masculine  Mon Cala
## 2       Adi Gallia       none         dark      blue female  feminine Coruscant
## 3 Anakin Skywalker      blond         fair      blue   male masculine  Tatooine
## 4     Arvel Crynyd      brown         fair     brown   male masculine      <NA>
##        species
## 1 Mon Calamari
## 2   Tholothian
## 3        Human
## 4        Human

【Pythonヒント】select, sort, head import polars.selectors as cs string

shape: (4, 10)

name	hair_color	skin_color	eye_color	sex	gender	homeworld	species	vehicles	starships
str	str	str	str	str	str	str	str	str	str
"Ackbar"	"none"	"brown mottle"	"orange"	"male"	"masculine"	"Mon Cala"	"Mon Calamari"	null	null
"Adi Gallia"	"none"	"dark"	"blue"	"female"	"feminine"	"Coruscant"	"Tholothian"	null	null
"Anakin Skywalk…	"blond"	"fair"	"blue"	"male"	"masculine"	"Tatooine"	"Human"	"Zephyr-G swoop…	"Naboo fighter\|…
"Arvel Crynyd"	"brown"	"fair"	"brown"	"male"	"masculine"	null	"Human"	null	"A-wing"

データラングリング（data wrangling）

演習課題

東京国際大学データサイエンス教育研究所竹田恒

2024-09-13

1 高校問題

1.1 問題

1.2 問題

1.3 問題

1.4 問題

1.5 問題

1.6 問題

1.7 問題

1.8 問題

1.9 問題

1.10 問題

1.11 問題

1.12 問題

1.13 問題

1.14 問題

1.15 問題

1.16 問題

1.17 問題

1.18 問題

1.19 問題

2 スターウォーズ問題

2.1 問題

2.2 問題

2.3 問題

2.4 問題

2.5 問題

2.6 問題

2.7 問題

2.8 問題

データ ラングリング（data wrangling）

演習課題

東京国際大学 データサイエンス教育研究所 竹田 恒

2024-09-13

1 高校問題

1.1 問題

1.2 問題

1.3 問題

1.4 問題

1.5 問題

1.6 問題

1.7 問題

1.8 問題

1.9 問題

1.10 問題

1.11 問題

1.12 問題

1.13 問題

1.14 問題

1.15 問題

1.16 問題

1.17 問題

1.18 問題

1.19 問題

2 スターウォーズ問題

2.1 問題

2.2 問題

2.3 問題

2.4 問題

2.5 問題

2.6 問題

2.7 問題

2.8 問題

データラングリング（data wrangling）

東京国際大学データサイエンス教育研究所竹田恒