大規模言語モデルを利用した母語識別/NLI

05-20-2024

Native Language Identification with Large Language Models

Why this paper?

今やりたいことと被っている
- L2学習者のためのreadabilityの判定
- 日本語において同じ課題の再現可能性が検討できる
従来のNLP手法との比較ができる

学習者のインプット（読解）とアウトプット（作文）から特徴づける言語パターンを見つける

Native Language Identification with Large Language Models

初めてGPT-3.5とGPT-4を使った母語識別タスク
- GPT-4がTOEFL11ベンツマックで精度91.7%という新しい記録を達成
- zero-shot母語識別タスクは母語が未知でもいける
- GPT-4が自分の回答に言語的推論を提供できる能力を検証
GPT-4はESL作者が書いた英文章の母語推測に高い正解率を持っている

Method and Data

正解の例を提供しない
既存の知識と理解だけで推測

GPT-3.5 & 4
TOEFL11
- ETS Corpus of Non-native Written English
- 11国の英語学習者（Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, Turkish）が試験で書いた作文サンプル1100篇（総計12100篇*平均348単語）
- 実際に使用したサンプルは母語ごとに100篇、総計1100篇のテストデータ
Data Leakage
- 非公開なデータのため、テストセットのみopen-setタスクを実施

Experiment A

A 従来のNLI分類をLLMsにより再実施 :

closed-setタスク
- モデルの予測は事前に定義された11種類に限られている
- docをインプットとし、System promptとUser prompt例は以下:

You are a forensic linguistics expert that reads English texts written by non-native authors in order to classify the native language of the author as one of:

“ARA”: Arabic
“CHI”: Chinese
“FRE”: French
“GER”: German
“HIN”: Hindi
“ITA”: Italian
“JPN”: Japanese
“KOR”: Korean
“SPA”: Spanish
“TEL”: Telugu
“TUR”: Turkish

Use clues such as spelling errors, word choice, syntactic patterns, and grammatical errors to decide.

DO NOT USE ANY OTHER CLASS.
IMPORTANT: Do not classify any input as “ENG” (English). English is an invalid choice.

Valid output formats:
Class: “ARA”
Class: “CHI”
Class: “FRE”
Class: “GER”

<TOEFL11 ESSAY TEXT>

Classify the text as one of ARA, CHI, FRE, GER, HIN, ITA, JPN, KOR, SPA, TEL, or TUR. Do not output any other class - do NOT choose “ENG” (English). What is the closest native language of the author of this English text from the given list?

Experiment A

A 従来のNLI分類をLLMsにより再実施 :

Model	TOEFL11 Test Set
Random Guess Baseline	9.1%
SVM + Meta-Classifier [@malmasi-dras-2018-native]	86.8%
BERT + Meta-Classifier [@steinbakken-gamback-2020-native]	85.3%
GPT-2 [@lotfi-etal-2020-deep]	89.0%
Ours - GPT-3.5 (Zero-shot)	74.0%
Ours - GPT-4 (Zero-shot)	91.7%
Ours - GPT-3.5 (Open-set, Zero-shot)	73.4%
Ours - GPT-4 (Open-set, Zero-shot)	86.7%

Evaluation Metrics

正解率/Accuracy :

先行研究と一致する
データに偏りがない
GPT-4_closed-setの評価
- HindiとTelugu母語話者による英文章の区別が難しい
- Chinese, JapaneseとKorean母語話者による英文章もクラスターになっている

Experiment A

A 従来のNLI分類をLLMsにより再実施 :

GPT-3.5とGPT-4の比較：

GPT-3.5は今回のデータセットにおいて、12%の文書は最初に英語だと予測され、再分類するといつもフランス語を答える

Model TOEFL11 Test Set

Ours - GPT-3.5 (Zero-shot) 74.0%

Ours - GPT-4 (Zero-shot) 91.7%

Ours - GPT-3.5 (Open-set, Zero-shot) 73.4%

Ours - GPT-4 (Open-set, Zero-shot) 86.7%

Model	TOEFL11 Test Set
Ours - GPT-3.5 (Zero-shot)	74.0%
Ours - GPT-4 (Zero-shot)	91.7%
Ours - GPT-3.5 (Open-set, Zero-shot)	73.4%
Ours - GPT-4 (Open-set, Zero-shot)	86.7%

Experiment B

B 文長による影響を調査 :

GPT-4により実施
文頭から2000文字まで様々な長さの文節をインプット
1250文字まで長ければ長いほど正確率が高い（NLIタスクにおける最低文字数を提唱）

Experiment C

C Open-Set分類タスクにおける表現 :

open-setタスク
- モデルの予測は事前に定義されていない
- docをインプットとし、System promptとUser prompt例は以下:

You are a forensic linguistics expert that reads texts written by non-native authors in order to identify their native language.

Analyze each text and identify the native language of the author.

Use clues such as spelling errors, word choice, syntactic patterns, and grammatical errors to decide.

<TOEFL11 ESSAY TEXT>

出力フォーマットが指定されていないため後処理正規表記が必要

Experiment C

C Open-Set分類タスクにおける表現 :

Open-SetとClosed-Setの比較：

GPT-3.5とGPT-4どっちも精度が落ちているので指定外のクラス予測を詳しく見る必要がある

Model TOEFL11 Test Set

Ours - GPT-3.5 (Zero-shot) 74.0%

Ours - GPT-4 (Zero-shot) 91.7%

Ours - GPT-3.5 (Open-set, Zero-shot) 73.4%

Ours - GPT-4 (Open-set, Zero-shot) 86.7%

Model	TOEFL11 Test Set
Ours - GPT-3.5 (Zero-shot)	74.0%
Ours - GPT-4 (Zero-shot)	91.7%
Ours - GPT-3.5 (Open-set, Zero-shot)	73.4%
Ours - GPT-4 (Open-set, Zero-shot)	86.7%

Experiment C

C Open-Set分類タスクにおける表現 :

GPT-3.5とGPT-4の比較：

GPT-3.5
- 英語と予測した文書の中HindiとTeluguが一番多い
GPT-4
- 英語と予測しない

GPTモデルは実際の母語識別において応用できるようになっている

Feature Explainability

第二言語学習において

LLMsは母語識別の予測結果について解釈できるかどうか :

従来の研究は特徴量エンジニアリングと特徴量表現の作成に依存している
LLMsはzero-shotで即時にpromptに即した解釈が出力される

You must provide a guess. Output two named sections: (1) “Native Language” with the name of the language, and (2) “Reasoning” with a detailed explanation of your judgement with examples from the text.

The text is quite well-written, with only minor indicators pointing towards the author’s native language being German. The major clue is the author’s use of reflexive verbs in instances where they are not commonly used by native English speakers. For example, “the consumer’s responsibility to train itself” - a native English speaker would typically phrase this as “it’s the consumer’s responsibility to train themselves”.
This is a reflection of the regular use of reflexive verbs in German. In addition, the phrase, “they are fake, so why do we believe them anyway?” is syntax reminiscent of German structure. The use of colons before explanations, though not incorrect in English, also suggest German as it is more prevalent in German writing, such as in “realized: the advertisement”

自分の解読を正当化できるが、その分析に基づいた仮説は謹慎

Feature Explainability

第二言語学習において

LLMsは母語識別の予測結果について解釈できるかどうか :

手がかりになっている言語特徴：
- つづりエラー
- 頻出文法エラーパターン
- 翻訳と音訳による表現
人力による幻覚の検出は必要となる

Limitations and Future Work

promptをより細かく改善する
GPTsではなくopen-sourceなLLMs（Llama-2）においての実施（正確率は及ばないがその差異の分析はできる）
多言語（すでにEnglish、ArabicとChineseの枠は埋めているがJapaneseはまだ）