Tariff_Extraction

根据Case_Number对关税情况进行抓取

已有的工作总结

所有的工作可以分为两个模块:

  1. HTS编码抓取
  2. 税率抓取

因为这两个信息的存储方式存在很大差异,加上技术的限制,我们只能分开抓取

模块一:HTS编码抓取

  • 获取FR中China相关的「Notice」的结构化信息,包括

    • 发布部门

    • 发布时间

    • 通知标题

    • 通知摘要

    • 通知的网页链接、PDF链接

  • 第一次筛选:根据通知的标题、摘要检索是否包含关税信息-3000

  • 第二次筛选:下载并扫描所有PDF文档,检查是否包含HTS编码-900

    至此已经提取出所有「可能的」HTS编码,但是问题在于:

    • 这些编码是否是被征税的对象(可能存在豁免的情况)

    • 由于编码识别的规则比较僵硬,无法确定是否是HTS编码,也可能只是8位数字,比如说时间

    • 征税的税率无法识别(目的)

模块二:关税信息抓取

关税的抓取作为整个工作流程最繁琐的环节,难点在于这一信息为非结构信息,只能通过阅读文本提取,因此

  • 方案一:在纯人工的情况下,只能逐个阅读提取信息

  • 方案二:引入LLM提取结构化信息,但是怎么使用、对谁使用都是一个问题(垃圾进-垃圾出原则)

    • 900+文档的超长上下文会使得LLM出现幻觉的情况大幅提升,因此不能将所有文档一股脑全部投进去

    • 所以为了保证关税信息提取的准确性,只能考虑单个文档识别策略

  • 但是于此同时,仍然有一个问题,从整个关税流程的角度出发,识别「单个」通知的关税情况会丢失掉不同通知之间的关系,以USTR为例,如果将LIST1里面的关税清单分开识别,我们就没办法搞清楚不同清单属于哪一批次,因此在识别之前我们需要考虑如何强化不同关税通知之间的联系。

  • 结合对DOC_ITC文件的分析,发现这两个部门都会按照不同的关税行动对文件进行归档,归档的依据就是CASE_NUMBER,比如说:

    • DOC:

      • AD (反倾销): A-570-\d{3} (例如 A-570-967)

      • CVD (反补贴): C-570-\d{3} (例如 C-570-968)

    • ITC:

      • 701-TA-\d+ (反补贴)

      • 731-TA-\d+ (反倾销)抓取思路如下:

  • 按照不同部门对「已经识别到HTS」的900条文件按照部门进行归类
#导入所有涉及关税信息的文件,发现一共936条可能包含HTS编码
import pandas as pd
df = pd.read_csv("/Users/altiens/Desktop/US_tariff_crawl/US_tariff_crawl/china_notices_with_hts_fast.csv")
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3484 entries, 0 to 3483
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   document_number     3484 non-null   object
 1   title               3484 non-null   object
 2   abstract            3484 non-null   object
 3   html_url            3484 non-null   object
 4   pdf_url             3484 non-null   object
 5   publication_date    3484 non-null   object
 6   agencies            3484 non-null   object
 7   agencies_id         3484 non-null   object
 8   multi_agencies      3484 non-null   int64 
 9   excerpts            3483 non-null   object
 10  has_tariff_content  3484 non-null   bool  
 11  matched_keywords    3484 non-null   object
 12  hts_codes           936 non-null    object
 13  hts_count           3484 non-null   int64 
 14  scan_status         3484 non-null   object
dtypes: bool(1), int64(2), object(12)
memory usage: 384.6+ KB
#选取出所有可能包含HTS编码的文件
#筛选出有HTS编码的行
df_hts = df[df["hts_codes"].notnull()]
df_hts.info()
<class 'pandas.core.frame.DataFrame'>
Index: 936 entries, 6 to 3478
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   document_number     936 non-null    object
 1   title               936 non-null    object
 2   abstract            936 non-null    object
 3   html_url            936 non-null    object
 4   pdf_url             936 non-null    object
 5   publication_date    936 non-null    object
 6   agencies            936 non-null    object
 7   agencies_id         936 non-null    object
 8   multi_agencies      936 non-null    int64 
 9   excerpts            936 non-null    object
 10  has_tariff_content  936 non-null    bool  
 11  matched_keywords    936 non-null    object
 12  hts_codes           936 non-null    object
 13  hts_count           936 non-null    int64 
 14  scan_status         936 non-null    object
dtypes: bool(1), int64(2), object(12)
memory usage: 110.6+ KB
#按照agencies列进行分组,按照降序排列
df_hts.groupby("agencies").size().sort_values(ascending=False)
#结果发现发布关税信息最多的部门是:DOC_ITC、ICT、USTR
agencies
Commerce Department; International Trade Administration                      581
International Trade Commission                                               240
Trade Representative, Office of United States                                 84
Homeland Security Department; U.S. Customs and Border Protection              16
Commerce Department; Industry and Security Bureau                              7
Transportation Department; National Highway Traffic Safety Administration      3
Health and Human Services Department; National Institutes of Health            2
Commerce Department                                                            1
Commerce Department; International Trade Commission                            1
Transportation Department; Federal Aviation Administration                     1
dtype: int64
  • 我这里先选择了文件数量最多的DOC_ITC(Commerce Department; International Trade Administration)。下面根据这两个部门联合发布的文件CASE_NUMBER进行归档:

    • 先扫描所有这一部门的PDF,并抓取出Case_Number

    • 根据扫描结果,结合CASE_NUMBER将DOC_ITC部门的文件可以拆分为280个文件夹

  • 最后,我们的文件夹层次为:

    • 所有包含HTS编码的文件

      • DOC_ICT

        • A-570-967/ (铝型材 AD)

          • 2016-06299.pdf (初裁)

          • 等等

      • USTR

        • 301 Investigation

          • LIST1

          • LIST2

      • ICT

      • 等等

  • 同时,在识别之前,我们还需要确定自己需要抓取的信息:

    Case_ID 案件号 (A-570-xxx)
    Filename 文件名 (含年份)
    Doc_Type 文件类型
    Doc_Date 发布日期
    Effective_Date 生效日期
    Rate_Change 税率是否改变 (T/F)
    China_Wide_Rate 中国普遍税率
    Company_Rates 个别企业税率
    Has_Exclusions 是否有豁免
    • 关于上面China_Wide_Rate和Company_Rate的区分,DOC所发起的关税行动相较于USTR的行动范围较小,一般是针对个体(数十个企业),在很多情况下会考虑差异化税率,也即不同的公司采用不同的税率,同时这一特点启发我们在识别USTR和DOC关税情况的时候需要采取不同的策略

    • 基于上面需要抓取的信息,Prompt设计如下:

          Analyze the document: '{filename}'.
      
          Text Context:
          {text_context[:30000]}
      
          Extract the following fields into JSON:
      
          1. "doc_type": (String) e.g., "Initiation", "Prelim Determination", "Final Determination", "Admin Review", "Rescission", "Exclusion Notice".
          2. "doc_publish_date": (String) YYYY-MM-DD.
          3. "effective_date": (String) YYYY-MM-DD. Look for "DATES: Effective [Date]" or similar. If same as publish date, verify. If not found, return null.
          4. "is_rate_change": (Boolean) Does this doc set/change rates?
      
          5. "china_wide_rate": (String) The rate for "China-wide entity". e.g., "33.2%".
          6. "has_differential_rates": (Boolean) Are there specific companies getting different rates than the China-wide rate?
          7. "company_specific_rates": (String) Summarize top 3 companies and their rates. e.g., "Company A: 10.5%, Company B: 12.3%". If list is huge, say "Multiple companies (range 10%-20%)".
      
          8. "has_exclusions": (Boolean) Does this document explicitly exclude/exempt certain products or HTS codes?
          9. "excluded_hts_codes": (List of Strings) Extract specific HTS codes mentioned as excluded. e.g. ["7604.21.0000", "7604.29.1000"]. If it refers to an attachment or general scope, describe briefly e.g. ["See scope appendix"].
      
          10. "summary": (String) 1-sentence summary.
      
          JSON Output:
  • 这里先识别所有DOC_ITC的文件,识别结果如下,这里展示A-570-967的情况

result = pd.read_excel("/Users/altiens/Desktop/Learning/Python_Modules_Structure/US_China_Tariff_Full_Analysis.xlsx")
result[result["Case_ID"] == 'A-570-967']
Case_ID Filename Doc_Type Doc_Date Effective_Date Rate_Change China_Wide_Rate Has_Diff_Rates Company_Rates Has_Exclusions Excluded_HTS Summary
630 A-570-967 2016-06299.pdf Initiation 2016-03-21 2016-03-21 False NaN 0.0 NaN True ['See scope appendix'] The Department of Commerce is initiating an an...
631 A-570-967 2016-14046.pdf Admin Review 2016-06-14 2016-06-14 True 33.28% 1.0 Allied Maker Limited: 86.01%, Birchwoods (Lin'... False NaN The Department of Commerce issues preliminary ...
632 A-570-967 2016-16145.pdf Admin Review 2016-07-07 2016-07-07 False NaN 0.0 No specific rates provided in this initiation ... False NaN The Department of Commerce initiates administr...
633 A-570-967 2016-18649.pdf Final Determination 2016-08-05 2016-08-05 True 33.28% 0.0 No company-specific rates provided; all export... False NaN The Department of Commerce finds that revoking...
634 A-570-967 2016-18669.pdf Admin Review 2016-08-05 2016-08-05 False 40.41% 0.0 No specific companies identified; all 38 compa... False NaN The Department of Commerce preliminarily finds...
635 A-570-967 2016-27346.pdf Prelim Determination 2016-11-14 2016-11-14 False NaN 0.0 NaN False NaN The Department preliminarily determines that h...
636 A-570-967 2016-28502.pdf Admin Review 2016-11-28 2016-11-28 True 33.28% 1.0 Allied Maker Limited: 86.01%, Birchwoods (Lin'... False NaN The Department of Commerce amends the final re...
637 A-570-967 2016-28504.pdf Final Determination 2016-11-28 2016-11-28 True 33.2% 1.0 Multiple companies (range not specified) False NaN The Department of Commerce issues final result...
638 A-570-967 2017-08352.pdf Continuation of Antidumping and Countervailing... 2017-04-25 2017-04-25 False NaN 0.0 NaN True ['See scope exclusions for aluminum extrusions... The Department of Commerce continues the antid...
639 A-570-967 2017-11665.pdf Admin Review 2017-06-06 2017-06-06 True 85.96% 1.0 tenKsolar (Shanghai) Co., Ltd.: 85.96%, PRC-wi... False NaN The Department of Commerce issued preliminary ...
640 A-570-967 2017-15683.pdf Final Determination 2017-07-26 2017-07-26 False NaN 0.0 NaN False NaN The Department of Commerce finds that heat-tre...
641 A-570-967 2017-24407.pdf Final Determination 2017-11-13 2017-11-13 True 85.96% 0.0 No differential rates; all entities under revi... False NaN The Department of Commerce issues final result...
642 A-570-967 2018-02521.pdf Admin Review 2018-02-08 2018-02-08 False 86.01% 0.0 No companies received separate rates; all part... False ['See scope appendix'] Commerce preliminarily determines no separate ...
643 A-570-967 2018-04390.pdf Initiation 2018-03-05 2018-03-05 False NaN 0.0 NaN True ['208.0', '295.0', '308.0', '355.0', 'C355.0',... Commerce initiates anti-circumvention inquirie...
644 A-570-967 2018-15573.pdf Initiation 2018-07-20 2018-07-20 True NaN 1.0 Changshu Changsheng Aluminium Products Co., Lt... True ['6603.90.8100', '7616.99.51', '8479.89.94', '... This document initiates and provides prelimina...
645 A-570-967 2018-16071.pdf Admin Review 2018-07-27 2018-07-27 False 86.01% 0.0 No specific rates set in this review False NaN The Department of Commerce issued final result...
646 A-570-967 2018-19571.pdf Final Determination 2018-09-10 2018-09-10 True NaN 0.0 Multiple companies (all 32.79% dumping margin ... True ['See scope exclusions for specific aluminum e... Commerce recalculates cash deposit rates for 2...
647 A-570-967 2019-09214.pdf Initiation 2019-05-06 2019-05-06 False NaN 0.0 NaN False NaN The Department of Commerce is initiating anti-...
648 A-570-967 2019-10275.pdf Prelim Determination 2019-05-17 2019-05-17 False NaN 0.0 NaN False NaN Commerce preliminarily determines that extrude...
649 A-570-967 2019-28234.pdf Final Determination 2019-12-31 2019-12-31 False 135.18% 0.0 No specific companies identified; all subject ... True ['Crude artificial corundum with particles > 3... Commerce determines that revoking the antidump...
650 A-570-967 2019-28265.pdf Admin Review 2019-12-31 2019-12-31 False 86.01% 0.0 No companies received separate rates; all part... False NaN Commerce preliminarily determines that no comp...
651 A-570-967 2020-11205.pdf Second Amended Final Scope Ruling 2020-05-26 2020-05-26 False NaN 0.0 NaN True ['See scope for components not covered'] Commerce issues a second amended final scope r...
652 A-570-967 2022-10179.pdf Initiation 2022-05-12 2022-05-12 False NaN 0.0 NaN True ['7605.19.000'] Commerce initiates changed circumstances revie...
653 A-570-967 2022-13139.pdf Preliminary Results of Changed Circumstances R... 2022-06-17 2021-04-30 False NaN 0.0 No company-specific rates provided True ['7605.19.000', '7604.29.1090', '7616.99.5190'] Commerce intends to partially revoke the antid...
654 A-570-967 2022-14299.pdf Final Determination 2022-07-06 2022-07-06 False NaN 0.0 NaN True ['7605.19.000', '7604.29.1090', '7616.99.5190'] Commerce revokes antidumping and countervailin...
655 A-570-967 2022-23810.pdf Continuation 2022-11-02 2022-11-02 False 50.43% 0.0 No company-specific rates mentioned True ['See scope exclusions for specific alloys and... Commerce and ITC determined that revoking the ...
656 A-570-967 2022-23826.pdf Final Determination 2022-11-02 2022-11-02 False 50.43% 0.0 No specific companies listed False NaN Commerce finds that revocation of the antidump...
657 A-570-967 2023-02212.pdf Admin Review 2023-02-03 2023-02-03 False 111.73% 1.0 TMI/TMM: no shipments, other exporters have se... False NaN Commerce preliminarily determines that TMI/TMM...