bkk_firmus_task.knit

BKK FIRMUS
|| Database from forms

VIRTUALLY FILLED FORMS

#PYTHON LIBRARIES
#reticulate::py_install(packages = "PyMuPDF", pip = TRUE)

import fitz  #pip install PyMuPDF
import pandas as pd
#import ace_tools as tools  # For displaying DataFrame
import glob
from pdf2image import convert_from_path
import os
from PyPDF2 import PdfReader
import pytesseract
import re
import janitor
import json

Extract Text from the PDF

Before applying the regular expressions, you need to extract the text from the filled PDF. You can do this using Python libraries such as:

pdfplumber (for text-based PDFs) PyMuPDF (fitz) (for structured text extraction) pdf2image + Tesseract OCR (if the PDF is a scanned image)

path_upto_file = "C:/Users/as/Nextcloud/04_PROJECTS/07_BKK FIRMUS/00_RESOURCES/01_INPUTS/02_Files/"


pdf_path = os.path.join(path_upto_file, "01_Virtually filled forms/vf_ba_01.pdf")

# Open the PDF
doc = fitz.open(pdf_path)

# Extract form field values
form_data = {}

for page in doc:  # Loop through all pages
    for field in page.widgets():  # Access form fields (widgets)
        if field.field_name:  # Ensure it's a valid form field
            form_data[field.field_name] = field.field_value if field.field_value else "N/A"
            
# Convert extracted data to a DataFrame
df = pd.DataFrame([form_data])

Store Multiple Forms

Now that you have a DataFrame with one form as a row, you can process multiple PDFs and append the extracted data.


# Define the folder path
folder_path = os.path.join(path_upto_file, "01_Virtually filled forms/example")

# Get all PDF files in the folder
pdf_files = glob.glob(os.path.join(folder_path, "*.pdf"))

# List to store extracted form data
form_responses = []

for pdf_path in pdf_files:
    doc = fitz.open(pdf_path)
    extracted_data = {}
    
    for page in doc:
        for field in page.widgets():
            if field.field_name:
                extracted_data[field.field_name] = field.field_value if field.field_value else "N/A"

    # Append extracted form data to list
    form_responses.append(extracted_data)

# Convert list of dicts to DataFrame
df_multi = pd.DataFrame(form_responses)

#REORGANIZING THE DATAFRAME:
# Define the desired order
desired_order = ['Eintrittsdatum', 'Frau', 'Herr', 'Divers', 'Unbestimmt', 'Name Mitglied', 'Vorname Mitglied', 'Geburtsdatum Mitglied', 'Straße Hausnummer Mitglied', 'PLZ, Ort Mitglied', 'Telefon privat', 'Telefon mobil Mitglied', 'E-Mail', 'Rentenversicherungsnummer falls zur Hand', 'Geburtsort, -land Mitglied', 'Geburtsname Mitglied', 'Staatsangehörigkeit Mitglied', 'ledig', 'verheiratet', 'getrennt lebend', 'geschieden', 'verwitwet', 'eingetragene Lebenspartnerschaft', 'lege ich bei', 'reiche ich nach', 'liegt bereits vor', 'Bisherige Krankenkasse Mitglied', 'Ort Krankenkasse', 'Bisherige Krankenkasse von', 'Bisherige Krankenkasse bis', 'Versichertennummer', 'pflichtig', 'freiwillig', 'privat', 'familienversichert', 'Ja', 'Nein', 'Ehegatte', 'Lebenspartner', 'Kinder', 'lege ich bei_3', 'reiche ich nach_3', 'bitte zusenden', 'Nachname Vorname Ehegatte/Partner', 'Krankenkasse des Ehegatten/Partners', 'Tätigkeit', 'beschäftigt seit', 'Name Arbeitgeber', 'Straße, Hausnummer Arbeitgeber', 'PLZ, Ort Arbeitgeber', 'Bruttoentgelt', 'Erste Beschäftigung als Arbeitnehmer/-in in Deutschland', 'Gesellschafterin undoder Geschäftsführerin einer GmbH', 'Ich beziehe Rente bzw habe Rente beantragt', 'Ich erhalte Versorgungsbezüge zB Betriebsrente Pension', 'Versorgungsbezüge von', 'Datum Mitgliedsantrag','Staatsangehörigkeit', 'Ich bin MutterVater eines Kindesmehrerer Kinder', 'Akt. Fachsemester', 'Monatlicher Gewinn', 'Stunden wöchentliche Studienzeit', 'Ich bin während des Studiums beschäftigt oder selbstständig tätig', 'Straße PLZ Ort SEPA', '(Fach-) Hochschule', 'PLZ  Ort', 'Fachrichtung', 'Vorname', 'Rentenversicherungsnummer', 'Geburtsdatum (TT.MM.JJJJ)', 'Ort bisherige Krankenkasse', 'Telefon mobil', 'Ich beziehe bzw beantrage Leistungen bei der Agentur für Arbeit', 'IBAN (22 Stellen)', 'Nachname Einwilligungserklärung', 'BNR', 'Geburtsdatum Einwilligungserklärung', 'Anschrift des Zahlungspflichtigen (des Kontoinhabers)', 'Geburtsname', 'Nachname, Vorname des Ehegatten/Partners', 'Monatliches Bruttoentgelt', 'Datum letztes Versicherungsverhältnis von', 'Vermittlernummer od', 'Name des Zahlungspflichtigen (Kontoinhaber)', 'Nachname', 'Straße  Hausnummer', 'Studium seit', 'Datum SEPA', 'Ort, Datum', 'Ich habe mich von der Krankenversicherungspflicht befreien lassen', 'Ich habe Anspruch auf Sachleistungen nach ausländischem Recht', 'Datum (TT.MM.JJJJ)', 'Bisherige Krankenkasse', 'Name des Kreditinstituts', 'KVNR / Mandatsreferenz', 'Datum Einwilligungserklärung', 'Studium bis', 'Vorname Einwilligungserklärung', 'Datum letztes Versicherungsverhältnis bis', 'Name Versicherter / Firma', 'Steuer-Identifikationsnummer', 'Vermittlernummer od. MA BKK firmus', 'Stunden wöchentliche Arbeitszeit', 'Geburtsort / -land', 'BIC (11 Stellen)']  

# Reorder the dataframe
df_multi_org = df_multi[desired_order]
# Ensure reticulate uses Pandas DataFrames
r_df_multi_org = r.df_multi_org = df_multi_org  # Make available in R

Resulting Database:

IMAGE FILLED FORMS

Alternative: If the PDF is a Scanned Image

If the PDF is an image-based form (scanned or handwritten), you will need OCR.

import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:/Program Files/Tesseract-OCR/tesseract.exe"

#print("Tesseract is working!")

Apply Regular Expressions to Extract Responses

Once you have the extracted text as a string, you can apply the regexes we defined to extract each blank field.


# Set paths
pdf_folder = os.path.join(path_upto_file, "02_Image filled forms/example")
poppler_path = r"C:/Program Files/poppler-24.08.0/Library/bin"
pytesseract.pytesseract.tesseract_cmd = r"C:/Program Files/Tesseract-OCR/tesseract.exe"

# Get all PDFs in the folder
pdf_files = [os.path.join(pdf_folder, f) for f in os.listdir(pdf_folder) if f.endswith(".pdf")]

# List to store extracted data
all_form_data = []

# Define regex patterns for text fields
regex_patterns = {
    "möchte ab dem": r"Ja,\s*ich möchte ab dem\s*(.*?)\s*Mitglied",
    "Name": r"Name\s*(.*?)\s*‘",
    "Vorname": r"/ab dem\s*(.*?)\s*!",
    "Geburtsdatum": r"Geburtsdatum\s*(.*?)\s*Straße \| Hausnummer",
    "Straße | Hausnummer": r"Straße \| Hausnummer\s*(.*?)\s*PLZ \| Ort",
    "PLZ | Ort": r"PLZ \| Ort\s*(.*?)\s*Telefon privat",
    "Telefon privat": r"Telefon privat\s*(.*?)\s*Telefon mobil",
    "Telefon mobil": r"Telefon mobil\s*(.*?)\s*E-Mail",
    "E-Mail": r"E-Mail\s*(.*?)\s*Rentenversicherungsnummer",
    "Rentenversicherungsnummer": r"Rentenversicherungsnummer\s*(.*?)\s*Geburtsort, -land",
    "Geburtsort, -land": r"Geburtsort, -land\s*(.*?)\s*Geburtsname",
    "Geburtsname": r"Geburtsname\s*(.*?)\s*Staatsangehörigkeit",
}

# Define regex patterns for checkboxes
checkbox_patterns = {
    "Ich beziehe Rente bzw. habe Rente beantragt": r"Ich beziehe Rente bzw\. habe Rente beantragt.*?(✓|X|On|Ja)?\s*(?:\n|$)",
    "Ich erhalte Versorgungsbezüge": r"Ich erhalte Versorgungsbezüge.*?(✓|X|On|Ja)?\s*(?:\n|$)",
    "Erste Beschäftigung als Arbeitnehmer": r"Es handelt sich um meine erste Beschäftigung als Arbeitnehmer.*?(✓|X|On|Ja)?\s*(?:\n|$)"
}

# Process each PDF
for pdf_path in pdf_files:
    #print(f"Processing: {pdf_path}")

    # Convert PDF to images
    images = convert_from_path(pdf_path, poppler_path=poppler_path)

    # Extract text from images
    raw_text = "\n".join([pytesseract.image_to_string(img, lang="deu") for img in images])

    # Extract form responses using regex
    extracted_data = {}

    # Extract text fields
    for field, pattern in regex_patterns.items():
        match = re.search(pattern, raw_text, re.DOTALL)
        extracted_data[field] = match.group(1).strip() if match else "Not Found"

    # Extract checkbox values
    for field, pattern in checkbox_patterns.items():
        match = re.search(pattern, raw_text, re.DOTALL)
        extracted_data[field] = "TRUE" if match and match.group(1) else "FALSE"

    # Append extracted data to list
    all_form_data.append(extracted_data)

# Convert all extracted data into a DataFrame
df_multi_ima = pd.DataFrame(all_form_data)

# Ensure reticulate uses Pandas DataFrames
r_df_multi_ima = r.df_multi_ima = df_multi_ima  # Make available in R

Resulting Database:

TEST RESULTS 25.04.04

UI TESTS

# Load all sheets into a dictionary
sheets_dict = pd.read_excel("00_data/01-BKK FIRMUS-TESTS LOGS.xlsx", sheet_name=None, engine="openpyxl")

# Define sheets to include
log_sheets = {"ba", "se", "so", "st"}  

# Add 'form' column (sheet name) before merging, then concatenate
logs_df = pd.concat(
    [sheets_dict[sheet].assign(form=sheet) for sheet in log_sheets if sheet in sheets_dict], 
    ignore_index=True)

# Apply clean_names() and remove rows where 'input' is NA
logs_df = (logs_df
    .clean_names()
    .replace(["", " ", "NA", "N/A"], pd.NA)  # Ensure "NA" strings are true NaNs
    .dropna(subset=["input"])  # Drop rows where input is NaN
)

# Format file_name column as two-digit string
logs_df["file_name"] = (
    pd.to_numeric(logs_df["file_name"], errors="coerce")
    .astype("Int64")  # nullable integer in case there are still NaNs
    .astype(str)
    .str.zfill(2)
)

logs_25_03_31 = logs_df[logs_df["time_stamp"].str.contains("2025-03-31", na=False)]

Ratings per Form, File and type on input

(hf = handfilled forms, im = virtually filled forms saved as images, no edition available)

OBSERVATIONS:

se_09_hf (0.79), ba_09_hf (0.77) and ba_10_hf (0.73): Very difficult handwritting, and Name/Vorname filled the other way around with model trying to solve it. (Example images below)

st_06_hf (0.75): Dates taken from different blanks rather than from the explected ones (right next to the fields), telephones swap,difficult handwritting and mistaken checkboxes.(Example image below-left)
so_01_im (0.10), so_03_im (0.10) and so_05_im (0.09): Schema broken in output.(Example image below-right)

Ratings per Form and Prompt (Empty or full)

# Optional: ensure types
logs_25_03_31['form'] = logs_25_03_31['form'].astype(str)
logs_25_03_31['file_name'] = logs_25_03_31['file_name'].astype(str)
logs_25_03_31['prompt_type'] = logs_25_03_31['prompt_type'].astype(str)

# Set plot size
plt.figure(figsize=(12, 8))

# Create the boxplot
sns.boxplot(
    data=logs_25_03_31,
    x='correct_ratio',
    y='form',
    hue='prompt_type',
    palette='crest', # https://seaborn.pydata.org/tutorial/color_palettes.html COLOR PALETTES
    order= sorted(logs_25_03_31['form'].unique())
)

# Add titles and labels
plt.title("")
plt.xlabel("Correct Ratio")
plt.ylabel("Form")
plt.legend(title="Prompt_type", bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.show()

BATCH TESTS

Ratings per Form, File and type on input

(hf = handfilled forms, im = virtually filled forms saved as images, no edition available)

#TAKE NEW OBSERVATIONS
logs_25_04_03 = logs_df[logs_df["time_stamp"].astype(str).str.contains("2025-04-03", na=False)]

OBSERVATIONS:

In these tests, all results below 0.8 are for handfilled cases, and most of them for the ones with worst handwritting (Fig. 1 and 2).

Figures 1-2

Most common issues are related to:
- Rentenversicherungsnummer, where the divisions are missinterpreted as ones, and the letter as a number.(Fig.3)
- Checkboxes, usually when they are compressed, or when the Xes are not as clear or as centered within the box.(Fig.4)
- Name/Voname pairs in cases where they were completed wrongly, the model corrects this and we don’t want that.(Fig.5)
- Private/Mobile Numbers switched, or mistaken when one is empty.
- Wrong dates when there is more than one, it sometimes get a different one rather than the one next to the requested field.

Figure 3

Figures 4-5

Figure 6

Ratings per Form and Prompt (Empty or full)

# Optional: ensure types
logs_25_04_03['prompt_type'] = logs_25_04_03['prompt_type'].astype(str)

# Set plot size
plt.figure(figsize=(12, 8))

# Create the boxplot
sns.boxplot(
    data=logs_25_04_03,
    x='correct_ratio',
    y='form',
    hue='prompt_type',
    palette='crest', # https://seaborn.pydata.org/tutorial/color_palettes.html COLOR PALETTES
    order= sorted(logs_25_04_03['form'].unique())
)

# Add titles and labels
plt.title("")
plt.xlabel("Correct Ratio")
plt.ylabel("Form")
plt.legend(title="Prompt_type", bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.show()

BKK FIRMUS
|| Database from forms

VIRTUALLY FILLED FORMS

Extract Text from the PDF

Store Multiple Forms

Resulting Database:

IMAGE FILLED FORMS

Alternative: If the PDF is a Scanned Image

Apply Regular Expressions to Extract Responses

Resulting Database:

TEST RESULTS 25.04.04

UI TESTS

Ratings per Form, File and type on input

OBSERVATIONS:

Ratings per Form and Prompt (Empty or full)

BATCH TESTS

Ratings per Form, File and type on input

OBSERVATIONS:

Ratings per Form and Prompt (Empty or full)

ARCHIVE

Display DataFrame

Print extracted form responses

Convert to DataFrame

Display DataFrame

Use the OCR-extracted text as a string

Define regex patterns**

Extract field values

Print extracted results

Handle Checkboxes (True/False)

Define improved regex patterns for checkboxes

Extract checkbox values

Print checkbox values

Store the Extracted Data

BKK FIRMUS || Database from forms

VIRTUALLY FILLED FORMS

Extract Text from the PDF

Store Multiple Forms

Resulting Database:

IMAGE FILLED FORMS

Alternative: If the PDF is a Scanned Image

Apply Regular Expressions to Extract Responses

Resulting Database:

TEST RESULTS 25.04.04

UI TESTS

Ratings per Form, File and type on input

OBSERVATIONS:

Ratings per Form and Prompt (Empty or full)

BATCH TESTS

Ratings per Form, File and type on input

OBSERVATIONS:

Ratings per Form and Prompt (Empty or full)

ARCHIVE

Display DataFrame

Print extracted form responses

Convert to DataFrame

Display DataFrame

Use the OCR-extracted text as a string

Define regex patterns**

Extract field values

Print extracted results

Handle Checkboxes (True/False)

Define improved regex patterns for checkboxes

Extract checkbox values

Print checkbox values

Store the Extracted Data

BKK FIRMUS
|| Database from forms