sadna_data_analysis_w

SADNA DATA ANALYSIS
|| Contracts task logs and study

CONTRACTS TASK LOGS
|| Prompts, responses, sources, ratings

The current document is a series of reports prepared to analyze the data resulting from the tests carried out for the contract task.

First steps:

We import the logs from the Gsheet we work on.

#FUNCTION TO IMPORT ALL DATA FROM EVERY SHEET IN XLSX FILE 
data_from_xl= function (sheetname) {
  data.frame(read_excel(
  "00-data/01-CONTRACTS-PROMPTING REGISTRY.xlsx",
  sheet=sheetname,
  #range = "A1:D100",
  col_names = TRUE,
  col_types = NULL,
  na = "",
  trim_ws = TRUE,
  skip = 0) %>% 
    clean_names() %>% 
    mutate(field = sheetname)
  )
}

sheet_names = excel_sheets("00-data/01-CONTRACTS-PROMPTING REGISTRY.xlsx")%>% 
              setdiff(c("INFO+RATING","SUMMARY","SOURCE RATING"))


s_all_contracts <- lapply(sheet_names, data_from_xl)

#BINDING OBSERVATIONS FROM ALL SHEETS TO GET THE COMPLETE LOG (WITHOUT BACKUP)
s_all_contracts <- do.call(rbind, s_all_contracts) %>% 
                  as.data.frame()

s_all_contracts = s_all_contracts %>%
 mutate(time_stamp = as.POSIXct(time_stamp, format="%Y-%m-%d %H:%M:%S", tz="UTC"),
        prompt_rating = as.numeric(prompt_rating),
        prompt_rating = round(prompt_rating, digits = 2))


#ADDING THE FILE NAMES TO THE CASE NAMES
concatenate_matches <- function(file_name, text_all_sources) {
  if (is.na(text_all_sources)) {
    return(file_name)
  }
  matches <- c()
  
  if (str_detect(text_all_sources, " 5\\.pdf|0_566|5.txt")) {
    matches <- c(matches, "5.pdf")
  }
  if (str_detect(text_all_sources, "8.pdf|8.txt|8_.txt")) {
    matches <- c(matches, "8.pdf")
  }
  if (str_detect(text_all_sources, "9.pdf|9.txt")) { 
    matches <- c(matches, "9.pdf")
  }
  if (str_detect(text_all_sources, "20230913")) {
    matches <- c(matches, "20230913.pdf")
  }
  if (str_detect(text_all_sources, "swm")) {
    matches <- c(matches, "swm.pdf")
  }
  
  if (length(matches) > 0) {
    file_name <- paste(file_name, paste(matches, collapse = " - "), sep = " - ")
  }
  
  return(file_name)
}

s_all_contracts = s_all_contracts %>% 
  filter(!is.na(prompt), prompt!="") %>%
  filter(!is.na(output), output!="") %>%
  filter(!is.na(time_stamp))%>%
  mutate(case = file_name,
         file_name = mapply(concatenate_matches, file_name, text_all_sources))

We unify, clean and filter all logs to create a single database:

#BINDING OBSERVATIONS FROM ALL SHEETS TO GET THE COMPLETE LOG (WITHOUT BACKUP)
s_all_contracts = s_all_contracts %>%
 mutate(time_stamp = as.POSIXct(time_stamp, format="%Y-%m-%d %H:%M:%S", tz="UTC"))


#COMPLETING SOURCE N:
s_all_contracts = s_all_contracts %>%
  mutate(source_n_temp = rowSums(across(c(source_1, source_2, source_3, source_4, source_5, source_6, source_7, source_8, source_9, source_10), ~ . != "-")))

#Replace NA values in source_n AND ARRANGE/ORDER BY TIMESTAMP:
s_all_contracts = s_all_contracts %>%
  mutate(source_n = ifelse(is.na(source_n), source_n_temp, source_n)) %>%
  select(-source_n_temp) %>% 
  arrange(time_stamp) 

#TURNING HALLUCINATION AND TESTING TYPE INTO DUMMIES:
s_all_contracts = s_all_contracts %>%
  mutate(hallucination = ifelse(grepl("H", categories),1,0)) %>% 
                       #CREATE VARIABLES TO CHECK TIME DIFFERENCE BETWEEN EACH LOG AND THE PREV AND NEXT:
                       mutate(time_diff_prev = difftime(time_stamp, lag(time_stamp), units = "mins"),
                              time_diff_next = difftime(lead(time_stamp), time_stamp, units = "mins"))


#COMPLETING PARALLEL TESTING FIELD WITH TIME CHECKS
s_all_contracts = s_all_contracts %>%
  mutate(time_diff_prev = ifelse(field == lag(field), 12, time_diff_prev), #CONSIDERING THAT IF THE PREVIOUS OBSERVATION IS FOR THE SAME FIELD, THEN IT IS THE SAME PERSON, NO PARALLEL TESTING.
         time_diff_next = ifelse(field == lead(field), 12, time_diff_prev),#CONSIDERING THAT IF THE NEXT OBSERVATION IS FOR THE SAME FIELD, THEN IT IS THE SAME PERSON, NO PARALLEL TESTING.
         within_10_minutes = if_else(time_diff_prev <= 10 | time_diff_next <= 10, 1, 0, missing = 0),
         parallel_testing = ifelse(is.na(parallel_testing)| parallel_testing != "p",
                                    ifelse(within_10_minutes == 1, 1, 0),
                                                                      1))

s_all_contracts = s_all_contracts %>%
  mutate(within_10_minutes = if_else(time_diff_prev <= 10 | time_diff_next <= 10, 1, 0, missing = 0),
         parallel_testing = ifelse(is.na(parallel_testing)| parallel_testing != "p"| parallel_testing != "i",
                                    ifelse(within_10_minutes == 1, 1, 0),
                                                                      1),
         time_stamp_CEST = time_stamp + hours(5),
         case_rating = as.numeric(case_rating),
         prompt_rating = as.numeric(prompt_rating)) %>% 
  filter(!is.na(case_rating))


#REMOVING VARIABLES THAT ARE NO LONGER NECESSARY AND OUTLIER OBSERVATIONS:
prompts_to_exclude <- c(
  "What are the two contract parties participating in the contract?",
  "What is the number of the contract?",
  "What is the number of the contract? Look for something like \"56690369\".",
  "Can you find \"56690369\" in the given document?",
  "Can you find \"56690369\" in the given document? If yes, which function or meaning does the number have?"
)


#CLEANING OUTLIERS  
s_all_contracts_raw = s_all_contracts %>% 
  filter(!prompt %in% prompts_to_exclude) %>% #SPECIFIC PROMPTS THAT ARE PECULIAR
  mutate(file_name = as.factor(file_name), #TRANSFORIMING SOME VARIABLE INTO FACTORS
         source_n = as.factor(source_n),
         field = str_remove(field,"BKP"))%>%
  mutate_if(is.factor, droplevels)%>%
  select(-c(within_10_minutes,time_diff_prev,time_diff_next,model_input_sources_prompt))

s_all_contracts = s_all_contracts_raw %>% 
  filter(source_n %in% c(4, 6, 8, 10) & #UNEXPECTED SOURCE NUMBERS
         !is.na(text_all_sources) #OBSERVATIONS WITHOUT SOURCES
         )

Hallucination

HALLUCINATION
|| Relation to source number and parallel testing

First we look at the amount of tests performed with each model, and the presence of hallucination within those:

#SELECTING ONLY THE MODELS WE WANT TO ANALYSE
s_all_contracts_v2 = s_all_contracts %>% 
                       filter(grepl("v2_",model))

#
hallucination_sources = s_all_contracts_v2 %>% 
                       #filter(hallucination==1, !is.na(source_n)) %>%  
  group_by(model,hallucination,source_n) %>% 
  summarise(observations = n()) %>%
  ungroup()

hallucination_sources_m = hallucination_sources %>% 
  group_by(model) %>% 
  summarise(observations_model = sum(observations)) %>% 
  ungroup()

hallucination_sources=left_join(hallucination_sources,hallucination_sources_m)

hallucination_sources = hallucination_sources %>%
  mutate(percentage = (observations/observations_model)*100) %>%
  mutate_if(is.numeric, round, digits = 2)

Tests with hallucination per source number (amount of observations and percentage they represent)

VIEW TABLE

REGRESSION (SECTION WITH ERRORS, NEEDS CORRECTING)

REGRESSIONS
|| Correlation of different variables with hallucination

Code

#REGRESSION

#CLEANING THE DATABASE
s_all_contracts_v2 = s_all_contracts_v2 %>% 
  filter(source_n != "0" &
         source_n != "2" & 
         !is.na(text_all_sources)) %>% 
  #filter(file_name %in% c("Becker - Abbott (X4) - 8.pdf",
                          #"Becker - Beckman - 9.pdf",
                          #"Becker - Siwig - 20230913.pdf",
                          #"Becker - SWM - swm.pdf",
                          #"Becker - Abbott (X4) - 5.pdf"))  %>%
  mutate(file_name = as.factor(file_name),
         source_n = as.factor(source_n))%>%
  mutate_if(is.factor, droplevels)

#LOGISTIC REGRESSION INCLUDING PARALLEL TESTING, FILE NAMES AND NUMBER OF SOURCES AS VARIABLES. HALLUCINATION AS VARIABLE OF INTEREST:
regression_pt_fn_sn <- glm(hallucination ~ parallel_testing + file_name + source_n, data = s_all_contracts_v2, family = binomial)

Logistic Regression Results
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-2.5955721	0.7672023	-3.3831652	0.0007166
parallel_testing	-0.3702465	0.3867738	-0.9572687	0.3384316
file_nameBecker - Abbott (X4) - 20230913.pdf	-15.6002500	6522.6386561	-0.0023917	0.9980917
file_nameBecker - Abbott (X4) - 5.pdf	0.5124146	0.7542288	0.6793887	0.4968916
file_nameBecker - Abbott (X4) - 5.pdf - 9.pdf	-15.0343726	1330.2659712	-0.0113018	0.9909827
file_nameBecker - Abbott (X4) - 8.pdf	-0.6373430	0.7851345	-0.8117629	0.4169277
file_nameBecker - Abbott (X4) - 9.pdf	-15.5951041	4555.2107032	-0.0034236	0.9972684
file_nameBecker - Abbott (X4) - swm.pdf	-15.6002500	6522.6386561	-0.0023917	0.9980917
file_nameBecker - Beckman	-15.0755667	1331.4282229	-0.0113229	0.9909659
file_nameBecker - Beckman - 8.pdf	-15.0755667	6522.6386420	-0.0023113	0.9981559
file_nameBecker - Beckman - 9.pdf	0.3904204	0.7541464	0.5176984	0.6046687
file_nameBecker - Beckman - 9.pdf - 20230913.pdf	22.4268168	6522.6386601	0.0034383	0.9972566
file_nameBecker - Siwig	-15.1351506	1227.7746346	-0.0123273	0.9901645
file_nameBecker - Siwig - 20230913.pdf	-0.8544149	0.7992249	-1.0690543	0.2850452
file_nameBecker - SWM	-15.0755667	1190.8656487	-0.0126593	0.9898996
file_nameBecker - SWM - 20230913.pdf	3.4905018	1.5859688	2.2008641	0.0277456
file_nameBecker - SWM - swm.pdf	-1.6086162	0.8602144	-1.8700177	0.0614814
source_n6	-0.6967296	0.3177356	-2.1927966	0.0283220
source_n8	-0.8949298	0.2717318	-3.2934305	0.0009897
source_n10	-0.6970967	0.2959086	-2.3557839	0.0184837

Parallel_testing has a significant positive impact on the likelihood of hallucination.
Among the file names, “Becker - Beckman” has a significantly lower impact on hallucination compared to the baseline, while “LaborDepot - Sysmex (X2)” has a significantly higher impact. Other file names like “Becker - Siwig” and “Becker - SWM” do not show a significant impact. Becker-Beckman is the lightest file with 206KB
None of the source_n factor levels have a statistically significant impact on hallucination in this model.

#ODDS RATIO
exp(0.7837)

FALSE [1] 2.189559

The odds of hallucination are about 2.19 times higher when parallel testing is present.

More about parallel testing coefficient

The coefficient for parallel_testing is 0.7837. This indicates that the log-odds of hallucination increase by 0.7837 when parallel_testing is 1, compared to when it is 0.
The standard error for the intercept is 0.1410, and for parallel_testing it is 0.2764.
The p-value for the intercept is less than 0.0000000000000002, indicating it is highly significant. The p-value for parallel_testing is 0.00458, which is less than 0.01, indicating it is statistically significant at the 1% level.

PDF AI-tempv2

MODEL VERSIONS PERFORMANCE
|| PDF AI-tempv2_28.06.24, 05.07.24 and 07.07.24

Prompt Rating-Time Relation

The objective was to observe if the prompt ratings showed some tendency along the day.

Prompts, files amount of tests, and individual testing were kept constant.
The graph also discriminates observations concerning number of sources requested to look for possible patterns in relation to that as well.

#V2-TIME-SOURCE N-MODELS

#DISPERSION PLOT TO SEE PERFORMANCE ALONG THE DAY
s_all_contracts_v2 <- s_all_contracts_v2 %>% 
  mutate(time_component = as.character(time_stamp_CEST),
         time_component = str_sub(time_component, 11),
         time_component = as_hms(time_component),
         time_component = as.numeric(time_component) / 3600)  # Convert to hours

source_n_colors <- c("4" = "#ffd966", 
                     "6" = "#93c47d", 
                     "8" = "#6fa8dc", 
                     "10" = "#c27ba0")

#VERSION 28_06_24
performance_time_plot_28_06_24 = ggplot(s_all_contracts_v2 %>% 
                                          filter(model == "PDF AI-temp_v2_28.06.24"), 
                                        aes(x = prompt_rating, y = time_component, colour = as.character(source_n))) +
                        geom_point() +
                        labs(x = "Prompt Rating", y = "Timestamp", colour = "Number of sources") +
                        scale_x_continuous(limits = c(-0.1, 1.1), n.breaks = 10) +
                        scale_y_continuous(limits = c(0, 23),breaks = seq(0, 23, by = 2)) +
  scale_color_manual(values = source_n_colors) +  
  theme_minimal()+
  theme(axis.text.y = element_text())


#VERSION 05.07.24
performance_time_plot_05_07_24 = ggplot(s_all_contracts_v2 %>% 
                                          filter(model == "PDF AI-temp_v2_05.07.24"), 
                                        aes(x = prompt_rating, y = time_component, colour = as.character(source_n))) +
                        geom_point() +
                        labs(x = "Prompt Rating", y = "Timestamp", colour = "Number of sources") +
                        scale_x_continuous(limits = c(-0.1, 1.1), n.breaks = 10) +
                        scale_y_continuous(limits = c(0, 23),breaks = seq(0, 23, by = 2)) +
  scale_color_manual(values = source_n_colors) +  
  theme_minimal()+
  theme(axis.text.y = element_text())



performance_time_plot_07_07_24 = ggplot(s_all_contracts_v2 %>% 
                                          filter(model == "PDF AI-temp_v2_07.07.24"), 
                                        aes(x = prompt_rating, y = time_component, colour = as.character(source_n))) +
                        geom_point() +
                        labs(x = "Prompt Rating", y = "Timestamp", colour = "Number of sources") +
                        scale_x_continuous(limits = c(-0.1, 1.1), n.breaks = 10) +
                        scale_y_continuous(limits = c(0, 23),breaks = seq(0, 23, by = 2)) +
  scale_color_manual(values = source_n_colors) +  
  theme_minimal()+
  theme(axis.text.y = element_text())

PDF AI-temp_v2_28.06.24

PDF AI-temp_v2_05.07.24

PDF AI-temp_v2_07.07.24

There do not appear to be any strong upward or downward trends in ratings over time within the observed time frame. The ratings do not show a clear pattern of increasing or decreasing as time progresses.
Concerning number of sources, the sample of prompts we tested with 6 sources might be too duced, but the observations are clearly in the lower half of the rating range.

Case ratings per model version

#CASE RATING-MODELS

model_colors <- c("PDF AI-temp_v2_28.06.24" = "#f6b26b", 
                  "PDF AI-temp_v2_05.07.24" = "#a2bf96", 
                  "PDF AI-temp_v2_07.07.24" = "lightblue4")

performance_model_case = ggplot(s_all_contracts_v2, 
                                        aes(x = case_rating , y = as.character(model), colour = as.character(model))) +
                        geom_point() +
                        labs(x = "Case Rating", y = "Model") +
  scale_color_manual(values = model_colors, guide = "none") +  # Apply the specified colors
  theme_minimal()

Prompt ratings per model version

performance_model_prompt <- ggplot(s_all_contracts_v2, 
                                   aes(x = prompt_rating, y = as.character(model), colour = as.character(model))) +
  geom_point() +
  labs(x = "Model", y = "Model", colour = "Prompt Rating") +
  scale_color_manual(values = model_colors, guide = "none") +  # Apply the specified colors
  theme_minimal()

Case Rating-Files Relation

The objective was to see if particular Cases/Files had a better performance.

# Create the boxplot

#MODEL 28_06_24
ss_28_06_24 = s_all_contracts_v2 %>%
  filter(model == "PDF AI-temp_v2_28.06.24")

performance_plots = function(datax,colorx,titlex) {
  ggplot(datax, aes(x = case_rating, y = file_name)) +
  geom_boxplot(fill = colorx, color = colorx, alpha = 0.5) +
  stat_summary(fun = mean, geom = "point", shape = 23, size = 3, fill = colorx) +
  theme_minimal() +
  scale_x_continuous(limits = c(-0.1, 1.1), n.breaks = 10) +
  labs(title = titlex,
       x = "Case Rating",
       y = "File Name") +
  geom_hline(yintercept = mean(ss_28_06_24$case_rating, na.rm = TRUE), linetype = "dashed", color = "black")
}


#MODEL 28_06_24
performance_files_plot_28_06_24 = performance_plots (ss_28_06_24,"#f9cb9c","Ratings by Case/File - 28.06.24")


#MODEL 06_07_24
ss05_07_24 = s_all_contracts_v2 %>%
  filter(model == "PDF AI-temp_v2_05.07.24")


performance_files_plot_05_07_24 = performance_plots (ss05_07_24,"#b6d7a8","Ratings by Case/File - 05.07.24")

#MODEL 07_07_24
ss07_07_24 = s_all_contracts_v2 %>% 
  filter(model == "PDF AI-temp_v2_07.07.24")

performance_files_plot_07_07_24 = performance_plots (ss07_07_24,"lightblue4","Ratings by Case/File - 07.07.24")

FIX THIS

Plot References

Median: The central horizontal line in each box represents the median rating.
Interquartile Range (IQR): The box itself represents the middle 50% of the data, from the first quartile (Q1) to the third quartile (Q3).
Whiskers: The lines extending from the box (whiskers) show the range of the data within 1.5 times the IQR from the quartiles.
Outliers: Data points outside the whiskers are potential outliers, shown as individual dots.
Mean: The diamond shape inside the box represents the mean rating for each file.

Prompts-Models

Prompt performance in each model version

Comparison between same prompts ran in different models. It is important to consider:

Not every prompt was ran through every model, so some comparisons will be missing.
The amount of tests ran for one prompt may affect prompt rating, so comparisons between equal prompts with different amount of runs may not be as representative.

prompts_model_version = s_all_contracts_v2 %>% 
  group_by(model, prompt) %>% 
  summarise(number_of_runs = n(),
            prompt_rating = first(prompt_rating)) %>% 
  arrange(prompt) %>% 
  mutate(row_color = case_when(
                      model == "PDF AI-temp_v2_28.06.24" ~ "#f9cb9c",
                      model == "PDF AI-temp_v2_05.07.24" ~ "#b6d7a8",
                      model == "PDF AI-temp_v2_07.07.24" ~ "lightblue4",
    TRUE ~ "#FFFFFF"  # Default color if the model does not match any of the specified ones
  ))

# Create the table excluding the row_color column
kable_output <- kable(prompts_model_version %>% select(-row_color), "html", escape = FALSE) %>%
  kable_styling(full_width = FALSE, ) %>%
  row_spec(0, background = "#D3D3D3")  # Header row color

# Apply row colors manually
for (i in 1:nrow(prompts_model_version)) {
  kable_output <- kable_output %>%
    row_spec(i, background = prompts_model_version$row_color[i])
}

Prompt ratings per model - Comparison table

References: * The “2x” in front of a prompt represents that it was a second consecutive run of it. * For visual aid: * PDF AI-temp_v2_28.06.24 = text * PDF AI-temp_v2_05.07.24 = text * PDF AI-temp_v2_07.07.24 = <span style=“color:”lightblue4”;“>text

model	prompt	number_of_runs	prompt_rating
PDF AI-temp_v2_05.07.24	2x Can you identify the start and end date of the contract in the provided information? If you are not able to extract a specific date, please return all the information that could help me identify or calculate both the commencement date and the termination or end date. Consider how this terms can be expressed in different laguages and in the context of a legal document. Please return only complete and correct answers.	5	0.54
PDF AI-temp_v2_28.06.24	2x Can you identify the start and end date of the contract in the provided information? If you are not able to extract a specific date, please return all the information that could help me identify or calculate both the commencement date and the termination or end date. Consider how this terms can be expressed in different laguages and in the context of a legal document. Please return only complete and correct answers.	30	0.54
PDF AI-temp_v2_07.07.24	Analyze the provided information to identify any numbers that could be “contract numbers.” Consider how “contract number” or similar terms might be expressed in different languages and common abbreviations used in contracts, for example numbers preceded by “NU” or “Nr.”. Extract every number that could potentially be a contract or agreement number. For each identified number, return it in the format: ‘Contract Number: XXX’. If no such number can be identified, respond with ‘Not available in the provided information’.	20	0.97
PDF AI-temp_v2_07.07.24	Analyze the provided information to identify any numbers that could be “contract numbers.” Consider how “contract number” or similar terms might be expressed in different languages and common abbreviations used in contracts. Extract every number that could potentially be a contract or agreement number. For each identified number, return it in the format: ‘Contract Number: XXX’. If no such number can be identified, respond with ‘Not available in the provided information’.	20	0.98
PDF AI-temp_v2_05.07.24	Based on the given information, identify the names of the two parties involved and the contract numbers associated with the agreement. Verify if any mention of “Local Laboratory” refers to a company named Labor Becker. Take into account how terms like “contract number” and “customer” might be expressed in different languages and their common abbreviations. Provide only the names of the two involved businesses or companies and the contract number. Ensure the responses are complete and accurate.	170	0.57
PDF AI-temp_v2_07.07.24	Based on the given information, identify the names of the two parties involved and the contract numbers associated with the agreement. Verify if any mention of “Local Laboratory” refers to a company named Labor Becker. Take into account how terms like “contract number” and “customer” might be expressed in different languages and their common abbreviations. Provide only the names of the two involved businesses or companies and the contract number. Ensure the responses are complete and accurate.	110	0.69
PDF AI-temp_v2_28.06.24	Based on the given information, identify the names of the two parties involved and the contract numbers associated with the agreement. Verify if any mention of “Local Laboratory” refers to a company named Labor Becker. Take into account how terms like “contract number” and “customer” might be expressed in different languages and their common abbreviations. Provide only the names of the two involved businesses or companies and the contract number. Ensure the responses are complete and accurate.	60	0.52
PDF AI-temp_v2_05.07.24	Based on the given information, identify the names of the two parties involved. Verify if any mention of “Local Laboratory” refers to a company named Labor Becker. Please provide the two partie in the following format: “contract party 1: XXX, contract party 2: XXX”. Please also provide the contract numbers associated with the agreement. Take into account how terms like “contract number” and “customer” might be expressed in different languages and their common abbreviations. Use the following format: “contract number: XXX”.	157	0.55
PDF AI-temp_v2_07.07.24	Based on the given information, identify the names of the two parties involved. Verify if any mention of “Local Laboratory” refers to a company named Labor Becker. Please provide the two partie in the following format: “contract party 1: XXX, contract party 2: XXX”. Please also provide the contract numbers associated with the agreement. Take into account how terms like “contract number” and “customer” might be expressed in different languages and their common abbreviations. Use the following format: “contract number: XXX”.	75	0.28
PDF AI-temp_v2_28.06.24	Based on the given information, identify the names of the two parties involved. Verify if any mention of “Local Laboratory” refers to a company named Labor Becker. Please provide the two partie in the following format: “contract party 1: XXX, contract party 2: XXX”. Please also provide the contract numbers associated with the agreement. Take into account how terms like “contract number” and “customer” might be expressed in different languages and their common abbreviations. Use the following format: “contract number: XXX”.	30	0.58
PDF AI-temp_v2_07.07.24	Based on the provided document, please identify and return the information which can help to conclude the end date or expiration date, or termiNAtion date, or expiry date, or end of term and the last effective date of this contract/agreement. If the information on the end date of the contract is not clear, look for the start date of the contract, or when this contract commence. Bear in mind that these terms may be in different languages, translate the origiNAl text first and then provide the answer.	20	0.91
PDF AI-temp_v2_07.07.24	Based on the provided information, could you identify the names of the two parties involved in the agreement? If there is a “Local Laboratory” mentioned, please search for its name, it might be Labor Becker, but only return that name if it appears in the document. Please return only the names of the two buisnesses or companies, complete answer, nothing else.	20	0.90
PDF AI-temp_v2_07.07.24	Based on the provided information, could you identify the names of the two parties involved in the agreement? If there is a “Local Laboratory” mentioned, please search for its name. Please also identify any numbers that could be “contract numbers.” Consider how “contract number” or similar terms might be expressed in different languages and common abbreviations used in contracts. Extract every number that could potentially be a contract or agreement number. For each identified number, return it in the format: ‘Contract Number: XXX’. If no such number can be identified, respond with ‘Not available in the provided information’. Please return only the names of the two businesses or companies and the number of the contract, complete answer, but nothing else and in German language. It is important that your response is complete and delivered in German.	1	0.35
PDF AI-temp_v2_07.07.24	Based on the provided information, could you identify the names of the two parties involved in the agreement? If there is a “Local Laboratory” mentioned, please search for its name. Please return only the names of the two buisnesses or companies, complete answer, but nothing else.	20	0.93
PDF AI-temp_v2_07.07.24	Based on the provided information, please identify and return the “end date” or “expiration date” or “end date” of the agreement. If the exact date is uncertain, present all relevant dates found with their corresponding references from the provided information. Consider variations in which “start date” or similar terms might be expressed in different languages or through common abbreviations used in contracts. Please ensure your response is complete and clearly stated.	20	0.93
PDF AI-temp_v2_07.07.24	Based on the provided information, please identify and return the “start date” or “commencement date” of the agreement. The date should be formatted as “Vertragsbeginn: dd/mm/yyyy”. If the exact date is uncertain, present all relevant dates found with their corresponding references from the provided information, formatted as “Vertragsbeginn (label or reference used in text): dd/mm/yyyy”. If no dates that could indicate the “start date” or “commencement date” are found, return “Nicht verfügbar in den bereitgestellten Informationen”. Consider variations in how “start date” or similar terms might be expressed in different languages or through common abbreviations used in contracts. Please ensure your response is complete, in the correct format, and fully written in German.	20	0.71
PDF AI-temp_v2_05.07.24	Based on the provided information, please identify the names of the two parties involved and the contract numbers associated with the agreement. If “Local Laboratory” is mentioned, determine if it refers to a company named Labor Becker. Consider variations and common abbreviations for “contract number” and “customer” in different languages. Return only the names of the two involved businesses or companies and the contract number. Ensure the answers are complete and accurate.	161	0.59
PDF AI-temp_v2_07.07.24	Based on the provided information, please identify the names of the two parties involved and the contract numbers associated with the agreement. If “Local Laboratory” is mentioned, determine if it refers to a company named Labor Becker. Consider variations and common abbreviations for “contract number” and “customer” in different languages. Return only the names of the two involved businesses or companies and the contract number. Ensure the answers are complete and accurate.	55	0.75
PDF AI-temp_v2_28.06.24	Based on the provided information, please identify the names of the two parties involved and the contract numbers associated with the agreement. If “Local Laboratory” is mentioned, determine if it refers to a company named Labor Becker. Consider variations and common abbreviations for “contract number” and “customer” in different languages. Return only the names of the two involved businesses or companies and the contract number. Ensure the answers are complete and accurate.	15	0.72
PDF AI-temp_v2_07.07.24	Basierend auf den bereitgestellten Informationen, können Sie bitte das “Startdatum” oder den “Beginntermin” der Vereinbarung identifizieren und zurückgeben? Das Datum sollte im Format “Vertragsbeginn: tt/mm/jjjj” angegeben werden. Falls das genaue Datum unsicher ist, präsentieren Sie bitte alle relevanten gefundenen Daten mit den entsprechenden Referenzen aus den bereitgestellten Informationen im Format “Vertragsbeginn (Label oder Referenz im Text): tt/mm/jjjj”. Wenn keine Daten gefunden werden, die auf das “Startdatum” oder den “Beginntermin” hinweisen könnten, geben Sie “Nicht verfügbar in den bereitgestellten Informationen” zurück. Berücksichtigen Sie dabei Variationen, wie “Startdatum” oder ähnliche Begriffe in verschiedenen Sprachen oder durch gängige Abkürzungen in Verträgen ausgedrückt werden könnten. Bitte stellen Sie sicher, dass Ihre Antwort vollständig, im richtigen Format und vollständig auf Deutsch verfasst ist.	20	0.56
PDF AI-temp_v2_07.07.24	Bitte nenne die zwei Vertragspartner, die in dem Vertrag involviert sind. Verwende dafür folgendes Format: “Vertragspartner 1: XX, Vertragspartner 2: XX”. Nenne ausschließlich die Namen der Vertragspartner, keine weiteren Informationen.	20	0.85
PDF AI-temp_v2_07.07.24	Bitte nenne mir die beiden Vertragspartner aus dem beigefügten Vertrag. Bitte nenne ausschließlich die Namen der beiden Parteien in folgendem Format: “Vertragspartner 1: xxx, Vertragspartner 2: xxx”	20	0.70
PDF AI-temp_v2_07.07.24	Bitte nenne mir die beiden Vertragspartner aus dem beigefügten Vertrag. Bitte nenne ausschließlich die Namen der beiden Parteien in folgendem Format: “Vertragspartner 1: xxx, Vertragspartner 2: xxx”. Bitte achte darauf, die korrekten und vollständigen Bezeichnungen der Vertragspartner zu nennen.	20	0.61
PDF AI-temp_v2_05.07.24	Can you determine the start and end dates of the agreement from the provided information? Please consider that “start date,” “commencement date,” and “end date” might be expressed in various languages and legal document contexts. Return the dates in the format “Start date: dd.mm.yyyy, End date: dd.mm.yyyy.” If either or both dates are not available, return “Start date: not available, End date: not available.” Ensure that your response is accurate and includes all potential start and end dates mentioned in the document.	5	0.69
PDF AI-temp_v2_28.06.24	Can you determine the start and end dates of the agreement from the provided information? Please consider that “start date,” “commencement date,” and “end date” might be expressed in various languages and legal document contexts. Return the dates in the format “Start date: dd.mm.yyyy, End date: dd.mm.yyyy.” If either or both dates are not available, return “Start date: not available, End date: not available.” Ensure that your response is accurate and includes all potential start and end dates mentioned in the document.	30	0.29
PDF AI-temp_v2_05.07.24	Can you identify the start and end date of the agreement in the provided information? Consider how “start date”, “commencement date” and “end date” might be expressed in different languages and in the context of a legal document. Please return the start and end dates in the format “Start date: dd.mm.yyyy, End date: dd.mm.yyyy”. If one or both dates are not mentioned, return “not available in the provided information”. Please make sure your response is correct and complete, and that you deliver all potential start and end dates mentioned in the document.	5	0.76
PDF AI-temp_v2_28.06.24	Can you identify the start and end date of the agreement in the provided information? Consider how “start date”, “commencement date” and “end date” might be expressed in different languages and in the context of a legal document. Please return the start and end dates in the format “Start date: dd.mm.yyyy, End date: dd.mm.yyyy”. If one or both dates are not mentioned, return “not available in the provided information”. Please make sure your response is correct and complete, and that you deliver all potential start and end dates mentioned in the document.	30	0.29
PDF AI-temp_v2_05.07.24	Can you identify the start and end date of the contract in the provided information? If you are not able to extract a specific date, please return all the information that could help me identify or calculate both the commencement date and the termination or end date. Consider how this terms can be expressed in different laguages and in the context of a legal document. Please return only complete and correct answers.	156	0.34
PDF AI-temp_v2_07.07.24	Can you identify the start and end date of the contract in the provided information? If you are not able to extract a specific date, please return all the information that could help me identify or calculate both the commencement date and the termination or end date. Consider how this terms can be expressed in different laguages and in the context of a legal document. Please return only complete and correct answers.	180	0.81
PDF AI-temp_v2_28.06.24	Can you identify the start and end date of the contract in the provided information? If you are not able to extract a specific date, please return all the information that could help me identify or calculate both the commencement date and the termination or end date. Consider how this terms can be expressed in different laguages and in the context of a legal document. Please return only complete and correct answers.	30	0.42
PDF AI-temp_v2_07.07.24	Ermitteln Sie anhand der bereitgestellten Informationen die Informationen, die auf das Ablaufdatum, das Kündigungsdatum, das Verfallsdatum, das Ende der Laufzeit und das letzte Gültigkeitsdatum hinweisen, und geben Sie diese zurück. Wenn die Informationen über das Enddatum des Vertrags nicht eindeutig sind, suchen Sie NAch Vertragsbeginn und Laufzeit. Berücksichtigen Sie, dass diese Begriffe in verschiedenen Sprachen abgefasst sein können, übersetzen Sie zunächst den OrigiNAltext und geben Sie dann die Antwort.	20	0.78
PDF AI-temp_v2_07.07.24	Hello! Please answer the following question. Does this contract mentions information regarding extension or renewal of this contract.	20	0.73
PDF AI-temp_v2_07.07.24	Hello. Please answer the following question. Does this contract has any mentioning’s of renewal or extension? If so please provide short and clear answer including details of renewal when there are any. If there are no information found, please give a short statement.	20	0.96
PDF AI-temp_v2_07.07.24	Hello. here is the document. Please provide me with information which covers explanation or terms of this contract renewal or extension terms. Please give a short statement only including requested information, nothing else.	20	0.43
PDF AI-temp_v2_07.07.24	Hi here is the document. Based on the following document when does this contract ends? What is the end or fiNAl day of this contract. Based on the given information in the text try to conclude the correct fiNAl date of the contact.	20	0.85
PDF AI-temp_v2_07.07.24	Hi! Please provide me all conditions of this contract renewal or extension.	20	0.59
PDF AI-temp_v2_07.07.24	How far in advance can this contract be terminated? Please provide all details about the contract termination, cancelation policy.	20	0.94
PDF AI-temp_v2_07.07.24	I have provided a contract document. Your task is to find and extract the start date of the contract. This date can be in any part of the contract. Return the date in the “MM/DD/YYYY” format, or state if the start date is not available.	20	0.36
PDF AI-temp_v2_07.07.24	I have uploaded a contract document. Please analyze the document and extract the start date of the contract. The start date is usually found in the introductory section or the recitals, but it may also be specified within the main body of the contract. Please provide the start date in the format “MM/DD/YYYY” or clearly state if the date is not found.	20	0.45
PDF AI-temp_v2_07.07.24	Ich habe hier ein Vertragsdokument. Bitte suche in diesem die genannte Kündigungsfrist	20	0.80
PDF AI-temp_v2_07.07.24	Identify and extract all potential contract numbers from the provided information. Return any sequences that could plausibly be contract numbers.	20	0.76
PDF AI-temp_v2_07.07.24	Is there any mention of an “end date” or “termiNAtion date” in the provided information? Consider various expressions of this concept within a legal contract, and account for different languages. Please identify the specified date or provide any reference that allows the determiNAtion or calculation of this date. Return your complete answer in German and in the following format:“Vertragsende: tt/mm/jjjj”.	20	0.56
PDF AI-temp_v2_07.07.24	Look for the number that appears more frequently. Could that be a contract number? List the most probable contract numbers that are found within the given information. If one or some of them has special symbols or letters, please include them.	20	0.88
PDF AI-temp_v2_07.07.24	Mit welchem Vorlauf kann dieses vertrag gekündigt werden? Bitte geben Sie alle Einzelheiten zur Vertragskündigung an.	20	0.89
PDF AI-temp_v2_07.07.24	Please check the provided document. Are there any information included which covers renewal or extension conditions of tis contract? If yes, please provide the renewal conditions as mentioned in this document.	20	0.62
PDF AI-temp_v2_05.07.24	Please identify the start and end date of the contract in the provided information. If you are not able to extract a specific date, please return all the information that could help me identify or calculate both the commencement date and the termination or end date. Consider how this terms can be expressed in different laguages and in the context of a legal document. Please return only complete and correct answers.	157	0.56
PDF AI-temp_v2_07.07.24	Please identify the start and end date of the contract in the provided information. If you are not able to extract a specific date, please return all the information that could help me identify or calculate both the commencement date and the termination or end date. Consider how this terms can be expressed in different laguages and in the context of a legal document. Please return only complete and correct answers.	60	0.63
PDF AI-temp_v2_28.06.24	Please identify the start and end date of the contract in the provided information. If you are not able to extract a specific date, please return all the information that could help me identify or calculate both the commencement date and the termination or end date. Consider how this terms can be expressed in different laguages and in the context of a legal document. Please return only complete and correct answers.	30	0.63
PDF AI-temp_v2_05.07.24	Please identify the start and end dates of the contract from the provided information. If exact dates cannot be extracted, return all relevant information that might help determine or calculate the commencement and termination dates. Consider how these terms might be expressed in various languages and within the context of a legal document. Ensure that the answers are complete and accurate.	155	0.51
PDF AI-temp_v2_28.06.24	Please identify the start and end dates of the contract from the provided information. If exact dates cannot be extracted, return all relevant information that might help determine or calculate the commencement and termination dates. Consider how these terms might be expressed in various languages and within the context of a legal document. Ensure that the answers are complete and accurate.	30	0.43
PDF AI-temp_v2_07.07.24	Please name both parties involved in the contract. Use the following format for your answer: “Contract party 1: XX, contract party 2: XX”. Just provide the names of the parties, no additional information.	19	0.84
PDF AI-temp_v2_07.07.24	Please provide me with all available conditions of this contract termination.	20	0.92
PDF AI-temp_v2_05.07.24	Please provide the full names of the two parties/companies involved in the contract. Please also provide the contract number given in the document. List all information below each other.	157	0.43
PDF AI-temp_v2_07.07.24	Please provide the full names of the two parties/companies involved in the contract. Please also provide the contract number given in the document. List all information below each other.	233	0.70
PDF AI-temp_v2_28.06.24	Please provide the full names of the two parties/companies involved in the contract. Please also provide the contract number given in the document. List all information below each other.	30	0.63
PDF AI-temp_v2_05.07.24	Please provide the start datum of the contract, and the end datum. If you can not find exact dates, please provide all related information.	157	0.39
PDF AI-temp_v2_07.07.24	Please provide the start datum of the contract, and the end datum. If you can not find exact dates, please provide all related information.	200	0.74
PDF AI-temp_v2_28.06.24	Please provide the start datum of the contract, and the end datum. If you can not find exact dates, please provide all related information.	30	0.31
PDF AI-temp_v2_07.07.24	Please search in the provided information and identify any numbers that could be contract numbers. Consider how “contract number” or similar terms might be expressed in different languages and common abbreviations used in contracts. Extract every number that could potentially be a contract or agreement number. For each identified number, return it in the format: ‘Contract Number: XXX’. If no such number can be identified, respond with ‘Not available in the provided information’. Return only complete and correct information, do not explain reasons or processes.	20	0.85
PDF AI-temp_v2_07.07.24	Suche ob es darin Angaben zur Verlängerung des Vertrags gibt. Antworte bitte auf Deutsch	20	0.92
PDF AI-temp_v2_07.07.24	The uploaded document is a contract. Please give me the start date of the contract. Keep in mind that i can be named differently.	20	0.58
PDF AI-temp_v2_07.07.24	The uploaded document is a contract. When does it start?	20	0.35
PDF AI-temp_v2_07.07.24	This is a section of a legal document. Can you identify and return the number used to reference the contract? Return only the correct contract number in the format: ‘Contract number: xxx’. If no contract number is found, return ‘Not Available in the provided information’.	20	0.78
PDF AI-temp_v2_07.07.24	Wann endet dieser Vertrag? Nenne bitte das konkrete Datum oder die Vertragslaufzeit. Gib deine Antwort auf Deutsch.	20	0.66
PDF AI-temp_v2_07.07.24	Welche Kündigungsfristen werden in diesem Vertrag genannt? Bitte nenne sie so ausführlich wie möglich in folgendem Format auf deutsch: “Kündigungsfristen: xxx” Wenn keine Kündigungsfristen genannt werden, schreibe bitte: “Keine Kündigungsfristen genannt.”	20	0.83
PDF AI-temp_v2_07.07.24	Welche Kündigungsfristen werden in diesem Vertrag genannt? Bitte nenne sie so ausführlich wie möglich in folgendem Format: “Kündigungsfristen: xxx”. Antworte bitte auf Deutsch.	20	0.89

Other considerations

On leftover info

In a test on 2024-07-10 9:00:08, the sources were 2, 20230913 and 9, which can´t be so considering we are uploading 1 document at a time…could this represent some leftover information?

Complete sources text

“Source 1 PDF: 20230913 (2).pdf Page: 10 Paragraph: 19 TERM AND TERMINATION 19.1 The Agreement shall come into force on the Commencement Date. The Agreement shall terminate on the earlier of: (a) end of Pre-Launch Phase (i.e. Launch Date) or (b) one (1) year after the Commencement Date. 19.2 Without prejudice to any of its other rights including rights to damages, either Party shall have the right to immediately terminate the Agreement by notice in writing to the other Party: - if the other Party commits any material or repeated breach of the Agreement and, if © ZIWIG CONFIDENTIAL 10…

Source 2 PDF: 9.pdf Page: 1 Paragraph: Ort, Datum Rechtsgültige Unterschrift und Stempel des Nutzers BECKMAN COULTER GMBH Zusatzvereinbarung zum Testmietvertrag Nr. 2-692, 24.08.2011 Seite 1 von 1…

Source 3 PDF: 20230913 (2).pdf Page: 12 Paragraph: 23 MISCELLANEOUS 23.1 Either Party may not assign or transfer any of its rights or obligations arising under the Agreement to a third party. 23.2 The Local Laboratory will be able to subcontract any part of its obligations under this Agreement to third party upon reception of prior written consent from Ziwig, per concerned third party. For avoidance of doubt, Logistic Partner is authorized as mentioned in Appendix 1. 23.3 The Agreement constitutes the entire agreement and understanding between the parties with respect to its subject matter and replaces and supersedes all prior agreements, arrangements, undertakings or statements regarding such subject matter, including the Term Sheet. 23.4 No variation of the Agreement shall be effective unless made in writing and signed by or on behalf of each of the parties. 23.5 If a provision of the Agreement is held to be illegal, invalid or unenforceable under any enactment or rule of law in any jurisdiction, such provision shall, to that extent, be deemed not to form part of the Agreement and the legality, validity and enforceability of the remainder of the Agreement shall not be affected. 23.6 No failure to exercise, and no delay in exercising, any right or remedy in connection with the Agreement shall operate as a waiver of that right or remedy…

Source 4 PDF: 20230913 (2).pdf Page: 11 Paragraph: . 19.3 This Agreement shall automatically terminate: - if the Operating Agreement is not entered into within a reasonable period after the Commencement Date; or - if the Operating Agreement with the Authorised Laboratory which details are specified in Appendix 1 terminates, and another Operating Agreement with another Authorised Laboratory is not entered into within a reasonable period of time. 19.4 In the event of termination: - the rights and obligations of each Party in relation to the other which have accrued prior to termination shall not be affected; - all rights granted to the Local Laboratory under the Agreement shall cease; - the Local Laboratory shall remove any reference to the Trademarks which may exist on premises used by the Local Laboratory, on all documents or marketing material; - the Local Laboratory shall without delay return to Ziwig or destroy all and any Confidential Information received during the performance of the Agreement and provide to Ziwiga written statement that it has done so….

Source 5 PDF: 20230913 (2).pdf Page: 12 Paragraph: 21 FORCE MAJEURE 21.1 Neither party shall be liable for any delay or failure to perform its obligations caused by any circumstances beyond its reasonable control, which could not reasonably be anticipated upon signature of the Agreement and whose effects cannot be avoided by appropriate measures. 21.2 In such an event the party unable to meet its obligations shall promptly notify the other in writing of the circumstances and the time for performance of the Agreement shall be automatically extended by a reasonable period. If the circumstances still exist 90 days after such notification, either party may terminate this Agreement with immediate effect on giving written notice to the other….

Source 6 PDF: 20230913 (2).pdf Page: 10 Paragraph: 16 CHANGE IN LAW 16.1 If a change in law occurs or is about to occur that significantly affects the conditions of performance of the Agreement by one Party, such Party may write to the other giving details of such change in law and its opinion of any necessary amendment to the terms of this Agreement, including financial terms. 16.2 As soon as practicable after receipt of such notice from either Party, the Parties shall discuss in good faith and endeavor to agree such amendments. If an agreement cannot be reached after a reasonable period of time, either Party may terminate the Agreement by notice in writing to the other. 17 LIABILITY OF THE PARTIES 17.1 The liability of the Parties arising out of or under this Agreement shall be respectively limited to €250,000 (two hundred and fifty thousand euros). This limit shall not apply to liability that cannot be limited under applicable law. 17.2 Ziwig shall not be liable to the Local Laboratory for any losses, damages, liabilities, claims, suits, settlements, whether or not resulting from third party claims, arising out of or resulting from Examinations or from actions or omissions of the Authorised Laboratory (unless arising out or resulting from the failure of Ziwig to comply with its obligations, representations or warranties relating to the Medical Device)….

Source 7 PDF: 20230913 (2).pdf Page: 8 Paragraph: 13 INSURANCE 13.1 Each Party shall provide and maintain, at its own expense, with an insurance company of repute, a professional indemnity insurance insuring the operations performed during the term of the Contract. Such insurance, however, shall not relieve or release either Party, or limit its liability to any and all of its obligations. 13.2 Proof of insurance evidencing limits and exclusions of liability shall be provided on demand. 14 INTELLECTUAL PROPERTY 14.1 All Intellectual Property Rights and know-how in and to the Medical Device, the Collection Kits, © ZIWIG CONFIDENTIAL 8…

Source 8 PDF: 20230913 (2).pdf Page: 9 Paragraph: 15 CONFIDENTIALITY 15.1 Each Party undertakes with the other Party that it will keep confidential and will not at any time (whether before or after the termination of the Agreement) disclose any information (“Confidential Information”) including but not limited to information in relation to the other Party’s business, clients, to the Medical Devices, the Sequencing, the Bio-informatics Analysis, the Specifications, supplied or made available to them by the other Party or in respect of which they may come into the possession of or have knowledge and that they will not use such information other than for the purposes permitted by the Agreement. 15.2 Each Party undertakes with the other that it will take those precautions it employs to protect their own confidential information and technology. 15.3 These provisions shall not apply to any information which: 16.3.1 at the time the disclosure is public knowledge or subsequently becomes public knowledge through no act or default on the part of the Receiving Party or on the part of the persons who have received it from them; 16.3.2 is lawfully received by the Receiving Party from a third party without breach of a confidential relationship with the Disclosing Party; or 16.3.3 was already known by the Receiving Party at the time of its disclosure; 16.3…”

On Hallucination

Top 3 hallucinations so far 🏆 :

1️⃣ “I’m sorry for any confusion, but as an AI text model developed by OpenAI, I am unable to access real-time data or specific personal information.”

2️⃣ “Lab”

3️⃣ “To provide a comprehensive answer, I will break down the given information into relevant sections and address each point separately.”

Best language swap 🏆 :

1️⃣” In dem angegebenen Text wird keine spezifische ““Enddatum”” oder ““Kündigungsdatum”” erwähnt. Ohne weitere Informationen oder den vollständigen Testmietvertrag Nr. 2-692 zur Verfügung zu haben, kann das genaue Vertragsende nicht bestimmt werden. Daher lautet die Antwort:

Vertragsende: Não especificado.”

Interfaces

NON-QUANTIZED MODEL TESTS ON DIFFERENT INTERFACES
|| Detection of possible effects of the access point on model performance

To see the effects of quatization on performance, we first aim to exclude other variables that could be related to the points we are using to access the models. If the the same non-quantized model shows a certain tendency in any of these interfaces, then we cannot attribute all variations in performance to quantization. To do this, we tested the same prompts and inputs on the same non-quantized model and version in two access points: Chat LMSYS and LLM_testing_app.

s_interf = s_all_contracts_raw %>% 
  filter(model %in% c("LLM TESTING APP-qwen1.5-7b-chat-2024.07.22","LMSYS-qwen1.5-7b-chat-2024.07.22")) %>% 
  mutate(prompt_n = sprintf("%02d", as.numeric(factor(prompt))),
         interface_model_version = as.factor(model))

s_interf_summary = s_interf %>% 
  group_by(interface_model_version, prompt) %>% 
  summarise(prompt_rating = first(prompt_rating),
            field_prompt_n = paste0("F", first(field), "-P", first(prompt_n)),
            .groups = 'drop') %>%
  arrange(field_prompt_n, prompt, interface_model_version)

Here we can see a summary of the ratings of each prompt per access point:

VIEW TABLE

Non Quantized Model Performance on LMSYS and LLM_testing_app

Prompt Rating Relation Between Access Points

If we want to observe more clearly the relation of the values between tests of the same prompt in the different interfaces:

s_interf_summary_2 = s_interf_summary %>%
  pivot_wider(names_from = interface_model_version, values_from = prompt_rating) %>% 
  mutate(difference = `LLM TESTING APP-qwen1.5-7b-chat-2024.07.22` / `LMSYS-qwen1.5-7b-chat-2024.07.22`) %>% 
  select(prompt, field_prompt_n, difference) %>% 
  arrange(field_prompt_n)

Prompt and Case Rating means per Interface

Input Types

INPUT TYPES IMPACT ON PERFORMANCE
|| Detection of possible effects of the input types on model performance

To see the effects of input types, we testes in the same interface, model and version, 6 prompts for each field, and compared their prompt ratings.

s_input = s_all_contracts_raw %>% 
  filter(model %in% c("LLM TESTING APP_Wizard70b_2024.07.29", "PDF AI-temp_v2_07.07.24")) %>% 
  filter(time_stamp_CEST > as.Date("2024-07-27")) %>% 
  filter(prompt!="When is the last effective date of this contract? When does this contract ends? Please only provide the information which answers this question, nothing else.") %>% 
  mutate(prompt_n = sprintf("%02d", as.numeric(factor(prompt))),
         interface_model_version = as.factor(model))

s_input_summary = s_input %>% 
  group_by(prompt, input) %>% 
  summarise(prompt_rating = first(prompt_rating),
            field_prompt_n = paste0("F", first(field), "-P", first(prompt_n)),
            .groups = 'drop')%>%
  arrange(field_prompt_n, prompt, input) %>% 
  ungroup()

Here we can see a summary of the ratings of each prompt per input type:

VIEW TABLE

Fields and Inputs

Prompt Rating per Field-Prompt and Input Type

Prompt Rating Relation Between Input Types

If we want to observe more clearly the relation of the values between tests of the same prompt with the different input types (txt input ratings/ pdf input ratings) :

s_input_summary_2 = s_input_summary %>%
  pivot_wider(names_from = input, values_from = prompt_rating) %>% 
  mutate(relation = `txt one` / `pdf one`) %>% 
  select(prompt, field_prompt_n, relation) %>% 
  arrange(field_prompt_n)

Prompt and Case Rating means per Input Type

PDF and TXT Input Files Paired T-Test

To analyze these results one step further, we made a Paired T-Test:

t_test_result = t.test(prompt_rating ~ input, data = s_input_summary, paired = TRUE)

FALSE 
FALSE   Paired t-test
FALSE 
FALSE data:  prompt_rating by input
FALSE t = -3.0197, df = 35, p-value = 0.004701
FALSE alternative hypothesis: true mean difference is not equal to 0
FALSE 95 percent confidence interval:
FALSE  -0.15700895 -0.03076883
FALSE sample estimates:
FALSE mean difference 
FALSE     -0.09388889

CONCLUSIONS

The high p-value (0.1789) indicates that there is no statistically significant difference between the ratings of the two input types.
The confidence interval includes 0, further suggesting that any observed difference could be due to random chance.
Based on the paired t-test results, we can confidently state that the two input types being compared do not have a significant effect on the prompt ratings.

Quantization

QUANTIZATION IMPACT ON PERFORMANCE
|| Detection of possible effects of the use of quantized models on performance

To see the effects of quantization, we test in the same interface, model and version, 6 prompts for each field, and compared their prompt ratings. The models will be in each case:

Qwen2-7B-Instruct (original, nonquantized)
Qwen2-7B-Instruct-GPTQ-Int4 (4 bit quantization)
Qwen2-7B-Instruct-GPTQ-Int8 (8 bit quantization)

First we test on the LLM testing app, and then on PDF AI.

s_quant = s_all_contracts_raw %>% 
  filter(str_detect(model, "NQ|Int4|Int8")) %>%
  filter(!model %in% "JETSON-PDF AI-Qwen2-7b-Instruct-GPTQ-Int4") %>% 
  filter(time_stamp_CEST<as.Date("20.08.2024", format = "%d.%m.%Y")) %>% 
  mutate(prompt_n = sprintf("%02d", as.numeric(factor(prompt))),
         interface_model_version = as.factor(model))

s_quant_llm_summary = s_quant %>% 
  filter(str_detect(model, "LLM")) %>% 
  group_by(prompt, model) %>% 
  summarise(prompt_rating = first(prompt_rating),
            input = first(input),
            field_prompt_n = paste0("F", first(field), "-P", first(prompt_n)),
            field = first(field),
            .groups = 'drop')%>%
  arrange(field_prompt_n, prompt, model) %>% 
  ungroup()

s_quant_pdf_summary = s_quant %>% 
  filter(str_detect(model, "PDF")) %>% 
  group_by(prompt, model) %>% 
  summarise(prompt_rating = first(prompt_rating),
            input = first(input),
            field_prompt_n = paste0("F", first(field), "-P", first(prompt_n)),
            field = first(field),
            .groups = 'drop')%>%
  arrange(field_prompt_n, prompt, model) %>% 
  ungroup()

Here we can see a summary of the ratings of each prompt per quantized-nonQ model on the LLM testing app:

VIEW TABLE

And here we can see a summary of the ratings of each prompt per quantized-nonQ model on PDF AI:

VIEW TABLE

Prompt Rating per Field-Prompt and Type of Model in the LLM testing app (Non-Q, GPTQ-Int4, GPTQ-Int8)

Prompt Rating per Field-Prompt and Type of Model in PDF AI (Non-Q, GPTQ-Int4, GPTQ-Int8)

Highest Prompt Ratings per field for the LLM testing app

Highest Prompt Ratings per field for PDF AI

Prompt Rating Relation Between Model Types (Non-Q, GPTQ-Int4, GPTQ-Int8)

If we want to observe more clearly the relation of the values between tests of the same prompt with the different input types (txt input ratings/ pdf input ratings) :

quant_llm_performance_2 <- ggplot(s_quant_llm_summary, 
                            aes(x = prompt_rating, y = field_prompt_n, 
                                group = field_prompt_n, colour = as.character(model), shape = as.character(model))) +
  geom_point(size = 1) +  # Adjust stroke for border thickness
  geom_line(aes(group = field_prompt_n), size = 1, alpha = 0.6) +  # Add lines connecting points
  labs(x = "Prompt Rating", y = "Field-Prompt", colour = "Model (Q-NQ)", shape = "Model (Q-NQ)") +
  scale_colour_brewer(palette = "Set2") + 
  theme_minimal()+
  theme(legend.position = "bottom")+
  scale_x_continuous(limits = c(-0.1, 1.1), n.breaks = 10)

On LLM testing app

quant_pdf_performance_2 <- ggplot(s_quant_pdf_summary, 
                            aes(x = prompt_rating, y = field_prompt_n, 
                                group = field_prompt_n, colour = as.character(model), shape = as.character(model))) +
  geom_point(size = 1) +  # Adjust stroke for border thickness
  geom_line(aes(group = field_prompt_n), size = 1, alpha = 0.6) +  # Add lines connecting points
  labs(x = "Prompt Rating", y = "Field-Prompt", colour = "Model (Q-NQ)", shape = "Model (Q-NQ)") +
  scale_colour_brewer(palette = "Set2") + 
  theme_minimal()+
  theme(legend.position = "bottom")+
  scale_x_continuous(limits = c(-0.1, 1.1), n.breaks = 10)

On LLM testing app

Prompt and Case Rating means per Model Type and Access Point

Quantization Impact on Model Performance

To analyze these results one step further, we make a regression:

s_quant = s_quant %>% 
  mutate(model = as.factor(model),
         case = as.factor(case),
         field = as.factor(field),
         model = as.factor(model))

quant_regression <- lm(prompt_rating ~ model + case + field, data = s_quant)

tidy_quant_regression <- tidy(quant_regression)

Summary of Linear Regression Model
term	estimate	std.error	statistic	p.value
(Intercept)	0.8108356	0.0115986	69.9080340	0.0000000
modelLLM TESTING APP-qwen2-7B-Instruct-GPTQ-Int8	0.0373813	0.0124466	3.0033316	0.0026912
modelLLM TESTING APP-qwen2-7B-Instruct-NQ	-0.0254064	0.0125403	-2.0259733	0.0428508
modelPDF AI- qwen2-7b-Instruct-GPTQ-Int4	-0.0125685	0.0108520	-1.1581730	0.2468803
modelPDF AI-qwen2-7b-Instruct-GPTQ-Int8	-0.0205166	0.0108789	-1.8858960	0.0593999
modelPDF AI-qwen2-7b-Instruct-NQ	-0.0065062	0.0108481	-0.5997545	0.5487125
caseBecker - Beckman	-0.0008213	0.0080770	-0.1016863	0.9190121
caseBecker - Siwig	0.0000797	0.0080725	0.0098734	0.9921229
caseBecker - SWM	0.0010136	0.0081199	0.1248283	0.9006674
field02	-0.1273303	0.0102041	-12.4783078	0.0000000
field04	0.0189324	0.0102042	1.8553506	0.0636386
field05	0.0646909	0.0102329	6.3218426	0.0000000
field06	-0.0415415	0.0102989	-4.0335996	0.0000562
field07	-0.1271262	0.0101898	-12.4757648	0.0000000

levels(s_quant$model)

FALSE [1] "LLM TESTING APP- qwen2-7B-Instruct-GPTQ-Int4"
FALSE [2] "LLM TESTING APP-qwen2-7B-Instruct-GPTQ-Int8" 
FALSE [3] "LLM TESTING APP-qwen2-7B-Instruct-NQ"        
FALSE [4] "PDF AI- qwen2-7b-Instruct-GPTQ-Int4"         
FALSE [5] "PDF AI-qwen2-7b-Instruct-GPTQ-Int8"          
FALSE [6] "PDF AI-qwen2-7b-Instruct-NQ"

FALSE [1] "Adjusted R-squared: 0.163353125023125"

FALSE           GVIF Df GVIF^(1/(2*Df))
FALSE model 1.001162  5        1.000116
FALSE case  1.000471  3        1.000079
FALSE field 1.001160  5        1.000116

CONCLUSIONS:

Being model LLM TESTING APP-qwen2-7B-Instruct-GPTQ-Int4 the baseline, we observe the impact of the other models/interfaces on prompt rating

modelLLM TESTING APP-qwen2-7B-Instruct-GPTQ-Int8:
- Estimate: -0.0441139
- Impact: This model type decreases the prompt_rating by approximately 0.0441 units on average, compared to the baseline.
- p-value: 0.0000769 → This effect is statistically significant.
modelLLM TESTING APP-qwen2-7B-Instruct-NQ:
- Estimate: -0.0272457
- Impact: This model type decreases the prompt_rating by approximately 0.0272 units on average, compared to the baseline.
- p-value: 0.0151374 → This effect is statistically significant, although less so than the previous model type.
modelPDF AI-qwen2-7B-Instruct-GPTQ-Int4:
- Estimate: -0.0939717
- Impact: This model type decreases the prompt_rating by approximately 0.0940 units on average, compared to the baseline.
- p-value: 0.0000000 → This effect is highly statistically significant.
modelPDF AI-qwen2-7B-Instruct-GPTQ-Int8:
- Estimate: -0.1023591
- Impact: This model type decreases the prompt_rating by approximately 0.1024 units on average, compared to the baseline.
- p-value: 0.0000000 → This effect is highly statistically significant.
modelPDF AI-qwen2-7B-Instruct-NQ:
- Estimate: -0.0889736
- Impact: This model type decreases the prompt_rating by approximately 0.0890 units on average, compared to the baseline.
- p-value: 0.0000000 → This effect is highly statistically significant.

New Parsing

NEW PARSING SYSTEM
|| Effects of the new parsing system on prompt ratings

To analyze the effects of the new parsing system on the prompt ratings, we take a look at the highest rated prompts per field and input type:

new_parsing = s_all_contracts_raw %>% 
  filter(str_detect(model, "Int4")) %>% 
  filter(time_stamp_CEST>as.Date("20.08.2024", format = "%d.%m.%Y")) %>% 
  mutate(new_parsing = ifelse(startsWith(prompt,"np_"),1,0),
         prompt = gsub("np_", "", prompt),
         prompt_n = sprintf("%02d", as.numeric(factor(prompt))),
         field_prompt_n = paste0("F", field, "-P", prompt_n),)


new_parsing_summary = new_parsing %>%
  group_by(field, input, new_parsing) %>% 
  summarise(input = first(input),
            new_parsing = first(new_parsing),
            prompt_rating = max(prompt_rating),
            .groups = 'drop')%>%
  arrange(field, input, new_parsing) %>% 
  ungroup()

Prompt Rating Relation Between Old and New parsing system

If we want to observe more clearly the relation of the values between tests of the same prompt with the different input types (new parsing/ old parsing) :

new_parsing_txt = new_parsing %>%
  filter(input == "txt one")%>%
  group_by(field_prompt_n,input,new_parsing) %>% 
  summarise(prompt_rating = first(prompt_rating)) %>% 
  ungroup() %>% 
  pivot_wider(names_from = new_parsing, values_from = prompt_rating) %>% 
  mutate(relation = `1` / `0`) %>% 
  #select(prompt, field_prompt_n, relation) %>% 
  arrange(field_prompt_n)

new_parsing_means = new_parsing_summary %>%
  group_by(input, new_parsing, field) %>% 
  summarise(prompt_rating = round(mean(prompt_rating),2),
            .groups = 'drop')%>%
  arrange(field, input, new_parsing) %>% 
  ungroup()

Prompt Ratings Means per Field, Input and Parsing System:

Prompt Ratings Means per Input and Parsing System:

CURRENT

Jetson + PDF AI + Qwens

Qwen models performace
|| Prompt ratings per model and field

To analyze the effects of the new parsing system on the prompt ratings, we take a look at the highest rated prompts per field and input type:

#I FILTER THE JETSON-PDF AI-QWEN OBSERVATIONS

jetson_pdf_qwen = s_all_contracts_raw %>% 
  filter(str_detect(model, "JETSON-PDF AI-Qwen")) %>%
  filter(!str_detect(model, "B")) %>% 
  mutate(model_field = paste0(gsub("JETSON-PDF AI-", "", model),"-",field)) %>% 
  arrange(model,field)

jetson_pdf_qwen_summary = jetson_pdf_qwen %>%
  group_by(field, model) %>% 
  summarise(mean_prompt_rating = round(mean(prompt_rating), digits=2),
            model_field = paste0(gsub("JETSON-PDF AI-", "", model),"-",field),
            max_prompt_rating = round(max(prompt_rating), digits=2),
            elapsed_time = elapsed_time,
            .groups = 'drop') %>%
  arrange(model,field) %>% 
  ungroup()

Boxplot Element References

Box (Colored Rectangle): Each box represents the interquartile range (IQR) of the data, covering the middle 50% of observations for each “Model and Field” category.

The lower boundary of the box marks the first quartile (Q1), or the 25th percentile. The upper boundary of the box marks the third quartile (Q3), or the 75th percentile. Line Inside the Box (Median): The horizontal line within each box represents the median (or 50th percentile) of the prompt ratings for that particular model-field combination.

Whiskers (Vertical Lines Extending from the Box): The whiskers typically extend to the smallest and largest values within 1.5 times the interquartile range (IQR) from Q1 and Q3, respectively. This range is intended to capture the majority of typical values, but not outliers.

Values beyond the whiskers are considered outliers.

Points Outside the Whiskers (Outliers): The black dots outside the whiskers represent outliers. These are data points that fall beyond 1.5 times the IQR from the quartiles. Outliers indicate prompt ratings that are unusually high or low compared to the rest of the data for that model-field combination.

Black Dots Inside the Box (Mean Points): The additional black dots inside each box correspond to the mean prompt rating for each “Model and Field.” This point helps you see how the mean compares to the median. If the mean is close to the median, the distribution is roughly symmetric; if it’s skewed, the mean will deviate from the median.

To see with more clarity the **best performing prompts* for each field and model:

As we can see, most of the observations are grouped between 0.8 and 1.

Qwen models speed
|| Elapsed Time per Model and Field

Considering the importance of the time factor, we analyze the time elapsed per test for each model and field. *It is important to note that we have some gaps in our data, since we started measuring elapsed time halfway through our tests of the Qwen models with Jetson + PDF AI.

We can see some clear differences between the 7b and the larger models, 32b and 72b. The color disctintion per field also allows us to evaluate the impact of this variable in the time factor.

To get a clearer view, we can examine this values directly grouped by field:

The relation between the solicited answer and the elapsed time is quite clear. To understand this, we can take a look at the questions/answers for each field:

10 Are there any attachments to the contract? - This is a yes or no answer, pretty straightforward.

09 Delivery conditions. - On this one we are asking for a complete list, so answers are long and must be thorough.

08 Payment conditions. - Similar to 09.

07 Is there a renewal clause - is the contract automatically renewed, for example? - Mostly “no” answers, maybe documents where the answer would be yes would take longer to analyse and return a complete response.

06 How far in advance can the contract be terminated? Notice periods - In this case conditions also must be mentioned.

05 When does the contract end? - The answer tends to be short, but it can include some clauses and may require calculation.

04 When is the start of the contract? - Same as 05.

03 What is the subject matter of the contract (about what) - This one is quite summarised in the best prompts, but it takes more “creative capacities” (?) and abstraction.

02 Contract Number - Short and direct.

01 Who are the contracting parties? - Short and direct.

To double check this, we can look at the answers’ length per field as well:

jetson_pdf_qwen$output_length <- nchar(jetson_pdf_qwen$output)

As we can see, elapsed time and output length seem to follow a similar pattern.

We can also take a look at the input files size, to check if there is a relation between that and the elapsed time. For that we have the following figures showing Elapsed Time per Field for each input file:

Considering that the input file sizes are:

+Becker - Abbott (X4) - 5.txt 72KB

+Becker - Abbott (X4) - 8.txt 4KB

+Becker - Beckman - 9.txt 2KB

+Becker - Siwig - 20230913.txt 36KB

+Becker - SWM - swm.txt 96KB

There doesn’t seem to be a clear relation between the mentioned variables…(think of other input files qualities that could be weighing…)

Batch Tests

JETSON + PDF AI + Qwen models
|| Effects on time and performance of Individual vs Batch runs

The data available for analysing the Effects on Time of batch testing versus individual runs is not much, considering there little overlap due to our late logging of this variable, but the results start to show a considerable decrease in time:

#I FILTER THE JETSON-PDF AI-QWEN OBSERVATIONS

batch_individual = s_all_contracts_raw %>% 
  filter(str_detect(model, "JETSON-PDF AI-Qwen"),
         !str_detect(model, "GPTQ")) %>% 
  mutate(batch = as.factor(ifelse(grepl("B", model), 1, 0)),
         model = gsub("B-", "", model),
         model_field = paste0(gsub("JETSON-PDF AI-", "", model),"-",field)) %>% 
  arrange(model_field)

For a closer look of the overlapping data:

In terms of Performance, measured by prompt ratings, we can see:

#I FILTER THE JETSON-PDF AI-QWEN OBSERVATIONS
batch_individual = batch_individual %>%
  filter (!str_detect(model,"-7b-|72b"),
          source_n == 8)

batch_perf = batch_individual %>% 
  filter(batch == 1)%>%
  mutate(prompt_clean = str_trim(str_to_lower(prompt)))
         
individual_perf = batch_individual %>% 
  filter(batch == 0)%>%
  mutate(prompt_clean = str_trim(str_to_lower(prompt)))%>%
  semi_join(batch_perf, by = "prompt_clean")

# Combine the filtered individual_perf with batch_perf
batch_individual_perf <- rbind(batch_perf, individual_perf) %>% 
                 mutate(model = gsub("B-", "", model),
                        model_field = paste0(gsub("JETSON-PDF AI-", "", model),"-",field),
                        model_field_batch = paste0(model_field,"-",batch)) %>% 
                 arrange(field, model, batch)

summary_b_i_p = batch_individual_perf %>% 
  group_by(model_field_batch) %>% 
  summarise(model_field = first(model_field),
            prompt_rating = first(prompt_rating),
            batch = first(batch),
            model = first(model)) %>% 
  arrange(model_field)

For fields with a great dicrease in ratings, we take a look at the sources, to check if the answers are actually there:

Field 02:
- 14b-4sources → 1/5 incorrect answers, of those 0/1 contained the answers in the sources.
- 14b-8sources → 1/5 incorrect answers, of those 1/1 contained the answers in the sources.
- 32b-4sources → 2/5 incorrect answers, of those 1/2 contained the answers in the sources.
- 32b-8sources → 1/5 incorrect answers, of those 1/1 contained the answers in the sources.
Field 04:
- 14b-4sources → 1/5 incorrect answers, of those 0/1 contained the answers in the sources.
- 14b-8sources → 1/5 incorrect answers, of those 1/1 contained the answers in the sources.
- 32b-4sources → 1/5 incorrect answers, of those 0/1 contained the answers in the sources.
- 32b-8sources → 0/5 incorrect answers, of those 0/0 contained the answers in the sources.
Field 08:
- 14b-4sources → 3/5 incorrect answers, of those 0/3 contained the answers in the sources.
- 14b-8sources → 1/5 incorrect answers, of those 1/1 contained the answers in the sources.
- 32b-4sources → 2/5 incorrect answers, of those 1/2 contained the answers in the sources.
- 32b-8sources → 1/5 incorrect answers, of those 1/1 contained the answers in the sources.

Summarising + For model 14b with 4 sources, 0/5 incorrect answers contained the needed information in the sources. + For 14b - 8sources, 3/3 incorrect answers contained the needed information in the sources. + For 32b - 4sources, 2/5 incorrect answers contained the needed information in the sources. + For 32b - 8sources 2/2 incorrect answers contained the needed information in the sources.

Some conclussions: For tests with 4 sources, the issue seems to be that the sources are not enough/good enough, consiedering the information needed is not in them. For tests with 8 sources, we can see that the number of incorrect answers (in our sample) drops, and that the problem is not the lack of information, but te interpretation of the content.

To see all the results from the matching tests (model-field-source number) together, we can also separate the observations in Batch and non-batch runs:

Even though results are amazing in both cases 🎉🎊🎉, we can see more disperse data on the batch tests, reflecting the differences in ratings between the cases we have observed previously.

Finally, we consider the impact of Source Number in Elapsed Time looking only at matching batch tests:

#I FILTER THE JETSON-PDF AI-QWEN OBSERVATIONS

batch = s_all_contracts_raw %>% 
  filter(str_detect(model, "B-JETSON-PDF AI-Qwen")) %>% 
  mutate(model_field = paste0(gsub("B-JETSON-PDF AI-", "", model),"-",field)) %>% 
  arrange(model_field)

Tests with 8 sources have the highest times in every case, but the observations are quite evenly distributed.

BEST RATINGS

BEST PROMPTS PER FIELD
|| Identification of best rated prompts by model per field

Considering the possibility of using different models to answer to particular requests, we extract from our logs the best rated prompts with the models on which they were tested, and other relevant data. It is important to note that 70b|72b|7B models have been excluded!

best_prompts = s_all_contracts_raw %>%
  filter(!is.na(prompt_rating)) %>% 
  filter(!is.na(model)) %>% 
  group_by(prompt,field) %>% 
  summarise(prompt_rating = max(prompt_rating),
            model = first (model),
            input = first(input),
            output = first(output),
            n_of_runs = first(n_of_runs)) %>% 
  filter(prompt_rating>0.72) %>% 
  filter(n_of_runs>4) %>% 
  arrange(field, prompt_rating)


best_prompts_field = best_prompts %>%
  filter(!is.na(prompt_rating)) %>% 
  filter(!is.na(model)) %>% 
  group_by(field) %>% 
  top_n(4, wt = prompt_rating) %>%
  ungroup() %>% 
  arrange(field, prompt_rating)

best_prompts_field_small = best_prompts %>%
  filter(!str_detect(model, "70b|72b|7B")) %>% 
  filter(!is.na(prompt_rating)) %>% 
  filter(!is.na(model)) %>% 
  group_by(field) %>% 
  top_n(4, wt = prompt_rating) %>%
  ungroup() %>% 
  arrange(field, prompt_rating)


# Write the data frame to the specified Google Sheet
#sheet <- sheet_write(best_prompts_field, ss = "1f5oLNE4YU0PdJU2vJy-V32-BLzno3ZJui-A_pg0tXZM", sheet = "R INPUT2")

#sheet <- sheet_write(best_prompts_field_small, ss = "1f5oLNE4YU0PdJU2vJy-V32-BLzno3ZJui-A_pg0tXZM", sheet = "R INPUT SMALL")

VIEW TABLE

SADNA DATA ANALYSIS || Contracts task logs and study

ARCHIVE

Data Overview

CONTRACTS TASK LOGS || Prompts, responses, sources, ratings

Hallucination

HALLUCINATION || Relation to source number and parallel testing

REGRESSIONS || Correlation of different variables with hallucination

PDF AI-tempv2

MODEL VERSIONS PERFORMANCE || PDF AI-tempv2_28.06.24, 05.07.24 and 07.07.24

Prompt Rating-Time Relation

PDF AI-temp_v2_28.06.24

PDF AI-temp_v2_05.07.24

PDF AI-temp_v2_07.07.24

Case ratings per model version

Prompt ratings per model version

Case Rating-Files Relation

Prompts-Models

Prompt performance in each model version

Other considerations

Interfaces

NON-QUANTIZED MODEL TESTS ON DIFFERENT INTERFACES || Detection of possible effects of the access point on model performance

Non Quantized Model Performance on LMSYS and LLM_testing_app

Prompt Rating Relation Between Access Points

Prompt and Case Rating means per Interface

Input Types

INPUT TYPES IMPACT ON PERFORMANCE || Detection of possible effects of the input types on model performance

Fields and Inputs

Prompt Rating per Field-Prompt and Input Type

Prompt Rating Relation Between Input Types

Prompt and Case Rating means per Input Type

PDF and TXT Input Files Paired T-Test

Quantization

QUANTIZATION IMPACT ON PERFORMANCE || Detection of possible effects of the use of quantized models on performance

Prompt Rating per Field-Prompt and Type of Model in the LLM testing app (Non-Q, GPTQ-Int4, GPTQ-Int8)

Prompt Rating per Field-Prompt and Type of Model in PDF AI (Non-Q, GPTQ-Int4, GPTQ-Int8)

Highest Prompt Ratings per field for the LLM testing app

Highest Prompt Ratings per field for PDF AI

Prompt Rating Relation Between Model Types (Non-Q, GPTQ-Int4, GPTQ-Int8)

On LLM testing app

On LLM testing app

Prompt and Case Rating means per Model Type and Access Point

Quantization Impact on Model Performance

New Parsing

NEW PARSING SYSTEM || Effects of the new parsing system on prompt ratings

Prompt Rating Relation Between Old and New parsing system

Prompt Ratings Means per Field, Input and Parsing System:

Prompt Ratings Means per Input and Parsing System:

CURRENT

Jetson + PDF AI + Qwens

Qwen models performace || Prompt ratings per model and field

Qwen models speed || Elapsed Time per Model and Field

Batch Tests

JETSON + PDF AI + Qwen models || Effects on time and performance of Individual vs Batch runs

BEST RATINGS

BEST PROMPTS PER FIELD || Identification of best rated prompts by model per field

SADNA DATA ANALYSIS
|| Contracts task logs and study

CONTRACTS TASK LOGS
|| Prompts, responses, sources, ratings

HALLUCINATION
|| Relation to source number and parallel testing

REGRESSIONS
|| Correlation of different variables with hallucination

MODEL VERSIONS PERFORMANCE
|| PDF AI-tempv2_28.06.24, 05.07.24 and 07.07.24

NON-QUANTIZED MODEL TESTS ON DIFFERENT INTERFACES
|| Detection of possible effects of the access point on model performance

INPUT TYPES IMPACT ON PERFORMANCE
|| Detection of possible effects of the input types on model performance

QUANTIZATION IMPACT ON PERFORMANCE
|| Detection of possible effects of the use of quantized models on performance

NEW PARSING SYSTEM
|| Effects of the new parsing system on prompt ratings

Qwen models performace
|| Prompt ratings per model and field

Qwen models speed
|| Elapsed Time per Model and Field

JETSON + PDF AI + Qwen models
|| Effects on time and performance of Individual vs Batch runs

BEST PROMPTS PER FIELD
|| Identification of best rated prompts by model per field