Skip to main content

Extract strategies

Sometimes we need extra customization of the extraction process to achieve the best results for our use case. For example, we may have really complex financial reports with tables, and we want to extract the data from them in a structured way. In these cases, we can create an extract strategy that will define how to extract the data from the documents.

To make strategies easier to reuse, first, we define them and then we can use them in the processing of documents. We can create as many strategies as we want, and we can use them in different processing jobs.

What is an extract strategy?

An Extract Strategy is a reusable configuration that dictates how data is extracted from documents. It acts as a container for one or more extraction components. Currently, you can configure the following components within a strategy:

  • Visual extraction: useful to extract data from documents that have a visual structure, such as tables, images or forms. It leverages a visual model to understand the layout of the document and extract the data from it.
  • AI tables: extracts data from tables in documents in a smarter way. It uses AI models to understand the structure of the table and extract the data from it. Combined with visual extraction, it can extract data from complex tables that have a visual structure.

Visual extraction

Visual extraction is a component that uses a visual model to understand the layout of the document and extract data from it. It is particularly useful for documents that have a visual structure, such as tables, images or forms. We can enable it without any extra configuration, or it can be customized to our use case using the following parameters:

  • LLM: In this field we can select in generative_model the LLM that will be used to extract the data from the document, and also, if we wanted, we could add the specific credentials for said LLM provider in user_keys.

    warning

    We must make sure to choose a visual LLM.

  • RULES: In this field we can define the rules that will be used to extract the data from the document. The rules should be defined using natural language, and they will be used to guide the visual model in extracting the data from the document. For example, we can define rules such as "describe the images in the documents as if you were an expert in the field of electronics" or "extract the data from the tables in the document and present it in a structured format".

Example of a visual extraction configuration::

{
"name": "Visual Extraction Strategy",
"llm": {
"generative_model": "chatgpt-azure-4o"
},
"rules": [
"Extract the text from the document ignoring the footers and headers",
"Describe the images in the document as if you were an expert in the field of electronics"
]
}

When we enable visual extraction, all pages in the document will be processed using the visual model, and the data will be extracted according to the rules defined in the strategy. We recommend playing around with the rules and the LLM to find the best combination for your use case.

AI Tables

AI Tables uses AI models to understand the structure of the tables in the document and extract the data from them. It is particularly useful for documents that have complex tables, such as financial reports or invoices. When activating this strategy, first we detect the tables in the document, and then we extract the data from them using an LLM. Only pages that contain tables will be processed using this strategy. The tables will be extracted and presented in markdown, and will be added to the extracted text for the document with the typle table. Note that this strategy can be used in combination with visual extraction, so we can extract data from complex tables that have a visual structure.

Ai tables can be used as it is or further customized using the following parameters:

  • LLM: In this field we can select in generative_model the LLM that will be used to extract the data from the tables in the document, and also, if we wanted, we could add the specific credentials for said LLM provider in user_keys.
  • RULES: In this field we can define the rules that will be used to extract the data from the tables in the document. The rules should be defined using clear and concise natural language, and they will be used to guide the LLM in extracting the data from the tables. For example, we can define rules such as "Convert all the extracted figures to euros" or "If the table has no title, make one up".
  • merge_pages and max_pages_to_merge: These parameters are used to control how many pages of the document will be merged together to extract the tables. The default value is 1, which means that each page will be processed separately. If we set it to a higher value, the strategy will merge the specified number of pages together and extract the tables from them. This can be useful for documents that have tables that span multiple pages. Example of an AI Tables configuration:
{
"name": "AI Tables Strategy",
"llm": {
"generative_model": "chatgpt-azure-4o"
},
"rules": ["Make sure to convert all the extracted figures to euros", "If the table has no title, make one up"]
}

Same as with visual extraction, we recommend playing around with the rules and the model to find the best combination for your use case.

Strategy creation and management

Before we can use an extract strategy, we need to create it. We can create as many strategies as we want, and just use the right one for each processing job. Once created we can not modify our strategies, but we can delete them and inspect the ones we have created for a given kb.

Dashboard

To create a strategy, just go to the section AI Models and then to Extraction. Then you can click on the button Create configuration to create a new strategy. Once there, you can fill in the necessary fields with the desired configuration. In the same section, you can also see the list of strategies you have created, and you can delete them once you no longer need them.

CLI

To create a strategy using the CLI, you can use the command nuclia kb extract_strategies add with the desired configuration in JSON format. You can also list the strategies you have created with nuclia kb extract_strategies list, and delete them with nuclia kb extract_strategies delete.

nuclia kb extract_strategies add --config='{"name":"strategy1","vllm_config":{}}'
nuclia kb extract_strategies list
nuclia kb extract_strategies delete --id=1361c0c7-918a-4a7f-b44b-ba37437619fb

SDK

To create a strategy using the SDK, you can use the add method of the extract_strategies object, passing the desired configuration in JSON format. You can also list the strategies you have created with the list method, and delete them with the delete method.

from nuclia import sdk
extract_strategies = sdk.NucliaKB().extract_strategies
print(extract_strategies.list())
id = extract_strategies.add(config={"name": "strategy1", "vllm_config": {}, ai_tables: {"llm": {"generative_model": "chatgpt-azure-4o"}}})
extract_strategies.delete(id=id)

Use extract strategies for processing

Dashboard

To use an extract strategy for processing documents, just upload the document normally, enable Customize data extraction and select the strategy you want to use in the dropdown menu. Once the document is uploaded, it will be processed using the selected strategy.

CLI

To use an extract strategy for processing documents using the CLI, you can use the command nuclia kb upload file with the --extract_strategy option, passing the ID of the strategy you want to use.

nuclia kb upload file --path=FILE_PATH --extract_strategy=1361c0c7-918a-4a7f-b44b-ba37437619fb

SDK

To use an extract strategy for processing documents using the SDK, you can use the file method of the NucliaUpload object, passing the path to the file and the ID of the strategy you want to use.

from nuclia import sdk
upload = sdk.NucliaUpload()
upload.file(path=FILE_PATH, extract_strategy="1361c0c7-918a-4a7f-b44b-ba37437619fb")