AI in Data Collection for CSRD/EU Taxonomy Reporting

The compliance journey for reporting on the ESG (Environmental, Social, and Governance) for the EU Green Deal can be a daunting task. Preparing a full disclosure report for the EU Taxonomy or the Corporate Sustainability Reporting Directive (CSRD) requires navigating through complex regulations, collecting thousands of data points from a plethora of documents scattered across a company’s data sources, and conceiving feasible strategies to optimise the social and ecological impact of the company’s activities.

The information that needs to be disclosed is contained in documents governed by multiple departments (HR, finance and accounting, legal, supply chain management, etc.), frequently in a wide variety of formats (PDFs, spreadsheets, …) and under the form of free text, which hinders efficient data gathering: there is no central database which can simply be queried or questioned to obtain all the data points that need to be included in ESG reports.

Artificial Intelligence (AI) tools can offer significant assistance during this challenging journey by doing the bulk of the data collection. In this article, we will explore how AI can serve as a valuable ally for the data collection of the necessary information for CSRD/EU Taxonomy reporting. We will specifically focus on the Retrieval Augmented Generation (RAG) technique, shedding light on its application and delving into its inherent limitations.

ESG Reporting: The Benefits of AI for Data Collection

AI accelerates every step of ESG reporting, and data collection is no exception. Here are some noteworthy benefits.

Efficiency and Automation

AI tools can automate the collection and processing of large volumes of data, saving time and resources. This streamlined, automated approach ensures a smooth and efficient ESG reporting process, allowing organisations to focus on deriving meaningful insights rather than spending time on manual data tasks.

Data Integration

By leveraging AI capabilities, organisations can seamlessly gather data from diverse sources and formats, scattered across different departments. Advanced AI tools not only collect and transform missing data but can also establish a comprehensive ESG data model, encompassing both quantitative and qualitative information. They facilitate the collection, cleaning, standardisation, and centralisation of necessary data—all conveniently housed in a single, accessible location.

Accuracy

Artificial Intelligence can help fortify the foundation of your ESG reporting, providing a robust framework for years to come. An AI solution ensures end-to-end data traceability and mitigates the risk of human errors, by assuming the bulk of data collection tasks in a tireless and ever-focused manner. In combination with appropriate human validation processes to verify the output of AI solutions, this meticulous approach not only enhances the reliability of your ESG reporting but also instills confidence in stakeholders, fostering a commitment to precision and accountability.

Scalability

Finally, AI offers scalability for ESG reporting, freeing companies from starting afresh each year. The inherent flexibility of AI tools empowers organisations to adeptly navigate the dynamic sustainability landscape within an ever-shifting regulatory environment. This adaptability ensures a scalable and future-proof approach, allowing seamless expansion and evolution in response to emerging ESG challenges and requirements.

Artificial Intelligence as a Tool to Facilitate Sustainability Reporting

Artificial Intelligence in general, and Large Language Model (LLM) solutions in particular, can considerably speed up the process of preparing the required reports for the EU Taxonomy or the CSRD.

LLMs, with the GPT (Generative Pre-trained Transformer) models as the most widely known instances, can extract and interpret any information from any type of textual source, given the appropriate input. They excel at summarising information and answering questions, and are capable of doing that for multiple languages.

On account of the nature of the data these models are trained on, they can even infer relevant information that is not explicitly present in the input. For instance, if you present a sequence of words and numbers extracted from a table, without any layout information, a well-trained LLM is capable of correctly interpreting the sequence: it infers that a certain number is linked to some words in the context, even though the header and column name layout information is stripped away.

As an illustration, an LLM can correctly transform the following hard-to-interpret sequence into a correctly formatted table:

Prompt:

Your task is to format badly formatted strings into a more readable HTML format.

You receive a text that was extracted from a table, but all formatting has been lost.

Return the text in a valid HTML format, trying to infer the correct columns and rows.

Add NaN for empty cells. ### Input text: '2022 2021 Average number of employees Of which men, % Average number of employees Of which men, % Subsidiary cont. Belgium 52 52 46 51 Italy 1,211 67 1,167 67 Vietnam 83 41 81 41 Austria 370 59 330 60 Spain 133 77 108 77 India 18 75 21 71 Brazil 8 52 8 46 Guatemala 373 50 308 50 The Netherlands 53 72 37 77 Australia 10 61 9 62 Sweden 9 84 10 90 Bangladesh 26 54 25 54 Monaco 1 0 1 0 United Kingdom 6 66 15 67 Singapore 7 66 8 55 Slovenia 205 65 227 62 Zimbabwe 352 80 147 82 Total 8,834 64 7,650 64'

Here is the result:

These capacities make LLMs an excellent tool for assisting a company’s Corporate Social Responsibility (CSR) team in drafting their ESG reports.

Nevertheless, a standalone LLM is not sufficient to generate a company’s ESG report, or even to provide correct answers to questions regarding sustainability regulations. LLMs are trained on large amounts of historical data scraped from public web sources, thanks to which they contain a considerable amount of general knowledge.

However, their knowledge is never fully up to date, since it is capped at the information available in public sources at the time of their training. They do not have any knowledge of private data sources, such as a company’s sources which need to be consulted to establish an ESG report. Another major shortcoming of LLMs is that they tend to “hallucinate”, i.e., generate plausible but untrue information, which is detrimental in the context of ESG reporting.

One technique that allows for overcoming these shortcomings is Retrieval Augmented Generation (RAG): the information relevant for generating a correct response is retrieved from an up-to-date, external data source. The retrieved information is subsequently inputted to an LLM with instructions on how to process it.

There are other, more traditional data mining techniques, such as entity extraction, that can be used for data collection for ESG reporting. Here we will focus on RAG since it is the most adequate and comprehensive technique to tackle all aspects of ESG reporting.

Retrieval Augmented Generation: A Response to LLM’s Limitations in Data Collection for ESG Reporting

Let’s now dive into the intricacies of a RAG solution: we will go over its two major components and explain in simple terms how these components work to collect data for CSRD/EU Taxonomy reporting.

An accurate and performant RAG system enables a company to analyse and get answers from their documents in a matter of seconds, making it easier to extract all the information needed for their CSRD and EU Taxonomy reports. Instead of having to manually search through hundreds of documents, the documents can simply be ingested in a RAG system, to which the company’s CSR team can ask questions in natural language. The RAG solution retrieves the relevant information in the ingested documents and generates an adequate answer.

A RAG solution typically consists of two major components: a document retrieval system and an LLM for question answering. The retrieval system is responsible for finding the most relevant sources for answering a question in a corpus of preprocessed documents. Ideally, the documents are retrieved because their semantic content is relevant to the question. This can be accomplished by employing semantic search, also sometimes called vector search. The original question and the retrieved sources are subsequently presented to an LLM, with the instruction to craft an accurate, well-formatted response based on the provided sources.

Semantic Search

The document retrieval technique known as semantic search relies on the meaning of text. It stands apart from keyword search as a document does not need to specifically contain the exact query words to qualify as a relevant match.

For instance, in a semantic search system, the query “total headcount” will match with documents mentioning “the total number of employees of the company”, but not with documents mentioning “total waste produced”, as the latter is entirely different in meaning. On the other hand, with a keyword-based approach, both documents can be retrieved since they both contain the word “total”, which is a keyword match for the query.

Processing Documents for Semantic Search

To make the semantic content of documents searchable, they have to be transformed into a format that can be understood and manipulated by machines. Additionally, the chosen format needs to allow for calculating the semantic similarity between a query and a document.

Calculations can only be performed on numbers, not on text, which is why the documents are transformed to a numerical vector representation that captures their meaning, also called ‘embeddings’.

You can think of an embedding as follows:

Each embedding has a certain number of dimensions, typically a number larger than 500. Each dimension can be thought of as representing a certain concept, such as ‘number’ or ‘human’. If a concept is more relevant to the meaning of a word or sequence of words, the weight for that dimension will be higher, and vice versa. ‘Number’ is more relevant to the meaning of ‘headcount’ than it is to the meaning of ‘waste’, hence the weight is higher.

The language models trained to calculate these embeddings learn from frequency and distributional patterns in large amounts of text data. The semantic dimensions of the resulting embeddings are entirely abstract, yet a quite accurate representation of the meaning of the text they correspond to, which makes them suitable for semantic search.

Once we have a numerical representation of the meaning of the text, all sorts of calculations can be applied. In a semantic search context, the distance between the query embedding and the document embeddings is calculated. The closer a document embedding is to the query embedding, the more similar it is in terms of meaning. The closest documents are maintained for the Question Answering step of the RAG system.

Question answering with an LLM

Once the relevant sources for the question are retrieved, the question and the sources are presented to an LLM with precise instructions and examples on how to process them. Additionally, some instructions concerning language can be included: for instance, it might be that the retrieved sources are in French and Dutch but that the ESG reports have to be written in English. In this case, the LLM can be instructed to always answer in English, disregarding the language of the input question and sources.

Working with RAG solutions: Risks and Countermeasures

Even though RAG solutions are powerful tools to assist a company in its ESG reporting journey, there are some pitfalls. Here, we discuss some risks of working with RAG systems and the potential solutions to these risks.

LLMs are not always capable of discerning relevant information from irrelevant information. The LLM can be given precise instructions on how to deal with a complete lack of sources. But if during the retrieval step irrelevant sources are found, the final answer generated by the LLM will inevitably be unnecessarily verbose or partially wrong.
LLMs are incapable of recognising that the provided sources are incomplete. It is thus possible that a generated answer only includes some of the information that is required for an ESG report.
Relevant context can get lost when preprocessing documents for semantic search. The length of the sources that can be ingested in a semantic retrieval system is limited: the embedding models used to transform text to numbers can only process spans of text of a certain length, for instance, 512 words. Due to this limitation, the ingested documents have to be chunked into smaller pieces. During this chunking process, links between several parts of a document can get lost. For instance, a paragraph concerning the activities of a subsidiary of a company, without explicit mention of the fact that it only refers to the subsidiary, can be separated from the paragraph that provides the context necessary for correct interpretation. When the isolated paragraph is presented to the LLM, it will have no indications that the described activities do not refer to the company as a whole, and it can thus wrongly generate an answer presenting the activities as pertaining to the main company.

Overcoming the Risks of RAG Solutions

This issue can be overcome by implementing a more advanced document retrieval system that for instance retrieves preceding and following paragraphs for the most relevant matches to the query, or that enriches documents with metadata. The architecture of the most appropriate retrieval system depends largely on the nature of the documents it operates upon. Identifying the best solution requires a deep understanding of the data and how it can be structured.

The best way to counter the shortcomings of RAG solutions that are used in critical processes such as ESG reporting is to include a human validation step. Given that any AI solution can make mistakes, the generated answers should only be included in the final report after being validated by a person who verifies the correctness and completeness of the answers. Through the four-eyes principle, a RAG solution becomes a trustworthy and efficient assistant in ESG reporting.

Greenomy: Your AI-Powered Solution for ESG Reporting

The Greenomy platform includes several AI tools to guide companies through their sustainability reporting journey. Concretely, CSRD and EU Taxonomy reporting includes a RAG solution: companies can upload their documents, and the RAG pipeline tries to find all required data points. The queries used to extract the data points are curated by a team of sustainability and legal experts, to ensure highly accurate data extraction.

Additionally, the platform provides a legal assistant. Sometimes, a required data point cannot be found by the RAG pipeline, for instance, because the information is not contained in the provided documents. In this case, users have to extract the correct data point manually, while complying with the regulations. However, the regulations can be quite hard to interpret, in which case the legal advisor bot can help. Users can ask any questions regarding sustainability regulations, and the AI legal advisor provides a clear answer.

Finally, the disclosure reports need to include ESG strategies. It is not always clear what actions can be taken by a company to improve the social and ecological impact of their actions. Through a RAG system, users can look up strategies of industry peers to find inspiration or discover how their own company compares with their peers in terms of sustainability.

AI as a Helping Hand in Data Collection for CSRD/EU Taxonomy Reporting