Research - DICE

Explaining document classification decisions to users

Real-world Machine Learning applications often require explainable solutions. This has become more apparent with the introduction of Deep Learning models which are by nature not explainable. We work on two different types of explainability. The first type is explainability by design, where the model learns to extract pieces of input text as justifications — rationales — that are tailored to be short and coherent, but sufficient for accurate prediction (usually document classification). The second type of explainability that we are considering is a post-processing step, which is trying to identify parts of the input that are responsible for the decision of the model, without being designed or trained for this task.

Regulatory document analysis for compliance applications

Thousands of new regulation documents are published every year. In order to achieve compliance, organisations need to put a lot of manual effort to retrieve all relevant new regulation and understand it. Specifically, the retrieval involves trying different combinations of keywords to query general-purpose regulatory databases and then spending time to go through the results to distill only the most relevant documents. We address this problem by working on novel Information Retrieval methodologies that receive documents as input (ex: documents describing existing controls of an organisation, or legislation documents of a country).

Analysis of annual financial reports for auditing applications

Annual financial reports play an important role in the financial audit process. Auditors usually check the numbers in financial statements. The text of the reports could also provide valuable information but it is very time-consuming to check. In addition, the introduction of XBRL (eXtensible Business Reporting Language) as a requirement for tagging reported financial values, introduces more challenges for auditors. We aim to automate the analysis and tagging of the text of financial reports using Machine Learning. To this end we focus on the following research challenges: (a) Classification of long texts with imbalanced class distribution (b) Numeric Entity Recognition.

Classification of documents with missing data

In many practical applications in the financial and the legal domains, thousands of documents need to be annotated with one or more of possibly tens or thousands of labels. In addition to their size, the label sets are frequently updated, making it very impractical to maintain the correct labels per document. Therefore, one would like to train document classifiers that assign labels automatically. Training such classifiers with machine learning methods is a challenge, not only due to the number of the different labels and their volatility but also due to their highly imbalanced distribution. In effect, it is very difficult to get training data that adequately cover all classes. Our research focuses on text classification with few- and zero-shot learning capability to handle rare and unseen classes.