MIT 6.S191: Deep CPCFG for Information Extraction

Transcription for the video titled "MIT 6.S191: Deep CPCFG for Information Extraction".


Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.

Opening Remarks

Introduction (00:00)

Thank you. So I lead AI at EY globally, and today we're going to talk to you about some work we've done in information extraction, specifically deep CPCFG. So hopefully we'll introduce some concepts that you may not have come across before. And before we get started, maybe a couple of disclaimers. The views expressed in this presentation are mine and Freddie's. They're not necessarily those of our employer. And I'll let you read those other disclaimers. AI is very important for Ernst & Young. Many of you will be familiar with the firm. We are a global organization of more than 300,000 people in more than 150 countries. We provide a range of services to our clients that include assurance, consulting, strategy and transformations, and tax. Both in the context of the services that we deliver and in the context of our clients own transformations ai is incredibly important we see a huge disruption happening for many industries including our own driven by ai and because of that we're making significant investments in this area we've established a network a global network of ai research and development labs around the world as you see in various parts of the world and we also have a significant number of global ai delivery centers and we there's huge energy and passion for ai within ey and so that we have a community of more than four and a half thousand members internally and we maintain uh meaningful relationships with uh academic institutions including for example for example, MIT. And we, of course, engage heavily with policymakers, regulators, legislators, and NGOs around issues related to AI. Some areas of particular focus for us, the first is document intelligence, which we'll talk about today. And this is really the idea of using AI to read and interpret business documents. The key phrase there is business documents. We're not necessarily talking about emails or web posts or social media posts or product descriptions. We're talking more about things like contracts, lease contracts, revenue contracts, employment agreements. We're talking about legislation and regulation. We're talking about documents like invoices and purchase orders and proofs of delivery. So there's a very wide range of these kinds of documents and hundreds, thousands, maybe tens of thousands of different types of documents here. And we see these in more than 100 languages in more than 100 countries in tens of different industries and industry segments. At EY, we have built some, I think, really compelling technology in this space. We've built some products that are deployed now in more than 85 countries and used by thousands of engagements. And this space is sufficiently important to us that we helped to co-organize the first ever workshop on document intelligence at NeurIPS in 2019. And of course we publish and patent in this area. Two other areas that are important to us that will not be the subject of this talk, but I thought I would just allude to are transaction intelligence. So the idea here is that we see many transactions that our clients execute. And we review those transactions for tax purposes and also for audit purposes. And we'll process hundreds of billions of transactions a year. So it's this enormous amount of transaction data and there's huge opportunities for machine learning and AI to help analyze those transactions, to help determine, for example, tax or accounting treatments, but also to do things like identify anomalies or unusual behavior or potentially identify fraud. Another area that's very important for us is trusted AI. So given the role that we play in the financial ecosystem, we are a provider of trust to the financial markets. It's important that we help our clients and ourselves and ecosystem at large build trust around AI. And we engage heavily with academics ngos and regulators and legislators in order to achieve this but the purpose of this talk is really to talk about the purpose of this talk is really to talk about document intelligence and specifically we're going to talk about information extraction from what we call semi-structured documents things like tax forms you see the document on the screen in front of you here, this is a tax form.

Discussion On Information Extraction And Parsing

What is information extraction? (04:18)

There's gonna be information in this form that is contained in these boxes. And so we'll need to pull that information out. These forms tend to be relatively straightforward because it's consistently located in these positions. But you see to read this information or to extract this information, it's not just in these positions. But you see, to read this information or to extract this information, it's not just a matter of reading text. We also have to take layout information into account. And there are some complexities even on this document, right? And there's a description of property here. This is a list. And so we don't know how many entries might be in this list. There might be one, there might be two, there might be more. Of course, this becomes more complicated when these documents, for example, are handwritten or when they're scanned or when they're more variable like, for example, a check. So many of you may have personalized checks and those checks are widely varied in terms of their background, in terms of their layout. Typically they're handwritten and typically when we see them they're scanned, often scanned at pretty poor quality and so pulling out the information there can be a challenge and again this is driven largely by the variability. We have documents like invoices. The invoice here is very very simple and but you'll note the key thing to know here is that there are line items and there are multiple line items. Each of these corresponds to an individual transaction under that invoice and there may be zero line items and there are multiple line items. Each of these corresponds to an individual transaction under that invoice. And there may be zero line items, or there may be 10 or hundreds or thousands, or even in some cases, tens of thousands of line items in an invoice. Invoices are challenging because a large enterprise might have hundreds or thousands, or even tens of thousands of vendors. And each one of those vendors will have a different format for their invoice. And so if you try to hand code rules to extract information from these documents, it tends to fail. And so machine learning approaches are designed really to deal with that variation in these document types and this complex information extraction challenge. Let me just talk about a couple of other examples here. And you'll see a couple of other problems. This receipt on the left-hand side is pretty typical, right? It's a scanned document. Clearly it's been crumpled or creased a little bit. The information that's been entered is offset by maybe half an inch. And so this customer ID is not lined up with the customer ID tag here. And so that creates additional challenges. And this document on the right-hand side is a pretty typical invoice. You'll see the quality of the scan is relatively poor. It contains a number of line items here. And the layout of this is quite different than the invoice we saw on the previous slide or on the receipt on the left-hand side here. So there's lots of variability. For the duration of the talk, we're gonna refer to this document, which is a receipt, and we're gonna use this to illustrate our approach. So our goal here is to extract key information from this receipt. And so let's talk a little bit about the kinds of information we wanna extract.

Types of information (headers, line items, etc) (07:19)

So the first kind is what we call header fields. And this includes, for example, the date of that receipt. When were these transactions executed? It includes a receipt ID or an invoice ID, and it might include this total amount. And these pieces of information are very important for accounting purposes. Make sure that we have paid the receipts and paid the invoices that we should. They're important for tax purposes, is this expense taxable or not taxable, have we paid the appropriate sales tax? And so we do care about pulling this information out. We refer to these as header fields because often they do appear at the top of the document, and usually it's at the top or the bottom, but they're information that appear typically once, right? There's one total amount for an invoice that we receive. There aren't multiple values for that. And so there's a fairly obvious way that we could apply deep learning to this problem. We can take this document and run it through optical character recognition, and optical character recognition services or vendors have gotten pretty good. And so they can produce essentially bounding boxes around tokens. So they produce a number of bounding boxes and each bounding box contains a token. And some of these tokens relate to the information we want to extract. So this $5 and 10 cents is the total. And so what we could do is we could apply deep learning to classify these bounding boxes. We can use the input to that deep learning could be this whole image, it could be some context at that bounding box, but we can use it to classify these bounding boxes. And that can work reasonably well for this header information. But there are some challenges. So here, for example, there is this dollar amount, $5.10. There's also $4.50, $0.60, $0.60 over here, $1.50. If we independently classify all of those, we may get multiple of them being tagged as the receipt total. And how do we disambiguate those? So this problem of disambiguation is fundamental. And what often happens in these systems is that there is post-processing that encodes heuristics or rules that are human engineered to resolve these ambiguities. And that becomes a huge source of brittleness and this huge maintenance headache over time. And we'll say more about that later. The other kind of challenge we see here are, feels like this vendor address. So this vendor address contains multiple tokens. And so we need to classify multiple tokens as belonging to this vendor address. And then we have the challenges to, which of those tokens actually belong to the vendor address? How many of them are there? And what order do we read them in to recover this address? So while a straightforward machine learning approach can achieve some value, it still leaves many, many problems to be resolved that are typically resolved with this hand-engineered post-processing. This becomes even more challenging for line items. And throughout the talk, we'll emphasize line items because this is where many of the significant challenges arise. So here we have two line items. They both correspond to transactions for buying postage stamps. Maybe they're different kinds of postage stamp. And each one will have a description, postage stamps. This one has a transaction number associated with it too. It will have a total amount for that transaction. It might have a quantity, it might have a unit price. So there's multiple pieces of information that we want to extract. So now we need to identify where that information is. We need to identify how many line items there are. We need to identify which line items this information is associated with. So is this 60 cents associated with this first line item or this second one? And we as humans can read this and computers obviously have a much harder time, especially given the variability. There are thousands of different ways in which this information might be organized. So this is the fundamental challenge. So these are the documents we want to read. And on the other side of this are is the system of record data, right? Typically this information will be pulled out typically by human beings and entered into some database. This illustrates some kind of database schema. If this was a relational database, we might have two tables. The first table contains the header information and the second table contains all of the line items. So this is the kind of data that we might have in a system of record. This is both the information we might want to extract, but also the information that's available to us for training this system.

Representing document schemas (11:57)

For the purposes of this talk, it's going to be more appropriate to think of this as a document type schema. Think of it as JSON, for example, where we have the header information or the first three fields. And this is not exactly JSON schema, but it's meant to look like that. So it's these first three fields have some kind of type information. And then we have a number of line items and the number of line items isn't specified and maybe zero and maybe more than one. And then each one of those has its own information. So our challenge then is to extract this kind of information from those documents and the training data we have available is raw documents and this kind of information and so i want to take a little aside for a second and talk about our philosophy of deep learning and many people think about deep learning simply as large deep networks we have a slightly different philosophy.

Philosophy of end-to-end deep learning (12:35)

And if you think how classical machine learning systems were built, and the first thing that we would do is decompose the problem into sub pieces. And those sub pieces in this case might include, for example, something to classify bounding boxes. It might include something to identify tables or extract rows and columns of tables. Each one of them then becomes its own machine learning problem. And in order to solve that machine learning problem, we have to define some learning objectives and we have to find some data and then we have to train that model. And so this creates some challenges, right? This data does not necessarily naturally exist, right? We don't, for these documents necessarily have annotated bounty boxes that tell us where the information is in the document. It doesn't tell us which bounty boxes correspond to the information we want to extract. And so in order to train in this classical approach, we would have to create that data. We also have to define these objectives. And there may be a mismatch between the objectives we define for one piece of this problem and another. And that creates friction and error propagation as we start to integrate these pieces together. And then finally, typically these systems have lots of post-processing at the end that is bespoke to the specific document type and is highly engineered. So what happens is these systems are very, very brittle. If we change anything about the system, we have to change many things about the system. If you wanna take the system and apply it to a new problem, we typically have to re-engineer that post-processing. And for us, where we have thousands of different types of documents in hundreds or 100-plus languages, we simply cannot apply engineering effort to every single one of these problems. We have to be able to apply exactly the same approach, exactly the same software to every single one of them. And really, to me, this is the core value of deep learning as a philosophy, is it's about end-to-end training. We have this idea that we can train the whole system end-to-end based on the problem we're trying to solve and the data we fundamentally have, the natural data that we have available. So again, we begin by decomposing the problem into sub-pieces, but we build a deep network, a network component that corresponds to each of those sub problems. And then we compose those networks into one large network that we train end to end. And this is great because the integration problem appears once when we design this network, it's easy to maintain. The data acquisition problem goes away because we're, we're designing this as an end-to-end approach to model the natural data that exists for the problem. And of course, there are some challenges in terms of how we design these networks. And really, that's the key challenge that arises in this case is how do we build these building boxes and how do we compose them? And so we're gonna talk about how we do this in this case. It's about composing deep network components to solve the problem end to end. So here, what we do is we treat this problem as a parsing problem. We take in the documents and we're gonna parse them in two dimensions. And this is where some of the key innovations are, on parsing in two dimensions, where we disambiguate the different parses using deep networks and so the deep network is going to tell us of all the potential parse trees for one of these documents which is the most probable or the most that matches the data the best and then we're going to simply read off from that parse tree the system of record data right no post processing we just literally read the parse tree and we read off that JSON data as output. So again, just to... So we have these documents on the left-hand side, right? These input documents, we run them through OCR to get the bounty boxes that contain tokens. So that's the input to the system. And then the output of the system is this JSON record that describes the information that we have extracted from the document. It describes the information we extracted. It doesn't describe the layout of the document. It just describes the information we have extracted. Okay, so that's the fundamental approach.

Context free grammars (CFG) (16:38)

And the machinery that we're gonna use here is context-free grammars. And context-free grammars, anytime you wanna parse, you have to have a grammar to parse against. Context-free grammars, anytime you want to parse, you have to have a grammar to parse against. Context-free grammars, for those of you with a computer science background, are really the workhorse of computer science. They're the basis for many programming languages. And they're nice because they're relatively easy to parse. We won't get into the technicalities of a context-free grammar. I think that the key thing to know here is that they consist of rules. Rules have a left-hand side and a right-hand side. And the way we think about this is we can take the left-hand side and think of it as being made up of or composed of the right-hand side. So a line item is composed of a description and a total amount. A description can be simply a single token, or it can be a sequence of descriptions, right? Description can be multiple tokens, and the way we encode that in this kind of grammar is in this recursive fashion. Okay, so now we're going to apply this grammar to parse this kind of JSON. And we do have to augment this grammar a little bit to capture everything we care about, but still this grammar is very, very simple. It's a small, simple grammar really to capture the schema of the information we want to extract. So now let's talk about how we parse a simple line item. We have a simple line item. It's postage stamps. We have three stamps, each at $1.50 for a total of 450. And so the first thing we do is for each of these tokens, we identify a rule where that token appears on the right-hand side, and we replace it with the left-hand side. So if we look at this postage token, we replace it by D for description. We could have replaced it by T for total amount or C for count or P for price. In this case, I happen to know that D is the right token, so I'm going to use that for illustration purposes. But the key observation is that there is some ambiguity here, right? It's not clear in which of these substitutions is the right one to do. And again, this is where the deep learning is going to come in to resolve that ambiguity. so the first stage of parsing is that we uh we substitute everywhere we see a right-hand side of a rule we substitute the left-hand side of the rule resolving the ambiguity as we go and the next step is by construction and for technical reasons these grammars always have they either have a single token on the left-hand side, or they have a, sorry, it looks like maybe there's some question here. And so they either have a single token on the right-hand side, or they have two symbols on the right-hand side. And so since we've dealt with all the tokens, we're now dealing with these pairs of symbols. And so we have to identify pairs of symbols that we substitute again with the left- side so this is description description that we substitute with uh with a description and uh and likewise for count and price we substitute with you okay so we just repeat this process and and get a full parse tree where the final symbol here is a line item. So that tells us this whole thing is a line item made up of count, price, and total amount. Okay, so as I said, there's some ambiguity. And one place where there's ambiguity here is three and a $1.50. How do we know that this is in fact a count and a price? Right, this could just as easily have been a description and a description. So resolving this ambiguity is hard, but this is the opportunity for learning. This is where the learning comes in that can learn that typically $1.50 is probably not part of the description. It probably relates to some other information we want to extract. So that's what we want the learning to learn. And so the way we do this is we associate every rule with a score. So each rule has a score, and then we try to use rules that have high scores so that we produce in the end a parse tree that has a high total score. So what we're actually going to do is we're going to model these scores with a deep network. So for every rule, we're going to have a deep network corresponding to that rule, which will give the score for that rule.

Parsing with deep learning (20:55)

Okay? Now, let me illustrate that on one simple example here. We have $1.50, and this could be a description, a total amount, account, or a price. And we might intuitively think, well, this should be biased towards total amount or price because it's a monetary value. But the way we're going to resolve this is we're going to apply the deep networks corresponding to each of these rules. There's four deep networks. We're going to apply these deep networks. And each of them will return a score. And we expect that over time, they will learn that the deep network for total amount will have a higher score, and the deep network for price will have a higher score. So that's fundamentally the idea for these bottom set of rules, these token-based rules. For the more involved rules, where we have two terms on the right-hand side, we have the similar question about resolving ambiguity. And so there's two, I think, important insights here. The first is that, you know, we do have this ambiguity as to how we tag these first tokens. We could do CP or we could do DP. But we quickly see that there is no rule that has DP on the right-hand side. And so the grammar itself helps to correct this error, right? Because there is no rule that would allow this parse, the grammar will correct that error. And so the grammar allows us to impose some constraints about the kind of information these documents contain. It allows us to encode some prior knowledge of the problem, and that's really important and valuable. And the second kind of ambiguity is where, you know, it is allowed, but maybe it's not the right answer. So in this case, we, cp could be replaced by a u, and we're going to evaluate the model for this rule based on the left-hand side of this tree and the right-hand side of the tree. So this is gonna have two inputs, the left-hand side and the right-hand side. And likewise for this rule, which tries to model this as a description description. And so this, each of these models will have a score, which will help us to disambiguate between these two choices. So the question then arises, how do we score a full tree? And I'm gonna introduce a little notation for this. This is hopefully not too much notation, but the idea is we're gonna call the full parse tree T, and we're gonna denote the score for that tree by CT. And I'm gonna abuse notation here a little bit by tree T, and we're going to denote the score for that tree by C T. And I'm going to abuse notation here a little bit and reparameterize C as having three parameters. The first is the symbol at the root of the tree. And the other two are the span of the sentence, the span of the input covered by that tree, right? In this case, it's the whole sentence. So we use the indices zero to n. Now, the same notation works for a subtree, right? Where again, the first term is the symbol at the root of that subtree, and we have the indices of the span. Here, it's not the whole sentence, it just goes from i to j. Okay, so this is the score for a subtree. Let's see if there's questions here real quick. So this is the score for a subtree, and we're going to define that again in terms of the deep network corresponding to the rule at the root of that sub tree but also it's made up of these other sub trees so we're going to have terms that correspond to the left-hand side and the right-hand side of this tree so these are the we've defined this score now recursively in terms of the trees that make it up. And I do see there's a question here and it says, by taking into account the grammar constraints, does this help the neural network learn and be more sample efficient? Yes, exactly. The point is that we know things about the problem that allow the network to be more efficient because we've applied that prior knowledge. And that's really helpful in complex problems like this. So, okay, so as I said, we've defined out a score for the entire tree in terms of these three terms. And it's key to note here that actually, what we want is this to be the best possible score that could result in D at the root. And so in fact, we're gonna model a score as the max over all possible rules that end with a d and over all possible ways to parse the left tree and the right tree and this might seem um challenging but actually we can apply dynamic programming fairly straightforward to find this in a fairly efficient manner. Okay, so we've defined now a scoring mechanism for parse trees for these line items. And what happens is that we're then going to be able to choose among all possible parse trees using that score. And the one with the highest score is the one that we consider the most likely. The one that's most likely is the one that contains the information we care about. And then we just read that information off the parse tree. And so now we're going to train the deep network in order to select those most likely or most probable parse trees. So just a reminder, this first term, as I've mentioned, is a deep network. There's a deep network for every single one of these rules. And then we have these recursive terms that again are defined in terms of these deep networks. So if we unroll this recursion, if we unroll this recursion, we will build a large network composed out of the networks for each of these individual rules. And we'll build that large network for each and every document that we're trying to parse. And so the deep network is this dynamic object that is being composed out of solutions to these subproblems in order to identify which is the correct parse tree. So how do we train that? It's fairly straightforward.

Learning objective and training (27:10)

We use an idea from structured prediction. And the idea in structured prediction is we have some structure, in this case, a parse tree, and we want to maximize the score of good parse trees and minimize the score of all other parse trees. So what this loss function here is trying to do, or this objective is trying to do, is maximize the score of the correct parse trees, the ones that we see in our training data, and minimize the scores of all other parse trees, the ones that we don't see in our training data. And this can be optimized using backpropagation and gradient descent or any of the machinery that we have from deep learning. So this now becomes a classic deep learning optimization problem. And the result of this is an end-to-end system for parsing these documents. Now, what I haven't talked about is the fact that these documents are actually in two dimensions. So far, I've just focused on one-dimensional data. So now I'll hand over to Freddie, and he will talk about how we deal with this two-dimensional nature of this data. Freddie.

Dimensional parsing (28:21)

Well, thank you, Nigel. Okay, so this is the portion of the receipt that we were showing earlier on, and we are going to focus on the line items within this blue bounded region over here so moving along what we do is we apply ocr to the receipts and we get the bounding boxes that represents each tokens on the left side that is the line items as shown in the receipt on the right side that is the annotation that we are trying to pass we're trying to get the pass to match these annotations. What you will see is that there are many boxes over here that we consider as irrelevant because it wouldn't be shown in the annotations and that wouldn't be part of the information that we want to extract. So let's call these extra boxes and to simplify the parsing, I'm going to remove them. And then after going through the 2D parsing, we're going to come back and see how we handle these extra boxes. So before talking about 2D parsing, let me just motivate why we need to do 2D parsing. So what you see over here is this 2D layer of the first line item. And if we were to take this line item and we reduce it into a 1D sequence, what happens is that the description, which was originally in a contiguous layout in the 2D layout, is no longer contiguous in the 1D representation. You can see that the yellow region is contiguous in the 2D layout, and it's no is contiguous in the 2D layout and is no longer contiguous in the 2D sequence. That's because there's this blue region over here, the 60 cents, which has truncated the description. So while we can add additional rules to handle this situation, it typically wouldn't generalize to all the situations or all cases. A better solution would be actually to pass it in 2D. And passing it in 2D would be more consistent with how we humans interpret documents in general. So we begin with the tokens. And as what Nigel had mentioned, we put it through the deep network. The deep network is going to give us what it thinks each token represents. So in this case, we get the classes for the tokens, and now we can begin to merge them. Beginning with the token in the top left corner is the word postage. So we know that postage is the description. postage. So we know that postage is the description. The most logical choice to pass is to combine with the token that is nearest to it, and that's the stamps. So we can pass it in a horizontal direction. And we can do that because there's a rule in the grammar that says two description boxes can merge to form one description. Now, the next thing to do is we can either pass it to the right, as we can see over here, or we can pass it in a vertical direction. So which one do we choose? If we do it, if we pass it horizontally, like what we showed over here, this works because there is a rule in the grammar that says that a line item can be made simply the total amount. But what happens is that all the other boxes are left dangling there and then it wouldn't belong to any line item. And in fact, this wouldn't be an ideal pass because then it wouldn't match the annotations that we have. So an alternative pass is to pass it in a vertical direction, as shown over here. What's important to note is that we do not hard-code the direction of the path. Instead, the deep network is going to tell us which is the more probable path. In this case, you can combine them together because we know that postage stamps are description. And the deep network has told us that this string of numbers over here that is a description you can join them to be a description again the next thing to do is to look at um the box the token one over here and 60 cents we know that one is a count 60 cents is a price and we can join them because we have a rule in the grammar that says you can join a count and a price to get a simple yield. Now, then the next thing is we can join a description and a simple yield and we get a simple Q. Finally, let's not forget that we still have a token over there. We know that a token is a total amount. Finally, we can join horizontally. And as a result, get the whole line come over here. So moving along, recall that early on, we have simplified the parsing problem by removing all these extra boxes. They're not there. But what if we put them back? If we put them back, it complicates the parsing. It's not in the annotations. And I didn't show it earlier. So what do we do?

Handling noise in the parsing (33:20)

OK. So early on, we already know that push systems, they are descriptions. And we can join them to become a description again. So there's these extra words over here, the transaction number words. What do we do about them? We introduce a new rule in the grammar. And the new rule is saying, we allow a token to be a noise. So noise becomes a class that the deep network will possibly return to us. If we know that transaction number, they can be classified as noise. Then the next thing to do is we have, sorry about that. We can join two noise tokens together and we get one noise token because we introduced a rule in the grammar that allows us to do that. Next thing, we're going to add another rule that says the description can be surrounded by some noise around them. In this case, I've added a rule over here you can see. The exclamation mark here represents a permutation on this rule. What this means is that we're not going to put a constraint on how these right-hand side symbols can appear. In this case, noise can come before description or description can come after noise. In this case, the example shown over here, the noise comes before the description. And we can combine noise and the description together to get another description, the simple D over here. And moving along, I can combine two descriptions, and I get a description. So you can see that this is how we handle irrelevant tokens in the documents. So continuing with the logic, eventually we would end up with the right pass-throughs for the line items in the documents. And this is what we get, matching the current information.

Experimental results (35:23)

Okay, so finally, I would like to talk about some experimental results that we have. Well, our firm is mostly focused on invoices and most of these invoices tends to be confidential documents. So while we believe that there could be other labs, other companies also working on the same problem, it's really very hard to find a public data set to compare our results with. Fortunately, there is this interesting piece of work from Clover AI, it's a lab we're doing this company in South Korea called NaverCorp. They also look at the problem of line items. And to their credit, they have released a data set of receipts that they have written up about. And their paper is on as a preprint. The main differences between our approach and their approach is that what they require is for every bounding box within the receipt to be annotated. Which means every bounding box, you're going to go in, you're going to say this bounding box belongs to this class. And every bounding box, you need to have the associated coordinates with it. In our case, all we do is to rely on the system records in the form of a JSON format, which doesn't have the bounding box coordinates. So effectively, what we are doing is that we are relying on less information than we should have. And based on the results, we achieve pretty comparable results. As far as possible, we tried to implement the metrics as close as possible to what they described. So I guess with that, I can hand it over to Nigel. Great, thanks, Freddie. Let me see if I have control back. Okay, let me... Okay, I think I do. Let me see if I have control back. Okay, let me okay, I think I do. Okay, so, you know, this was a number of people helped us helped us with this work. And so I want to acknowledge that help and, and, and please do get in touch. We do lots of interesting AI work at EY and all around the globe. And we are hiring. Please reach out to us at And we referenced some other work during the talk.

Interactive Session

Question and answering (38:00)

Lots of really interesting, great papers here. Definitely worth having a look at. And with that, there were a couple of questions in the chat that I thought were really great. And so maybe let me try and answer them here because I think they also help to clarify the question, the content. So there was one question which is, can we let the AI model learn the best grammar to parse rather than defining the grammar constraints? And it's a really good question, but actually the grammar comes from the problem. The grammar is intrinsic to the problem itself. The grammar can be automatically produced from the schema of the data we want to extract. So it's natural for the problem. It's not something we have to invent. It comes from the problem itself. There was another question about, is it possible to share networks across the rules? And again, really good question. I think there's a few ways to think about this. So number one is that each of these rules has its own network and we share the weights across every application of those rules, whether that will be applied multiple times in a single parse tree or across multiple parse trees from multiple documents. The other is that oftentimes we will leverage things like a language encoder, BERT, for example, to provide an embedding for each of the tokens. And so there's lots of shared parameters there. So there are many ways in which these parameters are shared. And it ends up being possible to produce relatively small networks to solve even really complicated problems like this. And there was a question as to whether the 2D parsing is done greedily. And so, again, really good question. The algorithm for parsing CFGs leverages dynamic programming. So it doesn't, it's not a greedy algorithm. It actually produces the highest score parse tree, and it does that in an efficient manner. So naively, that algorithm looks like it would be exponential. But with the application of dynamic programming, I believe it's enqueued. And then there's a question, do the rules make any attempt to evaluate the tokenized data? For example, total actually equaling price times count when evaluating the likelihood of a trade? Again, really good question. We have not done that yet, but it's something that we do have in mind to do because that's a really useful constraint. It's something we know about the problem is that a line item total tends to equal the unit count times the unit price. And so that constraint should be really valuable in helping with a problem like this. And then a final question, are grammar rules generalizable to different document types? So again, these grammar rules are fundamental or natural to the problem, but they correspond to the schema of the information we want to extract. So that notion of generalizability of the grammar between document types is less important. So thank you. I'm happy to answer other questions. Hand it back to you, Alex.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Wisdom In a Nutshell.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.