The paper End-to-end Differentiable Proving describes a break through way for computers to reason, aka learn first order logic from a knowledge base. Learning to answer questions such as what country is next to Germany or who was George Washington's nephew? All without being explicitly programmed to do so. The technique is called Neural Theorem Provers (NTPs) and utilizes a numerical representation of words, relationships and questions as well as symbolic one. This allows for the learned rules to be interpreted and encoded (if desired) by humans. It's another step towards the goal of software being able to reason like people and make logical conclusions given a dataset.
Artificial Intelligence applications today such as AlphaGo or Google’s VisionAPI are highly domain specific and lack any intuitive understanding of the predictions produced by these complex deep neural networks. The field of Natural Language Understanding (NLU) is rapidly advancing in an effort to build systems which are capable of reasoning about passages of text and answering questions about relationships of data stored within a knowledge base. Setting aside the philosophical question of what it means for software to “understand” a fact, we can see the ability for these systems to answer questions which require transitive reasoning or following a logical chain. The canonical example here is “Who is the grandfather of Bart Simpson?” Given the family tree this is easily answered by a human, but has yet been difficult for computers to learn without the specific relationship being coded in.
In 2017, the field took a big step forward with the insights described in the paper, End to End Differentiable Proving by Sebastian Riedel and Tim Rocktäschel. The authors explain a new technique which dynamically generates neural network modules to learn vector representations of entities (e.g. Bart Simpson, Homer, Springfield, etc) and relationships (e.g. Grandfather of, parent of, lives in, etc) allowing the software to answer questions requiring first order logic. They dub this new model, Neural Theorem Provers (NTPs). I know that’s a mouthful, but we’ll break that down step-by-step below.
What are the benefits?
In order for software to be able to reason there needs to be a way for applications to learn concepts from data and discover how things are related to one another. Currently, all machine learning enabled software relies on its creator to decide the optimization function. Which parameters matter and which do not? We need systems that can think through the logical consequences of its actions and be able to identify logical inconsistencies. Although this is a long way off, the first step is for the computer to learn transitive reasoning. For example, if A is larger than B and B is larger than C, A must be larger than C. Many systems and computer languages like Prolog were created to do just this, but now researchers are creating systems that learn logic from the data rather than requiring a person to program it in explicitly.
Let's review some quick definitions, then dive deeper into some of the prior developments leading up to these achievements, how they work and what this will enable in the future.
Subsymbolic: A numerical representation of a word, phrase or question that just contains a list of numbers usually in the form of a vector. For example, word vectors in word2vec.
Vector Space: All possible combinations of lists of numbers used for subsymbolic representations
Atom: a predicate symbol and a list of terms
Transitive: In math, if a relationship applies between two members in successive order than it also applies to members they connect. For example, if A has B and B has C, then A has C.
Logic chains: Connecting various facts through shared expositions.
Knowledge Base: A graph database which contains the relationships between entities. For example, Sally is Chris’ sister, Denver is located in Colorado, or the sky is blue. See this post for more info.
Prior work to build systems with transitive reasoning comes in two forms: 1) Statistical Relational Learning and 2) Neural Multi-hop reasoning. These each have their own strengths and weaknesses.
Statistical Relational Learning are models which cluster questions and answers together to identify patterns, as described in the paper, Statistical Predicate Invention. They contain symbolic representations of each entity and allow for easy interpretation of the logic rules and addition of new rules to capture domain specific knowledge.
The second class of solutions are neural multi-hop reasoning models, which learn reasoning chains from a vector space or also learn subsymbolic representations of questions before comparing with answers. The strength of these models is that related entities such as grandfather and grandpa or bigger and larger can be leveraged when searching for valid logic chains. The downside is that there are no symbolic representations of the logic chains. Therefore, it is very difficult to add rules and interpret the logic patterns which are provided. One example of this type of model is the Compositional Vector Space Models for Knowledge Base Completion.
The authors introduce a novel technique, called Neural Theorem Provers (NTPs), for learning sub-symbolic (aka vector representations) for entities and questions while maintaining the capability to relate those vectors back to their symbols. For example, the word truck could be represented by a 100 dimensional vector and after the logical rules are learned from the knowledge base, that vector can still be converted back into the word truck.
The method utilizes a backward chaining algorithm similar to the one implemented by Prolog’s logic solver. This algorithm is used to create neural networks capable of proving queries to a Knowledge Base. A knowledge base contains a set of relationships between entities, such as Augustine Washington was George Washington’s father. These neural network proofs are differentiable allowing the model to be trained end-to-end. In other words, all that needs to be done is feed it a knowledge base and it will do the rest.
The learned vectors representing each entity and predicate allow for greater transitivity because synonyms can be interchanged similar to how people naturally think. For example, the words larger and bigger could be interchanged within a query or fast and speedy or truck and automobile.
Since the system learns the rules, rather than just a model to predict the answer, these latent rules can be decoded into human interpretable symbolic rules. Thus allowing more rules to be hand crafted and included.
After introducing these concepts, the paper continues to explain how these NTPs are constructed, what optimizations were done to improve performance and a summary of the experimental results.
The Nuts & Bolts of the Differentiable Prover
Neural Theorem Provers (NTPs) consist of modules which take discrete objects and a proof state as input and then return a new proof state.
The NTPs are recursive in nature, traversing the combinations of predicates and related terms until matches are found. This is done through three key functions, AND, OR and Union. The OR module checks all possible combinations and if any one of them is true, it returns a valid proof. Within the OR function, a separate AND function is called for each combination of predicates and terms, but this function only returns true if ALL containing proofs are true. The unification module starts it off by combining various predicates and terms to see if they yield a valid proof.
The tree below, taken from the paper, outlines the example for the grandFatherOf predicate:
Each module dynamically generates a neural network which can learn which are true and which are not. This allows for the system to be trained end-to-end and all that the user needs to do is provide a dataset for it to learn from.
How did it do?
The performance of this new method was compared to prior methods using datasets containing relationships between Countries, Kinship, Nations and the Unified Medical Language System (UMLS). For example, in the countries dataset the software had to learn to answer questions showing whether a city or country was located within a region or sub-region.
The vanilla or plain Neural Theorem Prover (NTP) was not able to beat the standing champion, ComplEx which is a Neural Link Predictor. However, the authors modified the NTP to utilize the subsymbolic vector representations learned by ComplEx as a starting point then allow NTPs to focus on the multi-hop reasoning where it shines. This new adaptation was dubbed, NTP-lambda. The new loss functions looks like this:
The result was an increase in accuracy for the 2nd class of questions (S2) scored from 87.95 to 93.04 and the third class' (S3) score went from 48.44 to 77.26. Also, it was able to score 100% for all the first order logic questions! NTP-lambda performed similarly better on the Nations dataset, but failed to outperform ComplEx on the kinship or UMCL datasets.
Despite not performing as well in one of four categories, NTPs provide the added benefit of producing human interpretable rules.
The authors plan to address some of the computational limitations introduced in this method by exploring hierarchical attention, reinforcement learning and a monte carlo tree search as applied by the AlphaGo system. There’s also interest in applying this method to mathematical proofs both in symbolic and written form.
This achievement also opens the door for people in industry to begin applying this method to some of the large Knowledge Bases created by Google, Wikipedia or Amazon. Such a system could theoretically answer questions about relationships not explicitly stated within those databases. For example, where was the first person who hiked Mount Everest born?
What types of questions would you like a digital assistant be able to answer for you?