I haven’t written to the email list in years. This is what I’m working on. Please feel free to unsubscribe below.
This is part 1? in a series on tackling the most audacious of problems in using GPT/LLM style tools against data sources. I’m thinking/learning out loud on a lot of this and sharing here.
Combining Large Language Models (LLMs) and Knowledge Graphs is our primary approach for enhancing Retrieval-Augmented Generation (RAG) systems
The Power of LLMs and Knowledge Graphs in RAG
I’ve been working with Neo4j for 3+ years now. I’ve been doing NLP things with Python and Neo4j in my free time, including working with other language models pre GPT-4.
Cracking the code of Knowledge Graph Retrieval-Augmented Generation (KG-RAG) is key to getting better context-aware responses from LLMs like GPT-x and Claude.
- LLMs provide natural language understanding and generation capabilities.
- Knowledge Graphs offer structured, interconnected data representations.
- LangChain / LlamaIndex allow us a low friction way to chain together LLM agents and data sources into a mostly linear workflow.
- LangGraph (tooling by LangChain) allows us to think about how to weave LLM agent work into more of an ecosystem of workers, teams, and tools.
I am working on using LangGraph to create a “virtuous cycle” where LLM agents help enrich our knowledge graphs, ensure quality of data and information, and leverage knowledge graph contextual information to achieve more than they can relying on “general knowledge” from the model.
This combo allows for more intelligent information retrieval and generation, leading to better systems and quality outputs in advanced AI-powered applications like the ones we’re building at Inbound Found.
Semi-Automating Knowledge Graph Creation / Refinement
Creating and maintaining a knowledge graph is expensive — data engineering, db administration, pipelines. It’s just a lot and feels like most time-consuming process. However, by leveraging LLMs and other AI techniques, we can semi-automate various aspects of this workflow:
- Information Extraction: Use NER (Named Entity Recognition) models to automatically extract entities and relationships from unstructured text.
- A lot of methods and tools around naming entities in text started feeling obsolete with GPT-4 coming out. But then it became apparent it is just too expensive to ask GPT to extract the entities. Because of this, I’m going back to the basics and looking at how we can system prompt cheaper LLMs with instructions like inside-out-beginning to minimize the tokens/effort of getting to a 80/20 on entity recognition.
- We also have access to domain-specific taxonomies/terms as well as API data like Wikipedia APIs, Google Knowledge Graph API, a few others, so we can “guess” at an entity and then cross-check it against a data source we can rely on.
- Entity Linking: Employ entity disambiguation techniques to link extracted entities to existing knowledge base entries. The only tricky part here is getting the data model right so we can have LLM agents “query” our Knowledge Graph without getting confused. I’ve found that the simpler the data model, the easier to ask GPT a question about the graph. I can say “Does Page A link to Page B?” and it can go query,
MATCH (a:Page)- [LINKS_TO]-> (b:Page) RETURN *
. Much easier than something likeMATCH (a:Page)- [LINKS_TO]-> (x:Page)-[PROBABLY_SAME_AS]->(b:Page) RETURN *
- Relationship Inference: Utilize LLMs to infer potential relationships between entities based on context and existing knowledge. This is where I’m having a lot of fun with prompt engineering. Will come back to.
- Data Model Evolution: Refine/expand KG schema based on new info/patterns/data as we build our automated extraction process.
“Entity Disambiguation”
I’m so glad there is a word for this. I’ve been trying to solve what this means in a vacuum without realizing its a thing. Getting this right ensures that extracted entities are correctly linked to their corresponding unique entries in the knowledge base. Prevents duplicate nodes meaning the same thing, keeps everything consistent, helps us keep the data model simple, and db from being messy.
One of the things that we’re trying to do is figure out a way to simply and quickly disambiguate the entities that we extract from documents web scraping other relevant sources. We’re taking pretty much everything that comes into the graph database, vectorizing it with the OpenAI’s Embeddings API. It’s not particularly expensive but we would rather not use it for junk and we would rather not put junk into the database (or if we do get rid of it quickly and cost-effectively).
The nice thing about vectors is with enough information we can vectorize all the nodes and run cosine similarity on them to find the high similarity nodes that can be clustered together. For example to flag and de-duplicate “Abe Lincoln” and “Abraham Lincoln” :USAPresident:Person nodes.
But as with most things it’s not good enough. We can have things with high cosign similarity that are not the same we can have things with slightly lower cosign similarities that are the same. And the OpenAI Embeddings API does have associated cost. Running cosine similarity is also not cheap; the process is exponential to compare all notes to all other nodes.
So we need to try to combine with other methods to nail a cost-effective and accurate enough approach to entity disambiguation process, typically called:
An “Ensemble Approach”
Enter FastRP (Fast Random Projection)
FastRP (Fast Random Projection) method was shared in 2019 and offers more scalable approach to generate node embeddings in large graphs. Neo4j allows us to run it out of the box now. For context, node2vec is 4000x more expensive than FastRP. So I’m thinking we can use this as a first pass to generate similarity clusters based on KG context. eg. if our duplicate President nodes show up similarly in a network graph, we should be able to flag it with FastRP.
Graph Structure Representation: FastRP generates low-dimensional vector representations of nodes that preserve the graph’s structural information. It’s pretty neat. This allows for disambiguation based on an entity’s position and relationships within the knowledge graph.
Embedding Similarity: By calculating the similarity between FastRP embeddings of candidate entities and the context of the mention, we can identify the most likely match.
Imagine a graph with these in it based on extracting named entities from two bio blurbs about Abraham Lincoln.
# Abraham Lincoln was the president of the United States... He chopped down a cherry tree.
( Cherry Tree )-[CHOPPED_DOWN]-( Abraham Lincoln )-[PRESIDENT_OF]->( United States )
# Abe Lincoln was the president of the USA... He chopped down a tree. ( Tree )-[CHOPPED_DOWN]-( Abe Lincoln )-[PRESIDENT_OF]->( USA )
I can easily get these sentences loaded into my graph db but then I need to merge them somehow.
Reducing dimensionality
FastRP compresses those structures down into simple matrix representations so we should be able to identify similar ones quickly. The way I understand it is that nodes can be represented as rows and columns in a table, with 1 or 0 in the cell depending on if there is an edge between those nodes or not. Other ways to reduce dimensionality… basically trying to compress the mass amounts of data representation from these crazy neural nets we all have access to now.. are compute intensive, while “very sparse random projection” cuts out a bunch of unnecessary calculations given most nodes are not connected so most “0”s in those matrices don’t need to be thought through. idk.
Complementary Signals: Combine FastRP embeddings with other disambiguation signals such as:
- Text similarity between the mention and entity names
- Contextual embeddings from LLMs
- Entity popularity or prior probability
Collective Disambiguation: Use the graph structure captured by FastRP to perform collective disambiguation, ensuring global coherence across all entity mentions in a document.
- The vector embeddings for context from LLMs or like Wikipedia’s Vector store.
- Applying other graph algorithms to get a sense of network related characteristics like entity popularity or prior probability
Analyzing text similarity between the mention and existing entity names with “string similarity”
For finding text similarity between entity mentions and entity names in Neo4j:
apoc.text.sorensenDiceSimilarity
orapoc.text.jaroWinklerDistance
functions calculate the similarity between the entity mention and entity name strings. These functions return a similarity score between 0 and 1, with higher values indicating greater similarity. Example using Sørensen–Dice similarity:
MATCH (mention:Mention), (entity:Entity)
WITH mention, entity, apoc.text.sorensenDiceSimilarity(mention.text, entity.name) AS similarity
RETURN mention, entity, similarity
- Filter the results based on a similarity threshold to only consider entity candidates with a high enough similarity score. eg.
MATCH (mention:Mention), (entity:Entity)
WITH mention, entity, apoc.text.sorensenDiceSimilarity(mention.text, entity.name) AS similarity
WHERE similarity > 0.75
RETURN mention, entity, similarity
- String similarity computation can be intensive for bigger graphs. So I’m trying to figure out if we can use FastRP first, and then filter the potential candidates based on some simple criteria like shared properties like same last name. Called “reducing pairwise comparisons needed.”
- Aside: looks like Levenshtein distance (
apoc.text.levenshteinSimilarity
) is used for things like spell checking so could help for that? egRETURN apoc.text.levenshteinSimilarity("Abe Linconl", "Abe Lincoln") AS output;
- Aside: looks like Levenshtein distance (
Implementing a Semi-Automated Entity Disambiguation Pipeline
Here’s what I’m thinking:
- Load a bunch of people’s linkedin profile URLs who have the same or similar names!
- Scrape HTML, convert to JSON, parse structured data into node naming conventions like School, Company, etc.,
- Use NER model like spaCy to get a starting point of entity mentions to disambiguate
- Do “candidate generation” where we try to identify some nodes the mentioned entity could refer to. For GSC keyword embeddings we could potentially start with cosine similarity here. We could also start with string similarity and maybe only check n degrees from the larger topic / thing considering graph position? Not sure.
- Do “context embedding” like sentence or paragraph level embedding capturing the surrounding text with something like BERT or GPT-3
- Retrieve/generate “entity embedding”.. multiple methods? FastRP? Google Knowledge Graph API or Wikipedia2Vec or OpenAI Embeddings API? Was suggested to generate summaries of text with context on the fly and compress those into some vector representation with BERT and just use that.. not sure.
- Run similarity algos, prob cosine.
- Create an ensemble score based on everything?
- Create approach for “collective disambiguation” to check for all mentions in the page/article/doc/transcript so we don’t have to assign correct entity to the mentions one by one. Basically we create a subgraph/projected graph of similar scores between mentions and candidate entities. PageRank for this.
- Aside: Look into “Loopy Belief Propagation”. From Claude “This is a message-passing algorithm that iteratively updates the beliefs (probabilities) of each node based on the messages it receives from its neighbors. In the disambiguation graph, mentions pass messages to their candidate entities indicating how likely they are to refer to them, and entities pass messages to other entities indicating how compatible they are. The algorithm converges to a stable set of beliefs, and the highest-probability candidate for each mention is selected.”
- Decision criteria finalized… update graph. Should be based on highest-scoring candidate for each mention, some process to avoid uncertainty.
Leave a Reply