I’m going back to converting dinner party conversations into speeches analogy from yesterday.
Let’s imagine you recorded all the conversations you’ve ever had, when and who they were with, the points you’ve made, and measures of how well those points were received.
Not only that, you can exercise some control over all that data, and ask questions like:
- What conversations have I had around what topics?
- For a given topic x, what conversations referenced or were referenced in other conversations?
- Did eyes light up or get heavy listening to you dribble on about x topic? put more simply, which were most (and least) well received?
- Is there existing demand for your thinking on x topic? what about compared to y topic? Or z?
- Maybe most importantly, what would it take to wrangle all the conversations around each of the most well received topics into (good) speeches?
We can actually do a lot of this already.
Let’s start simple. We want to make sure we can start asking meaningful questions, starting with, “which topics are the most popular?” and “what conversations were most impactful?”
With this information, we can do a lot. We can sort topics by importance, then cluster similar conversations together based on topics, and then rank order how influential or important those conversations were on a topic by topic basis.
As a disclaimer, I’m in the very early stages of learning how to not do this manually. For most of my marketing career, I have brute force cobbled answers like this together with GA filters, spreadsheets, pivot tables. A good day would be getting to one key insight.
I once spent days (weeks?) trying to slice and dice that data in SQL with join statements to figure out why a client had gone in the red for the first time in decades.
So this right now feels a bit like a caveman with a rock that I use to smash other rocks picking up an alien laser.
From what I can tell, it’s about a three step process:
1. Import and wrangle all the data into organized enough an easily query-able format
In Neo4j this means some data refactoring for clarity, creating indexes so that data is more easily accessible, and merging duplicate records (which I found an existing procedure for!).
In addition, we enable full text search of properties here. This would allow us to query/search for nodes based on text content they contain later.
2. Enrich our data to make insights more accessible
This is one of the fun parts.
We could try to measure impact a number of ways. If we’re talking about importance or influence, we have to consider what we mean by that.
Thinking about “importance” is really interesting in this context. Given a bunch of random data like a string of conversations, how would you identify what’s important? Unique things? Commonly cited things that aren’t so commonly cited in other types of conversations? Ideas that get a lot of mentions? Conversations that look like they sparked more conversations?
How do we discern importance from influence?
One way this is commonly done is by measuring “centrality,” in a network. From my handy O’Reilly Graph Algorithms book:
Centrality algorithms are used to understand the roles of particular nodes in a graph and their impact on that network. They’re useful because they identify the most important nodes and help us understand group dynamics such as credibility, accessibility, the speed at which things spread, and bridges between groups. Although many of these algorithms were invented for social network analysis, they have since found uses in a variety of industries and fields.
Graph Algorithms (Hodler & Needham)
In it they discuss, “degree centrality” used to measure connectedness, “closeness centrality” to determine proximity in a group of nodes, “betweenness centrality” which looks at the shortest paths between groups of nodes, basically nodes that are bridges, “control points,” sometimes called “brokers” in network science. And of course, our favorite and the reason Google is a monopoly: PageRank, used to measure overall influence of a node by looking at how many inbound connections it has, and how many connections those connections have.
Now at first glance, centrality algorithms can’t really do anything for us. Our data have no existing connections between them. We just loaded a bunch of unorganized conversations into a database.
No relationships, no PageRank.
But we do have words. And technically, we can find the same or similar words, phrases, sentences, paragraphs, or whatever across documents and create the link relationships between them that way.
This brings us to annotation and natural language processing (NLP).
In this world, linguistics is treated as a basis for understanding entities and relationships.
For example, “Jim and Ann’s dog Roger loves street pizza.”
There is a ton of information in this one sentence. For one, we know that Jim, Ann, and Roger are related. We are certain Ann has a dog named Roger. And we know Roger loves pizza.
There is also a lot to be potentially misunderstood. For one, does Jim also love street pizza? Or is “Jim and Ann” a thing? “Street pizza” is also an unusual combination of words.
To enrich the data that is our full text conversations in the database, we have to choose our training data, essentially what library we want to use to help us interpret meaning from our content.
The problem with your conversations is that they aren’t best trained by a generic dataset of lots of conversations, they need additional inputs. In the case of street pizza, we’d have to annotate that given it’s a phrase Ann and I use to describe how Roger always manages to find stray slices of pizza on the ground when getting walked in Philly.
Your expertise works the same way. You have a way you phrase things, an intended meaning behind it, you have the “curse of knowledge” bias, which assumes that others know what we know.
So the data needs to have stopwords such as “um” and “like” removed, be trained on a generic corpus (library) of data, hand annotated for the important-distinct-to-you things across a representative sample of conversations, and boom, data enriched.
That’s all we have time for. A rough step 3 next.
And of course, everything is easier in theory, but I’ve already found a handful of really interesting tutorials showing practical applications for similar situations that make getting started playing with this stuff more in reach:
- A 3 Part Neo4j Based Analysis of Hillary Clinton’s emails by Rik Van Bruggen
- An insane walkthrough of creating a knowledge graph with multiple data sources to drive content recommendations by Graph Aware’s Christophe Willemsen
Articles referencing this one