My content analysis workflow - part 1

Table Of Contents

1. Review site manually, get a lay of the site and typical page template structures
2. Crawl and extract text, meta data, metrics with ScreamingFrog

ScreamingFrog’s API integrations

Google Analytics
Google Search Console
Ahrefs

3. Collect and preprocess data from APIs and CSV exports, mostly with Python

Do NLP keyword extraction things

TF-IDF
Gensim Text Summarization (Keywords)
Google Search Console API
SEMRush Keywords
Coming Soon: Ahrefs

4. Load all that CSV data into Neo4j mapping relationships between NLP, URL, and analytics data things
5. Start analyzing

5.1 Peek some SEO audit things to get a rough idea of site health
5.2 Do some graph data science things

Closeness centrality
Betweenness centrality
Internal PageRank and CheiRank

6. Start enriching

5.2 Take a first pass at taxonomies

6. Make rules for what gets kept, redirected, consolidated.

Designing a framework for getting content to perform is a tall order. It can be labor intensive, dozens of little tradeoffs to be made at every turn, the related decision fatigue and it’s very hard to predict the returns on effort, because, quite frankly, no one has figured out how to do it well across a website.

To be fair, I’ll see sites do a really nice job at a piece of the puzzle.

I’ll crush on BASH pages I like like NN/g’s topics page or James Clear’s Articles page.

Hubspot is good at internally linking. I know this because they are able to get shitty content to rank. Maybe that should be a post. “What we can learn from high ranking shitty content”

Which sounds like a neg, but is honestly a compliment – I wouldn’t know how to do that.

It’s a big messy problem and so the default response is to do nothing. A weirdly high leverage activity is mapping your content into an info architecture that reduces friction at every turn for users to get exposure to your thinking, free and paid.

If you’ve been following my progress, you know I’ve been working on this for about 1.5 years. So here’s my current process.

1. Review site manually, get a lay of the site and typical page template structures

Here I am looking for overall structure and what data exists in the source code or DOM. As I go, I collect xpath or csspaths for key structures, things like:

Meta tags ScreamingFrog might not grab by default like date published, modified date:

Signals sent by body class about whether it is a page or post, what the post template is named, the body of the article, sometimes this also includes the categories and tags.

Here I can see the post id, header features, the name of the post template, and clearly that it is a single “post.” Similarly, on “pages,” I can see classes demonstrating the URL is a “page-template-default page page-id-###”.

I also like to look at the HTML structure within the body content. This hints at common patterns I can analyze that gets at the expert author’s process and habits that leave footprints.

For one of my posts, I see a table of contents block:

Later, I will look at internal linking with a graph database so it’s good to know that a lot of these anchor links will be treated as “self-linking” links structurally, and they signal some article structure around headings.

Same with blockquotes and other semantic HTML. Semantic HTML is just descriptive html tags, they hint at the contents of the HTML tag or underlying structure of the page:

https://www.w3schools.com/html/html5_semantic_elements.asp

If someone uses blockquotes, I think that can signal interesting data. Especially if they use the <cite> tag which is built into the WordPress blockquote block structure:

I also like to look at any structures I can get at around Calls To Action (CTAs). Are there sidebar email signups? Boxes with links to landers? Inline links with get your free x?

I’ll ignore this if the CTA is just part of the templating, but if they seem custom, or inline, or related to the content, I’ll go deeper there.

2. Crawl and extract text, meta data, metrics with ScreamingFrog

Once I’ve done step 1, I’ll start working on the custom ScreamingFrog configuration for the website. I already have good starter saved configs for sites built with Squarespace, default WordPress, Genesis for WordPress, or Elementor for WordPress so I’ll load and work from those if applicable.

I’ll load in my “custom extraction” CSSPaths and XPaths or RegEx if I can’t get at what I want reliably with CSS alone. I also extract things like text in italics and bold tags, hx tags all the way to h5, and then run those through an ngram generator just to see if there are themes or patterns in how or why the author uses italics, bold font – is it to emphasize a word in a sentence? Drive home the main point of the article? Are italics at the end of an article reserved for outdated CTAs?

I probably shouldn’t waste time here because it typically does not inform how I plan to organize content but I find it interesting to analyze nonetheless.

ScreamingFrog’s API integrations

Google Analytics

I have a custom config for this. For Google Analytics, I will pull All Users. I used to pull Organic Traffic but I can get that from Google Search Console API a bit more accurately. I’ll also deselect goals and select all pageview and sessions data. This is because I want landing page sessions and session duration and I want a count of pageviews irrespective of where the user landed and other stuff like.

Most of my clients (myself included) don’t do a great job of tracking goals or events, but if that data looks worthwhile, I will include key data as columns.

Google Search Console

I pull in data with the same config and time range as Google Analytics. Low traffic sites, I’ll pull a range for three months to a year. Higher and it’ll be more like 1 to 3 months.

I get at this data much more accurately with a later step in the workflow using Google Search Console API and mapping keyword to landing page relationships in my graph database. So getting these summary numbers here is just to give me an easy way of looking at ratio of organic traffic to all traffic, is there congruence? incongruence? And avg. positions / clicks / impressions to see is there potentially big opportunity to give more attention to looking into (high impressions with average position of 8 to 22 and low CTR).

Ahrefs

I will also pull in Ahrefs’ page specific metrics. Again, I pull the complete link data into Neo4j and map these relationships, and plan to crawl and scrape all referring links at some point as I go deeper with analysis processes, but this is just to eyeball link depth throughout the site and make sure I don’t throw out pages that should be redirected or get more careful consideration.

As a protip – by default some metrics are highlighted in there that you don’t need (like domain rating being requested 500 times when you only need it once) and those API request limits can add up if you willy nilly check off all the data you don’t actually need as columns.

For my purposes, I just pull in page rating, linking root domains, number of inbound links.

Keep in mind all the data that comes through is just a tiny snapshot of what is happening on a per page level.

3. Collect and preprocess data from APIs and CSV exports, mostly with Python

Like I said, I’ll typically get the article content in a column for my ScreamingFrog export. Once I’ve exported internal_html.csv from ScreamingFrog, I’ll just filter for the pages I want to analyze as they relate to content organization.

I have to really stop myself from getting distracted by technical SEO or dynamic content issues or lots of pagination junk so sometimes I’ll go back to ScreamingFrog and pull down just the sitemap.xml list of pages and posts and preprocess / enrich that data.

Do NLP keyword extraction things

Currently, I use two keyword extraction techniques. This is the quick and dirty python script for TF-IDF and Gensim Keywords. It’s not well commented, but I will add to Github and document better if bugged to. To run extractions on article content, you’ll need to extract entry content and rename the content text 1 instance as your column name for article text on line 6ish:

content_text_1_list = data['content text 1']

TF-IDF

This one is pretty simple, called TF-IDF. It gets a bad wrap but I won’t get into all the why there. It just gets misused and then people are like “see it doesn’t work.” I wrote more thoughts / description of TF-IDF re this context.

For analyzing a corpus (website) and documents in the context of that corpus (pages), it’s really useful for devaluing keywords that you use all the time across a bunch of pages and picking out the terms and scoring them high based on uniqueness to a given document or few documents.

I do TF-IDF for unigrams (one word keywords) and bigrams (two word keywords).

I’m sure I could get something useful going higher for some projects, but so far it generates too much noise, without a lot of additional filtering or iterative preprocessing, hasn’t been worth it.

Gensim Text Summarization (Keywords)

This one is neat. It uses sort of a PageRank as TextRank approach to identifying important terms that other words congregate around, the way PageRank looks at pages other pages’ links point to in aggregate.

It has some config options, so I’ll set it to grab like 20 terms from each document typically and unless it looks like noise, then reduce to 10. Like TF-IDF you can also grab what amounts to an importance or relevancy score for each term. Docs here. Keep in mind that I call one cell’s output csv gensim_bigrams.csv but its more like just casting a wider net with unigrams mixed in.

Google Search Console API

If you ever spent time in GSC, you know it’s a pain in the ass. You end up exporting CSVs of data that summarizes all pages metrics or all search term metrics without respect to which page has what metrics for which keyword.

I want the relationship between the page and the keyword with the metrics that apply just to that relationship, so I can compare it to other pages with some visibility on the same term or similar terms. This is where the GSC API shines.

I hacked at a “do NLP things to GSC terms python script” by JR Oakes to just grab and transform API data into a dataframe for CSV export, which looks like this:

As Ann would say: Yumzers.

SEMRush Keywords

I log in, type in my domain > Organic Research > Positions > Export > CSV

This gives me some more SERP data, and a bunch of third party estimates that can be useful if taken with a grain of salt.

You probably can’t see that but it looks like:

Keyword,Position,Previous position,Search Volume,Keyword Difficulty,CPC,URL,Traffic,Traffic (%), Traffic Cost,Competition,Number of Results,Trends,Timestamp,SERP Features by Keyword
google custom search wordpress,71,71,70,52.99,0.00,https://contentaudience.com/technical/google-custom-search-results-for-wp-site-search/,0,0,0.00,0.06,66900000,”[100,20,20,20,20,20,20,20,20,20,20,20]”,2020-10-02,”Image pack, Reviews, AdWords top, AdWords bottom, Video Carousel”

I have a power your WordPress site search with Google Custom Search article and it ranks on the 6th page of Google (71st position) for “google custom search wordpress.”

You get some trends data showing the search spikes in January and then drops to about 20% of that the rest of the months with [100,20,20,20,20,20,20,20,20,20,20,20]

You also get the SERP features on the Google results page for that query – here: Image pack, Reviews, AdWords top, AdWords bottom, Video Carousel.

In the future I plan to get this SERP data directly because I want A LOT more of it for content intelligence. Here is the data model I’ve been working on for my graph database – you can see just how much useful data can be extracted from a SERP:

At some point, this will help me crawl other URLs, model click potential of a SERP, and identify patterns in how other ranked pages’ content may to their position amidst search features and within the overall SERP structure. This is why I’m crazy about graphs for SEO insights. It’s all about how data relates.

Right now SEOs sort of eyeball this on a per search query basis. “Hmm it looks like this bullet list gets the rich snippet because the words are short and punchy. Let’s do that.” Or rely on just a dump of the SERP features like SEMRush provides in that above example: “Image pack, Reviews, AdWords top, AdWords bottom, Video Carousel”

I just went on a tangent talking about SERP stuff there, but the bigger content organization goal here with importing these keywords as related to those pages is to get the current context of what Google thinks those pages are about so we can explore topics extracted by multiple methods that each provide unique metrics in the context of how terms related to pages.

Coming Soon: Ahrefs

Of course, I want to enrich our website data with inbound links, the relationships those links represent, competitor links, etc., but why?

For that data to be truly valuable, I would need to go up a level or two – eg. links to pages that link to pages that link to your page. And then also down a level – pages you link to. And then compare competitor content along the same dimensions. Graph all of it, along with NLP things to analyze texts in aggregate.

That would give me a much richer integrated source of truth to work on extracting insights from – like predicting who is most likely to link under what conditions, key holes in your thinking that others account for who outrank you. Who the key linkerati players are that are elevating competitive content, the ones you should be rubbing elbows with.

I’m not there yet so settling for counts of linking root domains and page rating from Ahrefs as collected through ScreamingFrog’s Ahrefs API. And that’s more of an SEO as content amplification strategy exercise anyway.

4. Load all that CSV data into Neo4j mapping relationships between NLP, URL, and analytics data things

This is what I’ve been working on for the past year or so. My workflow here is:

Collect my .csv files and properly name them into a folder called “domainname.com-MMDDYYYY”.
Spin up a new local graph instance in Neo4j Desktop named as
“domainname.com v1” and add a plugin called APOC for advanced procedures. Start the DB
Move .csv files from domain folder to new db instance’s import folder *(what Neo4j expects as your LOAD CSV filepath unless you reconfigure some security settings for Neo to access external folders.)
Open my mega LOAD CSV scripts .cql file in VSCode and then navigate to the parent folder of the db instance in terminal and use cypher-shell from the command line to load the a to import all the columns and rows as node properties and relationships into our new db instance.
Inevitably deal with errors and reprocess some CSVs or add uniqueness constraints in the db.
Bask in the glory of successfully loading a bunch of connected data into a graph db.

The above probably seems like a lot. And right now I have it to about 2-3 hours depending on hiccups. I’m sure I could get it under 30 minutes through automating more or straight up making a SaaS tool that runs SF from the command line, executes all the Python scripts and API calls directly, but I’m still heavily focused on workflow design and improving the actual organization processes atm.

The data model looks close to this, except that all keyword phrases share a label :Term and also have their own label based on source, e.g. :GSCTerm

5. Start analyzing

5.1 Peek some SEO audit things to get a rough idea of site health

Get an overview of SEO issues, minor and major. I’ve talked about a better SEO audit process using graphs with Screaming Frog data.

It’s better because it is more flexible, easier to dive into and identify patterns around specific issues, and quickly get an export of culprit URLs, which often requires a lot of additional guesswork, poking around, and clicking back and forth in traditional SEO website audit tools.

It’s also better because technical SEO audits should not be separate from the bigger context of content audits. The approach should be blended. You have limited resources, what are the most important x to fix? Hard to know without scoring pages based on performance, potential, and relevance.

Are you going to fix a bunch of issues and then learn later that you should have just deleted all those pages? Naz.

Aside: Also, the more I learn, the more I think technical issues are a distraction to more important activities, at least for expert content. Which I hate to say, because for the past five years I have been wrapped in the warm blanket of safe, easier, more clearcut, technical SEO recommendations.

And that’s now why I say “peek” instead of do some major technical exercise at the start of SEO projects, which is at least 50% inefficient use of time and dev resources when done before any real planning or strategy. Sidebar rant over.

Onward.

5.2 Do some graph data science things

Ooh, yea. I started playing with the latest version of Neo4j Desktop 1.3.10, updated all my databases to 4.1.3 Enterprise, replaced the old “Graph Algorithms” plugin with the new “Graph Data Science” (GDS) Library and boy is it cool.

Now you can pop over into the GDS library, which I simultaneously really like, and am also secretly annoyed that the barriers to entry are so low for others who want to do the same kind of work.

Anyway, still playing but super quick and easy to get up and running.

Here are some recipes I’m playing with to surface interesting website insights:

Closeness centrality

The GDS closeness centrality algorithm (see docs) is relatively simple. It looks for a node’s (URLs) distance in link hops from all other pages, and measures “the average farness (inverse distance) to all other nodes.”

Simple and cool.

What do you expect will have the highest closeness score? Yep, template links like in header and footer navigation.

What do we expect would have a null score? Yep, two for two you are. Orphan URLs with no links to them.

What about low scores? And again. You’re on fire. Low score signal those URLs would be very hard for a user to find.

We can also re-calculate variations of closeness for better data. For example, I can re-run closeness algo on only link relationships found within the body of an article. This allows me to get away from thinking about internal linking as templating and into article content.

Betweenness centrality

My personal favorite, and very underrated. Betweenness algo finds nodes that act as bridges for other nodes. For us, pages that connect clusters of pages, useful not just for crawlability and indexing of your site in Google. What do you expect to score high here? Yes, again, category archives, tag archives, and then what?

Pages with lists of internally linked URLs. Here are two that scored high for my site:

https://contentaudience.com/emails/content-organizing-draft/

Okay, makes sense. I wrote a post with a draft of links to other posts and that one has the highest betweenness score.

This one is interesting:

The list of related articles on my research page are otherwise hard to find because I stripped out the /category/technical archive pages.

Betweenness doesn’t just score what pages have the most outbound links to other pages, it scores based on how important a page’s links are to finding those other pages. As an aside, when you strip dynamically created tag and archive pages from your site, you can orphan all those pages.

Betweenness helps you see those bottlenecks.

What about zero betweenness scored pages? URLs with 0 scores means one or more of the following:

neither inbound or outbound links
no outbound links
no inbound links
less outbound and inbound links than any other pages

And betweenness factors in templating links.

Even if a page is going to link to the navigation, or be linked in a blog archive, it still gets a zero score because all pages are linked from the blog archive or sitemap.

Here’s an example where I thought “wait, I link to 3 posts on this article. How can it be zero?” and then realized it still had inboundfound.com URLs so would not have counted as internally linked betweenness.

The other neat part is I can export all inlinks, pages with their betweenness scores and outlinks in aggregate to get a sense of why these pages are considered to be key bridges or bottlenecks between site content.

Internal PageRank and CheiRank

I spent a fair amount of time writing about this in a technical post called Graph First PageRank and CheiRank.

Conceptually though, for graph first SEO we need to forget about our preconceptions about PageRank to make it useful again.

PageRank is just one measure of influence as centrality based on network topology of our websites. We can use any key inputs for our weights and initial values.

Inputs can be normalized scores based on conversion data. We can even weight initial link relationships based on overall traffic patterns. That would allow us to visualize not just how traffic flows throughout a site, but also where it should be flowing more.

Similarly, CheiRank on existing traffic would give us a sense of the most communicative pages, not just based on aggregates of counts of links for predictive purposes, but real data.

Re conversions, where do well worn conversion paths leak links? idk.

More thoughts there as they develop later. Onward.

6. Start enriching

I’ll look at categories, tags, author URLs, pagination, and add labels to those URLs, or add new taxonomy relationships based on something I can filter by or get at. Like “tag” nodes to relate to URLs that are linked to from those tag pages. Merging here is a Cypher term meaning MATCH if exists or CREATE if it doesn’t.

5.2 Take a first pass at taxonomies

I actually try to get a first pass from the client upfront. We walk through it. I ask them what they think their well worn user journeys are. You’re the expert and while we all have trouble putting things plainly on our sites, it always seems to come out naturally in a conversation.

The process here varies based on the situation. If you have categories and tags, that actually gives me more to go on. I analyzed a site with 4k articles and just getting a count of pages with what tags was helpful for eyeballing some potential top level categories:

Filtering on “catalog,” I can see an array of tags that could be mapped with some ease to a series of well organized subcategories.

On the other end of the spectrum, I have 200 some posts and no reliable categories or tags – at least not topical ones. This is where the graph database and all that keyword extraction and mapping comes in handy.

6. Make rules for what gets kept, redirected, consolidated.

Here is also where I’ll start labeling URLs based on features. Low word count articles might get a “:TooShort” label.

Here is an example of labeling rulesets for content features:

Remember these labels are context specific. You may care more about word count and less about session duration, or have different step ranges that would be more appropriate.

Some of these are easy to add on like an internal_html.csv export of ScreamingFrog, right? Like I don’t need Neo4j for this part. So I may do this in spreadsheets going forward and then import those features as labels on those URL nodes and add the scores as properties on the node.

Now we’ll have lots of URLs with labels like:

So what do you do there? idk. I’m still working on that. And its very context dependent: How big do you want your sweeping cuts to be? Low performing posts on topic x should get deleted, but you simply don’t have enough on topic y and you need that category for your users, so you can’t just not have a core category represented.

Here is my current approach.

I create rules based on those labels like this:

The filters are meant to protect outlier situations. If all content related to topic x is underperforming and there isn’t a ton on that topic but it’s important for organization, like we need that place, it’s like an endangered species that needs to be protected and given extra attention and resources.

Back to graphing next.

My content analysis workflow – part 1

1. Review site manually, get a lay of the site and typical page template structures