Website audit with ScreamingFrog crawl data and Neo4j

This is a nice peek into how powerful graphs are for site audits and how these capabilities are coming along for those interested in content audit + ongoing recommendations service.

I’m chipping away at feature parity with some of the existing smart crawl tools out there.

If you aren’t familiar, the term “feature parity” just means adding or rebuilding features from scratch that already exist in competing products (or a previous version). Like how Facebook added “Stories” to compete with Snapchat’s self-destructing content.

Table Of Contents

How crawl reporting works
Rich ScreamingFrog crawl data unlocked
Enriching crawl data with analytics data
Google Analytics + Neo4j
Crawl data + Google Search Console + Google Analytics
- Check for crawl bloat, index bloat

How crawl reporting works

Moz Site Crawl, SEMRush Site Audit, and AHREF’s Site audit, as well as other crawl tools I’m sure, are simple crawlers with some basic logic applied to generate a hierarchy of issues:

errors / critical issues
warnings
notices

The problem with this hierarchy is that it’s sort of opinionated about what’s important on the one side, but then on the other side, it’s really more about confidence. The less “sure” the tool is that something is an issue, the more likely it is to be called a notice than a critical error.

Crawl reports are almost always technical

Most of the reported errors or critical items are things like:

error pages (4xx)
redirects
missing or duplicate tags (h1, meta descriptions, title tags)
canonical issues
thin content issues

And then typically you can filter by subdomain, or drill down into a specific issue by page. There is also some simple logic like, “if a page redirects to an error page, that can’t be right, so call it a critical error.”

All of these things are all pretty easy to do with our own setup (though, to be fair, ScreamingFrog is doing the heavy lifting to get the data).

Okay, “cool,” but why?

I have no interest in replacing crawl reports. They serve a purpose of getting a sense of how many different types of issues and hints as where to look.

But I’m more interested in:

organizing and prioritizing issues (how I want)
finding the actual culprits behind issues as quickly as possible (by analyzing how those things are related)
evaluating found issues in light of other data (like actual rankings)
uncovering whether issues are in fact issues, or just noise

A year or so ago I watched a webinar by Dominic Woodman of Distilled, called A Year of SEO Split Testing Changed How I Thought SEO Worked. He shared findings from a series of SEO tests, applying changes to chunks of some pages while leaving others the same. Some tests were typical things we often do and some not so much – like things only done due to client insistence.

In most cases, I was wrong at guessing what the test results were, mostly because applying the same changes to different websites had different effects.

This is the world we live in. Something can increase traffic on a 100k page ecommerce site and tank it on 2k page online publisher site.

Things are just… complicated.

Where crawl tools fall short

Crawl report tools fall short because they are opinionated about what’s important, but they aren’t that smart. For example, in literally every crawl tool ever, I can see something like “number of missing h1 tags.”

This alone means very little. W3C HTML5 spec and Google have basically said they don’t care if you use lots of H1s on a page or H2s instead of H1s or whatever. Instead consideration is in context, e.g. how you use those relative to each other, so:

h1 “SEO”
- h2 “Keyword Research”
  - h3 “Long tail vs short tail”
- h2 “Link Building”

is effectively the same as:

h3 “SEO”
- h4 “Keyword Research”
  - h5 “Long tail vs short tail”
- h4 “Link Building”

Or so the SEO’s thinking goes.

Given that context, I can eyeball a report and know to ignore the “missing H1,” items, knowing it’s actually not some big critical issue.

But if you’re an expert not interested in learning a bunch of SEO things, you can’t. It’s cheaper for you to swap an <h2> for an <h1> in your post.php template than spend the time learning whether it’s worth your time.

Aside: Having written all that, I’m actually not positive how “relative” heading tags shakes out in the wild. Google says a bunch of things don’t matter that turn out to (like subdomains vs subdirectories).

Given the “relative value” concept about h tags on a page, what can we look at that does matter? What would be properly actionable? Let’s put a pin in that question and get a feel for what core things we can do first.

Rich ScreamingFrog crawl data unlocked

Find server errors, page not found errors, etc.

We can see all non-valid URLs by status code with a MATCH (a:URL) WHERE a.status_code <> 200 RETURN a.address, a.status_code

We can use the same logic to look for 4xx pages MATCH (a:URL) WHERE 399 < a.status_code AND a.status_code < 500 RETURN a.address, a.status_code by checking for all status codes between 399 and 500.

Ditto for 5xx and 3xx.

Find pages with missing h1 tags

Valid pages without H1 tags are also easy to find in Neo4j:

MATCH (a:URL) WHERE a.status_code = 200 AND a.h1_1 = ""
RETURN a.address

Technical aside: if you’re creating your own, keep in mind that graphs prefer not to have empty properties loaded. When I modeled the data, I wanted to call out, “this is missing,” and decided to count the absence of h1s, metas, etc., as data – If you follow the graph paradigm and don’t add empty properties in this situation, you’d use a query like MATCH (u:URL) WHERE NOT EXISTS(u.h1_1) RETURN u instead.

Moving on. A simple a.address CONTAINS 'www' filter shows us how many are missing on www – ditto when we check our subdomain membership site (nation.domain.com):

Removing AND a.h1_1 = "" shows us the total pages on www (142) so we know 118 / 142 pages are missing H1s.

Let’s see where H2s are missing:

We now know that pages are more likely to be missing H2s and posts (both on www. and nation. subdomains) are more likely to be missing H1s.

Find pages with missing whatever tags

At this point we can literally check counts of anything missing or matching a configuration on a valid page (as long as we pulled it from the CSV into the database).

Right now our script loads most of the URL properties included in ScreamingFrog’s internal_html.csv columns:

Address
Content
Status Code
Status
Indexability
Indexability Status
Title 1
Title 1 Length
Title 1 Pixel Width
Meta Description 1
Meta Description 1 Length
Meta Description 1 Pixel Width

Meta Keyword 1
Meta Keywords 1 Length
H1-1
H1-1 length
H1-2
H1-2 length
H2-1
H2-1 length
H2-2
H2-2 length
Meta Robots 1
X-Robots-Tag 1
Meta Refresh 1

In our above snippet we can replace a.h1_1 with any node property and pull counts and lists of pages.

Finding duplicate tags

This is a useful feature of crawl report tools. Duplicate title tags, meta descriptions, content are pretty common, and all these tools report on those things.

Ironically though, most crawl reports don’t provide duplicate counts for other items, like heading tags.

Here’s how we can quickly check for duplicate properties with a simple query:

MATCH (a:URL)
WHERE a.address CONTAINS 'domain.com'
   AND a.status_code = 200
WITH a.meta_description_1 as md,
   count(a.meta_description_1) as dups,
   collect(a.address) as URLs
WHERE dups > 1
RETURN md, dups, URLs

HT to Joel D. for this data loading test unit snippet I hacked at to produce the above.

Here’s the summary – a count of duplicate meta descriptions:

And here is the list of URLs thanks to the COLLECT function, eg. COLLECT( address ) AS URLs which will consolidate all associated URLs.

If you’re an SEO, you’ll know how convenient seeing the actual meta descriptions, list of URLs, and getting it sorted by count of dups is.

With Neo4j, we can export all this exactly however we want. Meanwhile, most crawl reports look like this:

Ok, listed pages show 20 duplicates for those three pages – but are they all the same? Or do we happen to have multiple instances of pages that have the same number of duplicates (possible given the rampant dynamic content issues on typical CMSs)?

No way to know at a glance.

To find out, you have to click an item to see its duplicates, make a mental note, then click another item to see if it has matches, one by one:

I’m not knocking SEMRush here. All crawl tools use a relational database format like this, making what we want hard to get at.

When you grow up with world being analyzed in spreadsheets and tables, we shrug this off as normal. But it’s not anymore. Seemingly over night, this approach has become outdated AF.

What it means when crawl reports differ

Here’s an interesting question: why are we showing more “duplicate meta descriptions” than SEMRush? It turns out they smartly exclude canonicals and pages missing meta descriptions:

By default if I have 45 pages with no meta descriptions, they’ll be counted as duplicates because they are all empty values.

We can fix our query quite easily to exclude missing meta descriptions with AND NOT a.meta_description_1 = "" and we can exclude canonicals with AND NOT (a)-[:REL_CANONICAL]->() which means we should get the exact same results as SEMRush, right?

It’s closer.

We agree on the all of the rows of 2 duplicates. But the nation subdomain duplicates I have as 47 and they have as 20 – even though the crawls were only a few days apart.

Exporting the SR data to see which 20 they were reporting on, it turns out the CSV export only gives you 10 URLs, might be a bug. *shrug emoji*

I’ll chalk it up to some filter or limit SEMRush has configured that I don’t – for example, I configured the crawl to follow nofollow links, so I probably found more pages, while SEMRush crawler probably “respected” those directives.

But what about duplicate heading tags? That’s useful, but not typically reported on… And as I care about NLP and content things, I want to know that stuff.

Find duplicate h1 or h2 tags

MATCH (a:URL) WHERE a.address CONTAINS 'everspaces.com'
    AND a.status_code = 200
    AND NOT a.h1_1 = ""
    AND NOT (a)-[:REL_CANONICAL]->()
    AND a.indexability = 'Indexable'
WITH a.h1_1 as md,
   count(a.h1_1) as dups,
   collect(a.address) as URLs
WHERE dups > 1
RETURN md, dups, URLs

This query didn’t turn up much except a trailing vs non-trailing slash without listed canonical issue on our Podia membership site at the nation. subdomain – but it’s a good query bc it shows how we can easily constrain our questions to only look at things that matter (e.g. 200 OK, non-canonical, indexable pages with duplicate h1s).

Find redirects to invalid URLs

I wrote yesterday about this. We can pretty easily find all internal redirects as well as where those redirects live so we can change them.

Find internal redirects to non-valid URLs like 4xx, 3xx, etc.,

It’s pretty easy to chain together relationships like page A links to page B that redirects to page C where page C is not a valid (status OK) URL:

MATCH (a:URL)-[lt:LINKS_TO]->(b:URL)-[r:REDIRECTS_TO]->(c:URL)
  WHERE c.status_code <> 200
RETURN b.address AS fromURL,
  lt.anchor AS linkText,
  c.address AS redirectedURL,
  count(lt) AS inlinkCount,
  collect(a.address) AS inlinks
ORDER BY inlinkCount DESC LIMIT 10

The first MATCH line is doing all the work, the rest after RETURN is just us saying how we want to see that data with what properties.

Find pages with internal http link references

MATCH (a:URL)-[lt:LINKS_TO]->(b:URL)
  WHERE a.address CONTAINS 'http://'
  WITH a.address as src, count(lt) as links, collect(b.address) as dest
RETURN src, links, dest ORDER BY links DESC

Check for exact duplicate pages

ScreamingFrog records a hash:

Hash – Hash value of the page using the MD5 algorithm. This is a duplicate content check for exact duplicate content only. If two hash values match, the pages are exactly the same in content. If there’s a single character difference, they will have unique hash values and not be detected as duplicate content.
https://www.screamingfrog.co.uk/seo-spider/user-guide/tabs/

MATCH (a:URL)
WHERE a.address CONTAINS 'domain.com'
AND a.status_code = 200
WITH a.hash as hash,
count(a.hash) as dups,
collect(a.address) as URLs
WHERE dups > 1
RETURN hash, dups, URLs

But in many cases duplicate content is a fact of life. So we are less concerned with “is there duplicate content?” and more concerned about “is there duplicate content without proper use of canonicals.

MATCH (a:URL)<-[rc:REL_CANONICAL]-(b:URL)
WHERE a.address CONTAINS 'domain.com'
WITH a.hash as hash,
   count(a.hash) as dups,
   a.address as destURLs,
   collect(b.address) as sourceURLs
WHERE dups > 1
RETURN hash, dups, destURLs, sourceURLs

Lots of crawl tools can do this – so what?

If you somehow made it this far, you might still be thinking this.

Let’s take all the flexibility we have to ask whatever we want and get that data returned exactly how we want, and let’s layer in another data source:

Now we can start to prioritize issues and opportunities.

Enriching crawl data with analytics data

Google Search Console data in Neo4j

Screaming Frog makes this easy with an API integration. Of course, I want more than click, impressions, CTR counts, like loading in the actual keyword phrase relationships with my URL nodes, but that’s still on the wish list.

What would be cool to see that we could otherwise never see? How about:

What redirect URLs are still showing up in Google after a big CMS migration vs pages they are redirecting to

Let’s say we did a big migration from one CMS to another with lots of redirects to remap URL structures in the process.

And we want to see how Google is responding – essentially getting a status report on where the traffic is being routed.

That’s really hard data to get at. I mean you could. With unlimited time and resources, we can dig through server logs requests on a list of old URLs, and line URLs up from GA.

But for getting a quick status report, that’s not going to work – data will be too old by the time it’s ready.

Neo4j to the rescue.

We can generate a query to look at something like what redirect URLs are still generating impressions in search. We can also return the impressions on the pages they are redirecting to as well:

MATCH (x:GSCReport)-[hr1:HAS_REPORT]-(a:URL)-[]->(b:URL)-[hr2:HAS_REPORT]-(y:GSCReport)
WHERE a.status_code = 301
AND x.impressions > 100
RETURN x, hr1, a, b, hr2, y LIMIT 5

This took just a moment to write and the data even looks frickin’ cool:

Some context: light green nodes are URLs, dark green are GSC reports, red lines are redirects, yellow lines are canonicals. Ignore the self-referencing links for now.

I can return a list of old to new URLs, ordered by the biggest difference in impression counts in Google search results.

This data never used to exist!

Okay, it “did,” but effectively it was too hard to get at to be worth trying to look. We just don’t think in terms of “querying relationships of relationships or relationships” unless we’re talking about something linear like “3rd degree connections” on LinkedIn.

So you might never think to check.

Google Analytics + Neo4j

How many non-canonicals are driving organic traffic?

Google Search Console will give us a CSV dump of URLs when Google did not pick the canonical we set. But something like this will sort by the pages drawing the most search traffic that we did not intend for Google to interpret as the primary page.

MATCH (a:URL)<-[rc:REL_CANONICAL]-(b:URL)-[hr:HAS_REPORT]->(gar:GAReport)
WHERE a.address CONTAINS 'littletikes.com'
AND a <> b
AND gar.ga_sessions > 0
WITH a.address as canonical,
   b.address as nonCanonical,
   gar.ga_sessions as hits
RETURN canonical, nonCanonical, hits ORDER BY hits DESC

To me, this one is a lot of fun.

Like most of these little snippets, I could get this info with exporting data from multiple sources, using filters and pivot tables and matching things up in spreadsheets, but most people (myself included) would be too lazy to do this.

The cost to get this data was just too high unless you’re already aware of a major canonicals issue and have a sizeable budget with decent technical resources.

But this takes one saved snippet, runs it, and I can further drill down, or simply return the MOST important items to be addressed. If you have limited resources, this actually ends up being a much more highly productive approach.

We can make it even more useful by showing traffic to the canonical and the non-canonical versions and sort by the largest discrepancies.

Surfacing landing pages with highest and lowest resulting average session durations

ScreamingFrog when connected to Google Analytics API during a crawl can be configured to layer in some key metrics by specific dimensions.

For SEO purposes, we can return the average session durations by landing page (filtered by organic traffic) on a per page basis, and export with our crawled URL data as a column in the format h:mm:ss (e.g. 0:00:10 is 10 seconds).

To be honest I spent about half a day just figuring out how Neo4j now handles temporal data to get this as a duration property that we could run queries against.

The hour, minute, seconds needed to be parsed, converted to integers, and converted into a temporal (but queryable) format. But now all that legwork is done, forever! It will live right here, and no one will ever need to do this again.

LOAD CSV WITH HEADERS FROM "file:///internal_html.csv" AS sfia WITH sfia, [item in SPLIT(sfia.`GA Avg Session Duration`, ':') | toInteger(item)] AS durComponents
    MERGE (u:URL {address: sfia.`Address`})
    MERGE (gap:GAReport:Report { address: sfia.`Address`})
    SET gap.ga_avg_session_duration = 
        CASE sfia.`GA Avg Session Duration` 
            WHEN "" THEN null 
            ELSE duration({
                hours: durComponents[0],
                minutes: durComponents[1],
                seconds: durComponents[2]
            }) END       
    MERGE (u)-[hr:HAS_REPORT]->(gap);

Why is this data potentially useful?

It’s a good proxy for measuring the relative quality of traffic to pages.

Let’s say you have 500 pages. You want to know which ones are resulting in more engaged interactions with your site and which are falling short of what users were expecting.

With this loaded into Neo4j, I can ask, “how many pages had over 5 minute sessions on average?” (who’s really spending some time on the site and where are they coming through)

MATCH (gar:GAReport)<-[hr:HAS_REPORT]-(u:URL)
WITH gar.ga_avg_session_duration as dur,
u.address as addy
WHERE dur.seconds > 180
RETURN dur.seconds, addy ORDER BY dur.seconds DESC LIMIT 5

Or I can ask, “What category landing pages are resulting in users leaving after under 15 seconds on average?”

MATCH (gar:GAReport)<-[hr:HAS_REPORT]-(u:URL)
WITH gar.ga_avg_session_duration.seconds as dur,
u.address as addy
WHERE dur < 15
  AND dur IS NOT NULL
  AND u.address CONTAINS 'category'
RETURN dur, addy ORDER BY dur ASC LIMIT 10

We can filter URLs by structure to look at just /tag/ pages or just /blog/ pages. If our URL structure doesn’t have that information, we can add those labels to target more effectively.

Crawl data + Google Search Console + Google Analytics

Check for crawl bloat, index bloat

Index bloat is a theoretical concept: you check number of pages on your site being indexed versus the number of pages receiving traffic from Google.

A simple way to calculate it: index bloat = pages receiving traffic / pages indexed in Google

Same goes for crawl bloat, number of indexable pages vs number of indexed pages. Both

Get count of indexable pages from SF crawl

MATCH (a:URL)
WHERE a.indexability = 'Indexable' AND a.address CONTAINS 'https://'
RETURN count(a)

Get count of indexed pages in GSC

MATCH (a:URL)-[hr:HAS_REPORT]->(b:GSCReport)
WHERE b.impressions IS NOT NULL
RETURN count(b)

Get count of pages driving traffic in GA

MATCH (a:URL)-[hr:HAS_REPORT]->(b:GAReport)
WHERE b.ga_sessions < 1 AND a.address CONTAINS 'https://'
RETURN count(a)

To understand the full context, we need to know the number of indexable pages, the number of indexed pages, and the number of pages drawing search traffic or otherwise being used in a worthwhile way.

But that number is arbitrary without context or benchmarking.

We can take that further and create our own thresholds. We can calculate: index efficiency score = pages receiving more than x sessions over a period / pages indexed in Google.

We can also look at other patterns that may be associated with low to no traffic pages being indexed.

Once we start thinking about how things are connected, or might be connected, or should be connected, the possibilities are far outtt.

Website audit with ScreamingFrog crawl data and Neo4j

How crawl reporting works

Crawl reports are almost always technical

Okay, “cool,” but why?

Where crawl tools fall short

Rich ScreamingFrog crawl data unlocked

Find server errors, page not found errors, etc.

Find pages with missing h1 tags

Find pages with missing whatever tags

Finding duplicate tags

What it means when crawl reports differ

Find duplicate h1 or h2 tags

Find redirects to invalid URLs

Find internal redirects to non-valid URLs like 4xx, 3xx, etc.,

Find pages with internal http link references

Check for exact duplicate pages

Lots of crawl tools can do this – so what?

Enriching crawl data with analytics data

Google Search Console data in Neo4j

What redirect URLs are still showing up in Google after a big CMS migration vs pages they are redirecting to

Google Analytics + Neo4j

How many non-canonicals are driving organic traffic?

Surfacing landing pages with highest and lowest resulting average session durations

It’s a good proxy for measuring the relative quality of traffic to pages.

Crawl data + Google Search Console + Google Analytics

Check for crawl bloat, index bloat

Get count of indexable pages from SF crawl

Get count of indexed pages in GSC

Get count of pages driving traffic in GA

Articles referencing this one

Sitemap

Services

Recent Articles

How crawl reporting works

Crawl reports are almost always technical

Okay, “cool,” but why?

Where crawl tools fall short

Rich ScreamingFrog crawl data unlocked

Find server errors, page not found errors, etc.

Find pages with missing h1 tags

Find pages with missing *whatever* tags

Finding duplicate tags

What it means when crawl reports differ

Find duplicate h1 or h2 tags

Find redirects to invalid URLs

Find internal redirects to non-valid URLs like 4xx, 3xx, etc.,

Find pages with internal http link references

Check for exact duplicate pages

Lots of crawl tools can do this – so what?

Enriching crawl data with analytics data

Google Search Console data in Neo4j

What redirect URLs are still showing up in Google after a big CMS migration vs pages they are redirecting to

Google Analytics + Neo4j

How many non-canonicals are driving organic traffic?

Surfacing landing pages with highest and lowest resulting average session durations

It’s a good proxy for measuring the relative quality of traffic to pages.

Crawl data + Google Search Console + Google Analytics

Check for crawl bloat, index bloat

Get count of indexable pages from SF crawl

Get count of indexed pages in GSC

Get count of pages driving traffic in GA

Reader Interactions

Articles referencing this one

Sitemap

Services

Recent Articles

Footer CTA

Find pages with missing whatever tags