I spent about 20 hours trying to figure out how to properly load some crawl data into a graph database over the past two weeks.
Once I hit bottom and gave up, I got the answer within 5 minutes of asking the Neo4j slack community. I had made an assumption that blank meant null. Instead of
WHEN NOT NULL I needed something like
WHEN NOT "".
Internal redirects are inefficiencies
You link to a product page 50 times over the years, but at some point that product page URL changed, now you have 50 internal redirects.
You added a redirect so users don’t know the difference, but there are crawl efficiencies and depending on the link value model you subscribe to, potential decay in the power of a link that runs through an internal redirect (Google says this doesn’t happen directly but they also recommend minimizing redirects for mobile speed/performance).
If you’re an expert with a big site, chances are you would never see this problem, let alone be able to quickly generate a list to prioritize the worst culprits for quick fixes.
What internal redirects are the most frequent and where can I find them?
Within a few minutes of having Neo4j community save my butt and my URL1 > REDIRECTS_TO > URL2 properly loaded, I cobbled this query together:
MATCH (a:URL)-[lt:LINKS_TO]->(b:URL)-[r:REDIRECTS_TO]->(c:URL) RETURN b.address AS fromURL, c.address AS redirectedURL, count(lt) AS inlinkCount, collect(a.address) AS inlinks ORDER BY inlinkCount DESC LIMIT 10
This shows the most common internal redirects (sorted by most used), the from and to where URLs, as well as a list of pages with the redirecting links.
If one template link has no trailing slash and you can just add a trailing slash to fix 117 redirects, that’s pretty neat.
And we can actually do a lot more, not just with the data we’re asking for, but also with what we return:
Check for redirects going to invalid pages
We can filter our prioritized list by redirects that don’t end at a 200 (eg. errors and redirect chains).
MATCH (a:URL)-[lt:LINKS_TO]->(b:URL)-[r:REDIRECTS_TO]->(c:URL) WHERE c.status_code <> 200 RETURN b.address AS fromURL, lt.anchor AS linkText, c.address AS redirectedURL, count(lt) AS inlinkCount ORDER BY inlinkCount DESC LIMIT 10
This shows the redirecting URL culprit, where it goes, the referring pages to fix, and the anchor text for quick finding, AND sorts by most culprit URLs:
From too long to two minutes
I’ve been doing SEO for about 7 years, and I can honestly say, in the past, I would have just sent a list of all redirects to fix, and would not at all have wasted my time trying to prioritize fixes. The data was simply too hard to get at for the theoretical value.
But now it takes a minute to run and export a prioritized list. You can fix an issue causing the largest number of issues in another minute.