Crawl and scrape sites that require your login

Table Of Contents

Why scrape data you already have login based access to
Generate an initial list of paginated pages without crawling them to get it
Choose what HTML elements to extract from seed URL pages
Targeting what DOM elements we want to scrape with CSSPath
- When to use CSSPath vs Xpath vs Regex when targeting what you want to scrape
Using Screaming Frog Custom Extraction with CSSPath for targeting
Important! Exclude pages or URL patterns that could mess with data
Crawl seed URLs in list mode
Crawl an iframe if what we want is inside?
Final CSSPath Extraction Settings for 5 main report pages

This guide is for logging into a platform and scraping the data you want from it. The use case we are following is logging into my email software (ConvertKit) and scraping the engagement data from it. I provide tips, examples, and alternative options for how to find what you might be looking to scrape as I go through.

Why scrape data you already have login based access to

In a perfect world, we’d have easy access to the data hidden behind our logins for the websites we use: news and posts from my connections for Linkedin, Twitter, and email list data around emails sent in Convertkit, etc..

You can often get some data, request it, find a CSV button, but it’s frequently, it’s not all of what you want.

The use cases for crawling and scraping data only available when you are logged in somewhere (where it’s your data) are near infinite. That said, some data tools don’t allow you to scrape in their terms of use. They will flag your account if they think you are violating the terms, or put a temporary lock on your ability to access those pages. Often, with those tools it’s cost effective to just pay for API access to their data. This post is not about how to game that type of login, it’s about getting your own data out of accounts when it’s not accessible by export or API.

ScreamingFrog is my crawl tool of choice and has a method for crawling password protected websites.

If you ‘re just looking for how to authenticate with Screaming Frog, click that above link. This post is a deeper end to end walkthrough from authenticating to getting your data out in a format you can use, and covers some caveats and considerations for getting the most out of that process – logging in, what URLs to crawl, ways to include and exclude link clicks and pages to protect your data and not do unintended things, as well as how to go about fine tune extracting just the data that you want.

The example use case we’ll use is getting your content’s engagement data out of Convertkit.

The problem with Convertkit (one of the many problems) is a common one. You aren’t able to export all the data you want without tons of manual action.

Crawling and scraping engagement data out of Convertkit allows us to view all emails sent, create a summary of average opens, or sort by engagement metrics like clicks or open rates or even unsubscribes.

Poking around Convertkit logged in, we see a Broadcasts ( /campaigns ) page with a list of sent emails and info about them:

Clicking on a “View Report” link, we then see some stats and a right-hand sidebar with pages for content of email, recipients, opens, clicks, unsubscribes.

Planning what we’ll crawl and scrape looks like this:

Broadcasts (sent emails) > Report summary > Report data pages

With data like this, I could even look at patterns in engagement by certain emails based on subject lines / implied topics of posts.

Authenticate in our Crawl Tool (ScreamingFrog)

For our purposes, we’ll be using ScreamingFrog, a freemium crawl tool.

On the next modal, click the “Forms Based” tab, and click “Add” (bottom left).

On that page you’ll see a warning note:

WARNING! Potential data loss. If you log into a website and then crawl it, the SEO Spider will click every link on the page; this includes links to add or delete data. See here for information on our exclude feature.
Screaming Frog UI > Configuration > Authentication

Some portals have buttons or text links that alter data (think “Delete Tweet”) that we want to avoid. It’s important so we’ll cover this more once logged in.

Enter your login URL (you want a page with the login form on it).

Hit “OK” and wait for the page to load in the next modal window:

Enter your login info, log in and then hit “OK” again (bottom right this time).

Ensure you do not delete data accidentally by crawling willy nilly

It’s super important that we inspect the pages we want to crawl and make sure that we are not sending the crawler to button events that trigger deletion of content, data, or subscribers. Basically, we want to be very fine-tuned in what we ask of our crawler and make lots of exclusions, or conversely, only include the pages we want.

The first thing I did was choose which pages I wanted to crawl and extract from. Normally, I’d crawl everything in one shot, with different extractions for different types of pages lumped together, but it makes sense to take our time and start with collecting the list of page URLs with the most important data.

Generate an initial list of paginated pages without crawling them to get it

The pages I want are the “View Report” URLs. But they are hidden under 21 pages of pagination with no way to grab all the URLs at once. If I run a crawl, I potentially lose control of where the crawler is crawling.

So instead we can to generate a list of the paginated URLs to then extract the “View Report” URLs from for a first round.

We can do this with js, python, etc., but I find Google Sheets sufficient:

By pasting =A1+1 to the first column, I can generate the numbers up to 21st page, and then by adding ="https://app.convertkit.com/campaigns?page="&A2 to column B, I can generate that list of links.

For this step I now don’t have to worry about accidentally crawling anything outside of the listed URLs that might damage/delete something unintended.

We can crawl in “list mode” for just our generated URLs and know that we will only be crawling one page template type, ensuring we don’t accidentally delete or alter data.

Choose what HTML elements to extract from seed URL pages

Sidenote on collecting relational data: If we want to map relationship between users based on their engagement with our content, it might be easier to get that data from the subscriber’s activity page, rather than the actual email stats page.

It just depends on the amount of content and the amount of subscribers as well as what data you care about. Enriching a CRM? Focus on users. Doing content strategy? Focus on content metrics.

Here I can scrape the “View Report” URLs for my next step and also subject line, date and time sent, number of recipients, open rate, click rate, clicks, unsubscribes.

For further info, like who opened what, we will have to drill down into specific report tabs. More on that in a moment.

The most important thing here is to pick out the patterns in the DOM that you want to target. Once you’re able to identify a repeatable pattern to be scraped on each URL, you can choose a targeting method.

Targeting what DOM elements we want to scrape with CSSPath

With a little more of a frontend background, I’m also a bit more confortable with CSS than XPath, so sometimes I will use CSSPath. Instead of Scrape Similar, I will right-click what I want and then click “Inspect.”

This brings up Chrome Dev Tools, often with the div or tag selected already.

As long as the intended element is selected in the DOM tree, I can often just click on the little “+” symbol in the Styles side box (next to :hov and .cls” top right of black box below).

If it’s too general, like a link with no specified class, I can find a unique parent element to isolate and then target the child elements I want that way:

Here, you can see that broadcast-list class for the unordered list tag is unique, and that the a tag we want is inside an h3 tag. So targeting this should work: ul.broadcast-list > li > h3 > a

The “>” is used to say immediate child element so we don’t accidentally catch other a tags that may exist in sub list items.

That’s the easiest way, as long as the templating gives you a class to hook into that isn’t also being used elsewhere on the page.

Sidenote:

When to use CSSPath vs Xpath vs Regex when targeting what you want to scrape

Sometimes you don’t get CSS classes or html tags unique to what you want on a page.

For example, getting a parent element with CSSPath is not possible, CSS only traverses down. Depending on what you are scraping, you might want to use XPath instead.

If an item has a class, but it’s parent doesn’t and you want the parent element, XPath is going to be able to traverse up the DOM. Similarly, if you want to match on part of a class or other pattern, like <div id=”some-constant-###”> then you can match with a ‘starts with’ type pattern in XPATH on the “some-constant-” string.

To quickly find which elements to scrape with XPath, I often use the Scraper chrome extension with right-click “Scrape Similar.” It’s a good tool and will (usually) give you a good starting point for getting the XPath pattern you want.

For more complex pattern matching, Screaming Frog also has a regex option. Regex is nice if you just want something very specific.

I once wanted to find all the broken shortcodes on a WordPress site after client deactivated a Tweet This plugin being used intermittently (those shortcodes always fall within square brackets). Regex is going to be more reliably fine-tuned for grabbing symbols like “[” … “]” in the text output only and conditionally grabbing the strings within them based on a match on “twitter” within those brackets.

(End sidenote on XPath vs CSSPath vs RegEx)

Using Screaming Frog Custom Extraction with CSSPath for targeting

Because we are scraping simple templating and css classes exist in the HTML elements containing the text we want, we’re using CSSPath.

In Configuration > Custom > Extraction, we opt for “CSSPath”

And then make sure to add “href” as our HTML element to grab the link, and not the anchor text:

We said earlier that, in Convertkit, you can see that broadcast-list class for the unordered list tag is unique, and that the a tag we want is inside an h3 tag ul.broadcast-list > li > h3 > a

At this point we can just rip, or we can grab the other elements from these pages so we don’t have to collect it from the report URLs directly, doesn’t matter.

I’m just doing a quick grab for all stats links’ text in case it turns out to be more of a pain in the report structure later to get at. This will grab all link text: ul.broadcast-list > li > div > .broadcast_stats > a (Extract Text)

Important! Exclude pages or URL patterns that could mess with data

Looking at the page, I don’t see any issues with crawling all links on this page.

But I still want to make sure the crawler doesn’t crawl anything but the page type I am setting so I will write an “Include” rergex just for the pattern, just in case.

In Configuration > Include, I add https://app.convertkit.com/campaigns\?page=[0-9]?[0-9]

This regex pattern should catch all URLS like https://app.convertkit.com/campaigns?page=1 to 99

Sidenote: Question mark, at least in (Java) regex, is a reserved operator meaning the preceding pattern is optional, so we use forward slash “\” to indicate that the following character is not the special operator, but that we want the “?” from the URL itself.

Then the next question mark in the above pattern means optional, as in the first character group matching 0 to 9, giving us the option of matching a single digit number or double digit number. If we wrote [0-9][0-9] instead of [0-9]?[0-9] or [0-9][0-9]? then we would only match URLs starting from page=10 to 99.

Crawl seed URLs in list mode

To reiterate for the skimmers here, we opt for list mode to not crawl all pages associated with the links on the page template we are targeting. We’re doing this to prevent data loss.

Go to Mode > List

Then go to , Upload > Enter Manually…

Once in list mode, I like “Enter Manually…” because it allows me to edit the URLs after I’ve pasted them and before I crawl. “Paste” will skip the ability for you to edit, it’s sort of a paste and go option.

Now that I have these ready, once I hit next, they should start crawling.

It works! BUT it did not grab the stats. No big deal, we have the report URLs and the date/times now.

Click “Export” (to CSV) then import to Google Sheets or Excel.

You’ll see columns A through AR are default page info from Screaming Frog unrelated to what we’re extracting (as long as the status code columns are all 200s).

Scrolling to the right, and removing unneeded columns, this is what we’re looking for:

But, we need it in a list format to paste into Convertkit.

Noobs will spin their wheels here, often endlessly copying and pasting into a column in the sheet (I used to do this!). But there are many ways to get a bunch of columns and rows into one list that.

I prefer the time saving magic of pasting the whole range into a new file in VSCode (a code editor) and then Find/Replace the tabbed spaces with line breaks using regex match. But you can use this trick with any text editor that allows regex (including Google Docs).

In VSCode, you toggle regex find replace with that light blue box below in the input field on the right side. That is the RegEx “on” button for Find/Replace.

In the Find box, I copied the space between the first report URL and and second. This is a “Tab” space that gets added in lieu of the Google Sheets ability to show you that the data was in the next cell when exporting to CSV.

By putting “\n” in the Replace box, we are replacing the “Tab” space with the special operator in regex for a line break. This also maintains our crawl order, keeping our report URLs in the right chronological sequence:

And now back to Google Sheets to concatenate with app.convertkit.com for full URLs with list in Column A and this in B: ="https://app.convertkit.com"&A1

We’re converting our “/campaigns/123/report” pattern to the full URL with the stem/root part adding those strings together:

Sidecar: I did notice one outlier in the pattern, a non-report URL in my list that was “/campaigns/3446545/steps” – turns out it was a Broadcast draft I had never sent. And on that steps pages is a Delete button. Eek for you if you have lots of broadcast drafts, make sure to check and remove those.

Now we want to inspect the template for the report URLs, which looks like:

If you remember our data plan, we want the email content, the recipients, opens, and clicks page from this report URL template:

In avoiding data loss, we also want to investigate any links/buttons that may delete or alter data on this page.

I noticed the resent to unopens button and duplicate buttons — if clicked, these will go to pages that have “Delete” and “Next” buttons, and potentially (but unlikely) could accidentally generate drafts or send broadcasts.

So to avoid that, we want to make sure to exclude those URLs from being crawled that end in /resend and /duplicate

To do that we go to Configuration > Exclude and add the URL pattern to exclude:

https://app.convertkit.com/campaigns/4319719/duplicate has a bunch of numbers in it. So the exclude rule would look like this:

https://app.convertkit.com/campaigns/[0-9]+/duplicate

Sidenote: We use the reserved “+” special operator to indicate that the preceding character group can be repeated any number of times, from once to infinite, catching all 42891379 type numbers.

Let’s go back to Configuration > Custom > Extract and add in the report stats data we want to grab, as well as the links users clicked the most:

.report-stat
.row.mt--30 .col-sm-12 a (Extract Text)
.row.mt--30 .col-sm-12 a (Extract Inner HTML > href)

This worked well on a sample of one URL.

Sidenote: being extra safe, adding the 209 URLs in, I made sure to hit “Pause” after a couple URLs were crawled and just poke around in Convertkit to make sure everything still looked in place and I didn’t accidentally do something wrong like unsubscribe people are spin up unintended broadcast drafts.

Exporting the crawl CSV and looking in Google Sheets, it’s mostly good:

But we have a fair amount of columns like:

top links clicked link 2	top links clicked text and count 2	top links clicked link 3	top links clicked text and count 3	top links clicked link 4
/subscribers?subscribable_ids=1287477168	https://www.instagram.com/p/CA1Ez5q…(5)	/subscribers?subscribable_ids=1287477170	https://twitter.com/NetflixIsAJoke/status/1270… (4)	/subscribers?subscribable_ids=1287477167

The longer URLs are given a … instead of the full URL. To get the full URL path for the clicked links, I would have to crawl the “top links clicked” links and then scrape the next page with the full URL path of the link and who clicked it.

I can also click on Content to get the full body of the email that was sent. This would show how many links were in the email.

Many emails have a zero click rate, but that’s not a bad thing. A better metric might be percentage of links out of total available links to click. Similarly, we also want to look at distinct list member activity. We don’t want to assume someone only engaged with some emails when they’re newer on the list.

Interesting.

Let’s just collect anything that could be of use so we can ask all the questions we might want later: recipients and opens and clicks and content on a per email basis.

We’re working from 209 of these report URLs:
https://app.convertkit.com/campaigns/[0-9]+/report$

We want these URLs:

/campaigns/[0-9]+/report/content
/campaigns/[0-9]+/preview
/campaigns/[0-9]+/report/opens
/campaigns/[0-9]+/report/recipients
/campaigns/[0-9]+/report/clicks

And to make extra sure we exclude these URLs:

https://app.convertkit.com/campaigns/[0-9]+/resend https://app.convertkit.com/campaigns/[0-9]+/duplicate

To identify what data is associated with what email, we can use the unique report identifier in the numbers pattern of [0-9]+ e.g. the 4319719 found here: https://app.convertkit.com/campaigns/4319719/report .

We need to append /content /opens /recipients /clicks to our /report URLs for our next list crawl and make sure we are extracting just the data we want from each.

From that /campaigns/[0-9]+/ pattern we copy and paste the 209 URLs we have in a spreadsheet or code editor and replace /report with the above patterns we want like /report/clicks giving us 1,254 URLs (209 reports x 6 page template types).

We are also grabbing /campaigns/REPORT_ID/preview URLs because the /content pages populate the content via an iframe.

Then we will have to grab more URLs from the /clicks page because those subscribers that clicked certain links are not listed on that page as you might expect.

To see the specific pages we are crawling and key elements we want, you can view those templates here:

We are also grabbing subscribers for the /clicks report, but it doesn’t actually give us the emails that clicked on that page. Instead, it gives us a link to those subscribers filtered by a related ID in this form https://app.convertkit.com/subscribers?subscribable_ids=####&subscribable_type=Url

For those /clicks pages we’ll have to grab the URL where the clicking subscribers are kept and then get that as well. In addition, we’ll need to maintain the association between those /subscribers pages and their associated reports, because the report ID is not listed in the code on those pages anywhere.

Sidenote/Rant: It’s a weird decision by Convertkit that opens, recipients lists are housed in the email report, while clicks are only available within the /subscribers and without any hint of what report it’s assoc.d with from that page. I’d just like to vent here that to get all the email report data we want, which is not at all a lot of data, we have to look at 6 pages per report. This is why we’re scraping. Because Convertkit doesn’t respect the idea we want our own data in an accessible way that doesn’t warrant a 3k word tutorial to access. End rant.

Crawl an iframe if what we want is inside?

For /content, we see a subject line and body content. Inspecting the subject line reveals .broadcasts-show .preview_subject, .campaign_reports .preview_subject so we’ll use the second one .campaign_reports .preview_subject, inspecting the body content, shows it is in an iframe.

SF Config can be set to include iframes, but I *think* that is for whether to follow those src URLs. If SF has trouble, we can just get the src link for the iframe and throw it on the list of URLs we want data from ( https://app.convertkit.com/campaigns/4319719/preview ).

Looking at the resulting /preview page, it’s actually a very clean html version of the sent email, so that may prove easier to work with than /content as we should already have the subject line from the email report page title.

Final CSSPath Extraction Settings for 5 main report pages

Copy and paste-able version below.

body.campaign_reports-show .campaign_reports h3.preview_subject (Extract Text)
body.campaign_reports-show .col-sm-4 .report-stat (Extract Text)
body.campaign_reports-recipients table.table.subscribers tbody tr td > a (Extract HTML Element) href
body.campaign_reports-opens  tbody tr td > a (Extract HTML Element) href
.message-content (Extract Inner HTML)
body.campaign_reports-clicks table.table.table-condensed.table--reports td.url > a (Extract HTML Element) href
body.campaign_reports-clicks table.table.table-condensed.table--reports tr > td:last-of-type > a (Extract HTML Element) href

One click download my ScreamingFrog configuration file (crawl-convertkit-internals.seospiderconfig – generated on MAC OS X with SF v13.1

Running our ~1000 URLs through SF, we get an exportable file where each URL has the data as intended:

The only thing we are missing is subscriber email addresses. It looks like Convertkit tries to protect this data on certain page types with something like <span class="__cf_email__" data-cfemail="ab1c5dedcf1c2d3c2">[email protected]</span> in the source code.

So we either have to grab this data (and anything else we want) from the subscriber ID page where the email is in the title tag and in the html template source or run our crawler with js and scrape the rendered html, which would take longer.

I’m interested to know if there are x users who are more likely to click only technical posts than if my wife is still reading my emails.

I also personally don’t really want the “who this is” data for users. I’m more interested in the data in aggregate and patterns in user behavior than who clicked what. So the subscriber ID is a good, unique identifier for me.

But, if you are enriching your CRM with your Convertkit data, you’ll want to crawl the subscriber URLs and pull down the email address.

What now?

Next, I will be mapping my scraped data into Neo4j with this data modeling:

More later!

Crawl and scrape sites that require your login

Why scrape data you already have login based access to

Authenticate in our Crawl Tool (ScreamingFrog)

Ensure you do not delete data accidentally by crawling willy nilly

Generate an initial list of paginated pages without crawling them to get it

Choose what HTML elements to extract from seed URL pages

Targeting what DOM elements we want to scrape with CSSPath

When to use CSSPath vs Xpath vs Regex when targeting what you want to scrape

Using Screaming Frog Custom Extraction with CSSPath for targeting

Important! Exclude pages or URL patterns that could mess with data

Crawl seed URLs in list mode

Crawl an iframe if what we want is inside?

Final CSSPath Extraction Settings for 5 main report pages

What now?

Sitemap

Services

Recent Articles

Why scrape data you already have login based access to

Authenticate in our Crawl Tool (ScreamingFrog)

Ensure you do not delete data accidentally by crawling willy nilly

Generate an initial list of paginated pages without crawling them to get it

Choose what HTML elements to extract from seed URL pages

Targeting what DOM elements we want to scrape with CSSPath

When to use CSSPath vs Xpath vs Regex when targeting what you want to scrape

Using Screaming Frog Custom Extraction with CSSPath for targeting

Important! Exclude pages or URL patterns that could mess with data

Crawl seed URLs in list mode

Crawl an iframe if what we want is inside?

Final CSSPath Extraction Settings for 5 main report pages

What now?

Sitemap

Services

Recent Articles

Footer CTA