Recently, I was trying to download a bunch of things with a common regex format. The general idea (I thought) was to set an infinite number of hops and use a regex string to constrict the downloaded file-set. Of course, I couldn't find a way to do infinite hops (this is a good thing) but I finally figured out a way to do it.
Look at the scope section of a
crawler.cxml file. It starts like this, so you can find it in your file.
<!-- SCOPE: rules for which discovered URIs to crawl; order is very important because last decision returned other than 'NONE' wins. -->
The list of rules are:
To make it work, I just set up the hops to 1 and right before the org.archive.modules.deciderules.PrerequisiteAcceptDecideRule I added two regex rules. One rule to reject everything that didn't match and one to accept everything that did (the second one ended up being necessary to work around the number of hops rule (i.e if there's pagination to worry about). That was sufficient to visit just the urls that matched the pattern.