Robots.txt is a text file that web admins place at the root of a website to instruct web crawlers about which parts of the site they should and should not visit. The file will also often recommend starting points for crawling jobs, such as the URL of the site’s sitemap.
robots.txt can significantly ease browser automation tasks. By checking the file, your scripts can determine which paths are most-likely to lead to desireable content, and which you should avoid. While the file is not legally binding, it is certainly recommend that you respect the requests of your fellow web citizens.
Most browser automation jobs don’t intersect with
robots.txt. After all, most jobs aren’t scraping data, but interacting with web elements. That said, if you’re designing a far-reaching web crawler, it pays to leverage the robot instructions of sites you crawl. Most sites have every interest in guiding you to the relevant content, and you can save valuable cycles by skipping the pages they’ve blacklisted.
How can BrowserCat help with robots.txt management?
BrowserCat does not get involved with how you manage your automation scripts. While we recommend you respect the
robots.txt file of the sites you visit, whether you do so is ultimately your choice.
Every improvement we make to the platform is geared toward making it easier for you to automate web activities. That’s why we’ve gone to such great lengths to make spinning up our fleet of headless browsers a breeze, and why we lean on usage pricing, to ensure we’re only paid for the value you produce.
Like what you hear? Give BrowserCat a try today!
Tired of managing a fleet of fickle browsers? Sick of skipping e2e tests and paying the piper later?
Sign up now for free access to our headless browser fleet…