READING-NOTE

View on GitHub

Web Scraping

Web scraping, web harvesting, or web data extraction

Respect Robots.txt

You can find the robot.txt file on websites. It is usually the root directory of a website – http://example.com/robots.txt.

If it contains lines like the ones shown below, it means the site doesn’t like and does not want to be scraped.

User-agent: *

Disallow:/

Make the crawling slower, do not slam the server, treat websites nicely

Do not follow the same crawling pattern

Incorporate some random clicks on the page, mouse movements and random actions that will make a spider look like a human.

Make requests through Proxies and rotate them as needed to

Multiple requests coming from the same IP will lead you to get blocked, which is why we need to use multiple addresses. When we send requests from a proxy machine, the target website will not know where the original IP is from, making the detection harder.

There are several methods that can change your outgoing IP.

Rotate User Agents and corresponding HTTP Request Headers between requests

The most basic ones are: