User Avatar

Neha Setia Nagpal

3y ago

Welcome to my Social Blog

Experiencing 403 Forbidden status code while crawling a website?

It indicates that the server understands the request but refuses to authorise it. Here's one of the reason you might be experiencing it... đź§µ

Introducing request headers...

When it comes to food, everyone have a different palate. That’s the reason why people pass on additional information about their preferences when ordering online.

Request Headers are no different, when you request a website from your browser. They provide added information about the website requested by the user so the server can tailer the response accordingly.

To view request header, follow this tutorial :

https://stackoverflow.com/questions/4423061/how-can-i-view-http-headers-in-google-chrome

The following fields are consistently used by most major browsers when initiating any connection:

  • The Host  request header specifies the host and port number of the server to which the request is being sent.

  • Connection controls whether the network connection stays open after the current transaction finishes.

Connection: keep-alive

the value  keep-alive indicates that the client would like to keep the connection open for the subsequent requests.

  • Accept is used to specify which media types, the client is able to understand. For example,

Accept: text/html
Accept: image/*


  • Accept-Encoding indicates the content encoding (usually a compression algorithm) that the client can understand. For example,


Accept-Encoding: gzip
Accept-Encoding: compress


  • Accept-Language indicates the natural language and locale that the client prefers. For example:

Accept-Language: en-US

  • User-Agent : represents you by telling the server about your application, operating system, vendor, and/or version. For example:

user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36

But what’s their role in Web Scraping?

Typically, the only field that matters in web scraping is User-Agents. The sole purpose of a scraping project is crawl a website politely without being identified as a bot. To ensure you are a human, website inspect user-agents field.

Scraping Bots take the default value for user-agent set by python library making the request. For example, urllib library will set UA as Python-urllib/3.4. With the default UA value, website identify you as a bot and block you. The trick is to set this value human-like.

So, How To Behave Human-like With User Agents?

  1. Use quality User Agents

  2. Rotate User-agents

  1. Stick to user agents that match the browser you’re using for scraping to make them match the default behaviour of this browser. One of the many ways is to set the user-agent is using Request Library. It allows complete customisation of the http headers.

  1. Rotate User-agents to make server believe the requests are from different users. If you use same user-agent for multiple requests, you will be blocked. You can perform user-agent rotation Python and Selenium. If using scrapy, you can use `scrapy-useragents` middleware.

Pro-tip

If you ever encounter an extremely suspicious website, populating one of the commonly used but rarely checked headers such as Accept-Language might be the key to convincing it you’re a human.

To conclude, make sure you are choosing a reputed user-agents and keep rotating them for seamless crawling. If you don't want to do it yourself. Look for a proxy solution like @zytedata Smart Proxy Manager that does for you.

Sign up for 14-day free trial, https://www.zyte.com/smart-proxy-manager/

The all-in-one writing platform.

Write, publish everywhere, see what works, and become a better writer - all in one place.

Trusted by 80,000+ writers