Googlebot


Googlebot is the web crawler software used by Google, which collects documents from the web to build a searchable index for the Google Search engine. This name is actually used to refer to two different types of web crawlers: a desktop crawler and a mobile crawler.

Behavior

A website will probably be crawled by both Googlebot Desktop and Googlebot Mobile. The subtype of Googlebot can be identified by looking at the user agent string in the request. However, both crawler types obey the same product token in robots.txt, and so a developer cannot selectively target either Googlebot mobile or Googlebot desktop using robots.txt.
If a webmaster wishes to restrict the information on their site available to a Googlebot, or another well-behaved spider, they can do so with the appropriate directives in a robots.txt file, or by adding the meta tag to the web page. Googlebot requests to Web servers are identifiable by a user-agent string containing "Googlebot" and a host address containing "googlebot.com".
Currently, Googlebot follows HREF links and SRC links. There is increasing evidence Googlebot can execute JavaScript and parse content generated by Ajax calls as well. There are many theories regarding how advanced Googlebot's ability is to process JavaScript, with opinions ranging from minimal ability derived from custom interpreters. Currently, Googlebot uses a web rendering service that is based on Chromium rendering engine. Googlebot discovers pages by harvesting all the links on every page it finds. It then follows these links to other web pages. New web pages must be linked to from other known pages on the web in order to be crawled and indexed or manually submitted by the webmaster.
A problem that webmasters with low-bandwidth Web hosting plans have often noted with the Googlebot is that it takes up an enormous amount of bandwidth. This can cause websites to exceed their bandwidth limit and be taken down temporarily. This is especially troublesome for mirror sites which host many gigabytes of data. Google provides "Search Console" that allow website owners to throttle the crawl rate.
How often Googlebot will crawl a site depends on the crawl budget. Crawl budget is an estimation of how often a website is updated. Technically, Googlebot's development team uses several defined terms internally to takes over what "crawl budget" stands for. Since May 2019, Googlebot uses the latest Chromium rendering engine, which supports ECMAScript 6 features. This will make the bot a bit more "evergreen" and ensure that it is not relying on an outdated rendering engine compared to browser capabilities.

Mediabot

Mediabot is the web crawler that Google uses for analysing the content so Google AdSense can serve contextually relevant advertising to a web page. Mediabot identifies itself with the user agent string "Mediapartners-Google/2.1".
Unlike other crawlers, Mediabot does not follow links to discover new crawlable URLs, instead only visiting URLs that have included the AdSense code. Where that content resides behind a login, the crawler can be given a login so that it is able to crawl protected content.
Mediabot will usually first visit a page within seconds of AdSense code first being called from that page. Thereafter it revisits pages on a regular but unpredictable basis. Changes made to a page therefore do not immediately cause changes to the ads displayed on the page.
Ads can still be shown on a page even if the Mediabot has not yet visited it. In this instance ads chosen will be based on a combination of the overall domain theme and keywords appearing in the URL string. If no ads can be matched to the page, either public service ads, blank space, or a solid color are shown, depending on the settings for that ad unit.