|
© 2003-2007 Googlerankings.com
Googlerankings.com is in no way affiliated with or is the property of Google Inc. |
| Introduction About this guide Common issues Duplicate content in Google
Tools and services |
In certain cases, on course to the lifting of penalties ( solving usability or content related problems, removing unwanted content, requesting re-evaluation ) removing an offending web page / access to a faulty URL may become necessary. Such changes will take effect in a web site's ranking during subsequent crawls and updates, or in case of a ban, will be viewed as the proper action taken when re-evaluating the content. However, entirely removing a URL from the index is rarely necessary, as the historic supplemental results and cached versions of a URL ( that already return server responses for deleted or permanently redirected pages ) are not determinative of the outcome. For more generic purposes, Google will remove any and all URLs from its index if its crawl requests consistently receive a 404 - not found or 410 - Gone HTTP status code message from the server. Using a permanent redirect ( status code 301 ) will gradually transfer all parameters of the source URL to the target of the redirect, and once the transition is complete the source page may be dropped from the Index. URLs to obsolete pages are likely to first become "historic" Supplemental results, then fade out from the index completely. For the removal of valid URLs that are otherwise accessible, using the robots.txt disallow features and/or limiting Googlebot indexing with META tags are the proper methods. Known issues Sometimes web site owners may feel that a page they are administering, and is currently in the Google index, should no longer be listed as a result. Removing the page from the server, and thus the server responding to the requests of Googlebot with a 404-not found server message ( 404 - not found / 410 - gone HTTP status code ) will eventually mark the given web page as supplemental, and not show it as a result for normal searches. Such historic supplemental pages may stay within the index, and may be reached by queries unique to the deleted page up to a year. During this time a copy of the last crawled version will remain in the Google cache. Such historic supplemental pages will not play a role in evaluating a web site for relevance or importance, and may generally be ignored completely by the algorithms. Unless there are legal issues with the now deleted, but still cached content of the web page, copyright breach, defamation or security problems with the information displayed, you should not need or care to remove the page entirely from the index. In these cases you may request the cache to be deleted through the Google webmaster help center, using the tool for removing URLs. Regarding the manual URL removing system, it only works if the request corresponds with either a 404 - not found server response, or a Googlebot related restriction on the pages and in the robots.txt file. Also note that pages excluded are still cached and indexed, and only remain hidden from the users for an estimated 180+ days period ( exactly 6 months or more ). The URL removal tool thus can not be used to clear the history of a web site, rather to exclude it from the search result pages. Its combination with the Reinclusion request for dealing with penalties and bans is thus redundant and generally advised against. + Resolution: Completely removing an otherwise valid URL from the Google index, or preparing to remove a soon to be deleted page should be done by restricting the crawling, indexing, caching and display of the pages on a case by case basis. Implementing the proper META tags into the HEAD section of a page will communicate the necessary information to the algorithms, and subsequent crawls and updates will see the page gradually be excluded from the results, and clear the associated cache as well. Using robots.txt disallow features will be reported as a temporary block against the crawling of these URLs, and will only work while it is in place. It is advised to first include the page specific directives, then once these have been recognized, the removal of the pages or the setup of the robots.txt disallow attributes can follow. Also it is important to note that if a URL is constantly referred to by links from other web sites, it will be tried against the robot directives over and over again, and once those are not restricting the crawl, and the page at the URL is still in place, it will be re-indexed. After about a year of a URL becoming unavailable for Googlebot to crawl, it's usually dropped from the index entirely, including the cache and occasional supplemental versions. <META NAME="GOOGLEBOT" CONTENT="NOINDEX, NOARCHIVE, NOSNIPPETS"> is to be placed in the HEAD section of the pages that are to be excluded from the index. Another method is the <META NAME="GOOGLEBOT" CONTENT="NOFOLLOW"> directive that instructs Googlebot not to crawl the URLs that the page on which this directive is found links to. Use with caution. Below is an example for a robots.txt entry to block Googlebot from crawling certain URLs. You may place the robots.txt file into the relative root directory of your web site, or the directory in which it needs to applied. You can enter a list of URLs relative to the path, or use but the the / character to extend the directive to the entire directory ( if the robots.txt is in the domain root, this will put all the URLs of the domain or subdomain on the list of exceptions ), or use directory names with / at their end to exclude them. You may also speed up the process of removing a URL from the Index by first restricting their crawl and / or deleting them, and request the URL to be removed through the URL removal request page of Google Webmaster Tools. An example on robots.txt for disallowing an entire directory ( if placed in the domain root, i.e.. is reachable through www.example.com/robots.txt , this directive will disallow the crawl of the entire domain ) : # Disallow Googlebot Another example disallowing a directory, using a path relative to the position of the robots.txt file: # Disallow Googlebot Another example disallowing specific files, using a path relative to the position of the robots.txt file : # Disallow Googlebot
Resources Removing my content from the Google index ( Google Webmaster Help Center ) http://www.google.com/support/webmasters/bin/topic.py?topic=8459 Requesting removal of content from our index ( Official Google
Webmaster Central Blog ) HTTP/1.1 Status code definitions ( W3C ) Experiences with the Google URL removal tool ( Webmasterworld
) Another URL Removal cautionary tale ( Webmasterworld ) |
Web site diagnostics Banned from Google
|