When search engines crawl a site, they first look for a robots.txt file at the root of the domain. If found, they read the list of file directives to see which directories and files, if any, are blocked from scanning. This file can be generated using our robots.txt generator. When you use a robots.txt generator, Google and other search engines can determine which pages on your site should be excluded and which should not. In other words, the file generated by the robots.txt generator is like the opposite of a sitemap, which indicates which pages to include.
Other tools :
You can easily create a new or edit an existing robots.txt file for your site using the robots.txt generator. Use the robots.txt generator tool to generate Allow or Disallow directives ( allow by default, click to change) for user agents (use * for all or click to select only one) for specified content on your site. Click Add Directive to add a new directive to the list. To change an existing directive, click Remove Directive and then create a new one.
Create your own custom user agent directives
Our robots.txt generator lets you target Google and several other search engines based on your criteria. To specify alternate directives for a single scanner, click the User Agent list box (displayed by default *) to select the bot. When you click Add Directive , the custom section is added to the list with all generic directives included in the new custom directive. To change the generic Disallow directive to the Allow directive for a custom user agent, create a new user agent-specific Allow directive for the content. The corresponding Disallow directive has been removed for the custom user agent.
You can also add a link to your XML Sitemap. Type or paste the full URL-address file XML Sitemap in the text box XML Sitemap . Click Update to add this command to your robots.txt list.
When done, click Export to save your new robots.txt file. Use FTP to upload a file to the root domain of your site. With this file downloaded from our robots.txt generator, Google or other specified sites will know which pages or directories of your site should not be displayed when searching for content.
The Ultimate Guide to Blocking Your Content in Search
We all try very hard to ensure that all of our content is crawled and indexed by search engines. So it's ironic that sometimes we also have to struggle to remove or prevent some content from being indexed.
The process of blocking content from search can be frustrating, deleting can be slow, and generally frustrating, especially if you don’t know what options you have. Let's talk about the different options for removing content from search indexes and how to prevent it from being indexed altogether.
Find all affected URLs
Before starting the process of removing URLs, look at which URLs point to the content you want to remove. Consider reverse canonization. If the content is older, it can be indexed by multiple URLs, for example:
- website.com / mystuff
- website.com / mystuff /
- www.site.com / mystuff
- www.site.com / mystuff /
- www.site.com / mystuff / Index.htm
- www.site.com / mystuff / index.htm
and many more options. Identify all URLs pointing to the content you want to remove so that you are ready to remove all links to it.
Remove indexed content from search
There are several ways to tell search engines that content is no longer available. Let's get started right away.
Remove it from the web server
The easiest way to remove content from search indexes is to simply remove it from your site. When the crawler returns to your site to check the status of your published content, its request for remote content will result in a 404 HTTP status message that tells the crawler that the file could not be found. This result starts the automatic (albeit slow) process of removing the URL from the index.
Configure your web server to return 404 (or 410) for the URL.
If you must leave content on the server, you can configure the web server to still return a 404 "File not found" or 410 "File deleted" message for the given URL. The process for setting up a specific, non-standard HTTP status message for a URL on your web server depends on your platform. See the documentation for your web server for details. Note that this method will not work for non-HTML content such as PDFs and Microsoft Word DOC files.
Permanent URL redirection
Assigning a 301 (also known as permanent) redirect to a URL tells the crawler that the requested URL is no longer available and has been permanently replaced with a substitute (URL receiving redirect traffic).
All of the above methods will work after a while. They depend on waiting for the crawler to return to the site, requesting the affected URL to get a valid HTTP status code, and then on the search engine's algorithm to eventually clear the content. If the issue is urgent, such as when proprietary or confidential personal information is accidentally disclosed, you need to take immediate action to clean up that content. Here's how to do it:
Use search engine webmaster tools to remove specific pages
Google and Yandex offer tools to request the immediate removal of indexed content. Before you can access them, you must be a registered Google user for webmasters and Yandex webmasters (this alone is reason enough to register your site now, before the pressing problem arises).
- Sign in to Google Webmaster Tools and click Site Configuration > Scanner Access > URL Removal tab .
- Click Create a new delete request, enter or paste the URL you want to delete, and then click Continue . Remember that URLs are case sensitive, so I recommend copying and pasting the URL to remove.
- In the drop-down list, select the type of data deletion that you want (only cache, cache and SERP, or the entire directory), and then click " Submit Request" . Your request will appear as a list in the tool, where you can track the status of the request.
- The same principle applies to Yandex.
Please note that the URL removal tools provided by the search engine are usually designed to remove urgently needed data. In addition to the aforementioned methods, there are other ways to remove content that also prevent it from being indexed in advance. Let's explore them.
Block URLs to prevent duplicate content in the index
The most commonly used method of controlling crawler access to your site's content is by using Robot Exclusion Protocol (REP) directives. This can be achieved using several methodologies:
Use a robots.txt file on your site
A robots.txt file is a simple text file containing crawl exclusion directives for one or more REP-compliant crawlers (or generally general directives that apply to all REP-compliant crawlers). When a file is uploaded to the root of a domain (or subdomain) of a website, it will be automatically read by REP-compliant crawlers before any URLs are received (all major search engine crawlers are REP complaints). If the target URL is blocked by robots.txt, the URL is not retrieved.
The robots.txt file (note that this filename is always in lower case by protocol) allows webmasters to block crawlers from accessing one or more specific files in a directory, entire directories, or the entire site. ( Note: According to Google, this is the only approved method for removing entire directories from their index.) It also supports wildcards, making it extremely versatile.
The most common robots.txt statement targets all crawlers (called "user agents" in the REP). It is followed by a specific directive, such as blocking access to a file, directory, or site. An example robots.txt directive for generic user agents looks like this: