Photo Source: wholicab.in

How to “hide” a site from search engines

When hear the words “search engine” most of the people react as:

Oh! That’s google!

Well, this is not entire true. It is goole too, but there are a lots of search engines in the web world and a full list is present on wikipedia at list of search engines page.

The truth is that non of the content you post on the internet stay hide (therefore the quotes from the title). However there are a few techniques you can implement in order to “hide” your content from search engines, including password protected your site or use a robots.txt on your server.

Placing a robots.txt file with the content

on your root directory of  the site is more a technique to instruct the search engine bots not to spider the content. But again, we are not living in an ideal world and not all the bots respects this instructions and you are at the mercy and goodwill of the crawlers. All you can do at this point is to regularly check the server logs and manually deny the suspected crawlers.

More about robots.txt file syntax

The example above uses two keywords (User-agent and Disallow) through all of the robots are instructed to ignore the content of the entire site. But of course, robots.txt syntax is more flexible. The disallow keyword can be used to block access to a custom path, and the allow keyword is used to allow access in a sub-directory of a blocked parent directory.

Bellow is an example of robots.txt,

The above code instructs all search engines to not read the content of the folder1, but read the content under subfolder in folder1. The last row is instructing the bots to not spider the urls ending in .jpg (the use of $ is the keyword to tell the robots to ignore the URL ending in a specific way).

Note that robots.txt must be placed under the root content of your site, and remember:

The only safe and reliable way of not having a site listed is not putting it on the internet.

Spread the love

Leave a Reply