Crawler bots are like the viruses of the world, they grow and spread of their own accord, and they index whichever website links they want. Crawler bots, the digital spider slaves of search engines, spread their threads through the web and index every link they can find.
However, if you’re in the process of building a new website, or creating new pages for an already existing site, you don’t want these pages to be indexed, or shown in the search results. Reason? Because these pages are not ready. Elements like design, content, images, web links, SEO and much more may not be ready for the page to go live yet.
To ensure these unready pages don’t show on any search, you got to have control of the crawler bots. The bad news is that controlling Google is like controlling God (All Hail Google). The good news, you can control Google’s angels – crawler bots.
God be like – “Go find all the good and bad websites!”
Three Types of Crawler Bots
Before we get to controlling crawler bots, you got to know the three main types of crawler bots.
These guys live on your site: yoursite.robot.txt. Robot.txt tells all the crawlers on your site what pages they should access or not access. But, its instructions are not always respected by Google and Bing.
2. Meta Robots
These crawlers live in the headers of each of your web pages. They not only tell the search engine to index the page, but whether the search engine should check out the links on the page. Meta Robots have better control over search engines than Robot.txt.
3. Nofollow Tag
These folks live on the page link. While the crawler itself does not tell a search engine to index the page, nofollow tag vouches for the page and whether it gets a PageRank and link equity metric.
Controlling Crawlers Bots
Here’s the deal, I lied. You can’t really control crawler bots (not in the sense you’re thinking) because they are a product of search engines. What you can do is fool them so that your page is not indexed and mentioned in the search results.
The key is to use all three crawler bots in the right manner so that the pages aren’t indexed.
– You can give a command – “all user agents, you are not allowed to crawl yoursite.html”. That will be enough to prevent the page from not being crawled, however, just because it’s not being crawled does not mean it won’t appear in a search result. Explaining why this happens would need another article for itself.
– <meta name=”robots” content=”noindex, follow”>
Noindex sends out a command that the crawler can follow the links on the page, but they can’t index them.
– Still find your page on the search result? And, does the search result show your link, but with a meta description of “we can’t include a meta description because of this site’s robots.txt file.”? This happens because the crawler does not see the noindex part, instead it sees “disallow”.
– Add “meta noindex”. A search engine crawler goes through all the pages, but by adding “meta noindex”, you prevent the pages from appearing on a search result.
These points are just the basics, and there is a lot you can do once you get into controlling crawler bots.
WebMaster – The Manual Way
If you’d rather not bother yourself with the above points, you can take an easier path. Google and Bing webmaster allows you to manually pull off a page link from a search engine result. However, this is a good idea only if these are a few pages. If you want to prevent more than 100 web pages from being indexed and mentioned on search results, your best bet is to save precious time by controlling the crawler bots instead of inserting every link in webmaster.
Index Problems and Solutions
There are many, many index problems you may face. Here are 2 common ones I’ve come across and their solutions.
1. Content isn’t Ready
The most common reason no one wants a page to be indexed is because the content is not ready.
- In case you’re talking about tens and thousands of pages, use robot.txt to disallow these pages from being crawled.
- In case you’re talking about a few hundreds of pages, use meta robots noindex on them.
2. Duplicate Content
If a search engines catch duplicate content on your site, it can hurt your site’s SEO rank. The solution for this depends on your need for the duplicate page.
- If you have two pages of the same title (like Plumbing service) but one is a live page, while the other is not live (because you’re building a new and better website), noindexing it is the best solution.
- However, if you have a single link with duplicate content, then the best option is to add: rel=”canonical”.
There is a lot of technicality when it comes to hiding your page from search engines. If you aren’t well versed in it, it’s best to leave it to an SEO expert. If you don’t pay attention to these minor details, they can negatively affect your SEO rank.