Cheap Windows Hosting Tutorial – Tips Improve Your SEO With Robots.txt and Canonical Headers

ByAlexia Pamelov

Cheap Windows Hosting Tutorial – Tips Improve Your SEO With Robots.txt and Canonical Headers

5/5 - (1 vote)

CheapWindowsHosting.com | In this post we will explain you How To Improve Your SEO With Robots.txt and Canonical Headers.

Search engine crawlers (aka spiders or bots), scan your site and index whatever they can. This happens whether you like it or not, and you might not like sensitive or autogenerated files, such as internal search results, showing up on Google.

Google-Bot

Fortunately, crawlers check for a robots.txt file at the root of the site. If it’s there, they’ll follow the crawl instructions inside, but otherwise they’ll assume the entire site can be indexed.

Here’s a simple robots.txt file:

User-agent: *
Allow: /wp-content/uploads/
Disallow: /
  • The first line explains which agent (crawler) the rule applies to. In this case, User-agent: * means the rule applies to every crawler.
  • The subsequent lines set what paths can (or cannot) be indexed. Allow: /wp-content/uploads/ allows crawling through your uploads folder (images) and Disallow: / means no file or page should be indexed aside from what’s been allowed previously. You can have multiple rules for a given crawler.
  • The rules for different crawlers can be listed in sequence, in the same file.

Robots.txt Examples

This rule lets crawlers index everything. Because nothing is blocked, it’s like having no rules at all:

User-agent: *
Disallow:

This rule lets crawlers index everything under the “wp-content” folder, and nothing else:

User-agent: *
Allow: /wp-content/
Disallow: /

This lets a single crawler (Google) index everything, and blocks the site for everyone else:

User-agent: Google
Disallow:
User-agent: *
Disallow: /

Some hosts may have default entries that block system files (you don’t want bots kicking off CPU-intensive scripts):

User-agent: *
Disallow: /tmp/ 
Disallow: /cgi-bin/ 
Disallow: /~uname/

Block all crawlers from a specific file:

User-agent: *
Disallow: /dir/file.html

Block Google from indexing URLs with a query parameter (which is often a generated result, like a search):

User-agent: Google
Disallow: /*?

Google’s Webmaster tools can help you check your robots.txt rules.

Setting Up a “Crawler Friendly” CDN

Crawlers see your site the way you do, including loading content from CDNs which means that your images are being pulled from CDN and as such google won’t be touching origin server to index your images since origin is not used any more to deliver your image files.

  • Interesting fact is that google treats subdomains of your site used for static files delivery, with more “respect” than 3rd party domains used for same purpose. Therefore, it is highly recommended that you setup a CNAME for your CDN files and add it to Google WebMaster tools if you’d want to monitor the index rate for images.
  • To make sure your CDN is treating crawlers with appropriate terms you need make sure nothing but images is accessible for crawlers from CDN servers – unless you are using full site cache method of delivery. Your origin server has its own robots.txt, available at the root of the site and it’s probably allowing every page and image to be indexed from it. On the CDN, change your custom robots.txt settings (under the “SEO” tab in the control panel) and make sure that only images are “open” for indexing (and/or any other html page on which you’ve added canonical header as well):
User-agent: *
Allow: /wp-content/uploads/
Disallow: /

Make sure this robots.txt rule goes on your CDN, not your origin server.

  • For WordPress websites and Yoast SEO plugin there is a short code snippet to add into the function where image links are being generated from:
functionwpseo_cdn_filter( $uri) {
returnstr_replace( ‘http://domain.com’, ‘http://cdn.domain.com’, $uri);
}
add_filter( ‘wpseo_xml_sitemap_img_src’, ‘wpseo_cdn_filter’);

And update your existing sitemaps since above code will produce different urls for images.

  • For any HTML page being served from CDN, it is good to have canonical headers added because as much as google don’t care for canonicals on images, it honours same header on HTML files so, we can use the rel=”canonical” header to indicate the original source of a page (rel=”canonical” can work inside HTML tags and as a separate HTTP header). Crawlers that attempt to index a file from the CDN will see the canonical URL and store that, improving your SEO.

Here’s a sample .htaccess configuration for your origin server:

<FilesMatch "\.(ico|pdf|flv|jpg|jpeg|png|gif|js|css|swf|webp|html)(\.gz)?(\?.*)?$">
   <IfModule mod_rewrite.c>
      RewriteEngine On
      RewriteCond %{HTTPS} !=on
      RewriteRule .* - [E=CANONICAL:http://%{HTTP_HOST}%{REQUEST_URI},NE]
      RewriteCond %{HTTPS} =on
      RewriteRule .* - [E=CANONICAL:https://%{HTTP_HOST}%{REQUEST_URI},NE]
   </IfModule>
   <IfModule mod_headers.c>
      Header set Link "<%{CANONICAL}e>; rel="canonical"
   </IfModule>
</FilesMatch>

 

Purge CDN to even out headers on CDN. To check and confirm canonicals are applied to CDN assets as well use CURL as follows:

$ curl -I http://cdn.domain.com/path/to/file.html 
HTTP/1.1 200 OK 
Date: Sat, 27 Apr 2013 06:06:06 GMT 
Content-Type: text/html 
Connection: keep-alive 
Last-Modified: Sat, 09 Feb 2013 01:02:03 GMT 
Expires: Sun, 05 May 2013 11:12:13 GMT 
Link: <http://cdn.domain.com/path/to/file.hmtl>; rel="canonical"
Cache-Control: max-age=604800
Server: NetDNA-cache/2.2 
X-Cache: HIT
ALL DONE!

 

About the Author

Alexia Pamelov administrator

    Leave a Reply