CheapWindowsHosting.com | In this post we will explain you How To Improve Your SEO With Robots.txt and Canonical Headers.
Search engine crawlers (aka spiders or bots), scan your site and index whatever they can. This happens whether you like it or not, and you might not like sensitive or autogenerated files, such as internal search results, showing up on Google.
Fortunately, crawlers check for a robots.txt
file at the root of the site. If it’s there, they’ll follow the crawl instructions inside, but otherwise they’ll assume the entire site can be indexed.
Here’s a simple robots.txt
file:
User-agent: * Allow: /wp-content/uploads/ Disallow: /
User-agent: *
means the rule applies to every crawler.Allow: /wp-content/uploads/
allows crawling through your uploads folder (images) and Disallow: /
means no file or page should be indexed aside from what’s been allowed previously. You can have multiple rules for a given crawler.This rule lets crawlers index everything. Because nothing is blocked, it’s like having no rules at all:
User-agent: * Disallow:
This rule lets crawlers index everything under the “wp-content” folder, and nothing else:
User-agent: * Allow: /wp-content/ Disallow: /
This lets a single crawler (Google) index everything, and blocks the site for everyone else:
User-agent: Google Disallow: User-agent: * Disallow: /
Some hosts may have default entries that block system files (you don’t want bots kicking off CPU-intensive scripts):
User-agent: * Disallow: /tmp/ Disallow: /cgi-bin/ Disallow: /~uname/
Block all crawlers from a specific file:
User-agent: * Disallow: /dir/file.html
Block Google from indexing URLs with a query parameter (which is often a generated result, like a search):
User-agent: Google Disallow: /*?
Google’s Webmaster tools can help you check your robots.txt rules.
Crawlers see your site the way you do, including loading content from CDNs which means that your images are being pulled from CDN and as such google won’t be touching origin server to index your images since origin is not used any more to deliver your image files.
User-agent: * Allow: /wp-content/uploads/ Disallow: /
Make sure this robots.txt rule goes on your CDN, not your origin server.
functionwpseo_cdn_filter( $uri) { returnstr_replace( ‘http://domain.com’, ‘http://cdn.domain.com’, $uri); } add_filter( ‘wpseo_xml_sitemap_img_src’, ‘wpseo_cdn_filter’);
And update your existing sitemaps since above code will produce different urls for images.
Here’s a sample .htaccess configuration for your origin server:
<FilesMatch "\.(ico|pdf|flv|jpg|jpeg|png|gif|js|css|swf|webp|html)(\.gz)?(\?.*)?$"> <IfModule mod_rewrite.c> RewriteEngine On RewriteCond %{HTTPS} !=on RewriteRule .* - [E=CANONICAL:http://%{HTTP_HOST}%{REQUEST_URI},NE] RewriteCond %{HTTPS} =on RewriteRule .* - [E=CANONICAL:https://%{HTTP_HOST}%{REQUEST_URI},NE] </IfModule> <IfModule mod_headers.c> Header set Link "<%{CANONICAL}e>; rel="canonical" </IfModule> </FilesMatch>
Purge CDN to even out headers on CDN. To check and confirm canonicals are applied to CDN assets as well use CURL as follows:
$ curl -I http://cdn.domain.com/path/to/file.html HTTP/1.1 200 OK Date: Sat, 27 Apr 2013 06:06:06 GMT Content-Type: text/html Connection: keep-alive Last-Modified: Sat, 09 Feb 2013 01:02:03 GMT Expires: Sun, 05 May 2013 11:12:13 GMT Link: <http://cdn.domain.com/path/to/file.hmtl>; rel="canonical" Cache-Control: max-age=604800 Server: NetDNA-cache/2.2 X-Cache: HIT ALL DONE!