Domain Canonicalization

by Nathan Buggia ~ May 2, 2008

Pop quiz: what's the difference between the following URLs:

  • http://website.com
  • http://www.website.com
  • http://website.com/default.php
  • http://www.website.com/default.php

Give up? If you're a user, then chances you expect all of those URLs will lead you to the same page. Robots, however, are not as good at determining if pages are the same, so they often store each separately. A big part of how search engines rank pages is based on how many external links those pages have. If other sites on the web link to the different versions of your home page, then search engines may calculate the value of each URL separately, based on the number of links to each version. This can effectively diminish the potential rank your page would have if it were found (and linked to) by only one URL.

The practice of consolidating all versions of a page under one URL is referred to as "canonicalization" (because you collapse all versions under the "canonical" or true version). The four examples listed above are the most common, but there are potentially many, many URLs that lead you to the same page. By adhering to several best practices, you should be able to address 90% of common site-wide  canonicalization issues on your site and consequently increase how your site ranks.

Recommendation

The solution is to be explicit about the canonical form of your URLs. Following are four best practices to achieve this, with specific code and configuration examples.

  1. Select WWW or Non-WWW, then redirect the other option to your preferred version.
    The hard part is choosing if you want your site to be "www.website.com" or simply "website.com". There is no right answer for every company so you'll have to figure this out on your own (but, removing the "www." saves your customers 4 keystrokes, which really add up on a mobile device, and it makes your brand the first thing your customers see).

    Once you've selected, you then need to find a way to trap all requests to your application, check which form is being used, and if it is not the correct form, initiate a 301 Redirect to the correct form. For example, if the user types in wikipedia.org, they will automatically get redirected to www.wikipedia.org.

  2. Remove the default filename from the end of your URLs.
    All web servers allow you to select one or more default filenames to serve when the browser requests a directory. For example, this website is run on IIS, so when the user requests "http://janeandrobot.com" we really serve "http://janeandrobot.com/default.aspx".

    In the same code you use to enforce www vs. non-www, you should also check and see if the default filename is at the end of the URL and then trim it off. So, "http://janeandrobot.com/default.aspx" would be converted to "http://janeandrobot.com".

  3. Link internally to the canonical form of your URL.
    Make sure you always link to the proper canonical form of your URLs from within your site. This practice helps encourage external sites to link to the site using the correct version as well (since those linking to you often cut and paste from your pages or RSS feed.) Note there is a degree of diminishing returns here, so you don't need to spend the whole weekend hunting down every last URL. Just make sure to review your site's primary navigation, top landing pages and blog.

  4. Use Google Webmaster Tools to tell Google the correct form.
    Implementing these best practices on your site are ideal, since they address the problem for all search engines and give your customers a consistent, properly branded navigation experience. But what can you do if you reviewed steps 1-3 and found that it would take six months to implement on your production site? There is something that you can do today: using Google's Webmaster Tools, you can navigate to the "Tools" section and select "Set preferred domain." Here you can specify if you'd like Google to  use "www.website.com" or "website.com" in their index and search results, as well as consolidate links to both versions. Note that while this will provide you short-term benefit from Google, it does not help you in Yahoo! or Live Search.

Checking Your Website

To check your website to see if you're handling domain canonicalization correctly, you can use the Live HTTP Headers add-on for Firefox. 

Open the Live HTTP Headers tool, then try all the variations of the URL at several different levels to ensure they all redirect back to the appropriate canonical form. As you're checking each variation, look at the HTTP headers using the Firefox plug-in to ensure they are all 301 redirects (and not, for instance, 302 redirects).

Here's an example test case:

Canonical URL Form Test Case Test Result
http://janeandrobot.com janeandrobot.com Success
  janeandrobot.com/default.aspx Success
  www.janeandrobot.com Success
  www.janeandrobot.com/default.aspx Success
http://janeandrobot.com/about.aspx janeandrobot.com/about.aspx Success
  www.janeandrobot.com/about.aspx Success
http://janeandrobot.com/folder janeandrobot.com/folder Success
  janeandrobot.com/folder/default.aspx Success
  www.janeandrobot.com/folder Success
  www.janeandrobot.com/folder/default.aspx Success
http://janeandrobot.com/folder/test.aspx janeandrobot.com/folder/test.aspx Success
  www.janeandrobot.com/folder/test.aspx Success

Examples

Canonicalization issues are very common and being an Microsoft employee, I don't have to go far to find an example. Check out the website for Microsoft's annual Mix conference for web developers. 

I was able to generate the table below by plugging the common URL variations into Yahoo's Site Explorer to find a list of links to each variation. 

URL Variation Number of Links from within website Number of Links from outside websites
http://visitmix.com 17,663 59,498
http://www.visitmix.com 9,074 22,179
http://visitmix.com/default.aspx 0 22
http://www.visitmix.com/default.aspx 0 12


Looking through these numbers yields some interesting insights:

  • Not doing "www" vs "non-www" is definitely hurting their ranking - you can tell because they have a similar number of inlinks for each version. Ranking is done on a logarithmic scale, so every additional link is more valuable than the one before. If they redirected all versions to one canonical form, search engines would see their home page has having 81,711 external links, would would be a substantial boost.

  • They are not good about using the same version of the URL within their site. If you're not cognizant of this on your site, others won't be either. It looks like they use visitmix.com about 75% of the time internally, and www.visitmix.com the other 25%.

Additional Resources

Discussion

VaBeachKevin

May 9. 2008

This is one of the most overlooked items in my opinion. Great post.

Kittu

May 21. 2008

I've heard using this on page redirection may be considered as a 302 redirection in the eyes of crawlers, because at first crawler is going to that page and read the code then it gets the instruction to move to the directed page.
n Where as i know the safest way is to move yourself to some Linux server which will be using apache and it stores a file named .htaccess you can give instructions of redirection within that file, cuz whenever a request is generated the crawler first reads into the .htaccess file this tells the crawler which page is to shown for the requested one and thus it is the complete 301 redirection. Smile

Vanessa Fox

May 22. 2008

Hi Kattu - You're absolutely right that a 301 is the way to go. There are multiple ways of implementing a 301 (including using .htaccess if your server is Apache, as you've described). We'll be posting follow up articles about implementation techniques.

As for what you mention in your first paragraph, when you use an on page meta refresh, crawlers may interpret that differently than you expect. We'll be diving into those details in our implementation article as well.

Ashley Berman Hale

June 10. 2008

Hm - interestingly enough you have a link or two pointing to janeandrobot.com/default.aspx what does not 301 to janeandrobot.com.

I just thought I would give you a heads up about that. It looks like you covered it in the test case, so it might be a server/load balancing issue. Checked your header on that page and its still showing as 200.

/beep.

Sarah

June 11. 2008

Not only is this particular article/tutorial brilliant, but so far the entire Jane + Robot site says it all exactly as it always should have been said - and all in one place. Things I try to tell my clients every day, with varying degrees of success.

Even better, the site isn't only just articles, it's an authoritative resource that cites other documentation. THANK YOU.

Nathan Buggia
Author

June 12. 2008

@Ashley - good catch, as many of you know implementing proper canonicalization can be a lot more difficult than just writing down the best practices Smile

We're still working on fixing the canonicalization of this site, we currently are tracking down a bug in our content management system, hopefully it will be fixed soon!

g1smd

June 13. 2008

*** removing the "www." saves your customers 4 keystrokes ***

If you have the site-wide 301 redirect from non-www to www in place, then the visitor can omit typing the www in, and your redirect will deliver them to the correct URL and to the correct content anyway.

There are good reasons to use the www version as the canonical form, not least the ability to do:

site:domain.com -inurl:www

to make sure that no other forms, other than www that is, have been indexed.

You can't do that if your redirect runs the other way.

Josh

June 26. 2008

The reason we advise clients to always prefer www to non-www is that it makes people notice the URL in print advertising, signage, and other media. "www." is a very powerful visual cue to the presence of a URL. Having the brand "stand out" through the lack of www is only important in those cases where the URL is shown without other material, which is rare and inadvisable.

Nathan Buggia
Author

July 1. 2008

@Josh and @G1SMD - good points, you've come real close to selling me on "www"

Randy Cooper

July 7. 2008

I'll go with Josh on the www

I'm curious now though about the use of subdomains. I've heard from both camps (1. builds pagerank on the primary domain) and (2. considered totally separate)

RKF

July 7. 2008

I'm personally a fan of the non-www addresses. For most clients I'll use the www because they often assume it and print it on their marketing material. For me ... it's unnecessary. I think people queue off the .com more than the www, and having the www in print can make it more difficult visually for a client to remember the domain name (especially on a vehicle or billboard). The most important thing for them to remember is the domain name (because you DO have your .com registered, right?) when you have your redirects in place.

g1smd

July 12. 2008

Even if your site does use the www as the "real" address, you can still advertise the site without the www in both print media and broadcasting channels, and let the redirect fix up the URL after the user types it in.

For example, when I want to do a search at Google, I type google.com in to the browser, nothing more. I don't bother with the www, as Google's own redirect automatically adds it on for me and then lets me search.

Larry Swanson

July 23. 2008

I think the www vs. non-www decision should also consider your audience. If you're trying to reach web-savvy techies, then by all means omit the www, but if you're trying to reach less tech-savvy "civilians" and/or doing a lot of offline promotion, then keep the www (for the reasons RKF mentions above). In either case, it is important to be consistent in your usage across all media (you might call this "canonical branding") since you never know where someone will be when they jot down your URL and link to you.

Add comment


 

  Country flag

[b][/b] - [i][/i] - [u][/u]- [quote][/quote]