Difficult as it may be to believe, but by January of 2007, Google is still unable to recognize when URLs that obviously lead to the same page are in fact the same page. So what’s a URL, and what’s the problem here?
URL (pronounced you-are-ell, or sometimes “earl” as in Duke of) stands for Uniform Resource Locator. It’s the technical name for the address of a particular web page. For example, the URL of this site’s home page is https://www.tropicalwebworks.com
, and the URL of this page is https://www.tropicalwebworks.com/2007/01/18/infamous-canonical-url/
.
It’s common that any particular web page may be reached at multiple URLs. If this site were not configured optimally, the home page might be reachable at both https://www.tropicalwebworks.com
and http://tropicalwebworks.com
(notice the missing “www.”). Normal people would logically think that this would be desirable: After all, you don’t want people to get a “server not found” error if they try to get to your site without including the www part.
But Google sees these as two completely separate URLs that just happen to contain exactly the same content. There are two problems with such a situation:
- First, the “strength” of that page, and its ability to turn up in the search engine results, is diluted. Some of the page’s strength is allotted to one version, and some to the other, and neither “page” performs as well as it would if all the strength were concentrated in one page.
- And second, Google attempts to filter out pages containing duplicate content, based on the reasonable logic that people don’t want to see multiple results in their searches for the exact same thing. Thus, since both of these “pages” contain the exact same content, one of them will suffer in searches due to the dupe content filter.
It’s a double whammy. It’s not that your site actually has duplicate content. No, we could possibly call this situation “virtual duplicate content.” But it’s all the same to Google: It’s duplicate content, period.
And if that’s not bad enough, many people link to their home page like this: http://www.example.com/index.html. Now Google sees yet another instance of duplicate content: http://www.example.com and http://www.example.com/index.html. So ultimately what Google sees is four “duplicate content” pages:
- http://www.example.com
- http://example.com
- http://www.example.com/index.html
- http://example.com/index.html
And all this before we’ve even gotten past the home page of your site!
It’s easy-peasy to configure the server to do what’s called a “301 permanent redirect” from the non-www version to the www version of your site. This technique, which is recommended by Google, tells Google that the two are indeed the same and keeps the poor Googlebot from deciding that you have duplicate content and splitting your page’s strength among more than one version. “301” refers to the status code that’s returned by the web server to the browser (or the spider, in this case), and it says, in effect, “Hey, the correct, permanent URL for the page you’re requesting is actually over there. Don’t index it at this URL.”
It’s likewise easy-peasy to link to your home page without the “index.html” (or other directory index name, such as home.htm or default.asp). For index pages in subdirectories, you simply link to the directory: http://www.example.com/subdirectory/
, again leaving out the actual filename index.html.
I apply an appropriate 301 permanent redirect to the www version of every web site I develop. It’s not something I charge extra for, or something that I tout to my clients as being anything special. It’s about a 20-second task to set up the 301 properly. And I never link to directory index pages by filename. I don’t know why some of the big companies aren’t aware of this issue, or, if they are aware, why they don’t care enough to do it properly. It raises the question, if they’re so ignorant, or uncaring, about a thing that is so simple to do right, in how many other areas are they incompetent?
Share Your Thoughts:
You must be logged in to post a comment.