Thursday, March 22, 2012

Canonical Link Element

0:07
Hi everybody. Welcome back to another video. We're doing this thing where when we speak at a conference
0:12
and we talk about something substantial, not just questions and answers, we talk through our presentation later
0:17
and put it up so people can follow along, watch the slides, and hopefully learn a little bit.
0:21
So today I wanted to talk about the canonical link element. And that's something that Google, Yahoo!, and Microsoft
0:28
all announced that they will support in the future at SMX West. So, the date that we had this announcement was
0:37
February 12, 2009, and the funny thing about it is that Charles Darwin was born exactly 200 years ago that day.
0:45
So I started out with a slide where I made a corny joke and I said, whether you think the web was intelligently
0:50
designed by Tim Berners-Lee, or whether you think the web needs to evolve, either way this is an open standard which
0:57
helps people improve the web. And so we sort of said, what is a big problem that faces people today,
1:05
webmasters, SEOs, site owners on the web? And it's pretty clear that duplicate content is one of the things that
1:11
people care about the most. So what is duplicate content? Well, I've got a slide here where I show I think eight
1:18
different URLs, you know every single one of these URLs could return completely different content. In practice, we
1:26
as humans whenever we look at www.example.com or just regular example.com or /index or home.asp, we think of it as
1:34
the same page. And in practice, it usually is the same page. So technically it doesn't have to be, but almost always
1:41
web servers will return the same content for like these eight different versions of the URL.
1:46
So, that can cause a lot of problems in search engines if rather than having your backlinks all go to one page,
1:53
instead it's split between a www and a non-www version. And it's a really big headache. How do people solve this?
1:59
How do people fix this? Well, it turns out, and I'll dwell on this slide for just a few minutes, there are a lot
2:05
of ways to fix it. So, some people have joked that this canonical link element is kind of like, you know,
2:11
Spackle that fixes over the appearance of all the cracks in the wall. And the fact is there are a lot of
2:17
ways that you can fix things first and foremost, from the beginning, upstream where you don't need to fix it downstream
2:23
later on. There was a really funny quote by Jill Whalen at the conference where she said,
2:29
"Developers keep SEOs in business."
2:31
Right? And so whether you're a developer or an SEO there are some best practices that can make things a little bit
2:36
easier for your system so that you don't have to worry about this issue of duplicate content at all.
2:41
So, one is to try to make sure that your URLs are standardized, Microsoft sometimes calls them normalized,
2:48
in essence there's only one way to get to the content. If your content management system always generates consistent
2:55
URLs, and they're completely uniform, and you don't have to worry about having eight different versions in the
3:00
first place, that just saves you a lot of trouble. You don't have to worry about the issue coming up at all.
3:05
So one way to do that is to fix your content management system or your software so that you only generate these URLs
3:11
in a very consistent way. Another thing to do is to think about your site. Suppose you have www.example.com and
3:19
non-www, just plain old example.com. Well if you link to www sometimes and non-www sometimes, it's natural that
3:26
search engines might get a little bit confused. So linking consistently, saying okay, my homepage is going
3:33
to be www.example.com/. Nothing else, that's it. And then making sure that all of your internal linking is consistent,
3:40
that alone can make a really big difference, so that you don't end up with two, three, four copies of each page.
3:45
If you do have, you know, home.asp or index.html, you can rewrite such that all those other URLs are 301 redirects
3:56
to a single URL. So, it's great if you can fix it at the beginning, it's great if you can link consistently so the
4:02
issue never comes up, but if duplicate URLs do occur, then you can use a 301, a permanent redirect as we refer to it,
4:09
to sort of standardize and glom together all of those URLs. And search engines will follow that 301 redirect,
4:15
and typically group them all together. Google also does a couple of extra things that some search engines don't do.
4:21
So, in our Webmaster Tools, our webmaster console, which is totally free, doesn't cost anything at all,
4:28
you can specify, for example my site is mattcutts.com, you can specify if you prefer www.mattcutts.com or non-www,
4:36
so just mattcutts.com. That's a very easy setting, and that solves a lot of duplicate content issues right there.
4:42
And a little-known fact, not everybody realizes this, is that whenever you submit your URLs in what
4:48
we call a Sitemap, which is another standard that's supported by many major search engines, and it's a very simple
4:53
file, it can be as simple as a list of URLs, we take that list of URLs that you submit, and we say to ourselves,
5:00
oh, if we see a URL in that list, and then we see another version of it that's not in the list, we will prefer
5:06
URLs in the list that you gave us. So we sort of use it to break ties whenever you submit URLs from a Sitemap.
5:12
So there's at least a couple ways that you can give Google hints that try to help out with duplicate content.
5:18
But, that said, there will probably always be duplicate content issues that you can't fix. So, just to run through
5:26
a few example ones. Sometimes, you can't generate a permanent or 301 redirect. For example, at my old school account,
5:33
cs.unc.edu, I don't run the web server there. So I'd have to open a ticket or drop an email to the people that
5:39
administer that system and say hey, can you add a 301 redirect from this page to that page. A lot of free hosts,
5:45
you might not be able to generate a 301 redirect. And you can't help how people link to you. So for example,
5:52
you know, even if you link consistently to just the www version of your website, some other people might link to
5:59
the non-www version. And you can't really control that at all.
6:03
Uppercase versus lowercase paths. Microsoft IIS will support showing pages whether you link to home.asp capitalized
6:13
or lowercase, and sometimes even mixed case. And so if people link to different versions that are uppercase and
6:19
lowercase mixed, that can cause some issues. Session IDs are another really big factor. So I have seen,
6:26
at least in some search search engines, a site with a one-page privacy policy. And that privacy policy was indexed
6:33
three thousand times, each time with a different session ID, because the privacy policy was slightly different each time.
6:41
So, you know, session IDs in general if you can avoid them are great. But sometimes you as the
6:47
search engine optimizer or the person who is responsible for the site can't get rid of them entirely.
6:52
Tracking codes, you know, if you're buying ads. Analytics, you know the UTM parameter, landing pages where they
6:58
have to be different landing pages for different ads, those are the sort of things that you sometimes can't get rid of.
7:04
And if you run an e-commerce site, suppose you have different products. You might have sort by descending price
7:10
or sort by ascending price, and sometimes you need to have different facets, different views of your data, and
7:16
conceptually it's really the same thing, it's just a different way to slice and dice it.
7:21
Finally, there's breadcrumbs. So breadcrumbs are how did I get to this page? Am I coming to this red tent example
7:28
via tents, or am I coming to it via colors, or did I come to it because I was interested in accessories?
7:34
How did I land on this page? Even Google's own webmaster help documentation sometimes has a CTX parameter that says
7:41
here's how we got to this page. And that day, it was kind of funny, the Queen had just launched a new website:
7:50
royal.gov.uk. And so I wish the Queen the best, I want her to live long, and I wish the British monarchy the best,
7:59
however, someone at the Telegraph, telegraph.co.uk, had done an SEO audit of this site, and they had found
8:07
duplicate content issues. So you can see right here, just slash, royal.gov.uk/Home.aspx, and then at the very bottom
8:15
I almost made a ransom note style where I mixed uppercase and lowercase. And the royal website returned the same page
8:23
for all three of those URLs. So that was just a very simple example to illustrate that anybody can have these
8:29
sorts of issues.
8:31
So what's the answer? Lets, you know, I've buried the lead enough, how do people solve this particular problem?
8:37
Well, assuming you can't solve it any other way, and absolutely I encourage you to try to fix it upstream,
8:42
to try to link consistently. This not something that you should just say, oh, now all my problems are solved,
8:47
I don't have to worry about anything else. But, if you can't solve your problems in other ways, there's a very
8:52
simple element, link element, where you can say my canonical, and that's a long word that means you know, my preferred,
9:00
or the primary, or the clean, the pretty version of the URL that I want to use, is not this ugly URL with a tracking
9:07
code or a session ID, it's this pretty URL right over here. And all you have to do is in the head element of this
9:14
document say you know what, even though this has a weird session ID, the pretty version, the canonical version of
9:20
this URL, is over here. And that's literally all it is. It's a very simple open standard. It's one simple element
9:28
that you add to the head of your document.
9:31
Some interesting little tidbits. This is the director's cut so you get a little bit of extra info. Is this a tag?
9:38
Well, it's kind of, the technical name I believe is "element." But we're all friends here, nobody's going to abuse
9:45
you or you know make fun of you if you call it a canonical link tag versus a canonical link element. People often
9:52
speak about meta tags, right? And so meta tags are things that go in the head of the document as well. And so, if
9:59
a meta tag has a value that is a hyperlink, I think the most correct thing is not for it to be meta, but for it to
10:05
be called "link." And so that's why you see link rel="canonical" href= and the value. So now you know the official
10:12
name, but nobody's going to care if you just call it the canonical link tag.
10:18
One thing that's kind of interesting about this tag, let's just talk about a few high-order bits.
10:25
We don't promise we're going to abide by this 100%. Right? You know, if we see a webmaster and they've accidentally
10:31
shot themselves in the foot, you know maybe they've created an infinite loop, and it's very easy to create an
10:37
infinite loop, we reserve the right to do what we think is best. At least at Google, we are going to treat this as
10:42
a very strong hint. So unless we see some weird corner case or something where you're probably hurting your own site,
10:49
we probably would expect to respect this tag. So I think that in most cases, it will work quite well. But we do have
10:56
to reserve the final, sort of bottom-line ability to say no, we don't think this is what's best for the users.
11:03
Again, if you can fix it yourself upstream, that's much better. So look at all the other alternatives, the other
11:09
choices before you use this tag. Don't just say, oh, I can just slap everything with a canonical link tag and
11:14
boom, I'm done.
11:17
If you're a regular user, just like a mom-and-pop and you use WordPress or you use some shopping cart software,
11:24
it's probably best not to just roll up your sleeves and go digging into it and trying to fix it all yourself,
11:30
at least not quite yet. Wait a little while, because I think plugins will come out, people are talking about hey,
11:36
is WordPress able to add this to the core software, so maybe you don't even need a plugin? So if you're just a regular
11:41
user and you wait a few months, things should be fine. You know it's a brand-new element, so there's time for you
11:47
to sit down and cautiously deliberate and say okay, what kinds of duplicate content do I have, how can I fix it?
11:55
Take a little bit of time. Don't just jump right in and start, oh I'm going to point everywhere, I'm going to do everything.
12:00
There's enough time where this will be supported so you can plan ahead a little bit.
12:05
And as always, if we see people abusing it, we do reserve the right to change how we treat the tag, or to
12:11
not respect the tag. There is a nice way that we try to prevent abuse. We allow things within the same domain,
12:20
but we don't allow things to cross domains. So with 301s, there's always been this notion of can I hijack a site by
12:27
doing weird 301s, and can I steal the reputation of some other site? And at least right now, this element is not
12:34
really subject to that because you can only use it within the same domain. Now a natural question right after that,
12:41
is well, what about subdomains? Can I, you know, do things across different hostnames?
12:45
And the answer is yes, you can. So, I was talking to Tony Hsieh from Zappos, and they were talking about duplicate
12:51
content. And they have a server called zeta.zappos.com, which is sort of their staging software and might be the
12:56
next version. And they were saying, well, can I send my canonicalness, can I splat it from zeta.zappos.com to
13:03
www.zappos.com? And the answer is yes, you absolutely can.
13:08
Can you use it from https and send that to http? Totally, works great for that. It's on the same domain, so it's
13:16
no problem at all, at least within Google to use it for that purpose.
13:19
And then what's the difference between this and a 301 or a permanent redirect? There's really not that much,
13:26
other than this is restricted to one domain. So 301s can cross domains; this is all within the same domain.
13:33
In fact, whenever I think about it, the mental model that I have is that this is essentially like a little mini
13:40
301 redirect that you can generate with this link element. So, you know, if you think about how Google handles 301s,
13:48
that's probably a pretty good guess of how we'll handle this particular element.
13:54
So, a few more questions, since you've got the time, you're watching the video. Do the page have to be identical?
14:01
Bit for bit identical? No, they do not. Think again about this case where you have a catalog page and you can sort
14:08
by increasing price or decreasing price, those are conceptually pretty close to the same page. So if you want to say
14:14
map this to the same URL, and don't worry about the sort by parameter, you're more than welcome to do that.
14:22
They should be similar. You know, if we see, this is the only thing I can think of where there could be abuse,
14:26
is if you've got a cartoon page over here, and you've got something that's completely irrelevent to cartoons over
14:31
here and you try to combine them together. And you're not really gaining any advantage because you had PageRank on
14:36
this page and on that page. So it really doesn't make sense to combine them, but we do recommend that you use them
14:42
for similar pages. They don't have to be identical, but they should be similar.
14:46
A few sort of niggly bits. How about relative URLs versus absolute URLs? The answer to that is you can use either one.
14:55
We recommend absolute URLs. And there's a very simple reason. When you have relative URLs, you can move a URL and
15:02
everything stays the same relative to that URL. So essentially, you know the homepage can say /images or images.
15:10
And that will move it relative to that particular page. But it's better to have an absolute URL because this is
15:17
a powerful tool, and you really want to say this URL goes to exactly this URL. So you want to specify that.
15:23
Whereas if it's relative, if you mess it up here, then you might mess it up somewhere else as well.
15:28
Can you follow a chain of canonical tags, or canonical elements, just like you can follow a chain of 301 redirects?
15:35
Yes, but again I don't recommmend that, because if you have a big site and you have a big chain of 301 redirects,
15:41
it's easy for something to break. So, it's similar, something can break and you don't intend to have the consequences
15:47
that you wanted to, so what I would recommend is absolute URLs, and going from the old URL to the new URL, one hop
15:55
and that's all you do. It's just simpler that way, and you know you want to play it safe. You don't want to
16:01
accidentally shoot yourself in the foot. So what are some ways you can shoot yourself in the foot? Well, what if
16:07
you say my canonical is over here, and that's a 404 page? Right, the page might not exist. What if you had an
16:13
infinite loop? This is canonical. No, this is canonical. And we've all seen those happen, you know, what is the
16:18
Civil War? Look up the War Between the States. What is the War Between the States? Look up the Civil War.
16:23
You know, and now you have to put the dictionary down and your head hurts. So try to avoid infinite loops.
16:28
What if I point to a URL that hasn't been crawled? You know, we'll try to crawl that URL, but that corner case,
16:34
what if I told in the webmaster console, oh yeah, everything should be www.example.com, but then you specify your
16:42
canonicals as non-www, or without the www. So you can do all these sorts of things to almost shoot yourself in the
16:48
foot, and the answer is we will try to handle all of these corner cases in a reasonable way. The slide has some
16:54
Ghostbusters because there's the old saying, "Don't cross the streams," right? So think about this, take some time,
17:00
don't just throw canonical tags on willy-nilly on your site, you know, try to plan it out a little bit so that you
17:06
don't run into these corner cases.
17:09
So we're getting towards the end of the presentation. I just really wanted to send a shout out to Joachim, who is
17:14
the Google engineer who really did all the implementation, all the heavy lifting on this. Made sure that it worked
17:19
very nicely within a 301, and thought about all the corner cases. So, for example, someone said, well what if
17:25
I have a canonical, and I point to myself? Does that work? Yep, that works fine. What if I have a canonical and my
17:31
href is empty? Well, it turns out that parses as an error, which turns out to point to itself. So all this stuff
17:38
still works because Joachim did a really good design, but again, try to make sure that it's all absolute URLs and
17:44
everything's specified well. Also, I'd love to send a shout out to Greg Grothaus. It turns out when you dig into this,
17:51
a lot of people have proposed similar ideas. I saw at least one post out on the general web after we'd started
17:59
exploring this that said, hey, why don't you do this kind of a proposal? But Greg was really one of the people who
18:05
sparked the discussion at Google, who really pushed for it and had a great idea, and so I sort of think of him as
18:11
at least within Google, he really got the ball rolling and really sparked the wave of work on this, so I really
18:17
appreciate that. And of course all the people, you know, from Maile and Wysz and Adam and Riona who have worked on
18:23
the messaging and reached out to different people. At Yahoo!, Priyank, and a ton of people at Microsoft,
18:30
Nathan Buggia and a bunch of other people as well. My hope is that lots of search engines will support this.
18:35
So, Yahoo! and Microsoft have announced that they will support it, let's keep our fingers crossed for Ask, I'd love
18:41
for them to join in as well. Wikia, so Artur at Wikia had emailed us and sort of asked about doing canonical tags
18:49
anyway. And so it was really great that they could test it out while we were trying it out ourselves.
18:54
And then a ton of webmasters who always give us this sort of feedback on what they'd like to see.
18:59
On this last slide, I just list a bunch of resources, so Google, Yahoo!, and Microsoft all did blog posts about it.
19:06
There's an official Help Center documentation page. And, what we saw was, as people would come and have duplicate
19:13
content questions, Joost had come and sort of asked about an interesting corner case, we just said, hey, you know
19:19
what? We've got this thing coming out that might help with this. And so it was a very nice way to just do a sort
19:23
of very quiet beta test and see how well it worked. So, Joost happened to email just a few days before we were
19:30
ready to announce support, and so we gave him a heads-up about the possibility of this, and he turned around
19:35
plugins not just for WordPress, but also for Magento, which is an e-commerce shopping software, and Drupal, which
19:41
is another open-source content management system, which I think the White House just rolled out using Drupal.
19:46
So really appreciate the work that he's done as well. And in general, you know, be careful, be cautious, plan out
19:54
how you want to use this tag. But we don't intend to make any money off of it, we think it's just good for the web,
19:59
It'll lead to less duplicate content. It's an open standard, so any search engine that crawls the web can use this
20:06
information to help, you know, make the web more relevant and increase the relevancy of their search results.
20:09
And now you know as much as the audience knows when they attended SMX West.
20:14
Thanks very much for listening, and talk to you soon.

No comments:

Post a Comment