My colleagues and I have recently been working to improve our practices when
it comes to semantic mark-up, document structure, metadata, and other fun
stuff. Part of this has entailed developing (or having Pierre, who was doing
an internship with us at the time, develop) metadata management features for
the CMS we use. This new plug-in has a number of features, which
basically boil down to automatically generating a number of <meta />
and
<link />
elements based on the client’s data.
This is all fairly normal – any CMS that doesn’t do this (either by default or with the help of a plug-in or extension of some sort) is pretty defective – but it can raise a few issues, especially when we make mistakes. In the rest of this post I’ll describe our mistake and it’s effects.
Generating metadata
Our metadata plug-in has four main features. It can:
- generate (and, if configured, let the user manage) page metadata;
- generate canonical URIs;
- embed Google Webmasters Tools verify; and
- embed Google Analytics tracking code.
The metadata the plug-in embeds include title, description, keywords, copyright, and authorship <meta />
elements, and a robots directive. If it’s enabled it’ll also output <meta />
and <script>
elements for Google Webmaster Tools and Google Analytics (we’re looking at extending this support for other tools including Xiti).
The problem, and the topic of this post, is our support for canonical URLs. The “Canonical URLs” feature is a exactly what it sounds like: a method which allows a document to specify the official, canonical URL by which it’s content should be indexed and accessed. This helps resolve the “duplicate content” problem and is supported and encouraged by Google (see Specify your canonical), Yahoo! (see Fighting Duplication: Adding more arrows to your quiver), Bing (see partnering to help solve duplicate content issues) and Ask.com (see Ask Is Going “Canonical”!), amongst others.
The problem
Our plug-in currently includes markup like this on every page it generates:
<link rev="canonical" href="http://example.com/An-Article.html" />
If your familiar with [X]HTML, you may have noticed a problem with this: we’ve
used the @rev
attribute instead of a @rel
. This is a small difference –
just a single character; just three bits difference in ASCII – but the effect
on the semantics of the metadata we’re generating and embedding on every page
is large.
rel
vs. rev
The @rel
attribute specifies the relation between the subject (the
document containing the <link>
) and the object (the URI specified in the
@href
). That is, <link rel="help" href="help.html" />
says that the help
for this document is help.html
. It’s like an arrow pointing from this
document, to another document.
The @rev
attribute, on the other hand, inverts this meaning. That is, <link rev="help" href="foo.html" />
says that this document is the help
for
foo.html
. It’s like an arrow pointing to this document from some other
document.
Allowing this reversed relation may make sense for some relationships (namely, those which are ignored by pretty much everything), but with something like canonical – which is interpreted by search engines – it can be dangerous. Our metadata plug-in is generating code which says, on every version of every page (and every page has as many as half a dozen URIs under SPIP), you shouldn’t use that page (the genuine canonical URL), I’m the real deal! That is, we’re telling the world at large that the canonical URL is not canonical. This is bad and wrong (though I suspect that the search engines probably just ignore such claims).
Needless to say, I’m going to fix this (or have one of the others do it) and push the new version to our sites first thing on Monday morning.
Why does such a thing exist?
In the mean time, I wondered why anyone could possibly want such a thing? I’ve
not read too much into this topic, but it seems reasonable to think that the
@rel
attribute was originally included for the sake of symmetry. Alas,
symmetry is great in a closed system – one where you can trust all claims
because you know that every part of the system can be trusted – but can cause
problems in an open system like the web – where you can’t necessarily trust
every participant in the system.
Given that we have @rev
(though it’s probably going to go away in HTML5,
and similar updates to XHTML), what is it used for? One use case is using URL
shortening services like TinyURL,
Bitly, Tr.im, etc. If you use
Twitter or some similar service you’re probably
familiar with these: they essentially give a shorter alias to a long URL.
Doing this with a reversed canonical is patently stupid (see Counting the
ways that rev=“canonical” hurts the
Web for the reasons
why its stupid), and I’m pretty sure no search engine or semantic web
application will actually use it (intentionally, at least), but there you go.
Conclusion
For more information, see the “rel” attribute frequently asked questions at the Microformats wiki, or search the web for such phrases as “canonical url rev rel”. There’s plenty of information out there.
Thanks to Pierre Rousset (who still needs a web-site I can link to when I
mention him) for his typo which lead me to this interesting topic, Mark
Aeschlimann who noticed and questioned it, and the
various people who’ve written about and documented <link>
, @rel
, @rev
and related topics.