This article shows you how to keep your multilingual websites clean and flexible while still serving them in a way that search engines understand.
Making the contents of your website available in multiple languages is hard enough (unless you are okay with machine-generated junk), but its organisation can sometimes feel even harder. How do I keep the source files organised? How do I keep multiple versions in sync? How do I serve them in a way that search engines understand? And, finally, how do I make the URLs more appealing while still meeting the other criteria and constraints?
There are multiple URL patterns in use for multilingual websites. These are:
- https://en.example.com/section/subsection/page
- https://example.com/section/subsection/page.en.html
- https://example.com/section/subsection/page?lang=en
- https://example.com/en/section/subsection/page
First let’s see how these work, and what the pros and cons are. Then let’s discuss how to choose a scheme that you like even when you are constrained by the limitations of the CMS or site generation method you use.
The differences
The first one depends on the use of subdomains. In this pattern, you are actually running N independent websites, which is highly discouraged due to its complexity and cost. Also, if a search engine depends on backlinks to rank your site, one version of it might not get the benefits of another even though both serve the same quality content in different languages. So there is no reason to choose this pattern unless you want to keep those versions intentionally independent (e.g., contents differ considerably, regulations, etc).
Well, what about the rest? All of them are fine with respect to SEO and accessibility (as long as you have hreflang, lang and/or Content-Language set).
The ?lang=en (query param) method is typical if your site is run using a dynamic CMS. The .en.html scheme is indicative of a static site generator or manually created files. Both will appear strange to the general audience, compared to the /en/ prefix scheme.
Choose one, show another
Your CMS or site generator might let you make a selection between these schemes. But what if it doesn’t? Also, what if you intentionally want to choose one scheme and expose another? This might be needed if an unappealing scheme is easier to maintain. For example, the /en/ scheme looks nice, but it’s harder to maintain if your site is directory-based, because you have to replicate and sync the directory structure across all languages (e.g., /en/section/subsection/page, /ml/section/subsection/page, etc). But if you choose the .en.html scheme, you keep all the versions of a page in the same directory, making maintenance much easier. This is where you can rely on your web server to hide what’s inside and expose a different interface.
Apache RewriteRule
The Apache RewriteRule helps you transform the URLs that the user requests, either as a visible redirection (so that the URL will change in the browser’s address bar), or in a totally transparent manner (so that the URL in the address bar remains the same).
Since writing the correct rules and avoiding redirection loops is tricky, let’s try the visible variant first (the user sees the pretty URL becoming the strange one), and then move on to making the process internal.
Here’s the RewriteRule snippet that you can put in the .htaccess file under your document root (the main directory where you keep your site’s files). Please note that Apache recommends rewrite rules to be written in the server-wide configuration instead of htaccess files if possible, and we’re choosing htaccess only to keep things simple for the time being.
RewriteEngine On RewriteRule ^([a-z][a-z])/(.+)$ /$2.$1.html [R=307]
If you are using RewriteRule for the first time, it’s a configuration that maps a URL pattern to a transformation. Whenever Apache gets a request from a visitor, it will match the URL against this pattern, and if there is a match, it will rewrite the URL according to the rule. The rule shown above matches every URL that starts with two lowercase letters followed by a forward slash and at least one more character (the dot means any character and the plus means at least one). The second part of the rule (/$2.$1.html) shows the replacement URL, with $2 and $1 referring to the second and first parenthesised parts given in the pattern (the tail and the language code, respectively).
Try a redirect first
The third part ([R=307]), reserved for flags, currently says the rewrite should happen in the form of a visible 307 (temporary) redirect. We’ll change it after testing to make sure the matching and rewriting works. It’s easier to test while the rewrite is visible.
Let’s see if and how the URLs get rewritten by sending requests using the curl command. Alternatively, you can try the developer tools in your favourite browser.
Note: If the above htaccess file results in an internal server error, it may mean that you don’t have the rewrite module enabled on your server. sudo a2enmod rewrite && sudo systemctl restart apache2 is how you do it on Ubuntu and other Debian-based distros. |
Here’s what happens with curl requests:
$ curl -I localhost/en/section/subsection/page HTTP/1.1 307 Temporary Redirect ... Server: Apache/2.4.52 (Ubuntu) Location: http://localhost/section/subsection/page.en.html ...
Now let’s try another language:
$ curl -I localhost/ml/section/subsection/page HTTP/1.1 307 Temporary Redirect ... Server: Apache/2.4.52 (Ubuntu) Location: http://localhost/section/subsection/page.ml.html ...
At this point, a language that you have no translation for will also result in a redirect, but will correctly result in 404 (Not Found) when the page gets redirected. When we make the rewrite an internal process instead of a redirect, incorrect URLs will result in 404 in the first step itself.
Make it internal
Alright, let’s make the rewrite internal and invisible to the user. First, make sure you have the files where they are expected to be:
~/www$ tree section section └── subsection ├── page.en.html └── page.ml.html
(In my case, www/ inside the home directory is the document root; I believe you know where yours is. By the way, you can use a graphical file browser instead of tree to check your directory structure.)
Now, how does the htaccess file go?
RewriteEngine On RewriteRule ^([a-z][a-z])/(.+)$ /$2.$1.html
Yes, the only change was to remove the flags part. Now we can test again to make sure there is no redirect this time:
$ curl -I localhost/en/section/subsection/page HTTP/1.1 200 OK ... $ curl -I localhost/ml/section/subsection/page HTTP/1.1 200 OK ...
Here’s another test to make sure languages that we don’t have files for result in 404 as expected:
$ curl -I localhost/xx/section/subsection/page HTTP/1.1 404 Not Found ...
Finally, here are tests that actually bring us the page body:
$ curl localhost/en/section/subsection/page <h1>Welcome!</h1> $ curl localhost/ml/section/subsection/page <meta charset=”utf-8”> <h1>സ്വാഗതം!</h1>
The same is true from a browser (note how the URLs remain unchanged, although we are still serving page.en.html and page.ml.html).
Figure 1 shows how English and Malayalam pages get served from .en.html and .ml.html files while the browser still shows /en/-style URLs.
Note: |
|
Enforcing the new pattern
Alright, the /en/-style URL works fine, without the user knowing what’s under the hood. But the .en.html URLs are also accepted without change, if one is to enter them manually (or visited using old bookmarks or external links). This is not a security issue or anything, but what if you want to enforce the new pattern, so that people (and search engines) will be inclined to bookmark and share your URLs this way?
Instinctively, one might modify the htaccess file like this:
# WRONG! Read below. RewriteRule ^(.*)\.([a-z][a-z])\.html$ /$2/$1 RewriteRule ^([a-z][a-z])/(.+)$ /$2.$1.html
(In the pattern, a dot means any character while a dot prefixed with a backslash means a literal dot.)
The htaccess file shown above is incorrect. First, it results in a loop: the user visits the .en.html version of a URL, Apache rewrites it as /en/, which then gets rewritten as .en.html, which goes on and on. It’s a great thing Apache detects this and returns 500. (If you are already familiar with the RewriteRule, the L flag doesn’t help, but END does.)
$ curl -I localhost/section/subsection/page.en.html HTTP/1.1 500 Internal Server Error
The second problem is, what we’ve written here is an internal rewrite, not a redirect. Although we want the rewrite from /en/ to .en.html to be invisible, we want the reverse to be a visible and permanent redirect, so that we can enforce our preferred pattern. Thankfully, the solution to this problem automatically solves the first problem (rewrite loop). Here’s the modified htaccess:
# Works fine RewriteRule ^(.*)\.([a-z][a-z])\.html$ /$2/$1 [R=308] RewriteRule ^([a-z][a-z])/(.+)$ /$2.$1.html
And here’s the curl status:
$ curl -I localhost/section/subsection/page.en.html HTTP/1.1 308 Permanent Redirect ... Location: http://localhost/en/section/subsection/page ...
So, finally, if you try it from a browser, this is what happens now:
- A /en/ style URL serves the .en.html file, without causing any redirect (even the browser doesn’t know what’s happening behind the scenes).
- A .en.html style URL gets visibly redirected to the /en/-style URL, and then the above happens.