Open Source Translation: AI is Approaching Human Language Quality Levels


Open source is all about overcoming limits and restrictions on usage. The same is true for translation and localization processes. Just as converting from one programming language to another should be fast, easy, and accurate, the same should be true for “migrating” from one natural language to another. Ofer Tirosh, CEO, Tomedes, has built his “smart human translation” company, into an international powerhouse, with more than 50,000 clients in more than 100 countries and the ability to translate any language into any other, using more than 200 language pairs.


How can the open source community expand into additional markets and languages by smart translation and localization techniques?

The logic of growing revenues and markets in an open source software business is simple. Once you have a program or product that works in one natural language, the incremental effort to add more languages and thus to enter additional markets is relatively trivial. That effort is measured in time and money. Both efforts are tiny in comparison with the initial effort to create working software. And there are plenty of great tools out there to expedite the process and increase your scalability.

What is the added value that Tomedes offers to the open source community?

I have academic training in industrial engineering. At the core of that discipline is the knowledge of how to build things to scale in an efficient and cost-effective manner. Open source as a practice is driven by a similar logic: the community has built self-organizing tools and best practices so that diverse developers can build on the shoulders of colleagues for mutual benefit. I built Tomedes with a similar logic: we manage a super-efficient network of professional linguists around the world. I don’t need to employ them, but we have a streamlined the process of managing them and managing projects so that the workflow has very little friction, the quality control is fantastic, and the result is very fast turnaround.

What are the special challenges that the open source community faces when it comes to translation and localization?

Well, I think there is probably a temptation to go straight out to freelance marketplaces and deal one-to-one with individual translators. This is certainly possible with platforms like Upwork, Freelancer, Fiverr and many, many more. The problem with that approach is the hidden cost of management time and effort as well as the inability to really have effective quality control. When you are dealing one-to-one with freelancers, it’s like reinventing the wheel for each project. You need to brief the freelancers, explain the job, do the contract, deal with all the drafts, revisions, reviews, and iterations. You need to audit the final product externally, deal with scheduling and timing issues. You need to either provide translation tools or hope the translator has them. And then do the same for every project and every new language market. This is often case of “penny wise, pound poor” where apparent savings from avoiding an external agency management fee leads to excessive lost time by your internal management, and usually lower quality in result.

Can you cite examples of open source projects in which translation and localization played a key role?

I will avoid citing individual examples, because the real answer is that every open source project that expands into additional markets and foreign languages needs to have a translation and localization project. There are dozens of tools out there – many of them free — for open source developers to manage the process of translating and localizing their products. So it’s really never a matter of whether or not to translated or localize. It’s a matter of how to do it best.

We have been using the terms translation and localization as interchangeable? What is the difference between them?

In practice, people often do use the terms interchangeably. But there is clear distinction. Translation is a subset of localization. Translation refers just to the linguistic conversion, going from one language to another one, or to many. Localization includes that but it also includes converting metric conventions, currency and cultural nuances, things that are not strictly linguistic but address the totality of differences between one locality and another.

Which software tools can be helpful to the open source community in performing translation and localization?

Let’s start with localization editors. These are software platforms to manage the various conversions from one local version of software to another. Software makes a clear distinction between the logic of the code on the one side and the text strings and numeric variables on the other. Code editors usually render this distinction by displaying strings and fields in a different color that the programming instructions. Localization editors takes this a step further by extracting and organizing all the strings and numerics into well-ordered tables, where each row is a term or field, and each column is a language. So you end up with an open, that is expandable, matrix, exportable to create new local versions of your code. Some of the better localization editors, or translation management systems (TMS) I have come across include: POEditor, Smartling, and GlobalLink. There is specialization in this vertical as well, with focused products for localizing apps, like Applanga and Gengo, or for games, like Alconost or LocalizeDirect.

What do you think of the quality of machine translation? Do you think that one day it will make human translation obsolete?

Well, the introduction of Artificial Intelligence and processing power has transformed machine
translation into an artform that increasingly approaches natural language – I mean human
speech — levels. Anyone who tried Google Translate five years ago and compares it to the
product today will be stunned by the improvements. That said, there’s still a long way to go
before software can match the level of a really good human translator. But, really, at this point,
nothing would surprise me.

What are your thoughts on the open source race for neural machine translation. Does your company work with open source products for translation?

Well there’s no doubt that neural machine translation – which was theorized in a paper only in 2014! – already represents the state of the art at this point. Basically, NMT exploits a vast artificial neural network to predict the probability of a specific sequence of words occurring, typically modeling entire sentences in a unified integrated model. This was a huge improvement over using a statistical model, which is how things were done in the past. The NMT not only gives far better results. It’s almost much more efficient in terms of processing power and therefore time. It uses a fraction of the memory of SMT, it’s trained end-to-end, so performance and results are vastly improved.

Facebook is open sourcing its tool chain of machine learning and artificial intelligence tools that it uses to power many of its own products, including its open source project based on the company’s machine translation systems. Do you expect Google, Amazon and others to follow suit?

Well, it’s an interesting and important development. Facebook Translate uniquely provides the ability to train a sequence-to-sequence model with attention as well as a method to export this model to Caffe2 for production using ONNX, which is an open format for representing deep learning models. It also provides sample C++ code to load the exported model and run inference via beam search. That may sound like mumbo-jumbo to non-techies, but what it means in natural language is that the large social network in the world is opening its box of tricks to pull in open source developers in a public framework. I don’t really know whether other big players will follow suit, but it’s certainly a development to watch.

What do you think are the prospects for the open source NMT framework that Systran and Harvard University are building together?

Well, Systran was, as you know, the pioneer in machine translation, starting from decades ago, and Harvard is one of the hubs for advanced thinking on the subject. The fact they are doing this in open source is very exciting. You can already try out their Pure Neural server, and it works like lightning in about 20 languages, with usually stunning results, in my limited experience. Mind-blowing!

What are your thought on community-based translation platform like Zanata, which seeks to empower translators, content creators and developers to manage their own localization projects?

Well, I admire the Zanata initiative, and I wish them well. To some extent, it emulates the thinking that went into how I built up Tomedes. They also seem to be focusing on “smart human translation” and building up expert networks, while taking advantage of various machine language and automation tools. I think when the project has matured, it has the potential to provide an end-to-end tool for translation consumers (that is client, who need translations) as well as the human providers in the translation and localization ecosystem and supply chain. For now, I still think most professional customers who needed to translate or localize their products will prefer not to “build their own.” Most would prefer having a comprehensive ready-to-work solution that can support them 24/7 out of the box and has the quality controls and audit structures already built in. But I am watching this initiative with interest.

Thank you, Ofer. We’ll be watching you too!


Please enter your comment!
Please enter your name here