HTML Translation:Project Overview

From Fmfi
Jump to navigation Jump to search


To allow anyone to translate existing web content into other languages, allowing non-technical people to deliver multilingual content and thus increasing the amount of non-English content on the Internet.


Create a tool to convert HTML content to a translation format that can be used in a web-based translation tool that would allow the content to be translated. Take various pieces of translatable content and expose this to various communities and with varying levels of intervention observe how much translation is achieved.

The internet is dominated by English content in many ways because more content is produced in English then other language. People who want to create multilingual content discover that the task is simply too daunting.

This project looked at two aspects of online content translation: the technical side, making existing monolingual content localisable, and the human side, monitoring what happens when content in made available, do people translate it.

Why is translated content needed?

By providing web content a user has access to new information. Using technologies explored in the first mile component of the FMFI project this information can be accessed by many new Internet users. But once the content is displayed on a device we are confronted by the first of the first inch problems, language and literacy.

English content delivered to users who do not speak English is of no use. Thus in order for the delivery of content in the FMFI context to be relevant we should be able to deliver multilingual content.

The reality for most users who do not speak English is that they will never be able to increase there English literacy. But it is relatively easy to teach technology how to deliver multilingual content.

Content in a person's mother-tongue has many important consequences:

  • Better understanding on the message in the content
  • Higher levels of engagement
  • Positive affirmation about language and culture

These cannot be underestimated in situations where content is being delivered in e-Health and e-Government situations.

Current Situation

Most creators of websites are moving towards CMS (Content Management Systems) and Wikis. There is a large amount of content that is created in static HTML or with some form of custom content management. While the former are only now (2007) developing methods to manage multilingual content, the later are definitely monolingual.

Thus the structure of content sites on the Internet are designed to effectively manage monolingual content. Unless custom built it is almost certain that sites cannot manage multilingual content. When they can they are usually designed from a programmers, not a translators, point of view. This results in systems that are suboptimal for translation; taking longer to translate and update, thus costing more then is required. This higher cost itself detracts from translation.

This systems cannot properly manage multilingual content, do not assist translators and when they are built they often are created in such a way that they have a long-term negative impact on translation.


Three boundary partners where identified:

  1. Creative Commons
  2. Pootle User Community
  3. FMFI project partners

each of these allowed a slightly different configuration around the translations. These different configurations were tried to firstly test the tools, and secondly to examine the level at which people would spontaneously translate content.

Creative Commons

Creative Commons is an organisation that focuses on creating and environment that fosters the creation of open content, to support this they created a number of content licenses, these are accessible through their website in a number of languages. Creative Commons used its own web-based translation system to translate their licenses. However, its system did not actively help translators.

This partner was chosen as if the translation was successful it would result in a high profile success that would allow wider adoption of the concept of translation through good tools. It would be one in which would manage the complete flow of data and allow us to limit outside issues.

This was critically successful in that Creative Commons has adopted the Pootle Translation Management Software thus enabling more effective translation.

Pootle User Community

Pootle is used as a Translation Management System (TMS) developed by and used by a number of established localisation teams. The task addressed was to translate the Pootle User's Guide which is stored in a Wiki. The content would be hosted in the TMS and reintegrated into the wiki. With this partner their would be an element of being in control of the management of the translations by virtue of the fact that manages the development of the software. But there would be no control possible over the translators as these are all volunteers. The translators are volunteers and they could simply be encouraged to translate.

This engagement was also critically successful in that 6 new languages where translated. It also demonstrated that wiki content could be translated much more effectively them static HTML.

FMFI project partners

The last group was the highest risk. In this we were translating raw HTML content from the FMFI website. This made use of people who would not normally translate content. This in the long term would probably be the typical translation contributor. Being closely aligned to FMFI and also being a non-English group it was hoped that this group would rise to the challenge in that they could translate English content into other languages such as Portuguese.

This work was not successful and can be attributed to the lack of motivation to translate and the distance of the people from the concept of translation. In retrospect it is clear that non-translators need active motivation to move them from agreeing with the theory of multilingual content to actually actively participating in the creation of multilingual content.

Proposed Solution

The problem with most content management is that translation is brought in late in the party and squeezed to fit into the content creation workflow, ignoring all the needs of translators.

A Translation Management System is a system that manages the flow of translatable content through the translation workflow. It is designed specifically for translation and includes features such as glossary management (ensuring that terms are used consistently throughout a translation) and translation memory (reusing old translations). It allows management of rights for translators, sign-off, delegation, etc. In this project we used Pootle a TMS that is developed primarily for Open Source Software localisaton.

By allowing a content management system (CMS) or in fact static content to continue using the same processes and simply bringing a TMS alongside, we allow existing processes to continue and bring good translation processes alongside. Thus there is limited disruption on either side.

In order to achieve this parallel operation it is important to allow for the hand-over of content to be translated from the CMS to the TMS. And once it is translated to see that process in reverse. In order to achieve this the project adapted a number of tools from the Translate Toolkit. Pootle makes use of standard translation formats and thus the Translate Toolkit needed to be enhanced to allow HTML and later Wiki syntax to be converted to these standard formats. With the converters built any Wiki or static content can be converted to a translation format that can be handled by Pootle.

Thus the final solution works as follows:

  1. Convert from static to Gettext PO (by integrating and automating the tools developed earlier)
  2. Hand over to Pootle
  3. Manage the translation of the content and translate it into another language
  4. Take the translated content back from Pootle
  5. Convert the translated content into the static form understood by the website.

Thus now the content is available in another language.

Technology explained

Many people do not fully understand the translation process and how content is delivered in different languages. This section tries to address that and provide explanations of some of the concepts discussed.


A web server will present any content dynamic or static based on a request from a user. A little known feature that can be used to deliver various types of content is called content negotiation. Every web browser can be set to specify the users preferred language. A website that has multiple translations of the same content can then decide to present the appropriate version based on the user preferred choice. Unfortunately not many web sites do content negotiation properly.

Translation Management Software

A translation management system focusses on the management of the translation workflow and the translation assets.

Translators want to work quickly with high reuse of old translations but also consistently, that is ensuring that they translate with the same terminology throughout and that common errors are identified.

The needs of translators are not fully understood by programmers so unfortunately many systems designed around content also add the ability to translate without adding the systems that make translators effective.

A TMS allows a translators to leverage old translations. This means that any previous translation is stored in a translation memory. The translation memory is applied to any new text and if a 100% match is obtained the translator will usually not see the text.

Translators tools will also segment text preferably on the sentence level, this means that if a change occurs in the original text they will only need to translate the one sentence that changed. In most CMS systems the translator is presented with the complete text with no idea where and what changed and are thus left to review the whole text which is both time consuming and wasteful.

When a translator translates they also want to be able to see hints for terminology. This terminology lookup is drawn from a glossary of terms that they have built up for the work that they are translating. This term list ensures consistency between different human translators across the whole system.

Translation Process

The translation process in its simplest form operates as follows:

  1. A client supplies work (in our case an automated tool extracts text for translation)
  2. Translators examine the amount of work and plan the translation
  3. Work is assigned to a translator
  4. Any terms that need defining are extracted, defined, translated and added to the glossary
  5. The translation is performed
  6. A review of the translations are performed by another translator
  7. The translation are sent back to the client (again an automated process in this case)

Social Challenges

The hardest social challenge was to understand what motivates people to translate. And also to understand how to motivate people who do not usually translate to actually participate in translation.

Technology Challenges

The tools needed to operate smoothly so that users would regard this as an enabler and not as another process with complications.

With the Creative Commons work this has proved difficult as they have used the Gettext format in a non-standard way. This is frustrating as this does affect how Pootle can be used, many failures occurred because of thie non-standard approach.

The biggest hurdle, which was not fully overcome, was the difficulty of converting HTML to Gettext PO. This is caused by the fact that HTML is in many cases not correct. Modern browsers have spent years working around these problems to mask them from most users but in our tools we have not had the equivalent of time and usage to iron out all of those. The correct approach is most likely not to look at HTML but rather Wiki, XHMTL and integration with CMS technology.

Intellectual Property

All the tools created where already release under the GPL so all software created continues to be Open Source. The content produced was released under open content licenses.

Pro's and Con's of the developed systems

The system allows someone to add the ability to translate onto their existing content without the hassle of creating a translation system or converting their content to a CMS or other repository. Being a true TMS is embodies all the features needed for good translation.

The system developed worked very well for wiki content. It was not tested enough with pure static content but could be deployed safely by someone able to diagnose and work around problems.

The biggest problem would be the level of skill needed to deploy a Pootle server, although any system administrator can do this task it would be beyond the skill level of the occasional website admin. As each site is unique the integration of the site to Pootle needs to be automated using custom scripting so some level of shell scripting is required. That said once the system is operational the system administrators task is relatively painless.