Facebook researchers have applied recent advances they’ve made in the unsupervised machine translation of human languages to a source code conversion system.
In a research paper recently distributed through ArXiv, boffins at the beleaguered ad biz describe a project called TransCoder.
TransCoder is a transpiler, also known as a transcompiler or source-to-source compiler. Various modern programming languages like Dart and TypeScript include transpilers that can convert source code in a different language.
TransCoder is intended for use with older languages like COBOL or Python 2 that don’t have this facility built-in, or where source code has to be integrated into a codebase in a different language where there’s no direct transpilation path.
One of the reasons for trying to build an automated code converter is that such work tends to be expensive. The paper points to the $750m and five years of time spent by the Commonwealth Bank of Australia to convert its platform from COBOL to Java.
A transpiler and subsequent tweaking could make such shifts faster and cheaper, it’s supposed, though the involvement of Accenture and SAP in the bank platform project probably didn’t help price-wise. Management fees don’t pay themselves, you know.
Over the past few years, Facebook AI boffins have devised a way to use neural networks to do unsupervised machine translation. Rather than feeding the system word pairs of text in, say, English and French, a neural network gets sentences from monolingual data sets in two different languages and maps them together in a data representation called a latent space. From this, the system can work out translation between the two tongues without supervision or data labeling.
Let the machine mind try
Marie-Anne Lachaux, Baptiste Rozière, Lowik Chanussot, and Guillaume Lample, part of a Facebook AI group based in France, have applied this approach to unsupervised training in TransCoder. Using open source code from GitHub projects, they’ve created a system that accurately translates functions between C++, Java, and Python.
Developers renew push to get rid of objectionable code terms to make ‘the world a tiny bit more welcoming’
“TransCoder could help port a project from Python to C++,” said Lachaux, Rozière, and Lample in an email to The Register. “It may make the code faster and also more maintainable since code written in strongly-typed languages can be easier to understand. However, TransCoder would not solve every issue around bad code quality.”
They pointed to code duplication, bad variable and function names, and suboptimal algorithms as issues TransCoder would not address.
TransCoder, they said, is intended to be an assistive tool for developers and is still at an early stage of development. “Currently, TransCoder is only able to translate at function-level, and cannot translate entire projects,” they said. “The generated functions and production code have to be tested; they are not guaranteed to be correct.”
The researcher said machine language translation is now widely accepted, even among professional translators. They believe programmers will also adopt machine learning-based tools as they improve.
To test their system, they created a test set of 852 parallel functions and associated unit tests.
“Although never provided with parallel data, the model manages to translate functions with a high accuracy, and to properly align functions from the standard library across the three languages, outperforming rule-based and commercial baselines by a significant margin,” the paper explains.
The baselines used for comparison came from j2py, a Java-to-Python translation framework, and Tangible Software Solutions, a commercial source code converter that turns C++ into Java. The paper claims TransCoder “significantly outperforms both baselines in terms of computational accuracy, with 74.8 per cent and 68.7 per cent in the C++ → Java and Java → Python directions, compared to 61 per cent and 38.3 per cent for the baselines.”
Lachaux, Rozière, and Lample said TransCoder can help improve code performance by translating source code in Python, for example, into a language with less overhead that can be optimized by a compiler like C++ or Java.
“We believe the automatic translations would typically be on par with human translations in terms of computational performance,” they said. “However, expert programmers could do more than just translate (e.g. improve the algorithm) or use some tricks to make the algorithm more efficient (e.g. bitwise operations on int instead of operations on boolean arrays) while our automatic translator would not.”
They expect that TransCoder will be used to deal with legacy code by porting it to a more modern language. Facebook, they said, is one company among many that has legacy code and they’re looking at ways its codebase could be improved through machine learning applications.
“We plan to release our source code and datasets,” said Lachaux, Rozière, and Lample. “We also plan to release the best version of our model for people who do not have the infrastructure to retrain it. We hope this will encourage further research in this direction.” ®
Rojenx is a leading concept artist who work appears in games and publications
Check out his personal gallery here