The world of machine translation has changed massively over the past years. I still remember the days where Google Translate would produce some goofy or weird translations. Then the Transformer happened in 2017 and with that, so did the world of machine translation.
While the famous “attention is all you need” paper is well known for introducing the powerful Transformer architecture, which later led to models used in ChatGPT, it is mentioned quite rarely that the original purpose of this paper was to introduce a better model for machine translation. Because the original transformer model consisted of two components: an encoder and a decoder. While this encoder-decoder approach wasn’t new, the decision to only use the attention mechanism in it certainly was.
tl;dr:
You should use a dedicated service if you are a business user and data privacy is important. Or if the contents to translate contains high levels specific of domain expertise. Specialized LLM still outperform on low-resource languages. LLMs are being really good at translation, but you need to spend extra time with things like prompt-engineering or fine-tuning.
Table of contents:
Fast forward a few years, and the landscape is dominated heavily by decoder-only style models, like GPT, LLaMA, Claude and the likes. Scaling up these models, training them with massive amounts of data and using them in an autoregressive manner turned out to make these decoder-only models very suitable for a massive range of tasks, including machine translation.
While the attention (pun intended) is focused on the decoder-only models, encoder-decoder and encoder-only models (such as BERT) have certainly not been forgotten in the year 2025, at least when inspecting trending models on sites like HuggingFace. Especially the encoder-decoder models still seem popular for machine translation tasks, be it for open-source models or via services like DeepL or Google Translate, which most likely use and encoder-decoder style model under the hood.
This article will inspect the viability of these different approaches, highlight the state-of-the-art approaches in neural machine translation in 2025 and answer the question: “Will LLMs dominate the translation space in the future”?
Important: To make things easier, whenever “LLMs” are mentioned, this refers to decoder-only models such as GPT, LLaMA and the likes, even though services like DeepL, Google Translate, etc. also deploy language models.
Comparing the translation capabilities of different models wasn’t as cut and dry as I initially thought it would be. It turns out that translation is a very delicate task with many things to look out for. Obviously it’s not just about using the right words and grammar. But also about similarity of the style in the translation, the clarity of the text, the context or special words that should be used. If you just want to know the meaning of a sentence you see while on holiday, a simple translation will suffice. But if you have to translate a million documents about complex topics, consistency and detail matter a lot.
Something that is very clear from the get go: Models like GPT, Mistral, Claude and Llama are good translators and LLM will have an impact on the translation space. That’s without doubt. One thing that Hendy et al found out is that GPT-4 is competitive with dedicated translation services like Google Translate, but lags behind in lower resource languages quite consistently. Robinson et al report that the quality of translations from GPT-4 roughly correlate with the amount of wikipedia articles that are available for a language, enforcing the point that LLMs are useful for translation tasks in common languages. This would also explain why open-source translation models like NLLB or Madlad put such a huge focus on incorporating lower resource languages.

Amount of Wikipedia pages for languages. It is very clear that some languages only have very little contnet and therefore don’t provide many translation pairs. Source: https://aclanthology.org/2023.wmt-1.40.pdf
Jiao et al also highlight that despite the great translation capabilities of GPT models, dedicated services still are sometimes better in domain specific tasks. This can probably be mitigated to some degree with finetuning and prompt engineering, but requires extra effort and expertise and thus makes it a bigger barrier for using LLMs for domain specific language tasks.