Retour aux actualités

Article précédent

Understanding Language Beyond Words: Building IdiomX for Multilingual AI and Idiom Interpretation

Maeline Lezy

External communication

27/05/2026

Désolé, ce contenu n'est pas disponible en Français

Understanding Language Beyond Words: Building IdiomX for Multilingual AI and Idiom Interpretation

By Ayman Sharara, DSTI School of Engineering student in MSc in Data Science & AI, developed IdiomX as part of his Deep Learning project, a large-scale multilingual dataset designed to help AI understand idioms, retrieve figurative expressions, and interpret hidden meaning across languages.

When Language Stops Being Literal

What does “spill the tea” mean? (Not a kitchen disaster)
Idioms are expressions where words do not mean what they say, and that is exactly where AI starts to struggle.
Humans understand them naturally. Machines… not so much.

The Problem

Most existing idiom datasets are small, limited, not multilingual, and focused on simple tasks. In short, they do not reflect how people communicate.

Building IdiomX (190K+ examples) : Step by Step

IdiomX was built through a structured and scalable pipeline:

Collection: extracting idioms from sources such as Wiktionary and WordNet, along with generating additional candidate idioms to improve coverage.
Cleaning and Normalization: filtering noise, deduplication, and standardizing expressions
LLM Enrichment: using OpenAI GPT-4.1-mini to generate meanings, contextual examples, and multilingual translations (English, Arabic, French).
Validation: combining semantic similarity scoring and rule-based checks to ensure consistency and quality

The pipeline is modular and extensible, making it easy to scale to new languages and add richer annotations.
The full workflow was implemented using Python, combining data engineering pipelines with LLM-based enrichment and validation to ensure reproducibility and scalability.

More Than a Dataset: A Multi-Task Benchmark

IdiomX supports multiple tasks:

Task 1: Idiom Detection
TF-IDF + Logistic Regression vs DistilBERT vs RoBERTa, with RoBERTa selected for strong contextual understanding.
Task 2: Context-to-Idiom Retrieval
Dense retrieval vs hybrid retrieval with reranking, where hybrid + fine-tuned reranker achieved the best performance.
Task 3: Cross-Lingual Retrieval (Arabic to English)
Multilingual embeddings compared to fine-tuned E5, with fine-tuned E5 showing strongest semantic alignment.
Task 4: Idiom Interpretation
Given an idiom or idiomatic sentence, the system retrieves its meaning in English, Arabic, and French. Hybrid retrieval with reranking produced the strongest interpretation performance.

Models were trained and evaluated on structured train and test splits with careful data selection to avoid leakage and ensure reliable benchmarking. The workflow spans raw data collection, model training, retrieval benchmarking, idiom interpretation, and deployment-ready artifacts.
Beyond these tasks, IdiomX can support chatbots, translation systems, idiom explanation assistants, language learning tools, sarcasm detection, and human-interacting robots.

Why It Matters Language is not just words, it is meaning, context, and sometimes sarcasm. If AI is going to understand humans, it needs to know that “break a leg” is encouragement, not a medical emergency. The project was developed using modern NLP and deep learning tools, integrating transformer models, embedding techniques, and retrieval architectures.

Idiom Interpretation Example

“Spill the tea”
English: reveal gossip
Arabic: كشف الأسرار
French: révéler des potins

This reinforces Task 4 instantly.

Conclusion

IdiomX helps move AI beyond literal language toward real understanding.
It is a step toward making machines interpret language the way humans actually use it.

Projet's Github

Hugging Face

Understanding Language Beyond Words: Building IdiomX for Multilingual AI and Idiom Interpretation

2026-05-27 09:54:00

alumni.dsti.institute

https://alumni.dsti.institute/medias/image/59895579366b47a79df2ff.png

2026-05-27 10:28:57

2026-05-27 10:17:38

Maeline Lezy

Understanding Language Beyond Words: Building IdiomX for Multilingual AI and Idiom InterpretationBy Ayman Sharara, DSTI School of Engineering student in MSc in Data Science & AI, developed IdiomX as part of his Deep Learning project, a large-scale multilingual dataset designed to help AI understand idioms, retrieve figurative expressions, and interpret hidden meaning across languages.When Language Stops Being LiteralWhat does “spill the tea” mean? (Not a kitchen disaster)Idioms are expressions where words do not mean what they say, and that is exactly where AI starts to struggle.Humans understand them naturally. Machines… not so much.The ProblemMost existing idiom datasets are small, limited, not multilingual, and focused on simple tasks. In short, they do not reflect how people communicate.Building IdiomX (190K+ examples) : Step by StepIdiomX was built through a structured and scalable pipeline:Collection: extracting idioms from sources such as Wiktionary and WordNet, along with generating additional candidate idioms to improve coverage.Cleaning and Normalization: filtering noise, deduplication, and standardizing expressionsLLM Enrichment: using OpenAI GPT-4.1-mini to generate meanings, contextual examples, and multilingual translations (English, Arabic, French).Validation: combining semantic similarity scoring and rule-based checks to ensure consistency and quality The pipeline is modular and extensible, making it easy to scale to new languages and add richer annotations.The full workflow was implemented using Python, combining data engineering pipelines with LLM-based enrichment and validation to ensure reproducibility and scalability.More Than a Dataset: A Multi-Task BenchmarkIdiomX supports multiple tasks:Task 1: Idiom DetectionTF-IDF + Logistic Regression vs DistilBERT vs RoBERTa, with RoBERTa selected for strong contextual understanding.Task 2: Context-to-Idiom RetrievalDense retrieval vs hybrid retrieval with reranking, where hybrid + fine-tuned reranker achieved the best performance.Task 3: Cross-Lingual Retrieval (Arabic to English)Multilingual embeddings compared to fine-tuned E5, with fine-tuned E5 showing strongest semantic alignment.Task 4: Idiom InterpretationGiven an idiom or idiomatic sentence, the system retrieves its meaning in English, Arabic, and French. Hybrid retrieval with reranking produced the strongest interpretation performance. Models were trained and evaluated on structured train and test splits with careful data selection to avoid leakage and ensure reliable benchmarking. The workflow spans raw data collection, model training, retrieval benchmarking, idiom interpretation, and deployment-ready artifacts.Beyond these tasks, IdiomX can support chatbots, translation systems, idiom explanation assistants, language learning tools, sarcasm detection, and human-interacting robots.Why It Matters Language is not just words, it is meaning, context, and sometimes sarcasm. If AI is going to understand humans, it needs to know that “break a leg” is encouragement, not a medical emergency. The project was developed using modern NLP and deep learning tools, integrating transformer models, embedding techniques, and retrieval architectures.Idiom Interpretation Example“Spill the tea”English: reveal gossipArabic: كشف الأسرارFrench: révéler des potinsThis reinforces Task 4 instantly.ConclusionIdiomX helps move AI beyond literal language toward real understanding.It is a step toward making machines interpret language the way humans actually use it.Projet's GithubHugging FaceLinkedIn

https://alumni.dsti.institute/medias/image/thumbnail_1190920656a0ac68c4e732.jpeg

Commentaires0

Veuillez vous connecter pour lire ou ajouter un commentaire

Articles suggérés

External communication

Bridging Borders: DSTI Strengthens Global Engineering Ties During India Tour

Maeline Lezy

21 mai

External communication

Data Engineer/AI Engineer & Data Scientist: Pioneering AI Industrialisation at DSTI

Maeline Lezy

20 mai

External communication

Predicting Mental Health Risk Through Social Media

Maeline Lezy

18 mai