CLMIR 2025

Overview

Mathematical Information Retrieval (MIR) is a research domain that focuses on retrieving relevant mathematical expressions, equations, and documents based on user queries. However, most existing mathematical information retrieval systems operate within a single language, limiting accessibility for users working in multilingual environments. Cross-Lingual Mathematical Information Retrieval (CLMIR) aims to bridge this gap by developing systems that retrieve mathematical content across different languages. This task focuses on English-Hindi cross-lingual retrieval, helping users find mathematical content in one language even if it was originally written in another. Many Hindi-speaking students and researchers struggle to access high-quality study materials and research papers because most of them are in English. Similarly, valuable mathematical content written in Hindi remains inaccessible to English speakers. The challenge in building such a system lies in accurately translating mathematical expressions and text, understanding queries in both languages, matching mathematical concepts correctly, and overcoming the lack of large bilingual datasets.

Dataset Description:

The dataset for the CLMIR 2025 task is curated from the Math Stack Exchange corpus from ARQMath-1 and contains approximately 39,862 instances (Training Data). The dataset is formatted to include the body of scientific information which contains mathematical equations, expressions, and supporting textual descriptions in Hindi language and its associated search ID. To assess the performance of participants, 10 formula and text-based queries in English language (Validation Data) and 50 formula and text-based queries in English (Test Data) shall be provided. Participants are expected to submit a results file containing the relevant search results corresponding to each query.

The use case example demonstrates how the CLMIR system retrieves relevant mathematical content across languages. It shows a query in English and the corresponding relevant and irrelevant results in Hindi, highlighting the system’s ability to match both mathematical expressions and textual meaning accurately.

Query (English: Text + Formula): What is the integral of ex? ∫ex dx
CLMIR System Generated Results
1. Search ID 1 (Relevant):
  समाकलन के मूल नियमों के अनुसार, यदि किसी फलन का अवकलन f(x) हो, तो उसका समाकलन हमें मूल फलन वापस देता है। इसी आधार पर, चूंकि ex का अवकलन स्वयं e^xहोता है, इसलिए उसका समाकलन भी वही रहेगा: ∫e^x dx = e^x + C
2. Search ID 2 (Irrelevant):
  x²का समाकलन इस प्रकार है: ∫x² dx = ( x^{3)/3 + C}
3. Search ID 3 (Irrelevant):
  x²sup>का समाकलन इस प्रकार है: ∫x² dx = ( x³)/3 + C
4. Search ID 4 (Irrelevant):
  ln(x) का अवकलज इस प्रकार है: d/dx (lnx) = 1/x

Target Audience and number of expected submissions guidelines:

The CLMIR 2025 task aims to engage participants from both academia and industry, including students, researchers, and academicians specializing in information retrieval, mathematical information retrieval, cross-lingual retrieval, and computational linguistics. It also seeks to attract industry professionals working on multilingual mathematical retrieval systems, promoting collaboration between research and industry. To ensure fairness and consistency, a participant cannot be a member of more than one team. Each team is required to submit a minimum of one run and may submit up to a maximum of three runs for evaluation. Participants are required to submit their results in a CSV (Comma-Separated Values) file format. The CSV file must follow the structure and column naming conventions with the following fields: QueryID, SearchID, Run Number, and Similarity Score.

Query ID: Represents the unique identifier of the query

Run no: Indicates the run number submitted by the participant (e.g., 1 for the first run)

SearchID: Denotes the identifier of the retrieved result corresponding to the query

Similarity Score: A numerical value indicating the similarity between the query and the retrieved result

Evaluation

The performance of participant systems in the CLMIR 2025 task will be evaluated using three key metrics: Precision@10 (P@10), Mean Average Precision (MAP), and Normalized Discounted Cumulative Gain (nDCG).

Registration / Participation Guidelines:

Only one registration per team is required. Please ensure that the team information is complete at the time of registration.
Each team can have at most 4 participants
Each team must have at least one member
A team can submit up to 3 different runs But only one working note
Each team is required to submit a detailed description of their algorithm(s)
Participants are allowed to use any external pretrained models