Mathematical Information Retrieval (MIR) is a research domain that focuses on retrieving relevant mathematical expressions, equations, and documents based on user queries.
However, most existing mathematical information retrieval systems operate within a single language, limiting accessibility for users working in multilingual environments.
Cross-Lingual Mathematical Information Retrieval (CLMIR) aims to bridge this gap by developing systems that retrieve mathematical content across different languages.
This task focuses on English-Hindi cross-lingual retrieval, helping users find mathematical content in one language even if it was originally written in another.
Many Hindi-speaking students and researchers struggle to access high-quality study materials and research papers because most of them are in English. Similarly,
valuable mathematical content written in Hindi remains inaccessible to English speakers. The challenge in building such a system lies in accurately translating mathematical
expressions and text, understanding queries in both languages, matching mathematical concepts correctly, and overcoming the lack of large bilingual datasets.
Dataset Description:
The dataset for the CLMIR 2025 task is curated from the Math Stack Exchange corpus from ARQMath-1 and contains approximately 50,000 instances (Training Data). The dataset is formatted to include
the body of scientific information which contains mathematical equations, expressions, and supporting textual descriptions in Hindi language and its associated search ID.
To assess the performance of participants, 10 formula and text-based queries in English language (Validation Data) and 50 formula and text-based queries in English (Test Data) shall be provided. Participants are expected
to submit a results file containing the relevant search results corresponding to each query.
The use case example demonstrates how the CLMIR system retrieves relevant mathematical content across languages. It shows a query in English and the corresponding relevant and irrelevant results
in Hindi, highlighting the system’s ability to match both mathematical expressions and textual meaning accurately.
Target Audience and number of expected submissions guidelines:
The CLMIR 2025 task aims to engage participants from both academia and industry,
including students, researchers, and academicians specializing in information retrieval,
mathematical information retrieval, cross-lingual retrieval, and computational linguistics.
It also seeks to attract industry professionals working on multilingual mathematical retrieval systems,
promoting collaboration between research and industry. To ensure fairness and consistency,
a participant cannot be a member of more than one team. Each team is required to submit a minimum
of one run and may submit up to a maximum of three runs for evaluation. Participants are required to
submit their results in a CSV (Comma-Separated Values) file format. The CSV file must follow the structure
and column naming conventions with the following fields: QueryID, SearchID, Run Number, and Similarity Score.
Query ID: Represents the unique identifier of the query
Run no: Indicates the run number submitted by the participant (e.g., 1 for the first run)
SearchID: Denotes the identifier of the retrieved result corresponding to the query
Similarity Score: A numerical value indicating the similarity between the query and the retrieved result
Evaluation
The performance of participant systems in the CLMIR 2025 task will be evaluated using three key metrics: Precision@10 (P@10),
Mean Average Precision (MAP), and Normalized Discounted Cumulative Gain (nDCG).