What is TDM?
Text and data mining (TDM) is a set of techniques aimed at “[…] the implementation of automated analysis techniques for digital text and data in order to extract information therefrom, including patterns, trends, and correlations ” (Article L122-5-3 I. of the Intellectual Property Code, created by Order No. 2021-1518 of November 14, 2021).
In other words, TDM involves using computer tools to extract information, links, and recurring patterns from a corpus of text or data.
The 2019 European Directive on Copyright and Related Rights in the Digital Single Market (specifically Articles 3 and 4) introduces an exception for the use of text and data mining technologies aimed at facilitating research and innovation within the European Union.
Types of content covered
All texts and all types of digital content are covered by the regulatory framework: text corpora, data, still or moving images, sounds, music, software, etc.
Who can use TDM?
Staff at universities and research organizations, public libraries, museums, archives, and institutions that preserve film, audiovisual, or sound heritage may use TDM. No compensation is required to be paid to the rights holder. Access to documents must, however, be lawful (see next paragraph) and solely for research purposes.
What is permitted?
It is permitted to “reproduce content protected by intellectual property rights for the purpose of conducting data mining activities for scientific research, without having to obtain prior authorization from the ‘rights holders’ (the producers of the databases, the owners of the texts and/or data targeted by TDM: companies, publishers, …) or to obtain licenses from them.”
“Copies obtained on the basis of this ‘exception’ in favor of TDM may be retained for as long as necessary with an appropriate level of security pby its beneficiaries, in particular to enable the conduct of new scientific research or to serve as a means of verifying results. […] thereby enhancing the reproducibility of research results and the cumulative nature of science” (see article on Text and Data Mining (in French) on the Ouvrir la Science website).
In practice, researchers are authorized to work on documents and data to which they have gained legal access: either because these are available via open access, or through documents acquired by their institution, or via their library’s subscriptions to online databases. Publications or data under closed access obtained through illicit means must not be subject to TDM.
Please note, however, that while the law permits researchers to work with these data and documents, database publishers may sometimes prohibit the bulk downloading of documents, also known as harvesting. It is therefore important to contact the relevant publishers before undertaking any text and data mining activities, which begin as soon as the data and documents are downloaded.
To assist you, we recommend notifying us of your TDM project at before any bulk download operation, to prevent the content provider from blocking your institution’s access—as they might mistake it for an attack.
When can TDM not be used?
- In the context of a for-profit partnership between public and private entities: “when a company is a shareholder or structurally affiliated with a research organization and thereby benefits from privileged access to data, then that organization cannot invoke the exception for scientific research purposes” (see article on Text and Data Mining (in French) on the Ouvrir la Science website).
- When the content to be mined contains protected data. The TDM exception must indeed align with the GDPR and comply with its obligations (see Regulation (EU) 2016/679 of the European Parliament and of the Council of April 27, 2016, on the protection of natural persons with regard to the processing of personal data and on the free movement of such data). Example: non-anonymized personal data.
Resources available at the Université de Lorraine for conducting TDM operations
Université de Lorraine subscriptions to paid resources with a dedicated TDM framework
For any text and data mining activities on these resources, please contact in advance to prevent the institution’s access from being blocked by the relevant content provider. For the relevant publishers, an API key will be provided.
- Springer Archives and Nature: An API key can be provided for any TDM request: https://api.springernature.com/. More information at https://link.springer.com/
- Sage Archives: You can use the CrossRef Text and Data Mining API.
- Elsevier: A specific API is dedicated to TDM.
- ACM (Association for Computing Machinery): For any TDM request, an API will be provided by the publisher or by a trusted third party (see the Couperin agreement, in French).
- RSC – Royal Society of Chemistry: No specific API mentioned.
Resources accessible to the entire French Higher Education community
Istex, or Initiative d’EXcellence en Information Scientifique et Technique, is a platform operated by Inist-CNRS as part of a government funding program for the purchase of scientific journal archives. Its 30 million documents, divided into 50 publisher bundles and 2 open-access bundles, are accessible only to members of the Higher Education and Research community. The metadata, however, is visible to everyone. 63 languages are represented, but the majority of the documents are in English, German, and French. The documents cover a broad time span (from 1455 to today) and a wide range of disciplines (humanities and social sciences, basic and applied sciences, among others).
You can create corpora on Istex-Search, query web services via TDM Web Services, or run data processing tasks on your data via TDM Factory. Lodex then allows you to continue processing and analyzing your corpus and finally to visualize your data and corpus. Istex services are accessible to all members of the Higher Education and Research community.
Training on Istex tools is offered by the Inist-CNRS team. Tutorials (in French) are also available on the Callisto website.
Open Heritage Resources
Data BnF and Gallica are two initiatives of the Bibliothèque nationale de France (BnF).
Gallica is a digital library that offers free, open access to several million digitized documents from all eras and in all formats. Gallica can be searched via the BnF API portal.
Data BnF, on the other hand, is part of the Semantic Web and provides access to several million linked metadata records from the BnF catalog. The database can be queried using SPARQL.
Are you registered with the BnF research library? You may benefit from the services of the BnF DataLab, which are dedicated to researchers, ranging from corpus construction to data mining, using the BnF’s digital collections. For more information, visit the BnF DataLab page and review the application form.
How to do TDM? Overview of some analysis tools
Various open-source tools can be used for text and data analysis.
Software installed on your computer, such as RStudio (based on the R language) and Iramuteq (based on the R and Python languages), as well as fully online solutions like TDM Factory, are available to assist you with your text and data mining operations.
- Iramuteq is an open-source software developed by LERASS (University of Toulouse 2, LabEx SMS), based on the R software and the Python programming language. You will find comprehensive documentation in French and numerous practical examples.
- Cortext Manager, software developed by LISIS (Université Gustave Eiffel), is a tool dedicated to data analysis and transformation. Available online, it is compatible with the Istex data download format, allowing for easier exploration of your data (see “Resources accessible to the entire French Higher Education community”). Cortext forum: https://docs.cortext.net/forum/
- TDM Factory “is an intuitive interface that allows you to upload your own data and easily apply text mining techniques to it” (TDM Factory website), developed by Inist-CNRS as part of the Istex initiative (see “Resources accessible to the entire French Higher Education community”).
This list is not intended to be exhaustive. Software may also exist within the laboratories themselves, which may have been created as part of previous research projects. Solutions to your software needs may therefore already exist within your research unit.
Practical tip
Don’t forget to properly structure your data before running your analyses. Indeed, some tools (such as Iramuteq) require very specific data formats as input. For simplified analysis, your files must have been cleaned beforehand and correctly formatted (UTF-8 or Latin characters, .tsv, .csv formats, etc.).
How do I visualize my data?
Some of the tools presented above (RStudio, Lodex, Iramuteq) also allow you to visualize data.
Other open-source tools such as Gephi or VosViewer specialize in data visualization, particularly in displaying networks and nodes.
The Open Science team at the Université de Lorraine offers VosViewer training sessions throughout the year.
Bibliographic resources
- Institut de l’Information Scientifique et Technique. DoRANum – Aspects juridiques, éthiques, intégrité scientifique : Text and Data Mining [En ligne]. 2017. Disponible sur : https://doi.org/10.13143/YWKR-5W34
- « Ouvrir la Science – La fouille de textes et de données à des fins de recherche : une pratique confirmée et désormais opérationnelle en droit français ». Disponible sur : https://www.ouvrirlascience.fr/la-fouille-de-textes-et-de-donnees-a-des-fins-de-recherche-une-pratique-confirmee-et-desormais-operationnelle-en-droit-francais/
- Francom J. An Introduction to Quantitative Text Analysis for Linguistics : Reproducible Research Using R. Taylor & Francis, 2025. [En ligne]. Disponible sur : https://directory.doabooks.org/handle/20.500.12854/143905
- Levshina Natalia. How to do linguistics with R: data exploration and statistical analysis. Amsterdam Philadelphia (Pa.) : John Benjamins Publishing Company, 2015. Disponible sur le catalogue des BU de l’Université de Lorraine.
- Schultz Emilien. Python pour les SHS : introduction à la programmation pour le traitement de données. Rennes : Presses universitaires de Rennes, 2020. Disponible sur le catalogue des BU de l’Université de Lorraine.
Data Mining at the Université de Lorraine: some practical examples
Examples of articles taken from an OpenAlex search dated November 21, 2025.
API query used
https://api.openalex.org/works?page=1&filter=title_and_abstract.search:text+and+data+mining,authorships.institutions.lineage:i90183372&sort=relevance_score:desc&per_page=10
- Dina N. Z., Yunardi R. T., Firdaus A. A., Juniarta N. « Measuring User Satisfaction of Educational Service Applications using Text Mining and Multicriteria Decision-Making Approach ». International Journal of Emerging Technologies in Learning (iJET) [En ligne]. 6 septembre 2021. Vol. 16, n°17, p. 76. Disponible sur : https://doi.org/10.3991/ijet.v16i17.22939
- Ostaszewski M., Niarakis A., Mazein A., Ravel J.-M. [et al.]. « COVID19 Disease Map, a computational knowledge repository of virus–host interaction mechanisms ». Molecular Systems Biology [En ligne]. 1 octobre 2021. Vol. 17, n°10, p. e10387. Disponible sur : https://doi.org/10.15252/msb.202110387
- Pateyron B., Weber M., Germain P. « Essai d’analyse lexicale et stemma codicum de quatre-vingt-trois rituels de Chevaliers Kadosh de la collation du fonds de l’atelier de recherches Sources ». Nouvelles perspectives en sciences sociales [En ligne]. 1 avril 2016. Vol. 11, n°1, p. 93‑144. Disponible sur : https://doi.org/10.7202/1035934ar
Example of a dataset created using TDM methods:
- Tchiedjo, Marie Laure; Thomas, Marielle; Pétronin, Florent; Kestemont, Patrick; Lecocq, Thomas, 2025, “Data Extracted from Scientific Articles on Worldwide Fish Polyculture”, https://doi.org/10.57745/8PDRLJ, Recherche Data Gouv, V2.
