GitHub Releases Multilingual Repository Dataset Covering Over 40 Million Repositories

2026-06-16 09:37

Favorite

en.Wedoany.com Reported - GitHub has released the GitHub Multilingual Repositories Dataset, a repository-level metadata dataset designed to help researchers and developers discover public GitHub repositories containing non-English natural language content. In constructing this dataset, the distribution of different languages varies across READMEs, issues, and pull requests: Korean is the most common non-English language in issue text but ranks only fifth in READMEs; Portuguese tops the list for non-English READMEs, involving over 3 million repositories. As AI plays an increasingly important role in how developers build software, multilingual developer content is more critical than ever. The dataset is now published on GitHub under the CC0-1.0 license, fulfilling GitHub's commitment made in 2025 as part of Microsoft's European Digital Commitments to make multilingual data more accessible, including for open-source AI developers.

This dataset is not a dump of repository content but a metadata dataset covering over 80 million classification records across more than 40 million repositories. For each public repository, the following is provided: language classification of the README, the most commented issue, and the most commented pull request, with the first 150 characters of each used as input samples, excluding text shorter than 20 characters; classification results for each text source from fastText, gcld3, and lingua-py, each accompanied by a confidence score, with the dataset only including classifications with a confidence greater than 0.5; repository metadata including creation timestamp, disk usage, star count, fork count, primary programming language, SPDX license, issue and pull request counts, and snapshot date. GitHub intentionally does not merge the three classifiers into a single label, as different classifiers vary in coverage and confidence calibration, especially for low-resource languages. By making all three classification results available, users can decide on their own level of strictness.

This dataset can be used to discover repositories that may contain developer documentation or collaboration in specific languages, study how non-English developer communities use issues, pull requests, and READMEs, build evaluation sets for AI coding tools, documentation generators, or review assistants (which need to perform well across multiple languages), encourage policymakers to leverage data on developer multilingual diversity to support arguments for expanding language coverage, and measure the representation of European and other underrepresented languages in open source. Language identification in software repositories is challenging; repository text is often short, may contain badges, templates, installation commands, code snippets, usernames, or mixed-language content, and a 150-character sample may not represent the entire repository. Therefore, this dataset should not be considered a ground-truth benchmark for language identification but is designed as a transparent discovery tool. The dataset should also not be used to infer sensitive attributes of repository owners, contributors, or communities, as these signals are repository-level metadata, not personal-level attributes.

Many European languages remain underrepresented in online text used to build and evaluate AI systems, which may cause AI tools to perform well for some developers, languages, and communities while leaving others behind. Open data helps bridge this gap. The dataset was constructed because developer content differs from general web text; READMEs, issues, and pull requests contain the language of software collaboration, such as installation instructions, bug reports, feature requests, review comments, and community norms. These contexts help build AI systems that better understand how developers actually work. By making multilingual developer content signals more discoverable and analyzable, this dataset provides researchers, open-source developers, and model builders with tools to study language representation in software development, helping identify gaps, support better evaluations, and create more inclusive AI tools for developers in Europe and beyond.

GitHub will discuss this dataset and the broader importance of open data for multilingual AI on June 16 at the Open Innovation Dialogue Hub in Strasbourg. The event, co-organized by the Microsoft Open Innovation Center, the Council of Europe, and GitHub, will bring together policymakers, researchers, cultural institutions, and open innovation leaders to explore AI, language diversity, cultural heritage, and open data.

This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com

America

This bulletin is compiled and reposted from information of global Internet and strategic partners, aiming to provide communication for readers. If there is any infringement or other issues, please inform us in time. We will make modifications or deletions accordingly. Unauthorized reproduction of this article is strictly prohibited. Email: news@wedoany.com

Previous：Canadian Nuvei to Acquire Payoneer for $2.75 Billion

Next：AMD Launches Three New Processors Including Ryzen 3 3100U