Mapping the Public Goods Behind Education AI: Why Metadata Standards Matter

June 17, 2026 | By Xin Wei, Rebecca Griffiths and Jeremy Roschelle

Key Ideas

Our scan of multiple repositories identified a sizable, distributed set of candidate data-bearing public-good records for K-12 and broader education AI.
Public data infrastructure for responsible AI in schools includes not only primary student data, but also teacher professional development records, curriculum artifacts, education benchmarks, and related research resources.
The usability of these records is currently limited by inconsistent licensing, documentation, grade-span reporting, and artifact-type labeling, pointing to a need for shared metadata standards.

The public infrastructure underlying Artificial Intelligence (AI) in education is growing. Researchers, developers, and educational agencies around the world are depositing datasets, benchmarks, curriculum artifacts, assessment resources, and related research artifacts in open repositories at a scale that now amounts to a public-goods ecosystem.

We began with the K-12 AI Infrastructure Program’s frame of datasets, models, and benchmarks, asking what public resources are available to support responsible AI development for schools. As we expanded across Hugging Face, GitHub, and Digital Object Identifier (DOI) registered repositories, we narrowed the reported inventory to data-bearing records: datasets, benchmark datasets, curriculum and standards artifacts, assessment resources, tutoring research artifacts, and paper-linked data resources. Standalone models, apps, and code pipelines are not counted unless they are associated with a reusable data artifact.

Public Goods Are Broader Than We Think

We use “public goods” in two related ways. First, we treat datasets, models, and benchmarks as shared building blocks that can support responsible AI development for education when they are made available for appropriate reuse. Second, following digital public-goods standards and open data traditions, we focus on resources that are openly accessible, reusable, and governed with attention to privacy, safety, and public benefit.

In this blog, we report on the data-bearing part of that infrastructure, not the full universe of models, apps, or software tools. In our scan, teacher professional development records, such as teacher coaching feedback datasets or annotated classroom observation records, can inform instructional coaching systems. Curriculum and standards artifacts such as state K-12 standards datasets, Texas Essential Knowledge and Skills (TEKS) resources, and Next Generation Science Standards (NGSS)-aligned science frameworks, support alignment research. Classroom observation datasets can inform pedagogical AI. Education benchmarks, assessment instruments, and tutoring research artifacts each support a different layer of responsible AI development for education.

What’s Working

Three findings emerged as we built the combined inventory.

The ecosystem is substantial. Our scan across Hugging Face, GitHub, and DOI-registered repositories returned more than 1,500 candidate data-bearing records related to education and AI.
The contributors are global. Records come from many countries, languages, and populations, from early childhood through adult learners. This work is spread across platforms, AI-focused hubs like Hugging Face, general code repositories like GitHub, and research data archives like Zenodo and figshare.
Contributors are partially documenting their work. About 80% of records include at least limited data documentation, and many include methodological notes about data collection, populations, and intended use. However, inconsistent open licensing and cross-platform metadata limit their discoverability and reuse.

What Would Unlock More Value

Targeted coordination steps would make existing data-bearing public-good records substantially more reusable. These recommended steps would ask repositories, funders, and research groups to standardize information already being recorded unevenly across the ecosystem.

Make documentation more complete and comparable: Data-bearing records should state what the artifact contains; where it came from; who it represents; how it was collected; and any privacy, consent, access, or reuse limitations.
Standardize grade-span metadata: Currently, 61% of records in our inventory do not specify a grade span, leaving it unclear whether the learners are in early childhood, elementary school, middle school, high school, postsecondary, or adults. A record is much harder to cite, compare, or build upon if downstream users cannot determine its target population.
Clarify license terms at the data-file level: Many records carry a license on the landing page but do not clearly state whether that license applies to the underlying data files. By writing explicit, machine-readable open-data licensing into grant requirements, philanthropic and government funders can ensure the resources they sponsor are legally reusable and increase the value of their investments.
Distinguish artifact types: Repositories and DOI registries should support clearer tagging for primary data, benchmark datasets, models, code, curriculum artifacts, review supplements, and papers. Most platforms have some version of a “type” field already; richer and more consistent use of these fields would help researchers find the exact kind of resource they need without sifting through mismatched formats.

Our scan suggests that a growing ecosystem of data-bearing public resources is already in place. The next step is making these resources easier to discover, understand, and reuse to support responsible AI development in education.

Earlier this week, DrivenData launched a new platform in collaboration with the K-12 AI Infrastructure Program. The platform is designed to support the work of building AI that actually serves students in three key ways:

Gather and distribute core AI infrastructure comprising datasets and models.
Ground AI advances in learning science and student outcomes through curated benchmarks.
Build a collaborative community through challenges and community discussion forums.

Explore the new platform today!

Mapping the Public Goods Behind Education AI: Why Metadata Standards Matter

Key Ideas

Public Goods Are Broader Than We Think

What’s Working

What Would Unlock More Value

Related Articles

June 16, 2026

Hearing from Students: Three Years of Interviewing Learners About AI in Education

June 5, 2026

Digital Promise at the ISLS 2026 Annual Meeting

June 3, 2026

New Grants Support Research into Math Education Using Datasets from Curriculum Associates, EdLight, Khan Academy, and University of Florida Lastinger Center for Learning

June 1, 2026

How Kansas City Public Schools Is Building Powerful Next-Gen Digital Leaders