The problem
Patent citations to scholarly literature are one of the most valuable signals of research translation — a direct link between academic discovery and commercial application. But accessing this data has traditionally required expensive proprietary databases like Lens.org or PatSnap, putting it out of reach for many research funders and universities.
What I built
A local pipeline that processes the entire EPO global patent snapshot (142 million patent documents) and resolves their non-patent literature (NPL) citations to DOIs, then matches those DOIs to OpenAlex records. The result: a comprehensive dataset linking patents to scholarly papers, built entirely from open sources.
Key figures
- 142 million patent documents processed from the EPO global snapshot
- 29.6 million non-patent literature citations resolved to DOIs
- 4.35 million unique scholarly papers matched to OpenAlex records
- Zero proprietary dependencies — fully reproducible from open data
How it works
The pipeline operates in several stages:
- Ingest: Parse the EPO’s bulk patent data files, extracting bibliographic records and non-patent literature citations
- Clean: Normalise citation strings, handling the enormous variety of formats used across patent offices worldwide
- Resolve: Match citation strings to DOIs using a combination of structured parsing and fuzzy matching
- Enrich: Link resolved DOIs to OpenAlex records, adding full metadata including authors, institutions, funding information, and citation networks
- Validate: Cross-check results against known benchmarks to ensure accuracy
Why it matters
For research funders, this opens up an entirely new dimension of impact assessment. You can now ask questions like:
- Which of our funded publications have been cited in patents?
- What industries are building on our research?
- How does our portfolio’s patent citation rate compare to field expectations?
- Where are the strongest translation pathways from our funded research to commercial application?
All without paying for proprietary databases — and with full transparency over the methodology.
Technical details
The pipeline is built in Python, using DuckDB for efficient local processing of large datasets. It runs on a single machine (no cloud infrastructure required) and can process the full EPO snapshot in under 48 hours. The code is designed to be modular, so individual components can be updated independently as data sources change.
Interactive Visualisation
Want to discuss this work?
I'm always happy to talk about methodology, data infrastructure, or how these approaches could apply to your organisation.