Home What I Offer Projects Tools & Platforms Blog About Book Intro Call Describe Your Project
Back to Blog
15 February 2026

Patent–publication citation linking from open data at global scale

Open DataPatent AnalysisData PipelineOpenAlex

The problem

Patent citations to scholarly literature are one of the most valuable signals of research translation — a direct link between academic discovery and commercial application. But accessing this data has traditionally required expensive proprietary databases like Lens.org or PatSnap, putting it out of reach for many research funders and universities.

What I built

A local pipeline that processes the entire EPO global patent snapshot (142 million patent documents) and resolves their non-patent literature (NPL) citations to DOIs, then matches those DOIs to OpenAlex records. The result: a comprehensive dataset linking patents to scholarly papers, built entirely from open sources.

Key figures

  • 142 million patent documents processed from the EPO global snapshot
  • 29.6 million non-patent literature citations resolved to DOIs
  • 4.35 million unique scholarly papers matched to OpenAlex records
  • Zero proprietary dependencies — fully reproducible from open data

How it works

The pipeline operates in several stages:

  1. Ingest: Parse the EPO’s bulk patent data files, extracting bibliographic records and non-patent literature citations
  2. Clean: Normalise citation strings, handling the enormous variety of formats used across patent offices worldwide
  3. Resolve: Match citation strings to DOIs using a combination of structured parsing and fuzzy matching
  4. Enrich: Link resolved DOIs to OpenAlex records, adding full metadata including authors, institutions, funding information, and citation networks
  5. Validate: Cross-check results against known benchmarks to ensure accuracy

Why it matters

For research funders, this opens up an entirely new dimension of impact assessment. You can now ask questions like:

  • Which of our funded publications have been cited in patents?
  • What industries are building on our research?
  • How does our portfolio’s patent citation rate compare to field expectations?
  • Where are the strongest translation pathways from our funded research to commercial application?

All without paying for proprietary databases — and with full transparency over the methodology.

Technical details

The pipeline is built in Python, using DuckDB for efficient local processing of large datasets. It runs on a single machine (no cloud infrastructure required) and can process the full EPO snapshot in under 48 hours. The code is designed to be modular, so individual components can be updated independently as data sources change.

Interactive Visualisation

Want to discuss this work?

I'm always happy to talk about methodology, data infrastructure, or how these approaches could apply to your organisation.