Skip to content
Build with AI

Document AI & RAG for a Newspaper Archive

A Saturday morning of workshops and corridor conversations with Mauritius's developer community at SWAN HQ in Port-Louis. My own session was on Document AI and RAG for a historical newspaper archive — turning a 1955 Le Figaro into something a researcher can actually query.

Ish Sookun

Ish Sookun

4 min read
Build with AI 2026 — Mauritius (Jochen & Devesh kick-starting with the opening address)
Build with AI 2026 — Mauritius (Jochen & Devesh kick-starting with the opening address)

On Saturday May 9 2026, I participated in the Google Build with AI event organised by GDG Mauritius and hosted by SWAN, at their Port-Louis Head-Office. 36 people attended the event. My presentation was titled "Document AI & RAG for a Newspaper Archive." I started the talk with a soft "Hello" and pulled up the Google Chrome browser on the screen with a tab showing Gemini. It had a question, "What did Le Figaro write about Bao Dai in 1955?"

There's no precise search result for 'What did Le Figaro write about Bao Đại in 1955?'. Let's broaden the search to look at French newspapers covering Bao Dai in late 1955, particularly around the October 1955 referendum where Ngo Dinh Diem deposed him.

Gemini answering the question about Le Figaro and Bao Dai.

I let that sink in for a moment. Then I switched to my terminal and ran a small Python script — same question, against a tool I had built and trained on actual 1955 issues of Le Figaro. The answer came back in about four seconds. Concise. Specific. Footnoted, with each citation linking back to the precise page in the original PDF where the claim came from.

Running the query using a Python script.

Different tool. Different answer. Same question.

That contrast was the whole reason I was doing that presentation.

A quick note on framing

I opened the talk by introducing myself and where I work — La Sentinelle Ltd, publisher of l'express, 5-Plus, Business Magazine, Turf Magazine, and a few other Mauritian titles. And I was up front about one thing: this is a personal hobby project. Nights and weekends, no La Sentinelle mandate. If it grows into something the company or other regional publishers want to adopt one day, brilliant. For now it's just me, a laptop, and a stubborn belief that queryable newspaper archives are worth building.

Why AI chatbots don't read old newspapers

Generally available chatbots — Gemini, ChatGPT, Claude — are remarkable at the surface of public knowledge. They've absorbed Wikipedia, scraped a substantial chunk of the open web, and learned to pattern-match across billions of documents. Ask them about the 1955 Vietnamese referendum and they'll tell you it happened in October, that Ngô Đình Diệm "won" with an implausible 98.2% of the vote, and that historians still raise eyebrows.

But ask them what Le Figaro specifically wrote about it on a specific date, with citations, and they fall over. Two reasons.

First — old newspapers are not deeply present in their training data. Major archives like the Bibliothèque nationale de France's Gallica have scanned millions of pages from periods like the 1950s, but those pages exist as PDFs of microfilm. The text inside is barely processed — at best, basic OCR with no document-aware structure, no notion of articles, columns, or jumps.

Second — even when text is available, "modern OCR" gets fooled by historical newspapers. Tools like AWS Textract or Azure Document Intelligence were trained on contemporary business documents — invoices, forms, magazines from the last decade. Put a seven-column 1955 broadsheet in front of any of them and you get column-bleed: the parser reads horizontally across columns, splicing fragments of unrelated articles together in the output.

Until you fix that gap, no chatbot can tell you what Le Figaro actually wrote.

The impossible question

This isn't an academic exercise. Every issue l'express has ever published is in our archive as a PDF — every photograph, every classified ad, every column from before independence. Almost none of it is searchable today.

The problem isn't unique to us. Le Cernéen ran from 1832 to 1982 — the oldest paper in the southern hemisphere, full stop. 150 years of Mauritian life, sitting in microfilm and PDFs that nobody can query today.

If a researcher asks: "Find me everything Le Figaro wrote about Bao Đại between 1953 and 1956 — with citations," what do they actually do?

They go to Paris. They sit in the BnF reading room. They flip through fourteen hundred sixty issues of Le Figaro page by page, taking six weeks of time from a trained academic to answer a single research question. And they haven't even started thinking about cross-paper queries — what about l'express, Le Mauricien, Le Cernéen, the official gazette?

This used to need an army of grad students. With the right tools, today, it needs a Saturday afternoon.

Why anatomy comes first

Most demos of "RAG on documents" skip the part where you study what you're parsing. They run a default OCR, chunk the output, embed the chunks, and call it done.

Anatomy of a newspaper alt=

For modern documents that mostly works. For 1955 newspapers it fails silently — your retrieval still runs, your answers look confident, and they're quietly wrong because every chunk is corrupted.

So I spent almost a third of the talk on the artifact itself. What is a 1955 newspaper, anatomically?

The header alone has twelve fields of structured metadata — title, edition number (Le Figaro ran multiple editions per day), year, day-of-year, the director of the time (Pierre Brisson in our specimen), regional cover prices for Morocco, Algeria, Tunisia, Spain, Italy, the dépôt légal stamp showing when the BnF received the issue. Each of those needs to come out cleanly into a database column or you can't filter by date or group by edition.

Seven primary columns per page. Five article boundaries on the front page that all start on the same horizontal line, then run down their columns for thirty to sixty lines each, then continue on different pages entirely. Plain OCR reads horizontally. Newspapers are read vertically. Read horizontally across the band and you get garbage like:

L'Allemagne LE TRAITÉ DANS UNE INTERVIEW Gérard Dupriez AUJOURD'HUI / a fait hier D'ÉTAT EXCLUSIVE BAO DAI parricide SEIZE PAGES.

That's a real result from pdftotext on the Le Figaro of 10 May 1955. Four unrelated articles spliced together by horizontal reading order.

It gets worse — articles don't even live in one column. The Bao Đại exclusive interview starts in column 3 of page 1, runs into column 4, then jumps to "page 13, columns 3, 4, and 5" via a tiny « Suite page 13 » pointer the typesetter put at the bottom of the front-page fragment. One article. Two pages. Five columns. Spread across roughly a meter of physical newspaper if you laid it flat.

Then there are the ads — about 40% of an average page in this issue is paid content. Some are obvious. Many are dressed up as articles. Orlane sells beauty cream under the editorial-looking headline « À peau soignée, beau maquillage ». Pullnyl's « La révélation de l'année » reads like an editorial scoop until you notice it's selling nylon shirts at 1,200 francs. Frescafil organizes itself into numbered points like a structured editorial list; it's the Au Printemps department store's summer collection.

If your pipeline can't tell these from editorial, your knowledge base ships ad copy as facts — to historians, to researchers, to the public.

The pipeline

Once you accept all of that, the architecture pretty much writes itself.

Explaining the five stages to make the paper qeuryable

Five stages, end to end:

  1. Ingest. PDFs land in Cloud Storage, an Eventarc trigger fires Cloud Run, the pipeline starts.
  2. Parse. Document AI Layout Parser. Hierarchical block structure, cross-column reading order, French language hint to halve the OCR error rate on aged typography.
  3. Enrich. Gemini 2.5 Pro multimodal — and this is where the talk's actual contribution lives.
  4. Index. Cloud SQL Postgres 18 with pgvector for retrieval, BigQuery for analytics.
  5. Generate. Gemini grounded answers with inline citations.

Stages 1, 4 and 5 are well-trodden — there are a hundred tutorials online. Stages 2 and 3 are where every team I've talked to falls down on newspaper data, so that's where I spent the most time.

Document AI Layout Parser gets us about 80% of the way. It segments columns correctly, groups body paragraphs, attaches photo captions to images. What it doesn't do — and can't, because it was trained on modern business documents — is recognize « Suite de la première page » pointers, distinguish advertorials from editorial, or stitch jumped articles back together.

That last 20% is what Stage 3 is for.

The trick: give Gemini both the Layout Parser JSON and the page image. The image gives Gemini visual context that JSON can't carry — the rules between articles, the boxed framing of ads, the way advertorials use a slightly different border style than editorial. The model can see what a 1955 newspaper reader saw.

The output is structured against a strict JSON schema: title, kicker, body, byline, kind classification (news / feature / opinion / advertorial / ad / listing), and a jump_target object if the article continues elsewhere.

That kind classification turns out to be the linchpin of the whole system — but you don't see why until you get to retrieval.

Retrieval is hybrid by necessity

The query that runs against the database has two CTEs side by side.

The lexical CTE uses PostgreSQL's tsvector with the 'french' configuration — diacritics stripped, French stopwords filtered, the Snowball French stemmer applied. Exact-term precision. Type Bao Đại and every chunk containing that exact tokenisation surfaces.

The semantic CTE uses pgvector's cosine-distance operator on 768-dimensional embeddings from text-multilingual-embedding-002. Paraphrase tolerance. Ask "what did the South Vietnamese ruler say about elections" and the semantic search finds chunks discussing « régime républicain ou monarchie constitutionnelle » even when no terms overlap.

Both CTEs filter on kind IN ('news', 'feature', 'opinion') and a date window.

This is where the Gemini stage-3 classification earns its keep. When someone asks an editorial question, advertorial chunks are pre-filtered out before either ranker even sees them. Pullnyl's nylon-shirt ad copy never gets the chance to be served as "what Le Figaro wrote about 1955 fashion." You pay for the multimodal classification once at ingest; you collect the dividend on every retrieval forever.

The two CTEs feed into Reciprocal Rank Fusion with k=60 — a parameter-free way to combine rankers that produce scores on incommensurable scales. RRF rewards documents that show up well in both rankers. A document at rank 2 in lexical and rank 3 in semantic beats a document at rank 1 in lexical and rank 50 in semantic. Calibrated retrieval rather than spiky retrieval.

Lexical for precision. Semantic for paraphrase. RRF to fuse them without having to calibrate the score scales.

The Mauritius angle

I closed the talk on why this matters here specifically.

Two hundred years of Mauritian newsprint sitting in PDFs and microfilm. Le Cernéen from 1832 to 1982 — the earliest record of life under British rule, slavery's abolition, indentured-labour arrivals, the long road to independence. Le Mauricien from 1908. l'express from 1963. Advance, Week-End, 5-Plus dimanche. Plus the holdings of the National Library of Mauritius, the Mahatma Gandhi Institute archives, the official gazette since 1773.

Almost none of it queryable today.

Building this index is, in a real sense, an act of recovery. We are putting our history into a form our great-grandchildren can actually ask questions of.

The pipeline I demonstrated runs end-to-end without copying historical content out of the region. africa-south1 is one network hop from Mauritius. For institutions like the National Library or the MGI, that data residency story isn't a nice-to-have — it's a precondition for cooperation.

The technology is here. The cost is accessible. The data residency story works. The only thing missing — for our island specifically — is for someone to start.

What's next

A few directions on my list:

  • A knowledge graph layer — entities, dates, geographies, navigable visually. Click a place, see every article that ever mentioned it.
  • Multimodal photo Q&A — ask "show me Cannes festival photos 1950 to 1960" and have Gemini read images and captions together.
  • Cross-paper queriesLe Figaro plus l'express plus Le Cernéen plus the gazette. Same question, multiple regional perspectives, side by side.
  • Era-aware retrieval — 1955 vocabulary differs from 2025 vocabulary in ways that matter for embedding quality. Time-conditioned embeddings is open research; there's a paper waiting to be written here.

The hobby project continues and DevFest 2026, I'm hoping to have something more structured than a Python script firing a single query. A proper service layer, a frontend a researcher can actually navigate, maybe the cross-paper piece running across two or three titles. We'll see how far I get.

Thanks

Thank you to SWAN HQ for hosting a great Build with AI event in Port-Louis, to the Google Developer Group community for putting the day together, and to everyone who came up after the talk with questions, war stories, and corrections that I'm still chewing on. The slides are at the end of this post; the demo source code goes up on GitHub this week.


Gallery

Attendees at the Build with AI 2026 edition
Attendees at the Build with AI 2026 edition
Jochen Kirstätter with opening remarks and Devesh from SWAN welcoming the atteendees
Jochen Kirstätter with opening remarks and Devesh from SWAN welcoming the atteendees
Noor Yadallee doing a live demo of America Sign Language using MediaPipe and Gemini Nano from a Flutter application running from his mobile phone
Noor Yadallee doing a live demo of America Sign Language using MediaPipe and Gemini Nano from a Flutter application running from his mobile phone
Ish Sookun presenting a case study of asking a specific question on generic AI chatbots vs a model trained on newspaper archives
Ish Sookun presenting a case study of asking a specific question on generic AI chatbots vs a model trained on newspaper archives
Ish Sookun using the example of the freely available Le Figaro archive of 1955 from the Bibliothèque nationale de France (BnF)
Ish Sookun using the example of the freely available Le Figaro archive of 1955 from the Bibliothèque nationale de France (BnF)
Build with AI 2026 — Mauritius / SWAN Head Office, Port-Louis (Group Picture)
Build with AI 2026 — Mauritius / SWAN Head Office, Port-Louis (Group Picture)
Attendees at the Build with AI 2026 edition

Attendees at the Build with AI 2026 edition

Jochen Kirstätter with opening remarks and Devesh from SWAN welcoming the atteendees

Jochen Kirstätter with opening remarks and Devesh from SWAN welcoming the atteendees

Noor Yadallee doing a live demo of America Sign Language using MediaPipe and Gemini Nano from a Flutter application running from his mobile phone

Noor Yadallee doing a live demo of America Sign Language using MediaPipe and Gemini Nano from a Flutter application running from his mobile phone

Ish Sookun presenting a case study of asking a specific question on generic AI chatbots vs a model trained on newspaper archives

Ish Sookun presenting a case study of asking a specific question on generic AI chatbots vs a model trained on newspaper archives

Ish Sookun using the example of the freely available Le Figaro archive of 1955 from the Bibliothèque nationale de France (BnF)

Ish Sookun using the example of the freely available Le Figaro archive of 1955 from the Bibliothèque nationale de France (BnF)

Build with AI 2026 — Mauritius / SWAN Head Office, Port-Louis (Group Picture)

Build with AI 2026 — Mauritius / SWAN Head Office, Port-Louis (Group Picture)

No results found.
Searching…