Auto-Translating English RSS Feeds to Traditional Chinese with DeepL

I’ve been building a personal RSS aggregator that pulls from 17+ developer leadership sources — most of them English. Reading English is fine, but I wanted the aggregated view to feel more native, so I decided to wire in automatic translation.

DeepL was the obvious choice. I’ve used it manually for years and it handles technical content noticeably better than Google Translate, especially for nuanced phrasing. They have a free tier at 500K characters per month, which works out to roughly 50 articles per day at average article length. Enough.

Detecting What Actually Needs Translation

Before translating anything, I needed to figure out whether the content was already in Chinese. Rather than pulling in a third-party library, I wrote a small custom detector that scans Unicode character ranges and returns an ISO 639-3 code.

export function detectLanguage(text: string): string {
  if (!text || text.length < 10) return 'en';

  // CJK Unified Ideographs covers both Traditional and Simplified Chinese
  const chineseChars = text.match(/[一-鿿㐀-䶿]/g);
  if (chineseChars && chineseChars.length / text.length > 0.3) return 'cmn';

  const japaneseChars = text.match(/[぀-ゟ゠-ヿ]/g);
  if (japaneseChars && japaneseChars.length / text.length > 0.2) return 'jpn';

  const koreanChars = text.match(/[가-힯ᄀ-ᇿ]/g);
  if (koreanChars && koreanChars.length / text.length > 0.2) return 'kor';

  return 'en';
}

The approach is simple: if more than 30% of characters fall in the CJK Unified Ideographs range, it’s Chinese (cmn); similar thresholds apply for Hiragana/Katakana (Japanese) and Hangul (Korean); everything else defaults to English. No dependencies, no model downloads, and the character-ratio heuristic is surprisingly robust for the kind of content I’m ingesting.

I also set a minimum length threshold on the combined title and description. Under ~80 characters there isn’t enough signal for a reliable detection, so I skip translation entirely rather than risk a false positive burning API quota.

The Code Block Problem

Technical articles have code blocks, and you really don’t want those going through a translation API. DeepL will sometimes try to “translate” variable names or comments, which breaks the examples.

My solution was to strip code blocks before translating, replace them with numbered placeholders, translate the surrounding prose, then reinsert:

function stripCodeBlocks(text) {
  const blocks = [];
  const stripped = text.replace(/```[\s\S]*?```/g, (match) => {
    blocks.push(match);
    return `[[CODE_BLOCK_${blocks.length - 1}]]`;
  });
  return { stripped, blocks };
}

function reinsertCodeBlocks(text, blocks) {
  return text.replace(/\[\[CODE_BLOCK_(\d+)\]\]/g, (_, i) => blocks[i]);
}

For HTML content in RSS feeds, I used DeepL’s built-in tag_handling=html parameter instead. It preserves tags and only translates text nodes, which is cleaner than manual stripping for HTML.

Calling DeepL and Handling Rate Limits

The API call itself is straightforward:

async function translate(text) {
  const response = await fetch('https://api-free.deepl.com/v2/translate', {
    method: 'POST',
    headers: {
      'Authorization': `DeepL-Auth-Key ${process.env.DEEPL_API_KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      text: [text],
      target_lang: 'ZH',
      tag_handling: 'html',
    }),
  });

  if (response.status === 429) {
    const retryAfter = parseInt(response.headers.get('Retry-After') || '5');
    await sleep(retryAfter * 1000);
    return translate(text); // retry once
  }

  const data = await response.json();
  return data.translations[0].text;
}

In practice I rarely hit 429 because I’m not translating in bulk bursts — articles trickle in as the feed updates. But the free tier has a daily character limit that occasionally trips in the morning when a lot of sources post at once, so the retry logic earns its keep.

Storing Both Versions

I store both the original and translated text in the article record (title_zh, content_zh, summary_zh columns alongside the originals). This lets me display the Chinese version by default while keeping the original for fallback or comparison. It also means I’m not re-translating on every read.

Deduplication is handled at the database level: the articles table has a UNIQUE constraint on the url column, so a second attempt to insert the same article simply fails with a constraint violation. No hash cache, no extra lookup — the constraint is the guard.

One thing I haven’t fully solved: technical terms. Some things translate poorly — “observability” became something approximating “可觀察性” which is technically correct but not how Chinese engineers actually talk about it. My current workaround is keeping the original English term in parentheses for a short list of known jargon. It’s a bit manual, but the output reads better.

Translation Quality

Honestly, the quality surprised me. For leadership and engineering culture articles — the kind of long-form reflective writing that makes up most of my sources — DeepL’s output is clean enough that I don’t feel the need to check the original. For highly technical deep-dives with lots of specific tool names, it’s still better than Google Translate but occasionally produces phrasing that feels slightly off.

For a personal aggregator where I’m the only reader, that’s more than good enough.