🔤

Text Encoding Detector

Browser-based detector for unknown text encodings with built-in sample loader, conversion tips, and localized documentation.

Text Input

What is Character Encoding

Character encoding is a system mapping characters to binary values for computer storage and transmission. Different encodings use different mappings: ASCII uses 7 bits (128 characters, English only), ISO-8859-1 (Latin-1) extends to 8 bits (256 characters, Western European), GB2312/GBK for simplified Chinese, Big5 for traditional Chinese, Shift-JIS for Japanese, UTF-8 (1-4 bytes, universal, backward compatible with ASCII), UTF-16 (2 or 4 bytes). Mojibake (garbled text like �) occurs when text encoded in one charset is decoded with another. BOM (Byte Order Mark) is optional signature at file start identifying encoding. Proper encoding detection prevents data corruption and ensures correct text display across systems and languages.

Features

🔍

Smart Detection

Automatically detect text encoding using statistical analysis, BOM detection, character pattern recognition. Support UTF-8/16/32, GBK, GB2312, Big5, Shift-JIS, ISO-8859 series, Windows-1252 with confidence scores
🔄

Charset Conversion

Convert text between any supported encodings: UTF-8 ↔ GBK ↔ Big5, fix mojibake issues, add or remove BOM markers, handle surrogate pairs and combining characters correctly
🩺

Encoding Diagnosis

Identify encoding problems: invalid byte sequences, mixed encodings in same file, BOM mismatches, surrogate pair errors with suggestions for fixing and detailed error reports
📦

Batch Processing

Process multiple files simultaneously with encoding detection, convert entire directories, preserve file structure, generate conversion reports with statistics and error logs
🎯

Application Scenarios

🏢

Legacy system migration

Convert GBK/Big5 archives to UTF-8 while modernizing finance, ERP or government document systems.
🌐

Global content QA

Content and SEO teams verify that multilingual sites, feeds and emails declare the correct charset and avoid mojibake.
🧾

Data ingestion pipelines

Validate the encoding of partner logs, CSV exports and ETL inputs before pushing them into warehouses or Spark jobs.
🛠️

Developer debugging

Quickly pinpoint BOM conflicts or mixed encodings while reviewing Git diffs or database exports.

📋Usage Guide

1️⃣
Upload or Paste
Upload text file or paste text content for encoding analysis
2️⃣
Detect Encoding
Click detect to automatically identify encoding, or manually specify if known
3️⃣
Review Results
Check detected encoding, confidence level, BOM presence, and preview text
4️⃣
Convert if Needed
Select target encoding, convert text, download or copy converted result

📚Technical Introduction

🔤Encoding Standards

Character encodings evolved to support different languages: ASCII (1963, 7-bit, 128 chars, English). ISO-8859 series (8-bit, 256 chars, regional: -1 Latin, -2 Central European, -5 Cyrillic, -6 Arabic). DBCS (Double-Byte) for Asian languages: GB2312 (1980, 6763 simplified Chinese), GBK (21886 chars, GB2312 extension), Big5 (13060 traditional Chinese), Shift-JIS (Japanese, complex byte rules). Unicode consortium created universal encoding: UTF-8 (variable 1-4 bytes, ASCII compatible, web standard), UTF-16 (2 or 4 bytes, Windows/Java default, BOM required), UTF-32 (fixed 4 bytes, wasteful but simple). Modern systems prefer UTF-8 for storage, UTF-16 for in-memory processing.

🔍Detection Algorithms

Encoding detection uses multiple techniques: 1) BOM detection: UTF-8 (EF BB BF), UTF-16 LE (FF FE), UTF-16 BE (FE FF), UTF-32 LE (FF FE 00 00). 2) Statistical analysis: character frequency distribution, byte patterns, valid byte sequences. Libraries like chardet (Python), ICU (C++), jschardet (JavaScript) use character n-grams and language models trained on sample texts. 3) Validation: check if bytes form valid sequences for encoding (UTF-8 has specific continuation byte rules, GB2312 has defined code ranges). 4) Heuristics: file extension (.txt), HTTP headers (charset), XML declaration (<?xml encoding="">). Confidence scores combine multiple signals. False positives occur with short texts or rare characters.

🔄Encoding Conversion

Converting between encodings requires: 1) Decode source bytes to Unicode code points using source encoding. 2) Encode code points to target encoding. Challenges: Unmappable characters (not all Unicode chars exist in legacy encodings) - handle with replacement char (�), HTML entities, or error. Normalization: Unicode has multiple representations for same character (é can be single code point U+00E9 or e + combining accent), NFC normalizes to composed, NFD to decomposed. BOM handling: add for UTF-16/32, optional for UTF-8 (usually omitted). Line endings: CRLF (Windows) vs LF (Unix) require separate handling. Streaming conversion for large files processes chunks with stateful decoders maintaining context between chunks.

🐛Mojibake and Fixes

Mojibake (文字化け, garbled text) occurs from encoding mismatch: UTF-8 text interpreted as Latin-1 shows à instead of é, Chinese shows � or 中文. Common causes: server sends UTF-8 without charset header, editor saves with wrong encoding, database stores UTF-8 in Latin-1 column. Fixes: 1) Re-decode with correct encoding: if text is UTF-8 but decoded as Latin-1, re-encode to Latin-1 bytes then decode as UTF-8. 2) Use encoding detection library. 3) Check HTTP headers, HTML meta tags, XML declarations. Prevention: Always use UTF-8 everywhere, declare encoding explicitly, validate data at boundaries. Ftfy (Python) library automatically fixes mojibake using statistical patterns.

Frequently Asked Questions

How accurate is the encoding detection?

The detector combines BOM detection, statistical analysis, and byte-pattern validation to estimate the most likely charset. Each run also exposes a confidence score so you know when additional manual review is recommended.
💬

Can I convert the text after detection?

Yes. Once an encoding is identified you can choose any target charset, convert the text entirely in the browser, and download or copy the converted output without uploading files.
🔍

What does the Add or Remove BOM option do?

A BOM (Byte Order Mark) is an optional signature at the start of UTF files. Adding a BOM helps some Windows tools detect encoding, while removing it keeps files lean for UNIX environments. Toggle the option based on where the text will be used.
💡

Why do I still see garbled characters after conversion?

If text was previously decoded with the wrong charset the damage may already be saved. Try reloading the original file, ensure the correct source encoding is selected, and convert again. Mixed encodings inside a single file can also produce mojibake.
📚

Is any of my text uploaded or stored?

No. Detection and conversion happen entirely in your browser. Files never leave your device so confidential documents remain private.

💡Best Practices

💡

Always Use UTF-8

Use UTF-8 encoding everywhere by default - databases, files, HTTP headers, HTML meta tags. UTF-8 is universal, supports all languages, backward compatible with ASCII, and is the web standard. Declare encoding explicitly in HTML (<meta charset="UTF-8">) and HTTP headers (Content-Type: text/html; charset=utf-8). This prevents mojibake and ensures consistent text display across systems.
🔍

Detect Before Converting

Always detect encoding before attempting conversion to avoid data corruption. Use confidence scores from detection libraries to verify accuracy. For ambiguous cases (low confidence), inspect sample text manually or try multiple encodings with visual verification. Never assume encoding based solely on file extension or origin - always validate.
📝

Handle BOM Correctly

Check for BOM (Byte Order Mark) at file start for definitive encoding identification. UTF-8 BOM (EF BB BF) is optional but helps with detection. UTF-16/32 requires BOM to determine byte order (LE/BE). Some systems expect BOM, others reject it - know your target system. Remove or add BOM as needed during conversion.
🩺

Test Mojibake Fixes

When fixing garbled text, identify original and misinterpreted encodings through pattern analysis. Common mojibake: UTF-8 as Latin-1 (é instead of é), Chinese as ? or random chars. Re-encode to intermediate encoding, then decode with correct one. Test fixes on sample data before processing entire files. Some mojibake is irreversible if data was already corrupted in database.
⚠️

Validate at Boundaries

Validate encoding at system boundaries - file reads, HTTP requests, database queries, API calls. Use charset declarations in all data transfers. Sanitize and validate input text to detect invalid byte sequences early. Log encoding issues for debugging. Implement fallback strategies for encoding detection failures. Never mix encodings within single file or database column.

🔗Related Documents

🔤RFC 3629 - UTF-8 Specification-UTF-8 encoding format standard
🔧ICU Character Encoding Detection-International Components for Unicode detection library
📚Character Encoding Best Practices-W3C guidelines for handling character encodings

User Comments

0 / 2000
Loading...