Browser-based detector for unknown text encodings with built-in sample loader, conversion tips, and localized documentation.
Features
- Smart Detection: Automatically detect text encoding using statistical analysis, BOM detection, character pattern recognition. Support UTF-8/16/32, GBK, GB2312, Big5, Shift-JIS, ISO-8859 series, Windows-1252 with confidence scores
- Charset Conversion: Convert text between any supported encodings: UTF-8 ↔ GBK ↔ Big5, fix mojibake issues, add or remove BOM markers, handle surrogate pairs and combining characters correctly
- Encoding Diagnosis: Identify encoding problems: invalid byte sequences, mixed encodings in same file, BOM mismatches, surrogate pair errors with suggestions for fixing and detailed error reports
- Batch Processing: Process multiple files simultaneously with encoding detection, convert entire directories, preserve file structure, generate conversion reports with statistics and error logs
Use Cases
- Legacy system migration: Convert GBK/Big5 archives to UTF-8 while modernizing finance, ERP or government document systems.
- Global content QA: Content and SEO teams verify that multilingual sites, feeds and emails declare the correct charset and avoid mojibake.
- Data ingestion pipelines: Validate the encoding of partner logs, CSV exports and ETL inputs before pushing them into warehouses or Spark jobs.
- Developer debugging: Quickly pinpoint BOM conflicts or mixed encodings while reviewing Git diffs or database exports.
Usage Guide
- Upload or Paste: Upload text file or paste text content for encoding analysis
- Detect Encoding: Click detect to automatically identify encoding, or manually specify if known
- Review Results: Check detected encoding, confidence level, BOM presence, and preview text
- Convert if Needed: Select target encoding, convert text, download or copy converted result
Technical Details
Encoding Standards
Character encodings evolved to support different languages: ASCII (1963, 7-bit, 128 chars, English). ISO-8859 series (8-bit, 256 chars, regional: -1 Latin, -2 Central European, -5 Cyrillic, -6 Arabic). DBCS (Double-Byte) for Asian languages: GB2312 (1980, 6763 simplified Chinese), GBK (21886 chars, GB2312 extension), Big5 (13060 traditional Chinese), Shift-JIS (Japanese, complex byte rules). Unicode consortium created universal encoding: UTF-8 (variable 1-4 bytes, ASCII compatible, web standard), UTF-16 (2 or 4 bytes, Windows/Java default, BOM required), UTF-32 (fixed 4 bytes, wasteful but simple). Modern systems prefer UTF-8 for storage, UTF-16 for in-memory processing.
Detection Algorithms
Encoding detection uses multiple techniques: 1) BOM detection: UTF-8 (EF BB BF), UTF-16 LE (FF FE), UTF-16 BE (FE FF), UTF-32 LE (FF FE 00 00). 2) Statistical analysis: character frequency distribution, byte patterns, valid byte sequences. Libraries like chardet (Python), ICU (C++), jschardet (JavaScript) use character n-grams and language models trained on sample texts. 3) Validation: check if bytes form valid sequences for encoding (UTF-8 has specific continuation byte rules, GB2312 has defined code ranges). 4) Heuristics: file extension (.txt), HTTP headers (charset), XML declaration (<?xml encoding="">). Confidence scores combine multiple signals. False positives occur with short texts or rare characters.
Encoding Conversion
Converting between encodings requires: 1) Decode source bytes to Unicode code points using source encoding. 2) Encode code points to target encoding. Challenges: Unmappable characters (not all Unicode chars exist in legacy encodings) - handle with replacement char (�), HTML entities, or error. Normalization: Unicode has multiple representations for same character (é can be single code point U+00E9 or e + combining accent), NFC normalizes to composed, NFD to decomposed. BOM handling: add for UTF-16/32, optional for UTF-8 (usually omitted). Line endings: CRLF (Windows) vs LF (Unix) require separate handling. Streaming conversion for large files processes chunks with stateful decoders maintaining context between chunks.
Frequently Asked Questions
- How accurate is the encoding detection?
- The detector combines BOM detection, statistical analysis, and byte-pattern validation to estimate the most likely charset. Each run also exposes a confidence score so you know when additional manual review is recommended.
- Can I convert the text after detection?
- Yes. Once an encoding is identified you can choose any target charset, convert the text entirely in the browser, and download or copy the converted output without uploading files.
- What does the Add or Remove BOM option do?
- A BOM (Byte Order Mark) is an optional signature at the start of UTF files. Adding a BOM helps some Windows tools detect encoding, while removing it keeps files lean for UNIX environments. Toggle the option based on where the text will be used.
- Why do I still see garbled characters after conversion?
- If text was previously decoded with the wrong charset the damage may already be saved. Try reloading the original file, ensure the correct source encoding is selected, and convert again. Mixed encodings inside a single file can also produce mojibake.
- Is any of my text uploaded or stored?
- No. Detection and conversion happen entirely in your browser. Files never leave your device so confidential documents remain private.
Related Documentation
- Unicode Standard - Official Unicode character encoding specification
- RFC 3629 - UTF-8 Specification - UTF-8 encoding format standard
- GB18030 Chinese Encoding - Chinese national standard character encoding
- ICU Character Encoding Detection - International Components for Unicode detection library
- Character Encoding Best Practices - W3C guidelines for handling character encodings