text-scrambler
Using the Unicode confusable characters and other tricks, we can transform a text into another that looks exactly like it but remains different from a machine view.
Examples
Replacing randomly the Latin characters by Greek or Cyrillic letters and adding the ZW(N)J.
Original text:
Herman Melville (August 1, 1819 – September 28, 1891) was an American novelist, short story writer, and poet of the American Renaissance period. Among his best-known works are Moby-Dick (1851), Typee (1846), a romanticized account of his experiences in Polynesia, and Billy Budd, Sailor, a posthumously published novella. Although his reputation was not high at the time of his death, the centennial of his birth in 1919 was the starting point of a Melville revival and Moby-Dick grew to be considered one of the great American novels.
Srambled text with zw(n)j added (looking the same but totally different):
Herman Melville (August 1, 1819 – September 28, 1891) was an American novelist, short story writer, and poet of the American Renaissance period. Among his best-known works are Moby-Dick (1851), Typee (1846), a romanticized account of his experiences in Polynesia, and Billy Budd, Sailor, a posthumously published novella. Although his reputation was not high at the time of his death, the centennial of his birth in 1919 was the starting point of a Melville revival and Moby-Dick grew to be considered one of the great American novels.
Srambled text with latin letter replaced with their Cyrillic/Greek equivalent:
Неrman Melvіllе (Αuguѕt 1, 1819 – Septеmber 28, 1891) wаѕ an Аmеrісаn nοvеlist, shοrt story writеr, and poеt оf the Americаn Rеnaіssanсe pеriоd. Amоng his bеst-known works arе Μoby-Dісk (1851), Τyреe (1846), a rоmаnticizеd accоunt оf hіs eхрerіencеs in Ρоlynеѕiа, аnd Вilly Budd, Ѕаіlοr, а pοsthumously рublіshed nоvеllа. Although hiѕ reputation was nοt hіgh at thе tіme οf hіѕ dеаth, the сentennіаl оf hіs bіrth in 1919 waѕ thе stаrting point οf a Μelville revival and Μοby-Dick grew tο bе соnѕidеrеd οne оf the great American novels.
Srambled text with both changes:
Herman Μelvіllе (Аuguѕt 1, 1819 – September 28, 1891) waѕ an Αmeriсan noveliѕt, shοrt ѕtοry writеr, аnd pоеt οf the Аmerіcаn Rеnaissanсе pеriоd. Amοng hiѕ bеst-knοwn wоrkѕ arе Mоby-Dick (1851), Typее (1846), a rοmаntісіzed aсcоunt οf hіs exрerіеnces іn Роlynеsіа, and Вilly Βudd, Ѕаilοr, a pοѕthumоuѕly publiѕhеd nоvеlla. Althоugh his reрutatіon wаѕ nоt hіgh аt thе tіme of hіs deаth, thе сentennіal of hіѕ birth іn 1919 wаs the stаrtіng рοіnt οf a Μelvillе rеvivаl and Мoby-Dісk grеw tο be сonsidered оne of thе greаt Аmеriсаn novels.
It is worth to notice that search engines can’t find the original webpage (as free online plagiarism checkers). Searching for Μelvillе (with cyrillic letters) (copy-paste it) on Google doesn’t return any match, though the original word Melville does.
Using all of the confusable characters of unicode (see the unicode confusable characters below), we can generate weird looking text worthy of old spam messages:
𝚮𝒆𝕣m𝓪n 𝝡ҽ𝟙∨𝘪𝘐𞺀𝓮 ﴾𝓐𝞄𝓰ꞟ𑣁t 1, 181Ⳋ – Ꮥ𝖊𝞺𝐭𝖾mƄ𝔢𝔯 Ƨ𐌚ꓹ 1ଃ𝟿1] 𝘸𝐚𝚜 𝖺𝔫 Αmℯ𝔯𝓲ꮯ𝒶𝓷 nം𝝼𝔢𝙸is𝖙؍ 𐑈𝖍ꬽꭇ𝓽 𝓼𝖙ⲟr𑣜 𝐰𝓻і𝒕е𝕣٫ α𝒏𝕕 𝙥𝜊e𝕥 ﮨf 𝘵h𝗲 Αm𝐞𝐫ꙇ𝒸an 𖼵𝘦𝑛𝐚𝒾𝑠𑣁𝜶𝕟𝗰𝒆 𝟈𝖾r⍳ﮫᑯ𐩐 Αmo𝓃𝖌 𝓱Ꭵ𝐬 Ꮟ𝙚𝗌𝕥۔𝖐𝖓o𝑤𝐧 𑜎оꮁ𝐤𝗌 𝜶𝗿𝖾 𝕸໐Ꮟ𝙮Ⲻ𝖣𝑖𝔠𝒌 〔1𝟪51〕ꓹ 𝖳𝗒𝓹𝘦𝚎 〔1🯸𝟜6❳ꓹ 𝖆 𝕣ꬽm⍺𝘯𝘵іꮯ𝛊𝐳ⅇ𝙙 𝕒cᴄჿ𝚞𝚗𝐭 𞹤𝔣 𝚑ӏ𝓈 𝕖𝑥𝙥𝔢𝗿ꙇe𝓷c℮ꮪ 𝖎𝚗 𝙋𝘰Ӏγ𝓷𝖾𝔰𝚒𝗮؍ 𝛼𝔫𝖉 𝔅Ꭵ𝖑l𝔂 𝓑𝐮𝖉𝒹‚ Ꮥаꙇ𝘭𝝈𝗋, α 𝑝ꬽ𐑈𝓽һ𝛖m𞺄ᴜ𝔰𝗹𝑦 𝖕ᴜᏏ𝝞𝜄sh𝗲ꓒ 𝓃𝗈𝓋𝒆𐌉ו𝞪꘎ 𖽀𝜤𝑡һ𝙤𝑢ց𝘩 𝒉ιѕ 𝖗𝒆𝛠𝚞𝐭𝓪𝙩ɪﮨ𝓷 𑜊𝖺s 𝘯𞹤𝚝 𝐡𝜄ᶃ𝕙 𝖆𝘁 𝙩hꬲ 𝓉𝔦mе 𝞼ẝ ℎıƽ 𝐝𝕖𝖆𝚝𝔥ꓹ 𝙩Ꮒꬲ 𝗰ⅇ𝗻𝔱𝖊𝖓n𝛊𝙖𐌠 ﻫ𝘧 𝒽𝖎𝘴 bı𝚛𝓽𝘩 i𝐧 1𑣖1𝟵 𑜏α𝗌 𝗍𝐡ҽ 𝕤𝑡𝛂r𝓉Ꭵ𝚗ᶃ 𝛒ס𝜾𝗻𝖙 𝜊𝖋 𝙖 ꓟ𝙚ⵏ𝛎˛І𝘭ҽ 𝔯𝐞v𝞲𝚟𝖆l ɑ𝘯𝖽 𝑀ං𝒃𝚢‐𝐷ͺ𝚌𝗸 𝓰ꭈеᴡ 𝓉ﮭ ᑲℯ cℴ𝙣𝔰𑣃dⅇ𝔯℮ⅾ ﻬ𝓃℮ ੦𝙛 𝙩𝔥𝔢 𝚐ꮁℯ𝜶𝙩 𝞐m𝘦ᴦ𝜾𝙘𝕒𝐧 𝓃o𝓿ⅇ|𝒔ꓸ
Full documentation at https://text-scrambler.readthedocs.io
Installation
pip install text-scrambler
Quickstart
Python
>>> from text_scrambler import Scrambler >>> scr = Scrambler() >>> text = "This is an example" >>> text_1 = scr.scramble(text, level=1) >>> ############# >>> # adding only zwj/zwnj characters >>> print(text, text_1, sep="\\n") This is an example This is an example >>> assert text != text_1 >>> print(len(text), len(text_1)) 18 35 >>> # though the texts look similar, the second one has more characters >>> ############# >>> text_2 = scr.scramble(text, level=2) >>> # replacing some latin letters by their cyrillic/greek equivalent >>> print(text_2) Тhiѕ iѕ an ехаmple >>> for char, char_2 in zip(text, text_2): ... if char != char_2: ... print(char, char_2) ... T Т s ѕ s ѕ e е x х a а >>> ############# >>> text_3 = scr.scramble(text, level=3) >>> # adding zwj/zwnj characters and replacing latin letters >>> print(text_3) Thіs iѕ аn eхаmple >>> print(text, text_3, sep="\\n") This is an example Thіs iѕ аn eхаmple >>> assert text_3 != text >>> ############# >>> text_4 = scr.scramble(text, level=4) >>> # replacing all characters by any unicode looking like character >>> print(text_4) ⊤𝒽𝐢𝘴 𝘪𝙨 𝞪ռ 𝙚⨯𝚊mρ𝟙ҽ >>> # >>> # generating several versions >>> versions = scr.generate(text, 10, level=4) >>> for txt in versions: ... print(txt) ... 𝕋𝗵𝕚𝔰 𝙞ѕ ɑ𝗇 ꬲ𝗑𝒂m𝛠Ⲓ𝚎 𝔗һ𑣃ƽ ˛ꜱ 𝛼𝐧 𝐞𝖝𝛼m𝜌𝟏ℯ Th𝓲𝔰 ⅈ𝔰 αn ꬲ⤬αm⍴𞸀e 𝗧𝗵i𝑠 i𝖘 ⍺𝘯 𝗲𝔁аm𝘱𝙸𝔢 ⊤𝚑𝑖s ɪ𝚜 𝜶𝑛 𝖾𝘅𝒶m𝛒𝑙𝓮 𝘛h𝙞ꮪ ⅈ𝗌 𝗮𝐧 ꬲᕽ𝓪m𝜌⏽𝓮 𝙏𝕙і𝓈 ıꜱ 𝔞𝕟 𝗲𝕩𝛂mр𐌉𝚎 𝕿Ꮒℹ𝐬 𝗶𝗌 𝛼𝔫 𝗲𝐱𝓪m𝞎𝙡𝖊 ⟙h𝜾ꮪ i𝘴 𝝰𝒏 𝙚ᕽ𝗮m𝗽𝗜𝗲 𝖳հ𝒊s 𝕚𝙨 𝖆𝑛 𝘦𝔁аm𝜌𝐈𝗲 >>> versions = scr.generate(text, 1000, level=1) >>> assert len(versions) == len(set(versions)) >>> # all unique
Command line interface (CLI)
To get words from input words through CLI, run
$ python -m text_scrambler usage: Usage : python -m text_scrambler file Replace/insert the charaters of the file using the unicode confusable characters positional arguments: file encoded in UTF-8 optional arguments: -h, --help show this help message and exit -l LEVEL, --level LEVEL 1: insert non printable characters within the text 2: replace some latin letters to their Greek or Cyrillic equivalent 3: insert non printable characters and change the some latin to their Greek or Cyrillic equivalent 4: insert non printable chraracters change all possible letter to a randomly picked unicode letter equivalent default=1 -n N, --generate N Scramble n times the string default=1
Links
See https://en.wikipedia.org/wiki/Word_joiner for more info on word joiners
See https://unix.stackexchange.com/questions/469347/using-uniq-on-unicode-text for why in this case the sort command wouldn’t work well to check the uniqueness of those strings
See http://www.unicode.org/Public/security/revision-03/confusablesSummary.txt for the complete list of confusable. Contents ========