UTF-8 Sampler

[ I Can Eat Glass ]
Most recent update: Sat Dec 21 17:37:33 2002

UTF-8 is an ASCII-preserving encoding method for Unicode (ISO 10646), the Universal Character Set (UCS). The UCS encodes most of the world's writing systems in a single character set, allowing you to mix languages and scripts within a document without needing any tricks for switching character sets. This web page is encoded directly in UTF-8.

Columbia University's Kermit 95 terminal emulation software can display UTF-8 plain text in Windows 95, 98, ME, NT, XP, or 2000 when using a monospace Unicode font like Andale Mono WT J or Everson Mono Terminal, or the lesser populated Courier New, Lucida Console, or Andale Mono. C-Kermit 7.0 and later can handle it too, if you have a Unicode display. As many languages as are representable in your font can be seen on the screen at the same time.

This, however, is a Web page. Some Web browsers can handle UTF-8, some can't. And those that can might not have a sufficiently populated font to work with (some browsers might pick glyphs dynamically from multiple fonts; Netscape 6 seems to do this). CLICK HERE for a survey of Unicode fonts for Windows.

First, the euro symbol:   €.

From the Anglo-Saxon Rune Poem (Rune version):

ᚠᛇᚻ᛫ᛒᛦᚦ᛫ᚠᚱᚩᚠᚢᚱ᛫ᚠᛁᚱᚪ᛫ᚷᛖᚻᚹᛦᛚᚳᚢᛗ
ᛋᚳᛖᚪᛚ᛫ᚦᛖᚪᚻ᛫ᛗᚪᚾᚾᚪ᛫ᚷᛖᚻᚹᛦᛚᚳ᛫ᛗᛁᚳᛚᚢᚾ᛫ᚻᛦᛏ᛫ᛞᚫᛚᚪᚾ
ᚷᛁᚠ᛫ᚻᛖ᛫ᚹᛁᛚᛖ᛫ᚠᚩᚱ᛫ᛞᚱᛁᚻᛏᚾᛖ᛫ᛞᚩᛗᛖᛋ᛫ᚻᛚᛇᛏᚪᚾ᛬

From Laȝamon's Brut (The Chronicles of England, Middle English, West Midlands):

An preost wes on leoden, Laȝamon was ihoten
He wes Leovenaðes sone -- liðe him be Drihten.
He wonede at Ernleȝe at æðelen are chirechen,
Uppen Sevarne staþe, sel þar him þuhte,
Onfest Radestone, þer he bock radde.

(CLICK HERE for another Middle English sample with some explanation of letters and encoding).

From the Tagelied of Wolfram von Eschenbach (Middle High German):

Sîne klâwen durh die wolken sint geslagen,
er stîget ûf mit grôzer kraft,
ich sih in grâwen tägelîch als er wil tagen,
den tac, der im geselleschaft
erwenden wil, dem werden man,
den ich mit sorgen în verliez.
ich bringe in hinnen, ob ich kan.
sîn vil manegiu tugent michz leisten hiez.

Some lines of Odysseus Elytis (Greek):

Τη γλώσσα μου έδωσαν ελληνική
το σπίτι φτωχικό στις αμμουδιές του Ομήρου.
Μονάχη έγνοια η γλώσσα μου στις αμμουδιές του Ομήρου.

από το Άξιον Εστί
του Οδυσσέα Ελύτη

The first stanza of Pushkin's Bronze Horseman (Russian):

На берегу пустынных волн
Стоял он, дум великих полн,
И вдаль глядел. Пред ним широко
Река неслася; бедный чёлн
По ней стремился одиноко.
По мшистым, топким берегам
Чернели избы здесь и там,
Приют убогого чухонца;
И лес, неведомый лучам
В тумане спрятанного солнца,
Кругом шумел.

Šota Rustaveli's Veṗxis Ṭq̇aosani, ̣︡Th, The Knight in the Tiger's Skin (Georgian):

ვეპხის ტყაოსანი შოთა რუსთაველი

ღმერთსი შემვედრე, ნუთუ კვლა დამხსნას სოფლისა შრომასა, ცეცხლს, წყალსა და მიწასა, ჰაერთა თანა მრომასა; მომცნეს ფრთენი და აღვფრინდე, მივჰხვდე მას ჩემსა ნდომასა, დღისით და ღამით ვჰხედვიდე მზისა ელვათა კრთომაასა.

And from the sublime to the ridiculous, here is a certain phrase in an assortment of languages (1):

  1. Sanskrit (5): काचं शक्नोम्यत्तुम् । नोपिहनिस्त माम् ।
  2. Sanskrit (standard transcription): kācaṃ śaknomyattum; nopahinasti mām.
  3. Classical Greek: ὕαλον ϕαγεῖν δύναμαι· τοῦτο οὔ με βλάπτει.
  4. Greek: Μπορώ να φάω σπασμένα γυαλιά χωρίς να πάθω τίποτα.
  5. Etruscan: (NEEDED)
  6. Latin: Vitrum edere possum; mihi non nocet.
  7. Esperanto: Mi povas manĝi vitron, ĝi ne damaĝas min.
  8. French: Je peux manger du verre, cela ne me fait pas mal.
  9. Provençal / Occitan: Pòdi manjar de veire, me nafrariá pas.
  10. Québécois: J'peux manger d'la vitre, ça m'fa pas mal.
  11. Walloon: Dji pou magnî do vêre, çoula m' freut nén må.
  12. Champenois: (NEEDED)
  13. Lorrain: (NEEDED)
  14. Picard: (NEEDED)
  15. Corsican: (NEEDED)
  16. Basque: Kristala jan dezaket, ez dit minik ematen.
  17. Catalan: Puc menjar vidre que no em fa mal.
  18. Spanish: Puedo comer vidrio, no me hace daño.
  19. Aragones: Puedo minchar beire, no me'n fa mal .
  20. Galician: Eu podo xantar cristais e non cortarme.
  21. Portuguese: Posso comer vidro, não me faz mal.
  22. Brazilian Portuguese: Consigo comer vidro. Não me machuca.
  23. Cabo Verde Creole: M' podê cumê vidru, ca ta maguâ-m'.
  24. Papiamentu: (NEEDED)
  25. Italian: Posso mangiare il vetro e non mi fa male.
  26. Roman: Me posso magna' er vetro, e nun me fa male.
  27. Sicilian: Puotsu mangiari u vitru, nun mi fa mali.
  28. Milanese: Sôn bôn de magnà el véder, el me fa minga mal.
  29. Venetian: Mi posso magnare el vetro, no'l me fa mae.
  30. Rheto-Romance: (NEEDED)
  31. Romanian: Pot să mănânc sticlă și ea nu mă rănește.
  32. Pictish: (NEEDED)
  33. Breton: (NEEDED)
  34. Cornish: Mý a yl dybry gwéder hag éf ny wra ow ankenya.
  35. Welsh: Dw i'n gallu bwyta gwydr, 'dyw e ddim yn gwneud dolur i mi.
  36. Manx Gaelic: Foddym gee glonney agh cha jean eh gortaghey mee.
  37. Old Irish (Ogham): ᚛᚛ᚉᚑᚅᚔᚉᚉᚔᚋ ᚔᚈᚔ ᚍᚂᚐᚅᚑ ᚅᚔᚋᚌᚓᚅᚐ᚜
  38. Old Irish (Latin): Con·iccim ithi nglano. Ním·géna.
  39. Irish: Is féidir liom gloinne a ithe. Ní dhéanann sí dochar ar bith dom.
  40. Scottish Gaelic: S urrainn dhomh gloinne ithe; cha ghoirtich i mi.
  41. Anglo-Saxon (Runes): ᛁᚳ᛫ᛗᚨᚷ᛫ᚷᛚᚨᛋ᛫ᛖᚩᛏᚪᚾ᛫ᚩᚾᛞ᛫ᚻᛁᛏ᛫ᚾᛖ᛫ᚻᛖᚪᚱᛗᛁᚪᚧ᛫ᛗᛖ᛬
  42. Anglo-Saxon (Latin): Ic mæg glæs eotan ond hit ne hearmiað me.
  43. Middle English: Ich canne glas eten and hit hirtiþ me nouȝt.
  44. English: I can eat glass and it doesn't hurt me.
  45. English (Braille): ⠊⠀⠉⠁⠝⠀⠑⠁⠞⠀⠛⠇⠁⠎⠎⠀⠁⠝⠙⠀⠊⠞⠀⠙⠕⠑⠎⠝⠞⠀⠓⠥⠗⠞⠀⠍⠑
  46. Lalland Scots / Doric: Ah can eat gless, it disnae hurt us.
  47. Glaswegian: (NEEDED)
  48. Gothic: (NEEDED)
  49. Old Norse (Runes): ᛖᚴ ᚷᛖᛏ ᛖᛏᛁ ᚧ ᚷᛚᛖᚱ ᛘᚾ ᚦᛖᛋᛋ ᚨᚧ ᚡᛖ ᚱᚧᚨ ᛋᚨᚱ
  50. Old Norse: Ek get etið gler án þess að verða sár.
  51. Norsk / Norwegian (Nynorsk): Eg kan eta glas utan å skada meg.
  52. Norsk / Norwegian (Bokmål): Jeg kan spise glass uten å skade meg.
  53. Føroyskt / Faroese: (NEEDED)
  54. Íslenska / Icelandic: Ég get etið gler án þess að meiða mig.
  55. Svensk / Swedish: Jag kan äta glas, det skadar mig inte.
  56. Dansk / Danish: Jeg kan spise glas, det gør ikke ondt på mig.
  57. Soenderjysk: Æ ka æe glass uhen at det go mæ naue.
  58. Frysk / Frisian: Ik kin glês ite, it docht me net sear.
  59. Nórdicg: Ljœr ye caudran créneþ ý jor cẃran.
  60. Nederlands / Dutch: Ik kan glas eten. Het doet me geen pijn.
  61. Afrikaans: Ek kan glas eet, maar dit maak my nie seer nie.
  62. Lëtzebuergescht / Luxemburgish: Ech kan Glas iessen, daat deet mir nët wei.
  63. Deutsch / German: Ich kann Glas essen, ohne mir weh zu tun.
  64. Ruhrdeutsch: Ich kann Glas verkasematucken, ohne dattet mich wat jucken tut.
  65. Sächsisch / Saxon: 'sch kann Glos essn, ohne dass'sch mer wehtue.
  66. Pfälzisch: Isch konn Glass fresse ohne dasses mer ebbes ausmache dud.
  67. Schwäbisch / Swabian: I kå Glas frässa, ond des macht mr nix!
  68. Bayrisch / Bavarian: I koh Glos esa, und es duard ma ned wei.
  69. Allemannisch: I kaun Gloos essen, es tuat ma ned weh.
  70. Schwyzerdütsch: Ich chan Glaas ässe, das tuet mir nöd weeh.
  71. Suomea / Finnish: Voin syödä lasia, se ei vahingoita minua.
  72. Hungarian: Meg tudom enni az üveget, nem lesz tőle bajom.
  73. Estonian: Ma võin klaasi süüa, see ei tee mulle midagi.
  74. Latvian: Es varu ēst stiklu, tas man nekaitē.
  75. Lithuanian: Aš galiu valgyti stiklą ir jis manęs nežeidžia
  76. Old Prussian: (NEEDED)
  77. Sorbian / Lusatian / Wendish: (NEEDED)
  78. Czech: Mohu jíst sklo, neublíží mi.
  79. Slovak: Môžem jesť sklo. Nezraní ma.
  80. Polska / Polish: Mogę jeść szkło i mi nie szkodzi.
  81. Slovenian: Lahko jem steklo, ne da bi mi škodovalo.
  82. Croatian: Ja mogu jesti staklo i ne boli me.
  83. Serbian (Latin): Mogu jesti staklo a da mi ne škodi.
  84. Serbian (Cyrillic): Могу јести стакло а да ми не шкоди.
  85. Macedonian: Можам да јадам стакло, а не ме штета.
  86. Russian: Я могу есть стекло, оно мне не вредит.
  87. Belarusian (Cyrillic): Я магу есці шкло, яно мне не шкодзіць.
  88. Belarusian (Lacinka): Ja mahu jeści škło, jano mne ne škodzić.
  89. Ukrainian: Я можу їсти шкло, й воно мені не пошкодить.
  90. Bulgarian: Мога да ям стъкло и не ме боли.
  91. Georgian: მინას ვჭამ და არა მტკივა.
  92. Armenian: Կրնամ ապակի ուտել և ինծի անհանգիստ չըներ։
  93. Albanian: Unë mund të ha qelq dhe nuk më gjen gjë.
  94. Turkish: Cam yiyebilirim, bana zararı dokunmaz.
  95. Turkish (Ottoman): جام ييه بلورم بڭا ضررى طوقونمز
  96. Marathi: मी काच खाऊ शकतो, मला ते दुखत नाही.
  97. Hindi: मैं काँच खा सकता हूँ, मुझे उस से कोई पीडा नहीं होती.
  98. Urdu(2): میں کانچ کھا سکتا ہوں اور مجھے تکلیف نہیں ہوتی ۔
  99. Pashto(2): زه شيشه خوړلې شم، هغه ما نه خوږوي
  100. Farsi / Persian: .من می توانم بدونِ احساس درد شيشه بخورم
  101. Arabic(2): أنا قادر على أكل الزجاج و هذا لا يؤلمني.
  102. Aramaic: (NEEDED)
  103. Hebrew(2): אני יכול לאכול זכוכית וזה לא מזיק לי.
  104. Yiddish(2): איך קען עסן גלאָז און עס טוט מיר נישט װײ.
  105. Ladino: (NEEDED)
  106. Gǝʼǝz: (NEEDED)
  107. Amharic: (NEEDED)
  108. Twi: Metumi awe tumpan, ɜnyɜ me hwee.
  109. Hausa (Latin): Inā iya taunar gilāshi kuma in gamā lāfiyā.
  110. Hausa (Ajami) (2): إِنا إِىَ تَونَر غِلَاشِ كُمَ إِن غَمَا لَافِىَا
  111. Yoruba(3): Mo lè je̩ dígí, kò ní pa mí lára.
  112. Malay: Saya boleh makan kaca dan ia tidak mencederakan saya.
  113. Tagalog: Kaya kong kumain nang bubog at hindi ako masaktan.
  114. Chamorro: Siña yo' chumocho krestat, ti ha na'lalamen yo'.
  115. Javanese: Aku isa mangan beling tanpa lara.
  116. Vietnamese (quốc ngữ): Tôi có thể ăn thủy tinh mà không hại gì.
  117. Vietnamese (nôm) (4): 些 𣎏 世 咹 水 晶 𦓡 空 𣎏 害 咦
  118. Mongolean: (NEEDED)
  119. Chinese: 我能吞下玻璃而不伤身体。
  120. Japanese: 私はガラスを食べられます。それは私を傷つけません。
  121. Korean: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요
  122. Thai: ฉันกินกระจกได้ แต่มันไม่ทำให้ฉันเจ็บ
  123. Hawaiian: Hiki iaʻu ke ʻai i ke aniani; ʻaʻole nō lā au e ʻeha.
  124. Marquesan: E koʻana e kai i te karahi, mea ʻā, ʻaʻe hauhau.
  125. Navajo: Tsésǫʼ yishą́ągo bííníshghah dóó doo shił neezgai da.
  126. Cherokee (and Cree, Ojibwa, Inuktitut, and other Native American languages): (NEEDED)
  127. Garifuna: (NEEDED)
  128. Gullah: (NEEDED)
  129. Lojban: mi kakne le nu citka le blaci .iku'i le se go'i na xrani mi

(Additions, corrections, completions, gratefully accepted.)

For testing purposes, some of these are repeated in a monospace font . . .

  1. Euro Symbol: €.
  2. Greek: Μπορώ να φάω σπασμένα γυαλιά χωρίς να πάθω τίποτα.
  3. Íslenska / Icelandic: Ég get etið gler án þess að meiða mig.
  4. Polish: Mogę jeść szkło, i mi nie szkodzi.
  5. Romanian: Pot să mănânc sticlă și ea nu mă rănește.
  6. Ukrainian: Я можу їсти шкло, й воно мені не пошкодить.
  7. Armenian: Կրնամ ապակի ուտել և ինծի անհանգիստ չըներ։
  8. Georgian: მინას ვჭამ და არა მტკივა.
  9. Hindi: मैं काँच खा सकता हूँ, मुझे उस से कोई पीडा नहीं होती.
  10. Hebrew(2): אני יכול לאכול זכוכית וזה לא מזיק לי.
  11. Yiddish(2): איך קען עסן גלאָז און עס טוט מיר נישט װײ.
  12. Arabic(2): أنا قادر على أكل الزجاج و هذا لا يؤلمني.
  13. Japanese: 私はガラスを食べられます。それは私を傷つけません。
  14. Thai: ฉันกินกระจกได้ แต่มันไม่ทำให้ฉันเจ็บ

In another test, we use HTML language tags to distinguish Bulgarian, Russian, and Serbian, which have different italic forms for lowercase б, г, д, п, and/or т:

Bulgarian:   [ бгдпт ]   бгдпт ]   Мога да ям стъкло и не ме боли.
Russian: [ бгдпт ]   бгдпт ]   Я могу есть стекло, это мне не вредит.
Serbian: [ бгдпт ]   бгдпт ]   Могу јести стакло а да ми не шкоди.

Finally, here is the Russian alphabet (uppercase only) coded in three different ways, which should look identical:

  1. АБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ   (Literal UTF-8)
  2. АБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ   (Decimal numeric character reference)
  3. АБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ   (Hexadecimal numeric character reference)

Notes:

  1. The numbering of the samples is arbitrary, done only to keep track of how many there are, and can change any time a new entry is added. The arrangment is also arbitrary but with some attempt to group related examples together. Bug #1: the (WANTED) examples shouldn't count. Fix: Fill them in! Bug #2: All languages not listed are wanted, not just the ones that say (WANTED).
  2. Correct right-to-left display of these languages depends on the capabilities of your browser. The period should appear on the left. In the monospace Yiddish example, the Yiddish digraphs should occupy one character cell. Note: unlike the other RTL examples, the Farsi phrase was entered "backwards".
  3. The third word is Latin letter small 'j' followed by small 'e' with U+0329, Combining Vertical Line Below. This displays correctly only if your Unicode font includes the U+0329 glyph and your browser supports combining diacritical marks. The Indic examples also include combining sequences.
  4. Includes Unicode 3.1 Plane 2 characters.
  5. Devanagari (used for writing Sanskrit and other Indic languages) requires complex rendering that most browsers are not capable of; furthermore it is far from settled how best to encode it in Unicode to achieve effects such as ligation. CLICK HERE for a thorough discussion.

Credits:
The "I can eat glass" phrase and the initial collection of translations: Ethan Mollick. Transcription / conversion to UTF-8: Frank da Cruz. Albanian: Sindi Keesan. Afrikaans: Johan Fourie. Anglo Saxon: Frank da Cruz. Arabic: Najib Tounsi. Armenian: Vaçe Kundakçı. Belarusian: Alexey Chernyak, Braille: Frank da Cruz. Bulgarian: Sindi Keesan, Guentcho Skordev. Cabo Verde Creole: Cláudio Alexandre Duarte. Chinese: Jack Soo. Cornish: Chris Stephens. Croatian: Marjan Baće. Czech: Stanislav Pecha. Dutch: Peter Gotink. Esperanto: Franko Luin. Estonian: Meelis Roos. Farsi/Persian: Payam Elahi. Finnish: Sampsa Toivanen. Galician: Laura Probaos. Georgian: Giorgi Lebanidze. Greek: Ariel Glenn, Constantine Stathopoulos, Siva Nataraja. Hebrew: Jonathan Rosenne. Hausa: Malami Buba, Tom Gewecke. Hawaiian: na Hauʻoli Motta, Anela de Rego, Kaliko Trapp. Hindi: Shirish Kalele. Hungarian: András Rácz. Icelandic: Andrés Magnússon. Irish: Michael Everson. Italian: Thomas De Bellis. Japanese: Makoto Takahashi. Korean: Jungshik Shin. Lëtzebuergescht: Stefaan Eeckels. Lithuanian: Gediminas Grigas. Lojban: Edward Cherlin. Macedonian: Sindi Keesan. Malay: Zarina Mustapha. Manx: Éanna Ó Brádaigh. Marathi: Shirish Kalele. Marquesan: Kaliko Trapp. Middle English: Frank da Cruz. Milanese: Marco Cimarosti. Navajo: Tom Gewecke. Nórdicg: Ywlyan Rott. Norwegian: Herman Ranes. Old Irish: Michael Everson. Old Norse: Andrés Magnússon. Pashto: N.R. Liwal. Pfälzisch: Dr. Johannes Sander. Polish: Juliusz Chroboczek. Québécois: Laurent Detillieux. Roman: Pierpaolo Bernardi. Romanian: Juliusz Chroboczek, Ionel Mugurel. Ruhrdeutsch: Timwi. Russian: Alexey Chernyak, Sanskrit: Siva Nataraja / Vincent Ramos. Sächsisch: André Müller. Schwäbisch: Otto Stolz. Scots: Jonathan Riddell. Serbian: Sindi Keesan, Ranko Narancic, Boris Daljevic, Szilvia Csorba. Slovak: G. Adam Stanislav. Slovenian: Albert Kolar. Tagalog: Jim Soliven. Thai: Alan Wood's wife. Turkish: Vaçe Kundakçı, Tom Gewecke, Merlign Olnon. Ukrainian: Michael Zajac. Urdu: Mustafa Ali. Vietnamese: Dixon Au, [James] Đỗ Bá Phước 杜 伯 福. Walloon: Pablo Saratxaga. Welsh: Geiriadur Prifysgol Cymru (Andrew). Yiddish: Mark David.

Tools Used to Create This Web Page:
The UTF8-aware Kermit 95 terminal emulator on Windows, to a Unix host with the EMACS text editor. Kermit 95 displays UTF-8 and also allows keyboard entry of arbitrary Unicode BMP characters as 4 hex digits, as shown HERE. Hex codes for Unicode values can be found in The Unicode Standard (recommended) and the online code charts. When submissions arrive by email encoded in some other character set (Latin-1, Latin-2, KOI, various PC code pages, JEUC, etc), I use the TRANSLATE command of C-Kermit on the Unix host (where I read my mail) to convert the character set to UTF-8 (I could also use Kermit 95 for this; it has the same TRANSLATE command). That's it -- no "Web authoring" tools, no locales, no "smart" anything. It's just plain text, nothing more. By the way, there's nothing special about EMACS -- any text editor will do, providing it allows entry of arbitrary 8-bit bytes as text, including the 0x80-0x9F "C1" range. EMACS 21.1 actually supports UTF-8; earlier versions don't know about it and display the octal codes; either way is OK for this purpose.

Commentary:
Date: Wed, 27 Feb 2002 13:21:59 +0100
From: "Bruno DEDOMINICIS" <[email protected]>
Subject: Je peux manger du verre, cela ne me fait pas mal.

I just found out your website and it makes me feel like proposing an interpretation of the choice of this peculiar phrase.

Glass is transparent and can hurt as everyone knows. The relation between people and civilisations is sometimes effusional and more often rude. The concept of breaking frontiers through globalization, in a way, is also an attempt to deny any difference. Isn't "transparency" the flag of modernity? Nothing should be hidden any more, authority is obsolete, and the new powers are supposed to reign through loving and smiling and no more through coercion...

Eating glass without pain sounds like a very nice metaphor of this attempt. That is, frontiers should become glass transparent first, and be denied by incorporating them. On the reverse, it shows that through globalization, frontiers undergo a process of displacement, that is, when they are not any more speakable, they become repressed from the speech and are therefore incorporated and might become painful symptoms, as for example what happens when one tries to eat glass.

The frontiers that used to separate bodies one from another tend to divide bodies from within and make them suffer.... The chosen phrase then appears as a denial of the symptom that might result from the destitution of traditional frontiers.

Best,
Bruno De Dominicis, Paris, France

Other Unicode samplers:

[ Kermit 95 ] [ K95 Screen Shots ] [ C-Kermit ] [ Kermit Home ] [ Unicode Fonts ]


UTF-8 Sampler / The Kermit Project / Columbia University / [email protected] / 21 December 2002