Scholarly Societies Project

Linguistic Considerations

Table of Contents
  1. Language Policy
  2. Character Encoding
    1. Encoding Standards
    2. Encoding Techniques
    3. Verification of the Encodings
    4. Proper Viewing of the Encodings
    5. Outstanding Issues
  3. Character Transliteration

Border

1. Language Policy

Society websites that satisfy the Guidelines for Inclusion of Resources are included in the Project, regardless of the languages in which the websites are written, provided that it is possible to determine basic information about the society, such as an English translation of the name of the society, and its founding year.

To see which websites have text in various languages, consult the Scholarly Societies by Language page.

Border

2. Character Encoding

Although English is the language in which the Scholarly Societies Project is presented, a large number of the societies covered have names that require either the use of diacritics, or non-Latin scripts, if they are to be properly displayed.

In the sub-sections below, issues concerning the encoding of non-English characters are discussed.

2.1 Encoding Standards
Character Set
Encoding Standard
Examples
Western European characters with diacritics These are encoded using the standard HTML entity references. A fairly complete list of these is found at ISO 8859-1 (Latin-1) Characters List (maintained at the University of Toronto). See also HTML 4.01 Entities Reference (maintained at the W3Schools). é =
é
ü =
ü
All other character sets available in the Arial Unicode font (which as of 2001, Sept.12 covers all characters in Unicode Standard 2.1) These are encoded using character references of the form &#dddd; where dddd is the decimal value of the hexidecimal number given by the Unicode Standard 3.0 charts published by the Unicode Consortium.

Note: Although it is possible to encode the hexadecimal value directly (using the &#xhhhh; format where hhhh is the hex value), this is not currently recommended, since some browsers do not support this.

я =
я (Cyrillic)
भ =
(Devanagari)
ლ =
(Georgian)
理 =
(CJK Ideographs)
한 =
(Korean)
2.2 Encoding Techniques
Situation at the Society Website
Encoding Technique
The society name is encoded in the original script as text somewhere at the website. The text string is copied to the Macchiato Unicode UTF Converter, and the decimal code for the string of characters is retrieved, with the code for each character preceded by &# and followed by ;. The result is HTML code that will display a Unicode-compliant representation of the character string.
The only occurrrence of the society's name in the original script is as part of a graphic (in which case the individual characters cannot be copied). The characters in the graphic are matched one by one against the Unicode Charts published by the Unicode Consortium to identify the Unicode hexadecimal value for the character. This is then converted to a decimal value, and then encoded as in 2.1 above.

[Human-based pattern recognition of this sort can be rather time-consuming, especially when a large set like the CJK Unified Ideographs set (20,000+ characters) must be scanned.]

There is no occurrrence of the society's name in the original script anywhere at the society website. Other sources must be consulted in order to determine the society name in the original script. Priority is given to web resources that appear to be authoritative.

If the search is successful, then one of the two above-mentioned techniques may then be employed in the encoding.

2.3 Verification of the Encodings
Once a first draft of an encoding of a society name has been completed, the resulting character string is tested to verify that:
  1. the string represents the society name and nothing more, and
  2. the society name is correctly rendered.
The preferred tools for verification are given below in order of priority.
Type of Verification Tool
Specific Tools
An online translation facility Altavista's Babelfish Translator
an online dictionary, used word-by-word, and in conjunction with a grammar of the language, where necessary. Specific online dictionaires are located using Your Dictionary.com's Language Dictionaries (which links to 1800+ dictionaries covering 250+ languages)
a print dictionary, used word-by-word, and in conjunction with a grammar of the language, where necessary. This is the last (but frequent) resort, since it relies on exact pattern recognition by a human, rather than by a machine.
2.4 Proper Viewing of the Encodings
Specific Problem
Solutions
Scripts Affected
The script is displaying as ????? (questions marks), ||||| (vertical lines) or □□□□□ (square boxes). You need to verify that your computer has a Unicode font for the script in question. At the moment, the most comprehensive Unicode font is the Arial Unicode font, which includes all character sets in the Unicode Standard 2.1. There exist, however, numerous Unicode fonts for particular scripts; see, for example, Allan Wood's Unicode fonts for Windows computers.

Arial Unicode font is available with Microsoft Office XP and Microsoft Publisher 2002.

Any script that your computer doesn't have a Unicode font for.
Conjunct glyphs are displaying as their separate components If the problem is with Arabic conjunct glyphs, you may be able to solve the problem by switching to either (a.) Netscape 6.0 or higher, or (b.) Internet Explorer 5.0 or higher.
If the problem is with conjunct glyphs in Devanagari and other Indic scripts, you may be able to solve the problem by switching to the Microsoft Office XP operating system. [For example, the Microsoft Windows 98 operating system definitely does not handle Devanagari conjunct glyphs properly.]
Arabic
Devanagari and other Indic Scripts
Contextual glyphs are displaying as their isolated forms, rather than changing as a function of their position in a word. If the problem is with Arabic glyphs, you can probably solve the problem by switching to either (a.) Netscape 6.0 or higher, or (b.) Internet Explorer 5.0 or higher. Arabic
right-to-left scripts, like Hebrew and Arabic are displaying backwards If the problem is with Arabic or Hebrew, you can probably solve the problem by switching to either (a.) Netscape 6.0 or higher, or (b.) Internet Explorer 5.0 or higher. Hebrew
Arabic
2.5 Outstanding Issues
Character Set
Issues
certain character sets, or portions of a character set The most comprehensive Unicode font is the Arial Unicode font. At the present time, it does not cover additional characters that were added to the Unicode Standard 2.1 to create the Unicode Standard 3.0, much less later versions.

Border

3. Character Transliteration
The Editor is in the process of implementing appropriate standards for the transliteration of various non-Latin scripts. This will necessitate revising some existing transliterations in the Scholarly Societies Project. It will also require the creation of a large number of transliterations for recently encoded items.

The standards to be employed are quoted below.

Character Set
Languages Served
Transliteration Standard
Arabic العربية
(Arabic)
فارسى
(Persian)
ISO 233-1984 (E): Documentation - Transliteration of Arabic characters into Latin characters
Armenian Հայերէն (Armenian) ISO 9985-1996: Information and documentation - Transliteration of Armenian characters into Latin characters
Chinese 汉语 (Chinese) The Pinyin system (as found in most modern Chinese dictionaries).
Cyrillic Беларуска (Belarusian)
Бьлгарски (Bulgarian)
Русский (Russian)
Македонски (Macedonian)
Српски (Serbian)
Українська (Ukrainian)
ISO 9-1986 (E): Documentation - Transliteration of Slavic Cyrillic characters into Latin characters
Georgian საქართველო (Georgian) ISO 9984-1996: Information and documentation - Transliteration of Georgian characters into Latin characters
Greek Ελληνικά (Greek) ISO/R 843-1968 (E): International system for the transliteration of Greek Characters into Latin characters
Hebrew עברית (Hebrew) ISO 259-1984 (E): Documentation - Transliteration of Hebrew characters into Latin characters
Japanese 日本語 (Japanese) The Hepburn system, with slight modifications, as found in the Kodansha Kanji Learner's Dictionary. Halpern, Jack (Ed.) Tokyo, New York, London: Kodansha International, 1999.
Korean (Hangul) 한국어 (Korean) ISO TR 11941-1996: Information and documentation - Transliteration of Korean characters into Latin characters
Thai ภาษาไทย (Thai) ISO 11940-1998: Information and documentation - Transliteration of Thai

Border

First published 2001, September 12
Last amended 2003, November 1
Jim Parrott, Editor
Scholarly Societies Project
Sending Email to the Project

Home