Computer-assisted teaching of Pinyin orthography

Peter Leimbigler, Ph.D.

Facilitating compliance with Pinyin orthography (中文拼音正词法) for teachers and learners is a main task for Chinese language teachers and software designers. This article introduces algorithmic self-correcting software solutions for Pinyin Input with non-standard spacing. Focus is on four aspects of Pinyin orthography in a computerized learning environment: 1) spacing rules for modal particles, 2) orthography of four-syllable fixed expressions, 3) number/measure-word combinations, and 4) an option for tone changes (“sandhi”) being reflected in the Pinyin rendering.


Compliance with the standard of writing – both in Hanzi and in Pinyin – is a major effort for native speakers and learners of any language. In the case of Chinese, especially when involving textbooks and other publications requiring a phonetic transcription, the task is even more difficult, as writing the same text in Chinese characters (Hanzi) and in the standard phonetic transcription (Hanyu Pinyin) challenge the writer in totally different ways. In this presentation I focus on the orthography rules for Hanyu Pinyin, in particular on the question how we can enhance implementation of the Pinyin orthography standard that was proclaimed in 1996 as Zhōngwén Pīnyīn zhèngcífǎ jīběn guīzé 中文拼音正词法基本规则 by the China State Bureau of Quality and Technical Supervision of the State Council of China. The original 8-page document with the rules is available at For the software solutions discussed in the text of this article, the latest (May 2008) release version 5.1 of the Chinese text system KEY5 at is being used. The research for our software development, including the linguistic basis for the software algorithms, was mainly done on the basis of the works mentioned in the references below, with many contributions in theory and practice from my colleagues world-wide, and my R&D team in Ottawa, Canada.


1. Spacing rules for modal particles

While the general spacing rules of Pinyin orthography are well defined in books like Yin & Felley (1990) and in modern standard dictionaries (like the Xiàndài Hànyǔ Cídiǎn 现代汉语词典, 2002), they are not always implemented in teaching. For instance, when non-standard Pinyin input is used on the computer to teach Chinese – such as monosyllabic input, or continuous input without spacing – this may facilitate text entry in certain instances, but in the long run might prove detrimental to the learners’ Pinyin orthography skills.

More complicated than the general vocabulary-level spacing rules of Pinyin are the Pinyin spacing rules on the syntactic level that might even seem inconsistent at first sight. With these syntax-based rules, there is a lot of potential for confusion for teachers and learners alike, because we are here dealing with rules that cannot be readily looked up in a dictionary. For example, sooner or later every Chinese language teacher and student is confronted with the Pinyin spacing rules governing modal particles like le 了, zhe 着, and guo 过 which in most cases (but not in all cases!) are supposed to be appended directly to the preceding verb, without a space in between. We can observe the complexity of these spacing rules in some example sentences containing the particle le 了. The following three sentences all comply with the Pinyin spacing rules – but would you as an author of learning material write the Pinyin text like this? Or, imagine you have to explain to your students the logic behind the Pinyin renderings, in particular the le 了 spacing:

  1. 走进来了两位客人 。 Zǒu jìnlái le liǎng wèi kèren.
  2. 来了两位客人 。 Láile liǎng wèi kèren.
  3. 客人来了 。 Kèren lái le.

Background explanation of the above le 了 spacing:

  1. “Zǒu jìnlái” is a verb + complement construction. The rule says that, if the construction is written as two units, then le 了 is written separate from it (Yin & Felley, 1990, pp. 303-304). Therefore the standard way of writing is “Zǒu jìnlái le …”
  2. The tense-marking particle le 了 is ordinarily written as one unit with the verb it follows (Yin & Felley, 1990, p. 276). Therefore the standard way of writing is “Láile …”
  3. In this sentence, the standard defines le 了 not as a tense-marking, but rather as a mood-marking particle at the end of a sentence, and sets out the following rule: the particle le 了, appearing at the end of a sentence or clause, is written by itself (Yin & Felley, 1990, p. 278). Therefore, the standard way of writing is “… lái le.”

To facilitate compliance through software algorithms – in the interest of those who are writing and learning Chinese on the computer, two questions come to mind:

  1. In a Chinese software system with Pinyin entry, can we make Pinyin input with non-standard spacing convert to the correctly spaced Chinese-character version?
  2. Can we, through back-conversion from Hànzì 汉字 to Pinyin or in two-line “Hanzi with Pinyin” mode – show or teach the writer/student the standard Pinyin orthography?
  3. The following examples show that the answer to both questions is affirmative, as our team has just (2008) implemented the respective self-correcting algorithms.
  1. Non-standard Input “zoujinlaile liang wei keren.” converts correctly into 走进来了两位客人.; this back-converts to the correct standard form “Zǒu jìnlái le liǎng wèi kèren” – thus providing the orthography teaching effect.
  2. Non-standard Input “lai le liang wei keren.” converts correctly into 来了两位客人.; this back-converts to the correct standard form “Láile liǎng wèi kèren” – thus providing the orthography teaching effect.
  3. Non-standard Input “keren laile.” converts correctly into 客人来了.; this back-converts to the correct standard form “Kèren lái le.” – thus providing the orthography teaching effect.


2. Orthography of four-syllable fixed expressions

A further problem area in Hanyu Pinyin orthography is the Pinyin rendering of chéngyǔ 成语 “fixed idioms”. These are set four-character expressions, and despite the standardization efforts and detailed hyphenation/spacing rules (Yin & Felley pp. 457-489) we find many inconsistencies in spacing and the use of the hyphen in such expressions in the current dictionaries.

We observe these inconsistencies in a large number of idioms; as one from thousands of similar border cases, we take a closer look at the idiom hè lì jīqún 鹤立鸡群 “crane-like stand in a flock of chickens” (stand out from the crowd, be exceptional). Like many four-character idioms, the expression has a wényán 文言 (classical Chinese) infrastructure, according to which logical semantic groupings should be either “hè lì jīqún” or “hè lì jī qún”. But we find three different ways of writing – which one is right?

  1. In the Xiàndài Hànyǔ Cídiǎn 现代汉语词典 (2002) we find the Pinyin form “hè lì jī qún”, which reflects the wényán 文言 infrastructure;
  2. In the Xīn Shídài Hàn-Yīng Dà Cídiǎn 新时代汉英大词典 (2001) we find the Pinyin version “hèlì-jīqún”, which is obviously inspired by the hyphenation rules;
  3. In the ABC Hàn-Yīng Dà Cídiǎn 汉英大词典 (2003) this idiom is rendered in Pinyin as one long string “hèlìjīqún”, as the editors did not see enough evidence of symmetry.

In such cases, the software solution we implemented combines two different approaches. For input purposes, any combination of the four syllables (with or without spaces or hyphens) converts correctly to 鹤立鸡群. In back-conversion from Hànzì 汉字 to Pinyin or in two-line “Hanzi with Pinyin” mode, 鹤立鸡群, by default, back-converts to the version with the best semantic segmentation for an infrastructure-based understanding “hè lì jīqún”; however, if the student has set the system to the “standard of 1996” which suggests to render non-symmetrical non-hyphenated expressions as one string, the back-conversion result will be “hèlìjīqún”.


3. Number/measure-word combinations

A number/measure-word algorithm that includes correct English translation of these combinations (including rendering of the numbers and singular/plural forms of the objects) is proving useful for teaching Pinyin orthography. For example, the non-tonal Pinyin input “san ke shu” produces 三棵树, which dictionary-tool-tips as “3 trees” and back-converts to “sān kē shù”. The non-tonal input “san ke zhu” converts to 三颗珠, which dictionary-tool-tips as “3 pearls” and back-converts to “sān kē zhū”.


4. Reflect tone sandhi in Pinyin text (on the KEY5 “Language Properties” panel)

This new feature solves a problem encountered by all learners and teachers of Chinese: the tonal changes that happen when, for example, two or more third-tone syllables follow one another. Further, the two characters bu4 (“not”) and yi1 (“one”) change their tones depending on the tone of the character that follows. – Here are the rules with examples:

  1. When there are two 3rd tones in a row, the first one becomes 2nd tone. Examples: 你好 (nǐ + hǎo = ní hǎo), 很好 (hěn + hǎo = hén hǎo), 好懂 (hǎo + dǒng = háodǒng).
  2. bù 不 is 4th tone except when followed by another 4th tone, when it becomes second tone. Examples: 不对 (bù + duì = búduì), 不去 (bù + qù = bú qù), 不错 (bù + cuò = búcuò).
  3. yī is 1st tone when alone, 2nd tone when followed by a 4th tone, and 4th tone when followed by any other tone. Examples: 一个 (yī + gè = yí gè), 一次 (yī + cì = yí cì), 一半 (yī + bàn = yíbàn), 一般 (yī + bān = yìbān), 一毛 (yī + máo = yì máo), 一会儿 (yī + huǐr = yì huǐr). [Note the exception: 一 yī remains first tone in purely numeric expressions, like when followed by another digit.]

Note: According to the Pinyin standard (1996) these tone changes should, by default, not be reflected in the Pinyin tone marks – we have kept to this rule in KEY 5. However, with the new feature on the Language Properties panel we have given the user a tool to set the system to showing the tone sandhi in Hanzi with Pinyin mode (to toggle, use the new H/P button on the KEY5 toolbar). The words/characters subject to tone change have a grey background when the feature is turned on.

In summary, it should be our goal to strive for the implementation of the Pinyin orthography standards. After all it was Confucius who warned “Míng bù zhèng zé yán bù shun, yán bù shùn zé shì bù chéng 名不正则言不順, 言不順则事不成” – if the words are not clear there’s no communication, and without communication things don’t get done.


Jiao, Fan. 2001. A Chinese-English Dictionary of Measure Words. Beijing: Sinolingua.

Zhou, Youguang. 2003. The Historical Evolution of Chinese Languages and Scripts (Pathways to Advanced Skills Series, vol. 8), translated by Zhang Liqing. Columbus, Ohio: Ohio State University National East Asian Language Resource Center.

Yin, Binyong & Felley, Mary. 1990. Chinese Romanization: Pronunciation and Orthography. Beijing: Sinolingua.

Yin, Binyong (Ed.). 2002. Xinhua pinxie cidian (Xinhua dictionary of pinyin spelling). Beijing: Shangwu Yinshuguan.

吴景荣、程镇球. 2001. 新时代汉英大词典. 北京,商务印书馆

现代汉语词典. 2002. 北京,商务印书馆

中华人民共和国国家标准 汉语拼音正词法基本规则
Basic rules for Hanyu Pinyin Orthography. 1996.
国家技术监督局 1996-01-22 批准、发布 1996-07-01 实施, at