July 12, 2021

Phonological Representation Database for Chinese Characters: Introduction to the database

This online database is constructed by Xiaowei Zhao (University of Richmond) and Ping Li (Pennsylvania State University). The purpose of this project is to build an easily accessible computational representational database which can accurately capture the phonological features of Mandarin Chinese characters. It is our hope that this database will be useful for psycholinguists who are interested in modeling studying of Chinese language processing or acquisition.

Similar to the PatPhon system (Li & MacWhinney, 2002), our phonological representation builds on the idea of syllabic templates to avoid the problems caused by variable phonemic length of Mandarin characters. In particular, based on their phonemic patterns, all the possible sounds of Mandarin characters (Monosyllables) are fit into a template of CVVVC[T] (C stands for Consonants, V stands for Vowels, T stands for Tones and is optional), and then represented by numerical codes (real value or binary value) based on each phoneme's phonological features. Our system can accurately and frugally represnts the all the possible sounds of Mandarin single-characters. Although the template includes only one syllable (because almost all the Mandarin characters are monosyllables), our system can be easily extended to the represent chinese words with more than one syllables.

The Hanyu Pinyin (Chinese Romanization System) has been widely used in China and other parts of the world. This is a very simple system and easy to learn. But it is so efficient that many phonological patterns can not be explicitly represented by the system. For example, the symbol [e] represents 4 different IPA phonemes according its varing positions in the syllable. To more accurately represent the phonological system, we decided to re-transcribe the PinYin symbols into IPA (International Phonetic Alphabet)-based symbols. The translating table between PinYin and IPA symbols can be seen in Table 1.

Through the above process, we got a total amount of 34 phonemes, they and their three dimensional articulatory features can be seen in Table 2. To convert the articulatory features to numerical representations for each phoneme, we replaced the features (including tone features) with numerical values, scaled between the range of 0 and 1. Thus, the closer the numerical values are, the more similar the articulary features should be, as shown in Table 3. We also provide an alternative to code the features with binary values, as shown in Table 4.

According to above coding, we can represent the 34 phonemes in numerical values. Table 5 shows the results based on real value codes (Table 3). To test the effectiveness of our representation of these phonemes, we conducted a cluster analysis of the 34 phonemes based on the coding in Table 3. The result in Figure 1 clearly shows that our representation has captured the phonological similarities among the phonemes: vowels are group toger and cononants are group together. And in each category, similar phonemes are put in close locations (e.g. the only difference between phonemes [p] and [p'] is aspirated or not, and they are put in the same branch).

To further test the effectiveness of our method in modeling studies to represent the entire Mandarin phonological system, we used the real code representation of all the Mandarin monosyllables (without tones) as the network input to a self-orgining feature map (SOFM) neural network with 50x60 nodes. The results after 200 epochs of training are shown in Figure 2. It can be clearly seen that our representation can capture the basic phonological patterns of Mandarin Chinese, similar sound patterns are grouped together. An umatrix analysis of the map (Figure 3) further proves the emergence of phonological categories on the map (the light color indicates the boundary among different categories).

For more details about the system and other information please refer to Zhao & Li (in preparation)***.

If you used the data from the database, please cite the hyperlinked reference below.

  • Li, P., & MacWhinney, B. (2002). PatPho: A phonological pattern generator for neural networks. Behavior Research Methods, Instruments, and Computers, 34, 408-415.
  • Zhao, X., & Li, P. (in preparation). An online database of phnological representation for Mandarin Chinese monosyllables.