February 24, 2014

Introduction to the Database

Introduction to the Chinese Single-character Word Database (CSWD)

This online database is based on Liu's (2006) doctoral dissertation (Beijing Normal University, 2006) and was constructed by Seth Levine (Johns Hopkins University) and Xiaowei Zhao (University of Richmond).

The main purpose of this database is to help you select the appropriate materials for your experiment. You can quickly obtain the indices of fifteen variables that are considered potentially important to word processing. It prevents you from having to rate various characters' variables when you conduct your experiments. All that's left for you to do is to select which variables are pertinent to your study and let our database facilitate your research work.

To date, only 2,390 single-character Chinese words are included (no double- or multiple-character words). It contains almost all the single-character nouns, verbs, and adjectives in modern Chinese. They were selected from the Language Corpus System of Modern Chinese Studies (LCSMCS, Sun, Huang, Sun, Li, & Xing, 1997). For more details please read Liu et al., 2007*.

For each word, there are 16 categories, including PinYin (Chinese pronunciation), grammatical category, word frequency, homophone density, phonological frequency, cumulative frequency, number of components, number of strokes, number of word formations, age of learning in textbook, number of meanings, phonological regularity, age of acquisition, word concreteness, concept familiarity, and imageability.

  • Word frequency – Frequency of the character used as a single-character word extracted from LCSMCS (Sun, Huang, Sun, Li, & Xing, 1997).
  • Cumulative frequency – the sum frequency of all of the words in which a character appears extracted from Balanced Corpus of Modern Chinese (Sun, 2006), which is the largest electronic database on Chinese to date (with about 660 million characters).  
  • frequency and homophone density of the characters were calculated from the Modern Chinese Frequency Dictionary (Wang, 1986). Phonological frequency refers to the total frequency of all of the characters that have the same pronunciation. All the three frequency indices are in times per million.
  • Age of learning refers to the time at which the learner is exposed to a given character, and this is determined by the time when a character is first introduced in standard school textbooks (People's Education Press, 2006).
  • Number of meanings was extracted from the Dictionary of Chinese Character Information (DCCI, Science Publishers, 1988). According to DCCI, about 53% of Chinese characters have one meaning, 21% have two meanings, 19% have three or more meanings, and the remaining 7% have no meanings of their own (i.e., they are characters bound to other characters to make words). This dictionary subdivided characters into six categories, based on the number of meanings that a character can have: 0 (no meaning), 1 (one meaning), 2 (two meanings), 3 (three to four meanings), 4 (five to eight meanings), and 5 (nine or more meanings).
  • Age of acquisition (AoA) refers to the time when the learner has acquired the meaning and pronunciation of the character.
  • Familiarity, concreteness, and imageability are standard lexical characteristics that have been examined in previous studies of alphabetic languages. The values for these three variables and age of acquisition were obtained by subjective ratings based on the method and procedure of Barca et al. (2002). A total of 480 native Chinese speakers (122 males) with a mean age of 20.3 years (range = 18-23 years) participated in the ratings.

For more details about the variables and other information please refer to Liu (2006) and Liu, Shu, & Li (2007).

If you used the data from the database, please cite the hyperlinked reference below.


