#Aozora Bunko jukugo frequency by kanji

Here you can find, for each kanji, the compound words (or jukugo 熟語) in which it appears in the Aozora Bunko digital library collection of texts, and the frequency of each of these words.

► View table in your browser ► Download data for Windows (.7z, UTF-16 text files with CRLF line endings) ► Download data for Mac/Unix/Linux (.tar.bz2, UTF-8 text files with LF line endings)

Purpose

  1. Beginner or intermediate learners of Japanese can use this data to optimize their studies by learning the most common words. That being said, (a) there already exists lists of common words, sometimes sorted by frequency, and (b) the source I used, Aozora Bunko, make the data skewed for common kanji as it includes mostly old books which don’t always reflect the contemporary usage of words in everyday contexts (conversation, news, contemporary novels…).

  2. Advanced learners of Japanese, maybe even natives, who are learning rare kanji are more likely to find this data useful. Let’s take an example. Say you encounter the kanji and decide to learn its meanings and the words in which it is found. In this case, Japanese-English dictionaries such as the widely-used EDICT won’t be of much use since they naturally include no or few word entries for rare kanji. You can use an online monolingual Japanese dictionary such as Kotobank, but you will find more than 60 entries with in them and no information about their respective frequencies is available. Instead, by looking at the data I provide here, you will see that the kanji appears in 14 different jukugo in the Aozora Bunko, the most common ones being 伯耆, 耆宿 and 耆婆.

Sources and methodology

The 14,000+ text files of the Aozora Bunko digital library were used for the corpus.

A large list of word entries was made by combining the EDICT dictionary, a list of yojijukugo extracted from 四字熟語辞典 ONLINE, a list of entries from the Dai Kan-Wa Jiten and a list of entries from the Hanyu Da Cidian, the last two made available by the Kanji Database Project.

I made a script which performs the following steps:

  1. For each kanji found in the corpus (including kanji outside the JIS X 0208 set), it searches for all sequences of one or more kanji ( and were considered valid kanji) containing that kanji. Occurrences of and 々々 are expanded into kanji appropriately so as not to miss out any jukugo (for many jukugo with repeated kanji, the dictionaries used include both an entry with and an entry without, but not systematically).

  2. For each sequence, the script looks for the longest substring which can be found in the dictionary entries.

  3. Statistics are computed using the number of texts in which a word is found rather than by counting the total number of occurrences of the a in the whole corpus. This methodology is similar to the one I used in this project where I explain why it gives more representative results.

Caveats