#Aozora Bunko kanji frequency

Here you can find data representing the usage frequency of kanji as found in the Aozora Bunko digital library collection of texts.

The data can be accessed here:

► Kanji usage frequency table (.txt UTF-8)

Kanji outside the JIS X 0208 set are also taken into account.

Rationale

I was inspired by Dmitry Shpika’s Kanji usage frequency lists. Incidentally, I would like to thank him for putting up this data. Please check out his work. He made kanji usage data using several sources, including Twitter and Wikipedia. Very nice programming work!

However, after using his Aozora Bunko list for a while, I noticed two weaknesses which skewed the statistics somewhat.

Some kanji radicals or elements which are usually not used on their own gathered relatively high rankings. One would expect such elements not to occur at all, or nearly so. For example, in Shpika’s list, 廴, a radical not used on its own, is stated to occur 1595 times and is ranked 2294th most common kanji. The explanation is simple: when a kanji outside the JIS X 0208 set appears in a text, the Aozora Bunko policy is to break it out into simpler parts. By instance, 𢌞 (it may not be displayed correctly if you don’t have a suitable font installed) is written ※［＃「廴＋囘」、第4水準2-12-11］, where 廴＋囘 is the kanji decomposition and 第4水準2-12-11 is the JIS X 0213 code point.
The methodology for counting the characters is quite not right and tends to favor some kanji. Every table of kanji usage frequency I’ve found online, by Shpika or by others, is made by simply counting the number of times a given kanji is found in a whole text corpus and computing its frequency of occurrence using the total number of kanji in the corpus. However, the resulting data is biased and not really representative of the usage of each kanji, especially for less common ones. The reason for this is that if some uncommon kanji appears in a given book, chances are it appears several times in this book. This is especially the case for character names and place names. Let’s stretch this reasoning to an extreme and consider a book in which a character’s name has a very rare kanji. Let’s say this kanji is so rare that it doesn’t appear in any of the other several thousands books in the collection. The character’s name may appear, say, a few dozen times in the whole book. Thus the rare kanji will be counted several dozen times even though it’s never been used by any other author in the collection.

Eliminating these biases is easy.

Either remove all character decompositions from the texts, effectively ignoring them, or, better still, write a program to parse these decompositions and thus take into account kanji outside JIS X 0208. I chose to do the latter.
Change the way of counting kanji occurrences: for each kanji, I chose to count the number of texts in which it appears, as opposed to the total number of times it appears in the whole corpus.

Comparison of results

Comparing Shpika’s table with the new one shows the desired outcome has been achieved. Here are some examples regarding the two points mentioned above.

The 廴 character mentioned earlier doesn’t appear at all in the new table, as one should expect. Some other radicals appear in a small number of texts, being either used in very rare or archaic words or names or being intentionally used by the author for describing a composed character.
The 劉 character, which is used almost exclusively in names, was overrepresented and ranked 1946th, while it is now ranked 3526th.

Of course, the two tables don’t show fundamentally different trends: common, everyday use kanji are ranked first in both tables, and conversely, very rare kanji are found near the end of both tables. The difference of results is most interesting for kanji which are in the middle, neither too common nor too rare.