Taiwan Hakka Language Corpus


Policy Goal:
In order to make the collection and organization of Taiwan’s Hakka language more complete, to sustainably preserve Hakka cultural assets, and to promote Hakka language teaching added-value applications and research development, we plan to establish a Taiwan Hakka language corpus. A linguistic corpus can record language in its contemporary usage and so has importance and value for the collection and preservation of language. Moreover, the systematic and convenient nature of a corpus not only strengthens research but also provides language students with a means of self-study. Because of this, the establishment of a corpus is an urgent task for the Taiwan Hakka language that is at serious risk of slipping away.


Implementation Overview:
The Council from 2017 has planned the establishment of the Taiwan Hakka language corpus, expecting to collect at least 6 million written words and 400,000 colloquial words covering the Sixian (including Nansixian or South Sixian), Hailu, Dapu, Raoping and Zhao’an dialects of Hakka, and gradually complete the Taiwan Hakka language corpus over five years in three stages. 
Stage One (years one and two) deal with the content of the corpus, including the number, genre, style, subject and mode of written and spoken words to be completed, and the dialects with differences in spoken language, finally producing a systematic framework that covers the presentation of the completed corpus, the digitization of Hakka written language texts and the spoken language of different dialects, hyphenation markers, parts-of-speech markers and speech markers, metadata creation, the processing and storage of corpus content and interface work. 

Stage Two (years three and four) deals with the information system that accompanies the corpus, continually expanding, testing and fixing the stage one corpus on a rolling basis, bringing down the error rate for hyphenation and markers. Stage Three (year five) is the overall assessment and revision of the corpus, including fixing and optimizing the corpus system interface, survey of corpus website users and writing the corpus content and user instructions.

The case is currently about to enter Stage Three. As of October 2021 the corpus comprises 5.8 million words of written language and 340,000 words of spoken language. The corpus system platform has started to go online since the end of 2021 and subsequently will continue to complete the work of establishing the Hakka language corpus according to the planning schedule.