Unprecedented Hakka corpus being constructed by Taiwan


Hakka Affairs Council (HAC) Minister Lee Yung-de attended a Nov. 29 presentation on the first-stage results of constructing Taiwan’s very own Hakka corpus. Minister Lee stated that, combined with artificial intelligence technology, the public-facing Hakka corpus will successfully preserve the heritage of the Hakka people while offering the general public quick and easy access to written and spoken materials on the Hakka.


Unprecedented Hakka corpus being constructed by Taiwan

The minister noted that the suppression of the Hakka in the past 50 years caused the number of the Hakka speakers to rapidly decrease. To conserve and pass down the Hakka through digitalization, HAC has started to build Taiwan’s first national-language database — the Hakka corpus — since 2017, Lee added. He explained that, with AI translation techniques, the corpus will help Hakka speakers communicate with users of more prevalent languages such as English and Japanese.  


HAC Minister Lee Yung-de attended a presentation on constructing Taiwan’s very own Hakka corpus

At the Nov. 29 presentation, the principal investigators of the Hakka corpus project — National Chengchi University (NCCU) professors Lai Huei-ling (賴惠玲) of the Department of English, Liu Jyi-shane (劉吉軒) of the Department of Computer Science, and Liu Hui-wen (劉慧雯) of the Department of Journalism — demonstrated the progress of the corpus being constructed by their team.

Professor Lai stated that the Hakka corpus is the product of interdisciplinary cooperation, adding that its construction process is time-consuming, during which the work must rely on experts in the field of linguistics, computer science, and communications. These professionals led the team to collect materials, process the data, and establish the whole system. The target language can now be compared with other languages so that the corpus can be further utilized by the general public.

The raw Hakka linguistic data, including written text and spoken content, comes from TV programs, publications, recordings from field research, interviews, speeches, daily conversations, and oral histories shared by elders. These resources have to be transliterated and revised by native Hakka speakers through several complicated procedures before joining the corpus.

Hence, the NCCU team recruited a group of Hakka teachers with different regional accents to take part in the project, help process the linguistic components, and preserve the Hakka. These experts from NCCU have also strictly reexamined and debugged the raw data for the system’s machine-learning procedure.

At the present, the corpus has successfully negotiated the copyrights of 316 publications and 149 articles as well as processed 5 million words of written text and 100 thousand words of spoken content. It is scheduled to be made officially available by the end of 2022.