Hakka Affairs Council (HAC) Minister Lee Yung-de attended a Nov. 29 presentation on the first-stage results of constructing Taiwan’s very own Hakka corpus. Minister Lee stated that, combined with artificial intelligence technology, the public-facing Hakka corpus will successfully preserve the heritage of the Hakka people while offering the general public quick and easy access to written and spoken materials on the Hakka.

The minister noted that the suppression of the Hakka in the past 50 years caused the number of the Hakka speakers to rapidly decrease. To conserve and pass down the Hakka through digitalization, HAC has started to build Taiwan’s first national-language database — the Hakka corpus — since 2017, Lee added. He explained that, with AI translation techniques, the corpus will help Hakka speakers communicate with users of more prevalent languages such as English and Japanese.

At the Nov. 29 presentation, the principal investigators of the Hakka
corpus project — National Chengchi University (NCCU) professors Lai Huei-ling (賴惠玲) of the
Department of English, Liu Jyi-shane (劉吉軒) of the
Department of Computer Science, and Liu Hui-wen (劉慧雯) of the
Department of Journalism — demonstrated the progress of the corpus being constructed
by their team.
Professor Lai stated that the Hakka corpus is the product of interdisciplinary
cooperation, adding that its construction process is time-consuming, during
which the work must rely on experts in the field of linguistics, computer
science, and communications. These professionals led the team to collect materials,
process the data, and establish the whole system. The target language can now be
compared with other languages so that the corpus can be further utilized by the
general public.
The raw Hakka linguistic data, including written text and spoken content,
comes from TV programs, publications, recordings from field research,
interviews, speeches, daily conversations, and oral histories shared by elders.
These resources have to be transliterated and revised by native Hakka speakers
through several complicated procedures before joining the corpus.
Hence, the NCCU team recruited a group of Hakka teachers with different
regional accents to take part in the project, help process the linguistic
components, and preserve the Hakka. These experts from NCCU have also strictly
reexamined and debugged the raw data for the system’s machine-learning procedure.
At the present, the corpus has successfully negotiated the copyrights of 316 publications and 149 articles as well as processed 5 million words of written text and 100 thousand words of spoken content. It is scheduled to be made officially available by the end of 2022.