“Some people speak without communicating.”
“Some people speak without communicating.” Tan Lee is a Professor in the Department of Electronic Engineering at CUHK with over 30 years of research experience in speech technology, including “Speech-to-Text” and “Text-to-Speech” technologies that we use on a daily basis. Au fait with abstruse computerese, Prof Lee is, however, mindful of making himself understood.
”Good voice” of CUHK
“Long-serving academic staff are sometimes so absorbed in their own lectures and tend to talk non-stop. I often remind myself not to do so, albeit rather unsuccessfully.” His text message before the interview seemed to imply that it would be a lengthy conversation. It turned out to be a very lively 2-hour chat, during which he shared and explicated his experience as the vice-leader of CUHK Choir as a student, on researches and interdisciplinary and knowledge transfer projects, as well as designing an individualized “phonograph”for a patient with laryngeal cancer recently.
“I’m very conscious of what and how I speak in class, especially in the past decade.” Tan Lee is careful of not only what he says, but also whether students understand. The 12 exemplary teaching awards he has received in the past 20 years — including 9 that were awarded in a row — are testimonies to his repute.
Despite his eloquence in Cantonese romanisation system and International Phonetic Alphabet (IPA), trained as an electronic engineer, Tan Lee only began his interest in linguistics 30 years ago during his doctoral studies at CUHK. “My first doctoral research was on Cantonese speech recognition.”
From mathematics to Cantonese linguistics
Having spent the majority of his life at Ma Liu Shui where CUHK situates, Tan Lee was admitted to CUHK in 1984 to study mathematics after one-year High Level matriculation, and transferred to electronic engineering in Year 2. “I believe that if you excel in mathematics, you excel in anything. I think highly of mathematicians, such as Prof Conan Leung. He was so brilliant that he studied mathematics with Prof Yau Shing Tung even before he completed his undergraduate degree.” Notwithstanding his love in mathematics, having one less year in high school spelled “disastrous” grades in his first year in university. He persuaded the department head to let him transfer to the more “pragmatic” study of electronic engineering.
“I continued on an MPhil at CUHK and worked as a teaching assistant at CityU (then City Polytechnic) afterwards.” As he pondered whether to study PhD at the end of the work contract, he bumped into Prof Ching Pak Chung and Prof Chan Lai Wan at the train station. He was told of their prospective research project on Cantonese speech recognition. He seized the chance to join their research team, “I hence began my doctoral study at CUHK in 1992.” Tan Lee quotes all names and timings from his commendable memory.
Although much of speech recognition is computer-processed, research in this field cannot do without basic knowledge in linguistics. “I was clueless about Cantonese linguistics, so I sought help from Prof Eric Zee, an anthropologist who was also proficient in phonetics. I attended his seminars at CityU to learn as much as I could, albeit unsystematically.”
“As I started learning Cantonese romanisation, there was no standardised system. Jyutping was adopted in the academia, but it was not used by laymen at all.” He illustrates with the miscellany of romanisation of Chinese surnames. “The surname ‘Zee’ of Prof Eric Zee is more commonly spelled ‘Tsui’; ‘Choi’, ‘Tsoi’ and ‘Choy’ refer to the same Chinese surname; so as ‘Cheung’ and ‘Chiang’.” This is a legacy of colonial administration.
“In the past, officers at Birth Registries wrote down names that were best possibly pronounced accordingly to the romanisation system that they had learnt. But there were no standardised system, hence the variations.” He then learnt IPA too. “I had to grasp it as it is applicable to all languages.”
Learning IPA, then Cantonese , helps to minimise embarrassment.
“There is a big difference between the long ‘aa’ and the short ‘a’, such as in ‘lau’ (a surname) and ‘laau’ ( to scoop), and ‘faai’ (quick) and ‘fai’ (lame). Have you heard the joke about a washing machine? A person goes into a laundry and tells the shop assistant, ‘This piece of clothing can die this way or die the other way.’ There is a slight difference between ‘sei’ (to die) and ‘sai’ (to wash) in Cantonese.”
Tan Lee has been researching in speech for over 30 years. “I study mainly the technologies of ‘Speech-to-Text’ and ‘Text-to-Speech’.” He points out that the former is trickier. Accent is one of the determining factors for the success of Speech-to-Text technology. “Context, accent, language, audio equipment and so on are some of the elements that may affect speech. Why does a mother talk to her young child slowly and repetitively? There is no single speech technology that can decipher all these issues.”
He emphasises the marvels of the human brain, “Understanding a speech may seem easy-peasy. Yet the human brain is in fact busy with a vast array of assumptions and settings. For instance, since I already know what you will ask me in this interview, I can understand you even if you stutter.”
“But you do not have these assumptions and settings to what I may say, then I have to speak very clearly. This is how the brain leads us to recognise speech. It is a very complicated technology. Can a computer programme recognise it? If it can this time, how about the next? A tremendous amount of training data are thus required.” He expounds perspicuously in just a few sentences.
Crossing the line for good
Although there is no single speech technology that covers all areas of communication, Lee Tan is good at liaising with scholars in various disciplines to take on all kinds of challenges with his expertise in engineering. These include improvement on speech processing of cochlear implant with audiologists, assessment for children with speech disorder with speech processing, identifying communication characteristics of exemplary counsellors with educational psychologists and so on.
“I suppose I am the most diverse researcher in terms of interdisciplinary projects. I have worked with all academic departments of CUHK except BA and Law.” He chuckles.
His recent project on retaining the voice of Jody, who was suffering from laryngeal cancer, has gained much public attention. The surgical treatment that Jody received involved the removal of the larynx, which meant she would lose her voice. Her son’s girlfriend sought help on the internet and it caught the attention of Matthew, a student of Tan Lee and led to this project. The team made over 10 hours of recordings of Jody’s voice before the surgery, and reconstructed her voice digitally with AI and speech synthesis so that she might “talk” in her own voice with others using “Text-to-Speech” application (see news coverage on Cable TV).
Once world’s largest Cantonese speech database
How did Tan Lee’s team manage to compile all the scripts for Jody in less than 24 hours? “We have already had these materials in hand including written language, conversational and story-telling speeches, as well as all the newspaper clippings in 1999-2000.”
He points out that CUHK is the first university to have received Hong Kong government’s Innovation and Technology Fund (ITF), and this very first funding has supported the building of a Cantonese speech database. The database, once the world’s largest of its kind, engaged over 1000 persons to create more than 400 hours of voice recordings.
“Apple acquired our database for its first-generation speech recognition application, though I have no idea how much information they have actually made used of.”
There is enormous potential for speech and deep learning technology. In spite of the possible appropriation for Deepfake, such as faking the voice of the president of a bank for financial fraud, the technology can help desperate patients like Jody and their families as well as people suffering from Parkinson’s disease and Spinocerebellar ataxia (SCA) with degenerate speech capacities.
“Imagine the elderly being able to hear the voice of their grandchildren who live abroad, or listen to them reading the newspaper… this technology will be helpful but there can be ethical issues too.”
He believes technology is to do good to those in need. “And not to those who chat with Siri when they are bored.” He does not like to use Siri but avails himself of the Speech-to-Text function in the mobile phone, “It generates very colloquial Cantonese phrases.”
It is not uncommon for scholars to change tracks from the academia to business. Having a wide range of avocations such as singing and playing basketball, Tan Lee still has great relish in his position in the university. “I can learn a lot here.” He shares another recent collaboration with Prof Harold Chui of Department of Educational Psychology that is “fun and novel”.
“They had collected multitudinous hours of recordings from counselling sessions. We analysed the speech, intonation and other characteristics of the counsellors. We found that the key to effective counselling was not what advice the counsellor gave but how the counsellor guided the recipients to express themselves. These included intervals between responses, repeating the wordings used by the recipient and the frequency of such, and more so the use of function words. We even had a special meeting on function words — this was definitely new for me. By this I became more cautious when talking to my children.”
Apart from inter-departmental collaboration, he was awarded funding from CUHK KPF to create a personalised storytelling system in 2020 with speech technology. The mobile application allows children and parents to listen to 100 children’s stories, and enables children to change contents of the stories, such as colours and places to enhance interaction between children and parents. “Daddy can assume the role of the big bad wolf,” Tan Lee grins.
Is it possible to synthesise the voices of parents who are away or too busy to read stories to their kids? The father of two disapproves in no time, “It will be best for parents themselves to read to their children. I do not wish the technology to deviate from its good intentions. I am therefore very alert about this.” A good technology is not only effective but also ethical — surely this will be well-discussed in his general education course “Demystifying AI”.
【Scholarly keyword】Speech technology research
“It’s relatively more difficult to get research in speech technology published or reviewed since this milieu is quite small. There are generally a lot of researchers on digital imaging and visual technology in each institute, but much less on speech. The ratio of research efforts in speech to image is 1 to 10. There are only a handful of speech technology researchers in Hong Kong.” The reason, Tan Lee observes, is that it’s difficult to visualise the difference.
In 1990s this was a stumbling block to even the most outstanding researchers to be tenured. “It turned out to be a blessing in disguise. They became top management in Microsoft and Apple.” Contrary to the knock-backs in the university, speech engineers are very much sought after by technology companies. “Alibaba can easily hire a hundred of them at one go.”
Original text in Chinese：Kary Wong@ORKTS
Not yet a subscriber? !Do it now!
Curated by InnoPort Team, one email to feed you the hottest info and story from the innovation universe — CUHK and beyond!
Where Ideas Root and Flourish