Viettel AI wins double award at speech and language processing competition

The Vietnamese Language and Speech Processing (VLSP) competition is part of the annual International Conference on Vietnamese Language and Speech Processing organized by the VLSP Club, a branch of the Vietnam Association for Information Technology. VLSP 2023 organizes 10 competitions on speech and text processing, bringing together leading researchers, experts and technology development units.

Although this was the fourth time Viettel AI participated in the competition and had won three times before, Viettel engineers still encountered many difficulties due to changes in the competition category structure.

Specifically, compared to last year, the Speech Recognition and Emotion Recognition categories this year have been merged into one category. The teams must solve two problems at the same time to ensure that both the text and the emotion of the sentence are recognized, the workload and difficulty have both doubled.

Make use of every data, whether low or high quality

Not only changing the structure of the categories, this year's exam also focuses on building models from scratch with limited data conditions, including raw, unlabeled and low-quality data. The exam provides 4 groups of data with different quality and form. There is data that only includes unlabeled audio, data that only includes audio and text, data that includes emotions and audio, high quality, standard labels, and a dataset that includes emotions and audio, low quality. Each dataset is clearly defined to serve each purpose and exam category, with a total of more than 300 hours on all datasets. This is quite a modest number compared to standard datasets for training Speech Recognition, which usually require up to 1,000-2,000 hours or more.

Each team had less than 2 months to work on and submit their work, but in reality, the actual time spent on researching solutions was much less due to lack of resources.

“This year, Viettel AI has devoted a lot of computing infrastructure resources to research new technologies as well as product development, while speech recognition is a technology that requires a lot of hardware resources,” said Mr. Dang Dinh Son - Artificial Intelligence Engineer, Virtual Assistant Platform, Viettel AI.

picture 1.jpg — Artificial Intelligence Engineering Group, Virtual Assistant Platform Block, representing Viettel AI participating in the category of Speech Recognition and Speech Emotion Recognition - VLSP 2023

Faced with the condition of low data volume and quality, the research team immediately determined the viewpoint of "having to utilize all data regardless of low or high quality". To do this, it is necessary to build a training cycle to process all data as well as only one model to solve many different problems instead of many models.

The results of pioneering technology mastery

In the context of both a lack of data and a lack of resources, the research team decided to build a simple, not massive, but importantly, refined processing process down to the smallest detail.

Viettel AI engineers carefully studied the latest research from leading conferences and journals around the world to find an approach. Combined with data processing methods to train the model that have been effective, the research team built a training cycle to process all the available data. The cycle includes 3 steps: building a pre-trained model to describe voice features without labels, fine-tuning from the pre-trained model for two problems: speech recognition and emotion recognition, and inference.

“Experience from solving problems with lack of data during the development and deployment of previous products also contributed significantly to helping the team find a decision-making method. On the contrary, the knowledge and results gained from the test also have the potential to be applied immediately to Viettel AI products, so the process of working while taking the test went quite smoothly,” said Mr. Bui Tien Dat - Virtual Assistant Platform Engineer, Viettel AI.

As a result, Viettel AI not only won first prize in the Speech Recognition and Speech Emotion Recognition categories, but also achieved an impressive score of 89.18% (the next teams were 83.40% and 78.45% respectively).

Mr. Son said the key factor lies in the speech processing model specifically for Vietnamese that Viettel AI has developed for a long time.

“Instead of using models and instructions from available research results, Viettel AI chose to build and develop its own model for Vietnamese speech processing. This model is constantly updated, optimized and becomes more and more effective,” said Mr. Son.

Not only stopping at the competition, this solution of Viettel AI will be the premise to upgrade virtual switchboard products, Viettel virtual assistant, helping to identify customers' emotions more accurately in conversations, thereby giving feedback or choosing appropriate nuances of words. Thus, conversations between humans and AI will become more natural, improving user experience. Many new applications in customer care are also opened such as building a system to automatically identify customer complaints and complaints to the switchboard for timely handling or to exploit information.

picture 2.jpg — Mr. Bui Tien Dat - Virtual Assistant Platform Engineer, Viettel AI represented the team to present the research results at the conference.

The representative of the unit said that Viettel AI will continue to develop technology, constantly upgrade products to increase accuracy, enhance user experience and product efficiency.

Quoc Tuan

Source