Dr. Dao Duc Minh: 'Mastering Vietnamese data is the first step in developing and mastering Vietnamese technology'
Báo Thanh niên•27/05/2024
Having worked for a large artificial intelligence organization in the US, why did you decide to return to Vietnam to join VinBigdata? While working in the US, although I participated in many large government projects, the results I achieved were often just a few steps in a large processing process. Many times, due to the very strict confidentiality procedures of the projects, I did not even know how the solutions I had developed were being used. When I returned to Vietnam in 2017, Vietnam was in the development stage and there were many problems related to big data and artificial intelligence that needed to be solved. I accepted the invitation of Professor Vu Ha Van to jointly realize the goal of developing Vietnamese technology solutions to serve the lives of Vietnamese people. I find my return to Vietnam much more meaningful because I will be able to work on problems with greater impact.
Dr. Dao Duc Minh in a workshop
NVCC
In the strategy of developing artificial intelligence, what is the role and influence of big data, sir? Data plays a very important and valuable role in training artificial intelligence. To train a high-quality artificial intelligence model, we often start by training a large database. Therefore, to have quality artificial intelligence, we first need to have good data. Good data needs to meet standards in terms of quantity and scale, quality, diversity and universality. The process of collecting and processing thousands of hours of data from the raw data cleaning step to create the highest quality data to feed into the artificial intelligence model is very expensive and complicated. On the contrary, to analyze big data, we need to use artificial intelligence to ensure the ability to process data accurately on a large scale, thereby creating more decisive or predictive results. For example, in the process of developing a virtual assistant product for Vietnamese people (ViVi), we had to collect and process tens of thousands of hours of high-quality audio data, from hundreds of thousands of voices from different regions, diverse ages and genders, with content spanning hundreds of fields... Or most recently, the launch of ViGPT - "The first Vietnamese version of ChatGPT for end users" developed from a Big Language Model fully owned by VinBigdata. This model was trained based on 600 GB of refined Vietnamese data from many different fields. With our understanding of data and Vietnamese language, we found a new approach to shorten the launch time of ViGPT to just 9 months after ChatGPT was born. This is the resonance between big data and artificial intelligence.
What is your view on linking research with practical value to serve the community? - I believe that technology research is only truly successful when it actually enters life, solves social problems and improves people's lives. To create practical commercial products and solve business and social problems, we must always pay attention and ask the question: what value will data bring to life? Up to now, we have researched a variety of products and solutions in various industries and fields, typically ViGPT, VinDr - providing AI solutions in medical imaging diagnosis, VinBase - a platform for artificial intelligence, or Vizone - a set of smart image analysis solutions.
With key personnel of VinBigdata at an event of Vingroup Corporation
NVCC
The 4th industrial revolution has been taking place strongly on a global scale. What advantages do you think Vietnam has? Compared to previous revolutions, I think Vietnam currently has many advantages to make a breakthrough in this 4.0 industrial revolution, helping to improve the country's position on the world map. The two keys to achieving this goal are data and people. Vietnam currently has nearly 100 million people, of which a high proportion of young people use phones and personal computers. In addition, we have prestigious experts in artificial intelligence and young, high-quality personnel in information technology and have a very good foundation in mathematics. So what about the limitations? The first limitation that can be seen is that despite having a large population, we are still having difficulty mastering data, specifically standardizing and synchronizing data at facilities, business units and administrative agencies. In addition, we also face other constraints such as limited investment resources, especially investment in high-performance computing infrastructure.
In your opinion, how important is mastering Vietnamese data in the journey of creating and mastering technology to serve the lives of Vietnamese people? Currently, there are many leading artificial intelligence products from the world, typically AI application products created based on large language models such as ChatGPT by OpenAI or Bard by Google. However, Vietnamese is not the core language group for the development of these products. Therefore, the quality of Vietnamese-specific content returned to users is more or less affected and has a high possibility of errors, more dangerously, errors in basic knowledge. As Vietnamese people, we have the advantage of accessing our own data sources. Only we have the ability to understand the characteristics of Vietnamese data, the needs and characteristics of Vietnamese people. Therefore, mastering Vietnamese data is truly the key to mastering core technologies, which are also the technologies that will serve Vietnamese people.
Internal training for VinBigdata members
NVCC
How to access specific data sources, especially when most Vietnamese people today use social networking sites from abroad? In fact, the largest source of human data today (not only Vietnamese people) is on the internet and social networks. However, we can still access and collect data from different sources, based on our understanding of the characteristics of Vietnamese data, depending on the characteristics set by each project. For example, OpenAI's GPT models have hundreds, even trillions of parameters, are trained on huge amounts of data and cost billions of dollars. Compared to them, we have chosen a completely different direction based on our research, capabilities and resources: that is, creating a Vietnamese language model with an architecture of only a few billion parameters, trained on a 600 GB set of Vietnamese data that we collected and refined ourselves, but has the same ability to process Vietnamese. The results show that our self-developed architecture can self-optimize, shorten the language model training time, reduce costs while still ensuring model quality. What are the challenges that you and your team have encountered in the process of researching and developing artificial intelligence products? The first challenge is certainly time. The wave of artificial intelligence technology is coming very quickly and is in a period of explosion. In the world, leading technology companies have quickly launched highly complete products that are constantly updated and improved. If we are slow and do not launch products in time, we will certainly fall behind. On the other hand, if we want to create products that can be applied and solve practical social problems, we must also consider finding and developing the outstanding, special and unique features of the product.
Presentation at Vietnam Artificial Intelligence Day (AI4VN 2023)
NVCC
In fact, many individuals and organizations in Vietnam and around the world have suffered great losses in data leaks. How do you view the issue of data security? It can be said that any application today comes from data. When working with data, on the one hand, we must ensure the goal of applying data to create the best technology for life, and on the other hand, we must ensure data security for individuals and organizations. The human factor is a very important link in the process of ensuring data security. They include developers, product users and users. For developers, awareness of data security must be present from the very beginning of data collection and processing. Often, when no problem occurs, we are not aware of the importance of data security. But if a data leak occurs, the damage can be huge. Data breaches can occur due to technical problems or malicious attacks. When data is breached, individuals or organizations can have their information used for illegal purposes, while businesses can suffer financial losses to fix related problems, and even damage to their brand.
Dr. Dao Duc Minh and VinBigdata team at an event
NVCC
After the aspiration to master technology to serve Vietnamese people, there must be steps to advance to the world? Any organization or enterprise that wants to bring its products to the international market must comply with international standards. VinBigdata has strengths in solutions and technology, so setting a vision to conquer the world is natural. Of course, to deploy for many different products and applications, it is necessary to have the companionship of international units with many years of experience and understanding of users around the world. Thank you!
Comment (0)