Editor: RunhuggingFace is currently the hottest open source community for machine learning, bringing together 300,000 different machine learning models and more than 100,000 applications for users to access and use.
What would it be like if the 300,000 models on HuggingFace could be freely combined to complete new learning tasks together?
In fact, in 2016, when HuggingFace was launched, Professor Zhou Zhihua of Nanjing University proposed the concept of "learnware" to depict such a blueprint.
Recently, the team of Professor Zhou Zhihua of Nanjing University launched one such platform - Beimingwu.
Address: Beimingwu not only allows researchers and users to upload their own models like using Huggingface, but also can match and collaborate according to the needs of users based on the pedestal system, and efficiently handle the user's learning tasks.
*Address: Beimingwu System Warehouse: Scientific Research Toolkit Warehouse: The biggest feature of this platform is the introduction of the Learnware system, which makes a breakthrough in realizing the adaptive matching and collaboration capabilities of the model according to user needs. A learning object consists of a machine learning model and a specification that describes the model, i.e., learning = model + specification.
The specification of the learning object consists of two parts: the semantic specification and the statistical specification
Semantic specifications describe the types and functions of the model through text; Statistical protocols use various machine learning techniques to describe the statistical information contained in the model.
The specification of the learning object describes the ability of the model, so that the model can be fully identified and reused to meet the needs of the user in the future without the user knowing anything about the learning object in advance.
The protocol is the core component of the learning object base system, which connects all the processes related to learning objects in the system, including learning object uploading, organization, search, deployment and reuse. Just as the Swallow Dock in "Dragon Babu" is composed of many small islands, the regulations in the North Dock are also like small islands.
Artifacts from different characteristics of the marker space constitute a large number of statute islands, all of which together constitute the statute world in the artifact pedestal system. In the Statute world, if the connection between different islands can be discovered and established, then the corresponding Statute islands can be merged. Under the artifact paradigm, developers from all over the world can share models to the artifact base system, which helps users efficiently solve machine learning tasks by effectively searching and reusing artifacts without having to build machine learning models from scratch. Beimingwu is the first systematic open-source implementation of academic artifacts, which provides a preliminary scientific research platform for academic artifact-related research.
Developers who are willing to share can freely submit models, and the Dock assists in generating specifications to form Docks and stores them in the Dock, so that developers do not need to disclose their training data to Docks in the process. Future users can complete their machine learning tasks by submitting requirements to the Workstation and searching and reusing the artifacts with the assistance of the Dock, and users can not disclose their own data to the Dock. And in the future, when there are millions of pieces in the Artifact Dock, there will be emergent behavior: machine learning tasks that have not previously developed a model may be solved by reusing several existing artifacts.
Artifact pedestal system.
Machine learning has been hugely successful in many fields, but it still faces many problems, such as the need for large amounts of training data and sophisticated training skills, the difficulty of continuous learning, the risk of catastrophic forgetting, and the leakage of data privacy ownership.
While each of the above problems has a corresponding study, because the problems are coupled to each other, solving one of the problems can lead to the others becoming more serious. The Learning Ware Pedestal System aims to solve many of the above problems at the same time through a holistic framework: Lack of training data Skills: Even for the average user who lacks training skills or has a small amount of data, a powerful machine learning model can be obtained, because users can take the high-performing learning artifacts from the Learning Ware Pedestal System and further adjust or improve them, rather than building the model from scratch.
Continuous Learning: As high-performing artifacts trained on a variety of different tasks are submitted, knowledge in the artifact pedestal system is continuously enriched, resulting in a natural path of continuous and lifelong learning.
Catastrophic Oblivion: Once a piece is received, it will be housed in the object pedestal system forever, unless all aspects of its functionality can be replaced by other pieces. As a result, old knowledge in the learning piece pedestal system is always preserved and not forgotten.
Data privacy ownership: Developers only submit models and don't share private data, so data privacy ownership can be well protected. Although the possibility of reverse engineering the model cannot be completely ruled out, the risk of privacy compromise is very small compared to many other privacy-preserving solutions.
As shown in the figure below, the system workflow is divided into the following two phases:
Submission Phase: Developers voluntarily submit a variety of artifacts to the artifact base system, and the system performs quality checks and further organization of these artifacts. Deployment phase: After the user submits the task requirements, the learning software base system will recommend the learning software that is helpful to the user's task according to the learning software specification and guide the user to deploy and reuse.
The protocol is the core component of the learning object base system, which connects all the processes related to learning objects in the system, including learning object uploading, organization, search, deployment and reuse.
Artifacts from different characteristics of the marker space constitute a large number of statute islands, all of which together constitute the statute world in the artifact pedestal system. In the Statute world, if the connection between different islands can be discovered and established, then the corresponding Statute islands can be merged.
When searching, the learning object base system first locates the specific protocol island through the semantic specification in the user requirements, and then accurately identifies the learning pieces on the protocol island through the statistical protocols in the user requirements. The merging of different protocol islands means that the corresponding artifacts can be used for tasks with different feature tag spaces, i.e., they can be reused for tasks beyond their original purpose. By making full use of the capabilities of the machine learning models shared by the community, the Xueware paradigm builds a unified specification space to efficiently solve machine learning tasks for new users in a unified way. With the increase of the number of learning pieces, the overall ability of the learning object base system to solve the task will be significantly enhanced by effectively organizing the learning object structure. As shown in the figure below, the system architecture of Beistywu consists of four levels, from the learning object storage layer to the user interaction layer, and the learning object paradigm is systematically realized from the bottom up for the first time. The specific functions of the four levels are as follows:
Learning software storage layer: manage the learning documents stored in the zip package format and provide the way to obtain relevant information through the learning software database;
System engine layer: It includes all the processes in the learning software paradigm, including learning software upload, detection, organization, search, deployment and reuse, and runs independently of the back-end and front-end in the form of Learnware Python package, providing a rich algorithm interface for learning software-related tasks and scientific research exploration.
System back-end layer: It realizes the industrial-level deployment of Beiyingwu, provides stable system services, and supports the user interaction between the front-end and the client by providing rich back-end APIs;
User interaction layer: Web-based front-end and command-line-based client are implemented, providing a rich and convenient way for users to interact. Experimental evaluationIn **, the research team also constructed various types of basic experimental scenarios to evaluate benchmark algorithms for specification generation, artifact recognition, and reuse on **, image and text data. Data experimentsOn a variety of datasets, the team first evaluated the performance of identifying and reusing artifacts from the artifact system that have the same feature space as the user's task. Moreover, since the ** task usually comes from different feature spaces, the research team also evaluated the identification and reuse of artifacts from different feature spaces. Homogeneous casesIn the homogeneous case, the 53 stores in the PFS dataset act as 53 unique users. Each store leverages its own test data as user task data and employs a unified feature engineering approach. These users can then search the base system for homogeneous artifacts that have the same feature space as their task. When users have no labeled data or a limited amount of labeled data, the team compares different benchmark algorithms, and the average loss for all users is shown in the figure below. The table on the left shows that the data-free approach is much better than randomly selecting and deploying a piece from the market; The figure on the right shows that when the user's training data is limited, identifying and reusing single or multiple artifacts is better than the user's self-trained model.
The table on the left shows that the data-free approach is much better than randomly selecting and deploying a piece from the market; The figure on the right shows that when the user's training data is limited, identifying and reusing single or multiple artifacts is better than the user's self-trained model. Heterogeneous casesAccording to the similarity between the learning parts and user tasks in the market, heterogeneous cases can be further divided into different feature projects and different task scenarios. Different feature engineering scenarios: The results shown on the left in the figure below show that even if the user lacks annotated data, the learning pieces in the system can show strong performance, especially the **erageensemble method of reusing multiple learning objects.
Different Mission Scenarios:
The right side of the figure above shows the loss curves of the user-trained model and several artifact reuse methods. Obviously, the experimental verification of heterogeneous pieces is beneficial when the amount of user-annotated data is limited, which is helpful to better align with the user's feature space. Image and text data experimentsIn addition, the research team performed a basic evaluation of the system on an image dataset. The diagram below shows that when a user is faced with scarce annotation data or has only a limited amount of data (less than 2000 instances), the Artifact Dock system can yield good performance.
The team also performed a basic evaluation of the system on a baseline text dataset. Feature space alignment is performed with a unified feature extractor. As shown in the figure below, the performance achieved through artifact identification and reuse is comparable to that of the best artifacts in the system, even in the absence of annotated data.
In addition, compared to training the model from scratch, the artifact pedestal system can reduce the number of samples by about 2000 samples.