**:Technology**.
Researchers at New York University in the United States conducted an experiment in which they trained a multimodal artificial intelligence (AI) system with the eyes and ears of a child, using data from the child's head-mounted camera records from 6 months old to 2nd birthday. The study, published in the latest issue of the journal Science, shows that the model, or neural network, can actually learn a large number of words and concepts using a limited number of snippets of a child's experience. That is, only about 1% of the child's waking time is captured, but this is enough for real language learning.
AI systems such as GPT-4 can now learn and use human language, but they learn from massive amounts of language input far more than children can learn how to understand and express language. The best AI systems train trillions of words on texts, while children can only receive millions of words per year.
Because of the huge gaps in the data, researchers have been skeptical that recent advances in AI will tell a lot about human learning and development. This time, the research team started at 6 months of age and ended at 25 months of age, capturing the child's first-person perspective** through a head-mounted camera every week. They used more than 60 hours of material with about 250,000 word instances (i.e., the number of words conveyed, many of which were repetitive). These word instances are associated with the frames that the child sees when they say the words, including a variety of different activities such as eating, reading, and playing.
After training, the team tested the model. They provided the model with a target word and four different images, asking it to choose an answer that matched the target word. The results show that the model is not only able to learn a large number of words and concepts that exist in children's daily experiences, but also generalize them into visual examples, even if the examples are completely different from what they see in the model training.