The author of Transformer founded Unicorn to launch a super multi modal LLM with better performance

Mondo Finance Updated on 2024-02-16

Editor: Run.

[New Zhiyuan Guide].ADEPT AI, founded by the authors of Transformer**, has launched Fuyu-He**Y, a multi-modal large model that is claimed to be second only to GPT-4V and Gemini Ultra. It not only has the ability to accurately identify**, especially UI, but also has a very strong mathematical reasoning ability.

2024 is indeed the year of multimodality for large models.

Another unicorn, Adept AI, has launched their multimodal large model, Fuyu-He**Y.

Founded by two Transformer authors who ran away from Google, ADEPT AI aims to develop an AI agent that improves the efficiency of migrant workers.

And fuyu-he**y is the basic multimodal large model that drives their upcoming agent products in the future.

They claim that the Fuyu-He**Y is the most powerful multimodal model available today, apart from GPT-4V and the yet-to-be-launched Gemini Ultra.

And it's less than 10% the size of the first two.

It excels at multimodal reasoning. The killer feature for us is UI understanding, but it also performs well in more traditional multi-modal benchmarks. In particular, the Fuyu-He**Y scored even better than the Gemini Pro in the MMMU benchmark.

In a standard text-based benchmark, fuyu-he**y can outperform its peers, even though it has to leave some of its capability space to the image module.

Because ADEPT AI's top priority product target audience is enterprise users, who are committed to developing an AI agent that can greatly improve work efficiency in the workplace, they also trained Fuyu-He**y's UI comprehension ability.

And according to adept, fuyu-he**y has very strong mathematical reasoning ability and is able to give full play to mathematical ability in multimodal tasks, for which they specially demonstrated a line test question:

Prompt: Which foods are most likely to trigger food poisoning based on the table below.

fuyu-he**y output:

The table shows the number of people who ate different foods and the number of people who got sick after eating them. To determine which food is most likely to cause a food poisoning outbreak, we looked for foods that had the highest percentage of people who consumed that food and got sick.

Let's calculate the percentage of people who eat each food and get sick:

Cold chicken: 34 86 = 0395, i.e. 395%

Potato salad: 38 54 = 0704, i.e. 704%

Egg sandwich: 40 76 = 0526, i.e. 526%

Fruit pie and cream: 12 32 = 0375, i.e. 375%

Cheese: 12 48 = 025, which is 25%.

The food with the highest percentage of people who eat this food and get sick is potato salad with 704%。Therefore, potato salad is the most likely cause of food poisoning.

How Fuyu-He**y was born

In October last year, ADEPT launched its own basic model, Fuyu-8B, which is a very simple, efficient, and powerful native multimodal model.

On the basis of the Fuyu-8B, ADEPT spent 4 months expanding the size and capabilities of the model.

In addition to the standard problems related to the law of model scaling, they also had to deal with various problems related to the architecture of native multimodal large models on text and image data.

Image data puts pressure on the model – feeding data to a multimodal model can be difficult: memory usage has skyrocketed, cloud storage ingress and egress are restricted; Even processing image format coordinates consistently between training and inference is a troublesome problem.

Image models are notoriously unstable – so they had to make drastic adjustments to Fuyu's architecture and training process to cope with this.

Moreover, high-quality image pre-training data is very scarce, and they put a lot of effort into collecting, organizing, and even creating this data. There is also a delicate balance between text and image tasks, with too much text data degrading image performance and vice versa – and the right way to achieve this balance at scale.

After 4 months of hard work, after solving these problems, ADEPT launched FUYU-HE**y, and will soon launch a product based on this model.

Assessment scores

According to ADEPT, the performance of the Fuyu-He**Y is second only to the GPT-4V and Gemini Ultra as a multi-modal large model.

If you evaluate the text capability alone, its performance is roughly comparable to that of the Gemini Pro.

In the assessment of long conversation competence in the MT-bench and alpacaeval 10, Fuyu-He**Y's performance outperformed Claude 2.

On the MMMU, which evaluates the multimodal performance of the model, Fuyu-He**Y shows the ability to surpass Gemini Pro.

Amazing multimodal reasoning capabilities

ADEPT also shared the case of Fuyu-He**Y answering MMMU's questions.

Janet Ludlow's company requires all analysts to value ** using a two-stage dividend discounting model, the DDM, and a capital asset pricing model, CAPM.

Ludlow values QuickBrush at $63 per share using CAPM and DDM.

She now has to value the Smilewhite company. Use the information in the following table to calculate the Smilewhite Claim Rate of Return:

adept replied:

The required rate of return for SmileWhite can be calculated using the Capital Asset Pricing Model (CAPM), i.e.:

Required Return = Risk-Free Rate + Beta X (Expected Market Return - Risk-Free Rate).

The risk-free rate and expected market returns are provided in the table, with a beta of 1 for Smilewhite15。

Let's calculate the rate of return required for Smilewhite:

Required rate of return = 450% +1.15 x (14.50% -4.50%)

Required rate of return = 450% +1.15 x 10%

Required rate of return = 450% +11.50%= 16%

fuyu-he**y can also perform more complex calculations:

Table 1147 provides a recent survey of the youngest** entrepreneurs with an estimated net worth of $1 million or more. Their age range ranges from 17 to 30 years old. Each cell in the table illustrates the number of entrepreneurs corresponding to a particular age group and their net worth. We wanted to know if age and net worth were independent.

How much is the $$ ch$$ 2 test statistic?

adept replied:

To answer this question, we need to conduct an independent chi-square test.

The null hypothesis of this test is that age and net worth are independent, while the alternative hypothesis is that they are not.

The test statistic for the chi-square independence test is:

where o is the observed frequency and e is the expected frequency, and all the cells in the contingency table are summed.

The expected frequency of each space is calculated as follows:

The degree of freedom of the chi-square independence test is:

At a given level of significance (e.g., 005) The cut-off value for the chi-square distribution with df degrees of freedom can be found in the chi-square table or calculated using the statistical package.

If the calculated chi-square value is greater than the critical value, we reject the null hypothesis and conclude that age and net worth are not independent. If the calculated chi-square value is less than or equal to the critical value, we cannot reject the null hypothesis and conclude that age and net worth are independent.

Let's calculate the expected frequency for each cell in the table:

Now, let's calculate the chi-square test statistic:

The degrees of freedom for this test are:

With 2 degrees of freedom, the significance level is 0The cut-off value for the chi-square distribution of 05 is about 599。

Since we calculated the chi-square value (176) Less than the critical value (599), so we can't reject the null hypothesis and conclude that age and net worth are independent.

This amazing computing power and map recognition accuracy make people really look forward to their future intelligent body products!

Related Pages