A lot has been said about the miracle of large language models lately. Most of these accolades are well deserved. Ask ChatGPT to describe general relativity, and you'll get a very good (and accurate) answer. However, at the end of the day, ChatGPT is still a computer program that blindly executes its instruction set (like all other LLMS programs). Its understanding of general relativity is no better than that of your favorite pet. Unfortunately, we use the word "human-like" to describe the techniques that engineers use to create them – for example, "machine learning" and "training." This is misleading because a person does not have the same mind as a human being.
There's a certain irony here – how can a chatbot that doesn't think properly summarize the findings of the smartest man of all time? To understand this paradoxical nature of LLMS, let's explore them further from the perspective of strengths, weaknesses, and dangers and see how we can leverage data and storage solutions like Minio to leverage the former and mitigate the other two. These are techniques that every engineer should keep in mind when training, testing, and deploying LLMS for their organization.
The advantage of their LLMS is that they are trained to understand the probability distribution of the words in the training set used to create the words. If the training set is large enough (i.e. the corpus of Wikipedia articles or public ** on GitHub), then the models will have a vocabulary and corresponding probability distributions, which will make their results look as if they have a real-world understanding of the text being output. Let's look at another example in more detail – this time from philosophy. Ask ChatGPT a question,".'cogito, ergo sum'What does it mean and who wrote it? You will get something similar to the text below.
"Cogito, ergo sum" is a Latin philosophical proposition that translates in English as "I think, therefore I am." This quote is associated with René Descartes, a French philosopher, mathematician, and scientist. Descartes expressed this view in his work On Method, published in 1637. This quote reflects Descartes' attempt to establish an unquestionable fundamental truth – the certainty of a person as a thinking being.LLMS uses probability distributions to produce such results. It works like this, they first look at the text in the question and determine that the word "cogito" is most likely to be the first word in the answer. From there, they look at the first word of the question and answer to identify the word that is most likely to be the next word. This continues until a special "End of Answer" character is determined to have the highest probability.
This ability to generate natural language responses based on billions of probabilities is not terrible, instead, it should be harnessed to create business value. When you use modern technology, the results get better. For example, using techniques such as Retrieval Enhanced Generation (RAG) and fine-tuning, you can learn about your specific business in LLM. Achieving these human-like outcomes will require data, and your infrastructure will require robust data storage solutions.
Not only can these next token** features be used to generate great text for your chatbot or marketing copy, but they can also be used to enable automated decision-making in your app. Given a neatly constructed prompt that contains information about the problem statement and callable APIs ("functions"), an understanding of the language will enable it to generate an answer that explains what "function" should be called. For example, on a conversational weather app, users might ask, "Do I need a raincoat if I'm going to Fenway Park tonight?" With a few clever tips, location data can be extracted from an LLM query (Boston, MA) and can be determined how to formulate a change to Weathercom precipitation api.
For a long time, the hardest part of building software was the interface between natural language and syntax systems such as API calls. Now, ironically, this might be one of the easiest parts. Similar to text generation, the quality and reliability of LLM function invocation behavior can be aided by the use of fine-tuning and reinforcement learning with human feedback (RLHF).
Now that we understand what LLMS are good at and why, let's look at what LLMS can't do.
LLMS cannot think, understand, or reason. This is the fundamental limitation of LLMS. Language models lack the ability to reason about user problems. They are probabilistic machines that can generate very good guesses about the user's problems. No matter how good a guess is, it's still a guess, and whatever generates those guesses will eventually produce something that isn't true. In generative AI, this is known as "hallucination."
If trained properly, hallucinations can be kept to a minimum. Spinner and rag also greatly reduce hallucinations. Bottom line - to properly train a model, fine-tune it, and provide it with relevant context (RAG), you need the data and infrastructure to store it at scale and serve it in a high-performance manner.
Let's look at one more aspect of LLMS, which I classify as dangerous because it affects our ability to test them.
The most popular use of LLMS is generative AI. Generative AI doesn't produce specific answers that can be compared to known outcomes. This is in stark contrast to other AI use cases, which make specific ** that can be easily tested. Image detection, classification, and regression for testing a model are straightforward. But how do you test the use of LLMS for generative AI in an unbiased, fact-figurative, and scalable way? If you're not an expert yourself, how can you be sure that the complex answer LLMS generated is correct? Even if you're an expert, human reviewers can't participate in automated tests that happen in the CI CD pipeline.
There are benchmarks in the industry that can help. Glue (General Language Understanding Evaluation) is used to assess and measure LLMSIt consists of a set of tasks that assess the model's ability to process human language. SuperGlue is an extension of the Glue benchmark that introduces more challenging language tasks. These tasks involve co-signifier parsing, question answering, and more complex linguistic phenomena.
While the benchmarks above are helpful, a big part of the solution should be your own data collection. Consider recording all questions and answers and creating your own tests based on custom results. This also requires a data infrastructure that can scale and execute.
There you have it. Advantages, disadvantages, and dangers of LLMS. If you want to take advantage of the first problem and mitigate the other two, then you will need data and a storage solution that can handle large amounts of data.