Apply the software development lifecycle to the data provided to AI applications

Mondo Technology Updated on 2024-03-07

Learn how to work with AI data using the same rules that you do.

Artificial intelligence (AI) is hot, and your interests will have you chattering about forward or reverse chains, neural networks, deep learning, Bayesian logic, clusters, classifier systems, and more. These are all AI technologies, but the magical ability of AI to be buried comes from the fact that it places equal emphasis on the sheer quantity and quality of data it needs. In fact, AI requires big data. In my last tutorial, Smart Data, Part 1, I took a deep dive into the role of data in AI. In this tutorial, I'll show you how to apply an iterative version of the well-known Software Development Life Cycle (SDLC) to data for AI applications. While there are many AI algorithms that use data in different ways, machine learning is the one that has contributed the most to the current AI boom. For this reason, and to be consistent with the previous tutorial, most of the discussion and examples in this tutorial are still focused on machine learning. Consider the best way to effectively collect, prepare, and use AI data similar to the software development lifecycle. In the same spirit as the recent development of agile development, I like to take a well-defined, iterative approach to managing data for AI, rather than a strict "waterfalls" approach.

Developers should already be familiar with iterative SDLCs. After starting a project, the project enters the planning and requirements quadrant, and then continues to iterate throughout the life of the software, truly integrating the "cycle" into the life cycle. There are many variations of this concept, including the idea that deploying is just exiting the process after testing.

Of course, developers often think about these phases in terms of functional specifications, database schemas, interface and structure diagrams, programs, and test cases. At best, data flow diagrams are included in the design, but the data often seems to be processed piecemeal in the SDLC, which is a detriment to AI development.

The most important thing the team can do as a whole is to include the need for data in the planning, along with a general problem definition and understanding of the deployment environment. If your team is using one of the many AI technologies that require training instances, your team should determine what kind of training data to get. It depends on what data is available, as well as the problem statement and needs, because the more training data, like the greater the range of results required to deploy software, the better the odds of success.

There may be a feedback loop in which the lack of applicable training data leads to the need to modify the solution. The original requirements may be able to be restored in future iterations. For example, let's say the team is developing an iris marker program, but the only data that can be used for training is the well-known iris dataset (see Part 1 for more information). The team may decide that for early iterations, the goal should be to identify mountain iris from the other two species with a high level of confidence, but accept a lower confidence level when distinguishing between Virginia iris and variegated iris. After getting better technology or data in later iterations, expectations can be raised in the direction of confidently assigning all 3 categories. When analyzing and designing, you should incorporate raw data sources for AI. When you start collecting this data, you'll begin to understand how it needs to be reviewed, enriched, maintained, and evaluated. Determine format limits and parameters as early as possible, such as the minimum and maximum specifications of an image, or the length of the audio. For data like the Iris dataset, don't forget to control the data units. You don't want to mix measurements recorded by inches with measurements recorded by centimeters. Keep in mind the words of a senior math teacher that every number in every place needs to be labeled in units, whether in the data value itself (as an abstract data type) or in a data pattern. In any case, you may want to include some sort of unit verification or conversion in your ** or instruction for expert review.

A very important, but often overlooked, aspect that should be included in data design is **. Where does each data item come from, and how effectively can we track it? What is the chain of custody in people or systems before it reaches your application? How exactly did your app change it, either through an algorithm or through expert review? If you find anomalies, bias causes, or other issues, it can be key to understanding what data in a larger aggregate corpus you need to fix or discard. It is also important for the continuous improvement of the system. As your application matures and you gain a deeper understanding of what technologies and processes are working and what aren't, you can have the same understanding of your data sources.

Part 1 explains the dimensions more completely. At this stage, you must also decide or reconsider the data dimensions to be used as training samples or in the algorithm. Does adding more detail to the data slow down performance? Does this improve the results, or does it compromise the reliability of the results due to dimensional disasters? Analysis and design are common stages in establishing your dimensionality reduction technique.

A data flow diagram (DFD) is a very important but underutilized design artifact. In the influential journal Structured Design in the late '70s, Ed Yourdon and Larry Constantine first detailed data flow diagrams as a key part of the software engineering process. This work builds on earlier work by D**id Martin and Gerald Estrin. Developers in areas such as security recognize the importance of data flow diagrams for banking applications due to the need to pass it from the remote user's browser to a more secure layer in the retail ledger and back-end reconciliation systems. A similar level of detailed description is an important part of determining the AI data preparation process, which in turn is an important factor in achieving success in your domain.

DFD for AI data tends to be based on common data acquisition and preparation techniques. The following diagram is an abstract version of a DFD, and you should take more concrete steps to update it based on the nature of the problem domain, the nature of the data, and the algorithm you employed.

A data flow diagram is about data, not processes. Rounded boxes are processes, but they should be described in terms of how the data is processed. The arrows are the keys that show the steps that the data goes through in the process. The final repository of data is the corpus of your application. Although DFD is not a process flow, I provide a loop icon to remind that the process should run continuously, constantly fetching new data needed for analysis (such as the results of a dataset score) and injecting it into the corpus.

To help you create a DFD specifically for your project, here are some of the abstract processes included in this example.

Data acquisition is the process of obtaining raw data that is ultimately included in a corpus. This can be done by digitizing or data scraping (extracting data from a web page or elsewhere by brute force). Data wrangling is the process of converting the data format to the format required for input, detecting metadata features (including **) that the machine understands, and labeling the data. You'll also try to identify discrete data items and eliminate duplication here. Data cleansing is the process of evaluating identified and flagged data items to correct or delete data that is corrupted, incomplete, or inaccurate. Scoring and integration refers to the application of statistical analysis to ensure the overall health of the resulting corpus. Each data item can be scored on its suitability with corpus maintenance needs. The entire collection of data items to be added to the corpus can be scored to ensure that it is combined in a way that maximizes the efficiency and accuracy of the algorithm. A popular model for scoring steps is exploratory data analysis (EDA), which involves many combinations of visual representations of the relationships between different variables across a set. Performing a comprehensive EDA is an important factor in successful dimensionality reduction.

In this tutorial, I will continue to analyze the later stages in the SDLC, but it is crucial to accept the principle that data analysis and preparation are ongoing activities in all phases of the AI SDLC.

One of the key SDLC lessons learned from the many software development methodologies that emerged in the early days was to put the implementation in place. Most people work in programming because they find it an exciting career. It can be exciting to guide the computer step by step and watch it do something special. Even searching for errors and looking for improvements in the algorithm can be exciting.

This psychological factor means that developers always tend to move into the implementation phase as quickly as possible, without paying enough attention to planning, analysis, design, structured testing, system management, and other aspects of a successful project. These other phases may seem boring compared to actual coding, but experience and engineering specifications have shown that common causes of project issues related to overruns and deadlines include a failure to clearly align requirements and a lack of attention to structured testing.

In the SDLC diagram, you can see that implementation is often only the end of the analysis and design, and this emphasis extends to the data as well. The focus on the data during the implementation phase is primarily to ensure that the algorithm coding faithfully reflects the design.

During the test, everything will change. In an AI application, testing allows you to see if all the data collection and preparation was successful. The training data is algorithmically processed and compared to the expected results. The expected results may have been obtained during the initial acquisition of the training corpus, or may have been obtained from previous iterations of the SDLC. The link between testing and evaluation is particularly important when developing AI applications, and this is a critical juncture that is why AI often needs to iterate on SDLC more times than it would otherwise be before it can be used in production.

At the heart of the test is a comparison of the expected results with the actual output, which is a mechanical assessment of whether the algorithm flow trained the samples as expected. This is just the beginning of the process of ensuring that AI is generating reliable, practical results. Evaluations are typically supervised by experts who are responsible for processing unplanned data from potentially real-world applications. If you're developing a mobile**, you might have a set of phrase recordings from multiple known speakers, such as "How many people is there in Ghana?" You can compare the voice response to the expected answer. The assessment may include having the new speaker ask the same question or a slightly changed question, such as "How many people is there in Nigeria?" The expert who evaluates the results is better able to understand the possible outcomes that the ** can provide under more realistic conditions.

The evaluation process typically involves deploying the application, having beta personnel use it, and then letting customers use it. Feedback from all of this data can guide the subsequent planning phases to help with subsequent iterations. Perhaps reports from the field suggest that the ** could not distinguish the question of some speakers: "How many people is there in Ghana?" "With" What is the population of Guyana? This could turn out to be an important sample for training and testing in subsequent iterations.

The first tutorial explains how important data is to creating AI and cognitive applications, how this importance has continued throughout the history of the discipline, and how it relates to the success and failure of AI throughout history. In this tutorial, I explain how to process AI data in a similar process following the same principles that I do. Apply a traditional, iterative SDLC to take advantage of the vast amount of data available today, while remaining vigilant for hazards.

Related Pages