In a series of articles, we will try to explain, in plain words, a series of terms which are obscure, almost hermetic, to many people, starting with supervised machine learning.
The goal of supervised machine learning is to create systems that, given a set of input variables, return an output value. These systems are called knowledge models, or simply models. And although they are programmed in some way, the models do not program the knowledge itself, but the algorithm to learn this knowledge from historic data, training data. It is called supervised because these data include a large number of samples combining the input variables with their corresponding know output value. It is like studying for an exam by reading the answers of hundreds of previous similar exams. And how many data are a “large number” of data? Well, it depends, but the more data we have and the more varied cases it includes, the better will be the results predicted by the model.
From a few hundred of lines including “colour,” “container” and “liquid,” we could train a system that tells us the liquid if we provide de colour and the container. So when we say “white” and “bottle” the system predicts “milk”*, with “yellow” and “can” it yields “beer”, and from “red” and “carton brick” it yields “tomato sauce.” It is not a complex model, and not very practical either, but it helps to show the concept.
If instead of colours, containers, and liquids, we had records from tens of thousands of customers, with variables such as “average payment delay,” “3-last payments delay,” “form of payment,” “credit limit,” “outstanding balance,”, and 10 or 20 other similar variables, and the output variable “bad debt” with values “Yes” or “No,” we could design a model that warns us about the risk of unpayment by a customer. Much more interesting, isn’t it? Insurance companies have been using these models to predict accident rates and quote their premiums, and banks use them to analyse loan risks, for instance.
Variables can be numeric, like a price, a delay, or a weight, or categoric, like a colour, a customer segment, of a form of payment. When the output variable is numeric, we speak of regression; and when it is categoric we speak of classification, which includes the particular case of binary classification of decision (yes/no, true/false, 1/0).
Whatever their type, training data must be clean, tidy, well structured, congruent, with no missing data nor scale distortions (we cannot mix thousands with millions, nor inches with feet). From these data, the data scientist will use the larger part to train the model, preserving a small portion to evaluate the quality of the model, its precision when predicting something of what we already know the correct answer.
There are many methods and algorithms of supervised learning, each one better suited for certain problems and data types, and the data scientist must evaluate several models to choose the most adequate in each case before deploying it in a production system, always aligned with the customer objectives and resources. Not always the most precise model is the most adequate if, for instance, because of its complexity or the huge pre-processing load it is not cost-effective to deploy it.
In Melioth DS we use CRISP-ML methodology to help you with the design, deployment, and maintenance of your knowledge models, always aligned with your business goals and needs.
*) There’s an expression in Spanish, “white and in a bottle… milk,” used to indicate that something is obvious, crystal clear.
Image by fullvector on Freepik