Hello everyone. Thank you for joining us today on the first day of Predictive Analytics Week 2024. Today we will be exploring reliable rule based AI and automated machine learning. My name is David Peralta and I'm an Aerie Marketing Manager here at Minitab. A few housekeeping notes before I kick it off to to Mikhail. If you are experiencing any technical issues, just refresh your page. If that doesn't seem to work, just exit out of the window, rejoin and and on a separate window and that usually does a trick for some of the folks that are new with us here I want to just walk through the webinar console. So on the very top left you should see me and you're. You'll eventually see Mikhail while he prevents presents a presentation today. Below that, which I'll actually highlight now is our request demo window. That link will take you to our Talk to mini tab page and and you can submit that form. If you'd like to have a one-on-one demo with us on predictive analytics on the predictive analytics module or if you're interested in learning anything more about our our different services and products like training, consulting, you can submit that form. It'll go directly to our team and we can connect with you one-on-one. Below that window is the related content window, which I've highlighted just now. Like I mentioned before, we have a bunch of content on there, all relating to predictive analytics. We have one pagers, infographics. If you'd like to learn a little bit more about predictive analytics. On the right side of that window is our Q&A box which I'll highlight now and you can submit all your questions, any questions you have during the presentation that goes directly to us. And when we go to the Q&A session, we'll make sure or we'll try to cover as many questions that you ask, You're there. And to the right of that is our Mini Tab Exchange window. Mini Tab Exchanges are one day in person event that we host in cities around the US We just hosted our event in Philadelphia. So if you are joining us who who from the people that joined last week, welcome, good to see you again. But our next event is in Rosemont, IL on June 18th. We'll also be attending or hosting events in Columbus, Dallas and Anaheim this year. So these events are completely complementary at no cost. So if you're interested in learning more about those events, click on that. On that window, it'll take you to our our website. You can learn more about it and register that way. And then lastly, above that is our survey. So we have about 8 questions that we like to ask our attendees at our webinars. It gives us an idea of what kind of messaging you're interested in hearing from us and it'll help us cater the future webinars to what you exactly want to hear. So we really appreciate your feedback. And with that, I'd like to introduce our speaker, Mikhail Golovnow. Mikhail is a Senior Advisory data scientist at Minitab. He has been prototyping new machine learning algorithms and modeling automation for the past 20 years. McHale has been a major contributor to Minitab's ongoing search for technological improvement among the most important algorithms in machine learning. And with that, I'll kick it off to McHale to begin today's presentation. Well, thank you, David. It's certainly my pleasure to talk to you again. And as David said, this time we're going to focus on the new set of topics. Predictive Analytics Week is always an interesting event because we only have about 45 minutes to cover a wide range of different topics. So let me just get started. The main theme of this day today is going to be artificial intelligence. Now if you look at the Google trends, the artificial intelligence overtime like what's going on, why in the past several years there's the surging interest in artificial intelligence to the point that we hear about it on the news. We hear about it from different philosophers, we hear about it from conspiracy theorists, we hear about it from people on the street or people who are otherwise so away from the traditional computer science or predictable headaches or anything of that nature. Well, obviously the reason why if you, if you correlate these trends with other things is the release of so-called chat GPD or the generative AI. It is generative AI that grabbed the world's attention recently and it's generative AI that has a great potential but also offers all kinds of challenging realities. Now, what's interesting about ChatGPT, and I'm not going to talk about it for the most part today, but I do know that it can become a very powerful tool in certain lines of work. Like my wife who works as a marketer, she's got some decades of experience. She uses chat GPD in many types of situations as a personal assistant. But she also relies on the years of all that work experience in order to guide the stupid thing and then make sure that it actually gives the result that she wants. So and we'll talk about it later and on the third day of the PA week, you'll see there will be some interesting discussion towards the very end about the philosophy of artificial intelligence. But let's table all of those issues for now. I'm not going to talk about generative AI at large. Let's focus on AI as something that is has a much wider scope here. So in general there may be identified different types of AI. There's in fact there's being different categorizations out there. They the whole system is not new. People have been talking about intelligence and of intelligent machines for decades and to some extent even for centuries. And again I'll talk about it on the last day at the very end. But the roughly you can characterize it as something that can be reduced initially to what we call reactive machines or expert systems. That's when we use the power of computers and storage and everything in order to gather up information from different experts and encode it as a set of rules. Now just like the beginner that goes to learn at school and whatever, so instead of learning from an actual live person, we could incorporate a lot of that knowledge into rule based systems and then produce very useful results, right? Then we have the machine learning systems and that will be the main topic of our discussion today. What are the artificial intelligence elements that enter into modern machine learning systems? I mean the very term machine learning is an another alternative to signify artificial intelligence, so to say. Because as you'll see, a lot of AI is centered around these various concepts of clever marketing, equivocation, when the same word is used in different meanings in different sense and so on and so forth. Now, I happen to be an expert in machine learning systems. I've been doing that work for close to quarter of a century at this point. I'm also classically trained statistician and the computer scientists. So I kind of seen all the different angles and all of this and to me machine learning systems serve as a very powerful practical application of whatever that we call it. Artificial intelligence or machine learning or predictive analytics or data mining or data analysis or anything about nature's irrelevant. To me, the most important thing is this is what I have seen many examples in real life, how it can be used to increase our productivity and to actually and gain and develop real good understanding of the phenomena that we are investigating. Related to that is neural networks. There's a very special area of artificial intelligence and again it goes back for like half a century. The first neural networks were the famous perceptrons that they they used this big machines even before the advent of personal computers. Whatever. They already used elements of those systems. And they came up with a clever name to it called the neural network, because presumably it mimics some of the machinery that connects the neurons in our brain and so on and so forth. And then the remaining 50 years of neural networks have been marked by series of different stages. As the machines became more and more powerful, more has opened up more and more opportunities. But in the end all you have is just a a large expansion in the scope of these things and the capabilities like instead of working with hundreds of neurons, now we can work with millions, in some cases possibly even billions of neurons and many, many different layers. They changed its names. Now it's recently it's been called deep learning. Well, not so recently for the past ten years or so or tensor flow, all these different things. So that's why neural networks is a very interesting area of application of artificial intelligence. And then we enter into the areas of ChatGPT, Watson and other things. When you have these AI systems designed and trained for specific tasks, and usually they relate to image recognition, language recognition, personal assistance or in information gathering and recombination, all of that type of stuff. And then at at that point we now enter until the higher levels of AI, and this is what this all fuss is all about. Now to me this is still in the realm of a fantastic kind of science fiction. But if you are interested, most people today for some reason ignore the most practical uses of AI that have been established for decades and for some reason look into these fantastic applications. Most of it have not even happened yet, but again, we will talk about it on the last day. For now it's just a useful overview. And that brings me to the very first, like a simple poll question here, what types of AI are you most likely to use in your work? And just feel free to answer you can You can use multiple choice here because in my line of work I use multiple different tools. So just find those boxes and and and click on them. I'll give you a little time to do that. And again there's a lot of terminology here, but just think about it, there's a rule based systems, there's just different guides and Wizards right? Like when you install Microsoft Windows system, you're kind of relying already on certain sort of guides. And then we have machine learning systems, right? The machine learning has to do anything with predictive modeling, building models, predicting the results, things like that. Then we have that special area of image, language assistance systems, whatever. I'll just call it Chad. GPD or previously there was a Watson computer, right? It was like from IBM that preceded all of that and then all of this stuff. OK. So I'll continue. We have about half, more than half of our audience already submitted the results. So let's just have a quick look. OK. It's actually not very surprising to me. That's exactly what I would expect. The machine learning systems are the highest percentage here. Well, and of course we may have some selection bias among the audience coming here, right. And we do have a lot of image recognition systems and the deep learning, even though it's been a lot of talking about it and all these years, let's be honest, that's the very powerful technique that fits only a very small number specialized applications. And beyond that, it's a kind of an overkill to even try to use it. All right. So let me move on here and again, highlight that historically Minitab statistical software, they've been already implementing a lot of these different artificial intelligence systems. They just were not called artificial intelligence at the time they were introduced, but they really are. Just to give you an example, can anyone think of an example of an expert machine in the mini tab that mimics the decision making ability of human expert takes in a specific domain? And again, some of you may or may not know mini tab, but I'll let you open this up right away. It's an assistant menu. We have the assistant menu in mini tab that guides the beginner statistician or someone who just wants to learn things about statistics on how to read and interpret different statistical tests. Like you know the major hurdle when you study statistics is to wrap your hand around all these different things like normal test, T test, proportion test, tied groups test and all these different tests. And sometimes even for me if I don't have not been using it for a while, it may be difficult to understand, OK, I'm going to use that test or that test here and that's where you rely on all kinds of assistant menus and other things, very powerful machinery in mini tab. Recently we've added another cool rule based AI in mini tab, it's called graph builder and I'll show it a little bit later just to introduce some flavor of it. OK. So but the main topic of my conversation today with you is going to focus on what type of AI we have been including within our machine learning or predictive analytics side of the mini tab. And let me start by introducing two major business problems. And in fact, we talked about it last year, even though last year was primarily focused on the algorithms. This year is focused on the, the bigger picture, like how to use those algorithms in the most effective way. So anyone interested in investing in the data collection and management eventually faces the following common needs. And again, this is not an exhaustive list, but those are the very common things that we deal with every day. The problem number one is find the most accurate predictive models subject to some natural constraints. OK, now why we call predictive analytics predictive? Well, we want to predict something. We want to predict something. There's usually a measure of success associated with it. Now ideally we want to be as accurate as possible, right? But not all real life examples allow 100% accuracy and therefore the natural problem that often arises is to find the most accurate predictive model subject to certain natural constraints. Nowadays have been entire business flourishing out there. Of all these machine learning, particular lytics competitions, Kaggle Platform being the one of the most successful one in that regard by they all center on this issue of finding the best algorithm, the best model. And the problem #2 is actually related to problem number one, but it has a very different flavor. In many real type scenarios we live in the data age, There's a lot of data is being collected, both useful and absolutely useless, right? So all everyone is obsessed about collecting data, the privacy laws, they've been released around collecting data, all these other things. But the one of the common issues that that generates is how do we handle a large number of variables in the data set, be it attributes, features, variables or whatever you call them, right. What is the most predictive useful subset so that we are not getting overwhelmed with a sheer number of different inputs available to us. So there's a two different applications. They're related to each other. They have a very distinct flavor, and this is where the modern artificial intelligence is introduced, known as automated machine learning. And that type of machine learning specifically addresses those two problems above. But there are also extensions of it and again, within the scope of our gathering today, it's impossible to cover every little part of it. So I'm just the chose to focus on things that are of the most interest to everyone. So let's look at the 1st these other amount thing. So automatic discovery of the best predictive model, be it regression or classification or whatever type of model that you're working with. So again, in this case, we use machine learning and automation around it in order to discover the best algorithm that produces the best model in terms of overall predictive accuracy and success. So let me go ahead and start our next poll. And don't worry, this is just the middle pole. There will be one more pull towards the end for for another type of automated machine learning. But if you look at your word line of work, how many predictive analytics algorithms do you routinely use? And it's OK if you have not if you're not using them at all. So those are four different choices and I'll give you some time to answer those. Some of you may not have used them at all. Some of you may have your own favorite like one or two. You know, like classically trained statisticians, they sometimes they prefer to use logistic regression or linear regression all the time. And then some of us have a handful of favorites and there's also daring experimental ones among us who who want to use pretty much everything out there. So let's just go, OK, we got 55%, let's look at this. OK, so none 47. So it's about half of the audience, OK. And that's fine. That's why I'm talking a lot of things in general here. But if you are interested in predictive analytics at a somewhat like a more general overview, do listen to the recordings from the last year, because that's what I've described the type of predictive analytics and the the different flavors of it and the different algorithms that we have. By the way, it's interesting that many of us use one or two and actually do a quick update and see the numbers. OK, perfect. So it's still the same picture, more or less, and then a handful of favorite favorites. I would say I myself place into this third category. I do have a handful of favorites, but it's mostly because over my career so far as a data scientist, I've I've looked at a lot of different algorithms and ultimately I've converged on my preferred subset so to say. So every time there is a a new data set that they need to work with, I would normally pull out that expert deck that I have and you will know in the moment what it is and just work with that. So it's pretty cool. Let's just move on then here's our first example. We're going to look at the data set and the deals with the delinquency prediction. So you have a set of bank accounts, it's, it's a pretty old data set. It's been publicly available for decades. If you're interested in data sets, whether we will make them available afterwards, at the end, we'll figure out certain things. But this is a very interesting data set. And again, you don't have to be using predictive analytics routinely in order to understand what's going on. So you have about 108,000 different bank accounts and those accounts are being monitored on a day by day, say, month by month basis by activity like what's going on in those accounts. And then we have observed past behavior of all of the accounts that experience the 90 days past due or other types of severe delinquency within the next two years from the point of observation. So approach it from the point of view of the bank. You have these thousands, 10s of thousands of accounts. And based on the historical records, you know that when something is happening on the account on day A, then within the next two years after that point of observation, that account will go bad for at least 90 days or more. Going bad, meaning you stopped receiving payments. Let's say I believe these are credit card accounts. So let's say in this case the person got some credit card activity and then they stopped sending their payments including the minimum payment, right for more than for three months or longer. So this clearly that account went upside down, something got wrong and then it's a very bad thing for the bank to experience because now they have to absorb the cost and all of this, right. So it's a very practical application of predictive analytics as all this banking itself and the banking and finance have been one of the the the 1st types of clients and and businesses who latched on to a practical applications of predictive analytics. So if you look at this particular data set and I just I picked a smaller subset of variables just to illustrate highlight the point. So I have the event of interest, the delinquent which is a 01 response. We have mostly zeros, all accounts are in good standing, but there's being a handful of ones that that went belly up so to say. And then we look at in this case only six predict 6 descriptors age of borrower in years, debt ratio, debt ratio is essentially if what you know if you apply for the loan, they look at every month how many fixed payments you're making and what's your total income. And then they they look at that ratio and that allows them to judge how heavily you're loaded with that obligations and all of this. Then you have obviously a monthly income. Then you have a number of open lines that essentially includes all of your different forms of credit instruments like I said, how many credit cards you have, loans, mortgages, car loans and all of this stuff. Again, it's a really useful indicator. It's kind of discrete, right? Then you have a number of mortgages because those are in a special category. If someone already has a mortgage, it means that you are basically a home owner and that puts you into into a different league, so to say, and then also the number of dependents and other things. So there's just a handful of very simple variables. And and again, what's our objective here to build a model that makes the best explanation of delinquency, right. So let's see what happens here. First of all, I you always do the descriptive stage first. What you see here is a bunch of histograms, let's say for age, death, ratio, etcetera. All those six variables, the ones in the red are delinquent accounts and the ones in blue are normal accounts. And we always start with the descriptive stage because that's something that's very useful. That gives us a lot of information about the data set. Like for example, if we look at the age profile, we see that they have a distribution that covers pretty much all the ages and it's a representative of what happens in the society, right? Same thing with the monthly income. Same thing with the debt ratio. You see that there's a sizable number of accounts. So they're pretty large percentage of debt. Now what you can also see from here is that you cannot really guess the model from what you see in this distribution. It appears that those delinquent accounts, they are spread around like all across all these different areas. And that's why we're trying to build a model here. OK, so let's build. There's also, if you look at mutual dependencies like here, I picked what we call a corellogram. So you took all of these six variables of interest and you look at mutual correlations and you color code them. And by the way, all of this is available in Minitab. I'll talk about it a little bit later. You can see how that ratio and the number of mortgages are directly correlated with each other. You can also see some other measurements like age and number of dependents negatively correlated and all these other things. So we start with descriptive stage just to understand a little bit better of what's going on in the data set. But then we can always build baseline model. And in this case, when they built a conventional logistic regression model, I simply picked all those six variables, threw them into logistic regression software. You may or may not even understand exactly what's going on here. But the interesting thing, there's only one bottom line number every time you build a predictive model. Despite of all of these voluminous output that sometimes statisticians get their heads like really deep into it. For the rank and file business user, including myself, anytime I build this model, this is the number I'm looking at. What's my test or independently validated performance of my model. That number is the bottom line, and this number specifically Aryan that are a seeker of. No time to delve into details. But that number varies between .5 and one point O .5 means that your model is as good as random, which means you might as well toss a coin and predict who's going to go bankrupt. Anything less than .5 is nonsense, but one point O would mean that you have a model that made a perfect prediction that's ranked all of the delinquent accounts above all of the good accounts. OK, in terms of probability of the Lincoln CR propensity to be delinquent, right. So in this case it's .65. It's OK, but it's not stellar. Ideally, I want to see something that's higher up. This is a baseline algorithm that I unleashed until this very simple problem. In this case, at this point we can now introduce elements of artificial intelligence, which technically it's a very simple procedure that runs all the different algorithms that we have. Automate, automate automatically for you. So that instead of picking all those settings by hand, you simply configure your search space in mini tab. Then if you're interested, you can study it and we'll have other separate webinars. There's a lot of help and documentation. It is very simple. You just say, OK, I want to run these algorithms, go and do it. Depending on the size of your data set, you may have to take a long coffee break or do some other things. Let the machinery do it for you and the machinery when it comes back again, Remember there's that one number, one final bottom line. In this case, it's the area under C curve, independently validated and what is stable summarizes here, it shows us that three net, which is the caster gradient boosting algorithm, it's also known in the field out there it's XG boost. That's slightly different flavor, but it's the same category of algorithms. So we call it the tree net, the original technology that was introduced by Jerome Frieden back what, 25 years ago. At this point we were among the first who commercially implemented it with the blessings from Jerry so to say and then now it's part of Minotop and it's called tree net to put that specific type of to make a distinct name and everything. But it is a stochastic gradient boosting and it is a flavor of HT boost the otherwise generally known out there. And as you can see from here, it's become very clear that using that automated machine learning, which is a specific category of artificial intelligence, that to me is the most useful one in that area, we've solved the problem. Number one, we found the most accurate model and it gave us five point boost in accuracy. So logistic regression like unadulterated logistics regression without any custom shaping, OK gave me 65 for the area under AC curve, whereas 3IN it boosted it up to 70% and five point is a huge huge increase. Just talk to any banking people or finance people and if you tell them you can increase performance with their models by five points in terms of variant dollar C curve, you'll generate a lot of excitement. OK, so and in this case it was sold for us by simple type of search. Now no time to delve into deeper, but we also augment that AI by allowing you to also search for interactions and do some interaction discovery. Like if you run this process here, what you will discover is that the original unrestrained model that includes all interactions is only a tiny little bit better than the model where all interactions have been disabled and now we're forcing all variables to be additive for the model itself. Albeit was nonlinearities allowed. And as you can see in this case we don't really need interactions, we can achieve almost the same level of performance by using generalized additive model and that's one of the cool features available in tree nut. As far as the tree nut and why, this becomes one of my favorite algorithms and I'm not going to delve too much on that. It gives us a lot of insights. Like not only do you get models that are highly accurate, you also get to see these plots that allow you to see, to explain and interpret what's going on. Like in this case, it turns out that people with the small number of credit cards, especially with the 0 number of credit cards, they have the highest probability of going delinquent. And that's most likely because of the identity theft and other things and like whatever is going on there. Same thing with the debt ratio, but it's reversal, right? So if your death ratio below, say, 10%, you're in a good zone, but then you kind of start entering into a danger zone right there as far as age. Now we see our first nonlinearity because remember we have the 5% five point boost in model accuracy. Why did that happen? Well, that's why age exhibits a reversal of trend. See, the previous curves can be modeled by pieces of S shaped sigmoid curves, which is the underlying feature of logistic regression. These ones cannot because with age you have this middle point here, anyone between 25 and 35 that has a highest probability of going delinquent and then it whoever is younger is a lower and whoever is older is a lower too. And the same thing with the monthly income and all the others. So as you can see from here, and again if you're interested, we can have a separate discussion on this. There's a lot of useful insights that I received from here and again from the data set and looking at a couple of descriptive statistics, then going taking a break with a cup of coffee to let the models that let the software search using artificial intelligence rules to identify my best model. Then I come back, I looked at the best model, I went to plots, I'm happy with my performance and I really like the insights. So that's actually a very cool way to work with these things and that's what we offer and I use it pretty much in my daily work all the time. So that's solved. Our first business problem, let's move on quickly on the second business problem. And this is the problem that deals with how many variables you need to manage in your predictable related analytics work or even just general work doesn't have to be predictive analytics. So just look quickly at this poll and see like think about the data sets that you're dealing with, Do they have like single number, like digits, number of projectors or 10s or hundreds or thousands? Because I've seen people who like saying genomics research who would work with 10s of thousands or even hundreds of thousands like gene expressions or SNP expressions or anything of that nature. So I just have a quick look at this and just think about it in general in, in your experience as practitioners and I understand not all of you have been actively working with data sets, but that hopefully should give us some interesting indication of the audience here. OK. And again it it pretty much agrees with my expectations. So we have about half of the audience staying within the manageable number 10s. Now 10s could be 50607080, that's still difficult to manage. Some of us are in single digits and I understand some of those data sets are in single digits, especially those on the classical side or the data sets where we simply don't have enough access to the points of measurement and and some the small number may go into thousands. So that's let me just do a quick update here. Yeah. So that's pretty much what we have. So let's quickly look at this problem and specifically I have a data set that again we work on this project at several years ago there was a a Gartner conference. So they asked us to to kind of provide some insights and then has to to deal with understanding what global like country level indicators are associated with the hunger. OK. So like I said, no, some some countries are prosperous, some some countries in the world, they at the very low level, there's a lot of suffering and other things. So they were just interested and provided the cool data set that covers pretty much 18 years of 21st century. And we want to look at what of these different indicators, social, economic indicators associated with world hunger. So there's been a lot of work that we did merging there. The whole thing was a mess, there was a ton of different things that have to be reconciled, etc, etc. But in the end we ended up having a data set. They had 98 variables and about 5000 observations because those are the different regions, countries in the world because they not only included, well, I remember it's 20 years or 18 or 19 years, right, to be precise for each country. And there's about a several hundred countries in the world plus regions etcetera. So it boils down to about 5000 observations. But the challenge here is the sheer number of variables and those are just a handful of all the different indicators that they provided. And again, what I like about it, it's politically neutral. We don't rely on any biases or anything like that. So anyone genuinely interested in making the world a better place, this is a great it's a great kind of material to look at and study and then draw your own conclusions depending on where your preferences are, so to say. So descriptive statistics. As you can see, there's a lot of stuff going on here and we are challenged by the sheer number of variables here. These were additional challenges along the way. Variables that have lots of different categories, variables that have missing values, variables that have extreme values where like you know when you have a handful of values sticking out. This is the source, various types of dependencies. And again this is where we relied on the machinery that was added to Minitab recently, like in in recent several past, a couple of years or so, and it's called Graph Builder. It's a rule based AI machinery. It's very easy to use. In fact, in the graph menu, Graph Builder is now in the very top entry, followed by all these other different types of graphs and graphs. And let's be honest with you, I the moment I started using Graph Builder, I ignored all the other menu items unless I needed something highly technical and specialized. This gives me all I need. You just pick the variables that you want, and in this case, like any combination, and then it automatically suggests all the different types of graphs that you can construct on it. Or you can go into individual categories, Or you can Scroll down and look at all these different types of charts. And again, it may take a little time, but it's highly interactive. Especially in this case, the data set is only 5000 observations, so you throw those things in and you can pick by variables. You can color code things, and you don't even need to remember a specific type of graph you're looking at. Just by studying those little pictures, you can quickly identify something of interest, like in this case, Cruella Graham shows how things are correlated with each other or histograms. You can pick individual categories there too. Now the system here doesn't allow me to run videos, so I just took a simple collection of various screen screenshots just to show you those things. OK, very useful tool, very straightforward, lots of joy to use and takes a lot of all that grunt effort and kind of unnecessary mental thinking Power, like effort away from you. OK, And then you're still sitting as the final judge. See the whole thing about artificial intelligence today? It's not really artificial intelligence. It's something that helps your intelligence to become a lot more productive. And gives you enough information to think about. But again, we'll talk about it in a couple of days from now. The graph builder allowed us to quickly identify variables that are related to the target, yet are not useful for projection. Like for example, we know if the country experiences under nourishment. This is a kind of formal way to indicate that the poverty rate, so to say, then there will be all kinds of things related to it, like there will be large percentage of stunted and wasted people and so on. And certainly there will be a negative correlation with the percentage of overweight people of population, which is the plague that's being plagued but impacting all of the prosperous, affluent Western countries in the world, right. So again, very useful displays, but there's a lot of different things that we can do. So anyways, to make the Long story short, and I'm going to bog you down with details, we've eliminated all of those variables. They were clearly not useful, and then set up predictor discovery. And again, you simply say we are working with that response. This is the list of whatever, 98 or so variables. Those are all predictors we're going to try. Go ahead and do it. You set up basic configuration of runs. I mean, we usually do it with three. Not because three not has the best capacity to explore and exploit all of those predictors. And in the end, again, you may have to go take a long coffee break or run it overnight, it doesn't matter. Let the machine do the job and in the end it showed a very interesting thing. And it it happens almost all the time when you start with 10s or hundreds of predictors, when you give it this type of automated search to discover best subset. But you'll just see here is that elimination scheme. There are different schemes that you can select there. Again, I'll leave the details out of the scope. It became very clear that model that only has six predictors can be as predictive as all the other models. And again I'll skip the issues of confounding and other things. But when you look at that model here, remember it's a three knot model. So when you look at that model, it actually shows you that you have very little overfitting and it shows you the only six variables that ended up surviving that variable selection process. And this time we're getting 87% R-squared in this case, because it's a regression problem. We are trying to predict the basically the percentage of hungry people in the country, region, whatever. And when you look at this model, it gives us a very good performance considering that's a regression model and because it's a three nut model. Now we can also look at the individual plots and again I'm not going to bog you down with details, I I cut and paste the slides from the presentation that we did for Gartner. But you'll discover some common sense things like the electrification is very important and that's usually associated with developed countries, right? Then you'll discover things like mortality, death per population, like whatever happens. Again, poverty is highly associated with those types of phenomena. The one interesting one was the CO2 emissions. You know, and again, I will not be political about this. There's all the different takes in it, etcetera. But if you're looking at the bare data right here, based on this data set, it seems that the real critical region here that the poverty kicks in is when you have no CO2 emissions. It means there's no industries at all. And then once you start introducing some level of industrialization, not necessarily the level of harming the environment, be it CO2 or nitric oxide or some other type stuff like real emission. So you said see when we talk about CO2, we don't recognize that's that that's just something that plants use to breathe. So it's actually quite recyclable. But there are other emissions that really produce some real damage and that could include all kinds of poisons and other contaminants. And I'll leave it at that, but you'll see here that we don't even have to worry about it when we look at the world hang hunger because a lot of those countries, they just don't have any emissions to begin with or something like that. And then when you go into male to female labor, again, the main insight of this model shows that we need to avoid the extremes. It is not good when you have countries where only males are only females are the main workers force you want to stay in that middle segment with a lot of leeway back and forth. And the final insight is that if you look at the effectiveness of government and the level of corruption and the rule of law, there's a very clear trend that poverty is associated with a lot of corruption and a lot of general breakdown in the society in terms of the rule of law and other things. OK. So on that note, as you can see, we solved our second problem. And again, there was a data set that was kind of kind of fun to look at and understand. In general, 6 relevant predictors out of the initial 97 were identified in this particular exam. OK. And again, to me, it only took once that 90% of our time and effort went into assembling the data set, doing merges and other things, writing the actual models and looking at them and putting down the final report for Gardner was just a walk in the park and a lot of cool things. And it's just you just need to take a coffee break and wait for that AI to do the job for you. So what's the overall conclusion? At Minitab, we have been relying on traditional AI for decades. The most recent invocations of it, so to say, will be the Graph Builder rule based AI that allows you to quickly explore variables and their mutual dependencies. And you saw it here. And then we have two very powerful flavors of other ML discover best model that allows you to rely on that combined knowledge of all the different algorithms and let the system identify the most useful ones for you and discover key predictors. Another very powerful feature that will help you to deal with the situations when you have, I would say 10s of variables, hundreds of variables, in some cases thousands and 10s of thousands. OK. And all of that is again is a useful part of artificial intelligence that will save you a huge amount of manual modeling effort and will give you a real advantage over your competitors. And on this lovely note, I will switch to Dave for a short Q&A session. Awesome. Thank you, Mikhail. So we're we're going to kick off our Q&A session in just a minute. I do want to remind everybody that is in the webinar to answer our survey questions. So take the time to submit your questions, but also don't forget that survey so that we can get your, your feedback. All right, here's here's a good question for you, Mikhail. Is there any requirement on data set size for any of the algorithms that you use today? Well, the the here is the thing. And that's why sometimes I ask the question too, like how many observations you have on your data set. There's a way to think about it. Classical techniques, they were just statistical techniques like linear regression, logistic regression. They were designed for data sets. They have, say, no more than 100 records because that's where they shine. You impose your own assumptions. You have all the distribution assumptions, other things, but you just don't have enough resolution within the data to capture fancy non linearities and other things. As data sets become larger and larger, traditional techniques tend to fail because it's hard to guess what's going on. And that's where machine learning techniques start shining. So I would say once you go into hundreds of records, you already have a good opportunity for machine. Modern machine learning techniques outperform conventional techniques and then the sky is the limit. Now with once you go into millions of records, you can experience all sorts of resource issues. But I would say machine learning techniques start at the level of say a couple of 100 and then go up. Now there's no card rule. I've seen cases when people got some good results even having on a, say 50 or 60 records, but certainly not below 30 records. That's when it's It's kind of. You're just wasting your time trying to find something cool there when you just don't have enough records to even consider it OK. Great, thanks. Mikhail. Another question here. I've been using Minitab, Minitab statistical software for a long time, but have never used predictive analytics algorithms. How can I access those tools you mentioned today? OK. The Minitab offers predictive analytics add on. It's on it's once you activate it, I mean it it it it's some a little bit of extra that you need to pay for it but once it's activated it it it it opens up essentially random forest, three net and Mars for regression and other Mal. Otherwise in the in the basic version you only have access to traditional classical techniques like linear and logistic regression and we also include Kart there because Kart classification and regression fees. This is a pretty unique cool technique that is also included with the base package. So I would encourage you to investigate that possibility of activated PA add on and in fact I have a vested interest I'll always be advising hey why don't we include and make it a part of the the the main version because everyone needs this stuff and and things like that. So that that that that's the current situation that we have. OK, go. Ahead. Great. Thanks Mikhail. I have another question here. You did not discuss data cleaning before jumping into models. Are some of these methods robust to non prep data? Well, here's the thing that the data cleaning is a necessary evil no matter what kind of problems you have. Yes, the machine learning techniques that we use, the three base techniques specifically Kart, Morris, Trina's Random Forest, they have been. They had a lot of cleverness placed inside of them and the ability to handle missing values, to handle variables with large number of distinct categories, to handle the outlying records in all of this. But it doesn't mean that you don't have to do data cleaning at all. There's still some common sense data cleaning that we all need to rely on. Now these tools are robust with respect to data cleaning, but they are not going to eliminate data cleaning because data cleaning involves a lot of just common sense things. Otherwise you'll have a garbage in, garbage out type of scenario. But by having said that, outlier is not an issue because all these things are rank based. Missing value is another real issue because they all know how to handle missing values. They don't just drop those records out the way normal regression would do, but but you may still experience some side effects of that. That's why I showed you graph builder. That's the tool that I use all the time to understand a little bit better the situation in my data set and then work with my data guys or sometimes do it myself, which I don't necessarily like because we have a division of Labor, right? But at least I can talk with them and say, oh, like, hey Joe, can you help me? I mean, there's something wrong going on with these variables and it doesn't make any sense. So can you check that? Can you check that? Can you check that? So I did not talk about data cleaning because outside of the scope of this, but a lot of these tools can be used to assist with data cleaning OK. Yeah, and there was another question similar kind of related to that. OK, any tool to help with data cleanup? We do have Mini Tap Connect which is another tool that can connect to different data sources and also help automate some of that data prepping so that you can use some of these algorithms to make sure that data is cleaned up. Just wanted to answer that question as well. And in fact, David, I think it's a, it's a great topic that dedicated one day next year for PA week like how to do data cleaning using PA tools because there's a quite a, there's several tricks that I can talk about, it just will take us a whole other hour to discuss. Yeah, perfect. Yeah, we'll we'll definitely try to do that next for next year, but that that's all the time we have for questions. Before I do let all the attendees go, I did want to touch on our mini Tab exchange in person event. So if you are based in the US and enjoy today's presentation, you will really enjoy one of our upcoming in person Mini Tab Exchange events. Topics include best practices and Mini Tab statistical software, industry expert speaker spotlights, interactive workshops and more. So our agenda is designed to help you expand your skills and leverage your data. Our next event in the US will be in Rosemont, IL on June 18th. In the coming months we will be in Columbus, OH, Dallas, TX and Anaheim, CA. All these events are complementary. You can register today by clicking on the Mini Tab Exchange link on the webinar dashboard, which is on your bottom right. We will also have exchange events repeated in four countries across Europe, so visit our Events web page or contact us for more details on the agenda and location for those events. You can also indicate in the survey if you're interested in attending any of those any of those events. Now finally, to close this webinar, I'd like to share some information for those who may not be as familiar with Mini Tab. At Mini Tab, we help customers around the world leverage the power of data analysis to gain insights and make a significant impact on their organizations. By unlocking the value of data, Mini Tab enables organizations to improve performance, develop life changing innovations and meet their commitments of delivering high quality products and services and outstanding customer satisfaction. Thank you all again for attending the webinar and we hope to see you back soon. _1734830599404