Science now I will hand it over to Russ for today's presentation. Breasts. Thank you Matilde Hello everyone today we're going to be talking about quantizing Edge neural networks with cucharas. So here's the agenda I'll be covering today. We'll start by talking about the inferencing computational load. Then move on to hardware acceleration. We'll talk about quantization of neural networks, overflow and saturation, and then we'll introduce the cucharas package and talk about how that can be used to improve the characteristics of the neural network as we quantize it. And then we'll talk about measuring power, performance, and area for an example design, and then explain our results on an example design that we've run this through. So inferencing of convolutional neural networks has a very large computational load. If we take a look at relatively small neural networks that are doing, for example, object detection, these require generally the small number of single digit billions of multiply accumulate operations as we go to larger networks, they'll extend into the 10s of billions of multiply accumulate operations. The convolution algorithm. Takes each input pixel and performs a number of operations on it to get an output pixel, and then these are stacked in a number of layers that the data is run through. There's actually a significant amount of fan out. So with these billions of multiply accumulate operations that we need to perform, if we look at most edge systems, the processing and compute capability that we've got in general purpose CPUs is usually not going to be enough to run a single inference within a second. So if we've got. If we've got multiple inferences that we need to do per second. Running this in software simply isn't going to work, it's just too large a computational load. So we need to bring in some form of hardware acceleration in order to perform these inferences and meet the timing and power demands that we've got now. Hardware is generally going to be faster and more efficient than our software, and we achieve this increase in speed and greater efficiency in a couple of ways. On the face of it, it might not be obvious because processors are made out of a particular set of silicon, the accelerator. It's on the same chip, will be based on the same same silicon and the multiplier in the CPU is going to be the same as a multiplier in our accelerator hardware. The way that we get performance out of moving things into hardware is to take advantage of the parallelism. So instead of having a single multiplier as we wouldn't CPU and running software within hardware, we can lay down 1020, fifty or even hundreds of multipliers and have those operate all at the same time, and that is going to dramatically speed up the execution of the algorithm. We can also connect these different processing elements and multipliers or other computational units that we've got. We can connect them together so that they feed into one another, and this eliminates the need to take intermediate data, push it back into caches or main memory, and then bring it back again when it's needed and the saving of the data movement time can also significantly speed up the execution of our algorithm. Now to get efficiency, one of the things we can do is actually tailor the hardware very specifically to the problem. Now with the CPU, it needs to be able to run any program that anybody's going to bring to it. Any point in time, whereas if we're targeting a particular accelerator to a very specific job, we can tailor it exactly to that problem. So while the processor needs to have the full range of numeric representation. If we've got hardware, and we know that none of the numbers are going to get larger than 100, we can actually throw away the high order bits, and this makes. The the hardware much smaller and therefore more efficient. Now if we look at convolutional neural networks and we take a look at the benefit that they can get from hardware accelerators that we bring to the party. Very typically we're going to see speedups of 20 to 100 times faster. In special cases, you can see gains of over 10,000 full. Based on the design constraints of how much hardware you can bring to the party, it's going to depend on exactly what speed up you get, but it's really gonna run much, much faster than it would in software on a CPU. The other thing I'll observe here is that CNN's are embarrassingly parallel. If we take a look at all of the multiply accumulate operations that we need to do within a convolution, we can perform all of those multiplications in parallel. There's no data dependency, which means if we could lay down on enough multipliers and we could get data moved into it fast enough, we could run these algorithms very very quickly. Also, the data reuse is very high within a layer. We're basically going to bring in some data and reuse it in a number of different operations. And so far we're able to keep that data local and not write it back out to main memory or caches. We get a benefit from that as well. Now, in order to bring hardware acceleration into your design, you've got a number of alternatives you could bring in a discrete component. An example of this would be Google's Edge TPU. As you can see there, it's a very small component. It's also very low power, but it does mean that you'll be going out off chip and back on CHIP, which has some issues with it, but it certainly is one way to accelerate the inferencing for convolutional neural networks. Another approach might be to acquire some IP. There's a number of IP vendors, and there's also a number of open source projects that have accelerators that are available for you to use. One example would be invidious deep learning accelerator or NVD LA. They have a small version of this that's targeted at edge type systems, and this can be configured in a variety of ways to implement an accelerator for machine learning and inferencing. Another alternative, of course, is you can build it from scratch. You could start with RTL and design the accelerator of just. As you would build any other hardware, but we what we did was we generated our accelerator from AC representation of the inferencing algorithm using a High-Level Synthesis. So High-Level Synthesis is a technique that takes an abstract representation of the algorithm. Typically it's going to be CC or system, but really there's no reason why it would be limited to these languages, just has to be a representation of the algorithm, and most of the commercial high level synthesis tools to use these three languages, and it will then compile this up into a format that it can manipulate. And target and ultimately generate a set of RTTL that implements that same algorithm described in the CRC. Now, High-Level Synthesis that compilation process will handle many of the low level details that. Hardware developers need to take care of when they're doing implementation, but many of these are are things that aren't representative of exploring the architecture space or the creative work. It's merely stuff that needs to be done in order to get working RTL, and so folks often don't mind handing that off to a compiler. So what's going to do is it's going to read in the C algorithm or C algorithm through it. Traditional parsing process. It's going to create a controlled data flow graph from there and it's then going to create an RTL representation targeted at a specific technology that you are going to implement your accelerator in, and it's going to be able to go in and identify opportunities for parallelism, pipelining, resource sharing, as well as inferring interfaces and memories. As we look at the challenges in putting together an AI accelerator, we've got a lot of competing goals. We've got the accuracy, the power of the area, the throughput and so forth. And, you know, we might want it to have a very small area. But that means that our our throughput might suffer. We'd like it to be very accurate, but that might mean that it consumes more power, and so building this accelerator means we want to be able to find the optimal tradeoff. Between these these different goals, and how do we go about doing that? So High-Level Synthesis is going to allow us to explore a much broader set of possible design space, so we've got a single model for both the architecture and the implementation, and we have a set of constraints or knobs on the tool that we can turn to be able to realize different architectures, and we can do this very quickly. Whereas with an RTL approach. Changing that RTL to a dramatically different architecture would involve lots of very low level programming, and so it really just isn't practical to explore the entire potential target space for the architecture with High-Level Synthesis it may be changing a constraint, it may be changing a few lines of C code, and we can realize a dramatically different implementation, and we can look at that and figure out does that better meet. The set of trade offs that we're interested in realizing, and in this way High-Level Synthesis really lets us take a look at a much broader design space than would otherwise be possible. It also lets us very quickly simulate that behavior of the CC system by bringing in a bit accurate representations of the data elements and bit bit accurate representations of all the operations, and these are all modeled in at the sea level, so this is before we take it down to the RTL, we can run a complete bit accurate representation. Of the algorithm targeting a particular quantization, and then we can observe the effects on the algorithm, we can look at how long is this going to take? How large is it going to be, and what is the change in the accuracy based on the fact that we changed the number of bits at a particular point in the algorithm. So we start with a floating point model and we'll make sure that that works correctly, and then we'll move it into a bit. Accurate representation using integer or fixed point numbers, maybe floating point numbers. We'll see that that's not desirable in a few slides, and then ultimately be able to use that to run a verification and get the true hardware behavior, and this is typically going to run 30 to 1000 times faster than the RTL implementation, which means we can evaluate again a lot of different architectural tradeoffs before we commit to RTL. Once we have decided on exactly what we're going to target within the High-Level Synthesis tool, we have some the ability to unroll loops so we can identify specific loops that we might want to completely unroll, and that means we're going to perform each iteration of the loop in parallel. Sometimes the loop will be more iterations than we can completely unroll, so we might want to pipeline it and there will have. Multiple implement multiple stages through the various steps inside the loop being executed concurrently. We can also do a combination of unrolling and partially unrolling and pipelining balloon, and so this gives us a great deal of control over the hardware that's going to be implemented if we have two loops next to each other that are operating the same number of iterations and operating on the same data that these can be merged and the compilers. Smart enough to figure that out. Now as we go towards scheduling this, the compiler does have awareness of the ultimate target technology and the the clock rate that you're trying to target for it, and so as it builds its RTL representation, it can build that so that it will go through RTL synthesis and timing closure very easily. It also has the ability to identify sharing opportunities for registers and resources in both registers and operators that are mutually exclusive throughout the execution of the algorithm, rather than putting down multiple units and taking up more area, it can identify that they can be shared and it will put in all the logic necessary to do that, and so it gives us the ability to create a very efficient implementation of the accelerator. That we're looking for this balancing all those different tradeoffs that we talked about a couple slides ago. One of the other challenges to moving the convolutional neural network inference into hardware is the large weight and biased databases that we're bringing into this. So for a very small convolutional neural network, we might have several megabytes of of waiting biased data that we need to process. If we have large databases. If we have large neural networks, these can range up into the gigabytes, and there's also. Intermediate storage that needs to be considered so it's not just the weight and bias database, but as as we're performing these computations, there will be intermediate results that we need to store, and these can be very large as well. And So what we find is that if we look at an inferencing accelerator, this is generally going to be dominated by memory. So most of the area that we're going to put in silicon will be memories to hold the weight and bias database and intermediate results. And we're going to find that our performance limitation is probably moving around all of this data. And so the size of this data tends to dominate both our power consumption and our performance. We're also going to find that the multipliers are very large. Now we want to bring a large number of multipliers into the mix because that lets us do more in parallel. But these are going to tend to dominate both the area and the power for the computational logic that we're using. So the observation that I'll make here is that anything that we can do that's going to reduce the size of the data is really going to benefit the design. So this is some information that NVIDIA published at a recent DAC paper back in 2017. Looking at the energy and area relative to different operations and some of the key points that I'll call out is that a fixed point multiplier is about half the area of the equivalent floating point multiplier. And multipliers are generally proportional to the square of the size of the factors. So a 32 bit fixed point multiplier is gonna be about 16 times larger and more power hungry than an 8 bit multiplier. And if we compare that to a 32 bit floating point multiplier, that's going to be about 32 times smaller. So if we can quantize our data down from 32 bit representation of the original algorithm down to 8 bits, it represents a. Really it represents the ability to really shrink down the multipliers in the design, which again are going to be our biggest component in the logic. It's also going to reduce the amount of data storage that we need. Both for the weights and the biases and intermediate results. So those are going to scale linearly with the size. So if we can drop from 32 down to 8 bit and we can go from floating point to fixed point, we're going to save about 32 X on our multipliers and about four X on our data storage and movement costs. Now, if you take a look at the normalized energy costs, you're going to see that going off chip to get data that DRAM access is 200 times more costly in terms of energy than an ALU operation. And so really, anything that's scales down the size of the data. If we could pack more of it in is going to significantly benefit the design. Now just a touch review on the numeric representation for floating point. Here I've got the I triple E 32 bit and 64 bit floating point representations as well as a fixed point representation. So for a fixed point representation, we're really going to be able to set it to be any size we want. We can have any number of integer bits and any number of fractional bits. Now in the example algorithm, we'll be talking about in a few slides. The biggest number that we ever saw through the entire processing of a fairly large data set was 59.804 and if we look at an eye triple E 32 bit floating point, we can see that the largest number that it can contain is up to 3.4 * 10 to the 38. I was going to write this out with zeros and commas, but it just kind of got ridiculous. But the point here is. That the total space that can be represented by by Triple E 32 bit floating point numbers is actually wasted here, because again, our numbers never get really large within an inferencing algorithm. We typically try to keep the values between positive one and minus one. Occasionally, we're going to have summations that get up into the 30s or 40s and once in a while I've been to 60 or 100, but between. Players were probably normalizing the data back down to the one to minus one, and so representing these numbers. That floating point means we're going to have a whole lot of wasted space within the design, and so moving to a fixed point and quantizing it means we can save all of that. Now. When we're running with floating point numbers, overflow is rarely going to be an issue, right? Even with these large convolutional neural networks, we're never going to get up to a number that's 10 to the 38th or larger, and so we don't have to worry about overflow as we move to a fixed point representation where we're more closely modeling the span of numbers that we're going to encounter, the possibility of getting some new data that's going to cause one of these operations to overflow. Becomes much much higher, so we do need to now worry about overflow. And one approach is of course to over engineer. We can simply just add on some bits and hope for the best was probably a better idea is to use saturating mail and so with saturating mouth rather than simply overflowing and producing an incorrect result. What we do is if it overflows, we assign it the largest possible value that could be held by that representation. So here's an example. If we took 62.5 and added two to it. Using a 10 bit fixed point number of seven integer bits and three fractional bits, if we were not using saturating math, what we'd find that this that the sum of that would be minus 1.5? What should have been a large number? A large positive number became a small negative number, and every computation downstream is going to be likewise incorrect. Now if we use saturating math, that's 62.5 + 2.0 would end up being 63.875. Now it's still wrong, but it's a lot closer to correct and one of the things things researchers have found is that within a convolutional neural network, the accuracy of the numbers around 0 tend to be very important, but if a number gets big, whether it's big, negative or big positive, the actual precision doesn't impact the accuracy of the inference all that much, and so simply knowing that it was a big number. Whether it was 64.5 or 63.875 turns out not to impact our inference much, so the saturating approach is a good way to allow us to reduce the size of our our data. So here's the example network that we did some work, and I'll explain the results of in a few slides. This is the MNIST character recognition. If you spent much time around neural networking, inferencing, you've probably run into this as an example and probably coded it up yourself, but we take it 28 by 28 bit representation of handwritten digit, and this neural network is going to go through a 2D3 by three convolution. And reload, and that's going to create 8 output images. We're then going to run it through another 2D convolution. This is also three by three and a reload, and that's going to output 3 images. We're going to run that through a dense layer and then softmax to create probabilities and we'll end up with the result of 10 probabilities of the 10 different digits that it potentially could be. And so this is our base neural network. Now. If we take that neural network that I just described and we load it up into Keras or Tensorflow or whatever your favorite machine learning platform is what we would find is that we would end up with a bunch of weights here. We didn't use biases, but typically you'd end up with both weights and biases, and these would be floating point numbers. Now we can simply take those numbers and push them into a fixed point representation, and this is what we'll refer to as post training. Quantization. Now when we do this, there's two things we can do. Number one we can go to the high order of these numbers and eliminate the most significant bits that never get set. So again, in our example, we never saw a number higher than 59. 2. 7 integer bits, actually seven with the sign bit is all that we need to represent. The numbers that we've got. The second thing that we can do is reduce the precision. In other words, take the low order bits of the number and start to to remove those bits and see what it does to the accuracy of our inferences and recall that we can run these simulations at the C level and that gives us the ability to measure that accuracy if these changes. Very quickly. And so we do that using the algorithmic C data types. And this is a set of C classes that allow us to model arbitrary integers, fixed point and floating point numbers, and allow us to run C execution of these in a bit accurate manner. And this is really an alternative to building an artl implementation to see exactly what the resulting hardware would do. And these are available in open source on GitHub and you can download them and start using them. They're just out there for anybody to use. And as I said, it allows you to observe the bit accurate algorithm behavior. Now, in this particular case, we had a test set of 10,000 images that we used for testing the accuracy of this network with the various bit widths. And there's a mistake in the slide. Here it says it took 3 minutes to run those 10,000 inferences. That was when I was originally compiling it using GCC without a minus 03. When I turned down the compiler optimizations that dropped down to 10 seconds, so 10 seconds to do the 10,000 inferences and look at the the accuracy of those inferences. If we were to do the same thing in our TTL, it would take 8 weeks and so you can see 3 seconds. Versus 8 weeks we can work through a lot of different. Combinations staying up at the sea level and running with these algorithmic seed data types. And so here's exactly how we got this implemented. A little snippet of code here where we defined our feature type and our wait type. That's the highlighted line there, and so we just do a type def and we use the AC fixed data type and it's template tized, and so we give it a number of bits. The word size is going to be the total width in bits for that representation, and then the number of integer bits. The fractional bits will be whatever's left. However, the true is represented. Do we want it signed or not? In this case we did. We wanna do this as a rounding operation. Normally integer math do it truncation and as we get down to a small number of bits that truncation does start to. Impact the accuracy of the results. And then we want to do it as we described as saturating mail and so simply by redefining this we now have this C type that has all the operations overloaded and so forth, and then we can recompile this with different word sizes, different number of integer bits, and look at the accuracy of our inference. And so first I want to look at the impact of saturating math versus non saturating math. So here what we did was recompiled the algorithm with a different number of integer bits from 16 down to one and what we see is that at 16 bits we get the same accuracy for the network as the floating point representation and that's 98 point. I believe it's 1/5 and as we shrink down to 7 bits. The accuracy doesn't change because all we're doing is removing bits that were never being set. Now with the fixed point values we see as we drop down to six bits and then to five bits, the accuracy of the network falls off really dramatically. However, with the saturating fixed point, when we go down to six bits we don't lose any accuracy at all. When we go down to five bits, it drops down, but it's still up in the 90%, so the saturating fixed point representation. The saturating math on the fixed point representation actually ends up letting us save one or two bits based on the neural network, and again shaving those one or two bits is significant. And the size and the energy that's used in the design. So here's a table where we varied the number of integer bits. And the number of fractional bits and looked at the accuracy. So we kind of did a sweep across all the different values. And here we're training with Karas and then doing a post training quantization using saturating operations and what we see here is that we're still with reasonably good accuracy at 5 integer bits and five fractional bits, or 6 integer bits and four fractional bits, so without retraining. In that work we can get down to about 10 bits and still retain most of the accuracy of the original floating point representation. But what research has observed is that if you retrain using the same math that you're using in the inferencing, you actually get better results. And So what we wanted to do now was to retrain this network using the same quantization that we had in the inferencing system. Now, in order to do that, we need a training system that is quantization aware that uses these AC data types that we've described. And that is a package called cucharas. So cucharas are quantized. Karas is a Python package that extends the original Paris that you know about. It has replacement layers that are literally drop in replacements for the original carousel layers, but they do quantization aware math internally, so it's minimally intrusive to your design. You can actually modify a cameras representation into a qaris representation. In a few minutes now it currently has support for all the commonly used layers and activations. If you're using something obscure, you may find it doesn't have support, but the team is working on enhancing that to support the full set of Keras layers, there are also working on tools for automating the search for the optimal quantizing and looking at performance estimation. So here is on the left. My original Clarence representation of the neural network that I just described. It's the two convolutional 2D3 by three layers, followed by the dense layer with a soft mask activation. And on the right we have the cucharas implementation. So here we have the. What we've done is taken the constitut D and replaced it with a Q2D and the dense layer has been replaced with queue dense. All of the other parameters remain exactly the same as they were in the original neural network. But we've added a parameter to each one of them called the kernel quantizer, and to that we assign a quantized bit pair of X&YX, being the total width of the integer of the floating point that we're going to be using, and why being the integer portion. Now this looks a lot like the AC data types, and that's because what's being used inside this layer is the functional equivalent of those. AC data types. And so we can take this and in our machine learning framework and we can retrain the network, but it will now be quantization aware. So this is the full we did a scan from 8 bits down to zero on the fractional and integer side. And looked at our accuracy at each of those points, and this is the post training quantization. And again, you'll see that 5 bits, 5 fractional, 5 integer bits, or 6 integer and four fractional represents sort of the smallest we can go and still get good results. This is what happens when we retrain the network using Q carrots. So here what we see is that we can actually go down to three integer bits and no fractional bits and have get 97.6% accuracy. Or go to three fractional bits with zero integer bits, and likewise get 97.66%. So we can take this all the way down to three bits. We ultimately did the representation. AS43 just seems scary small. But the the difference here is really dramatic. Being able to drop down from 10 bits to three or four is going to drop the size of the multipliers by about a factor of 10, so this is gonna have a really big impact on the ultimate hardware that we're going to construct for this. Now the folks who do cucharas it's a guy named Claudia, nor Koella, who was leading the team over at Google. This is being built at Google. This is an example of some of the results that they just published, not on a tiny little amnis network, but on a a large neural network that's being used in in industry. This was done on a on edge system in a particle Collider. It's actually a really interesting article. I'd encourage you to take a look at it if you have the time. So Claudia and our Coello currently works for Palo Alto Networks used to work for Google. On the Qeros team and prior to that he worked at synopsis and worked on synthesis and hardware design. So you really understood the need to both quantize this and effectively be able to reduce the size of the accelerators around neural networking. And as you can see here. Using the post training quantization he could get down to about 14 bits on this algorithm, and then the accuracy just fell off a Cliff, whereas there's virtually no degradation going down to 6 bit. So again, this is gonna translate into much smaller, much more power efficient hardware. So we've determined the accuracy and we could do all of that at a very abstract level. We could do that at the C level. Now we wanna look at what the impact of this is on the power and the performance. So what we did was used our high level synthesis compiler to create a variety of RTL implementations for the various bit width configurations. Now when we did this, we adjusted the transfer the data. So as we were going down. Below 16 bits we had a 32 bit bus on this system. As we went down to 16 bits or less, we packed two data values. When we got down to 10 bits, we packed three data values on the bus so that we could minimize the time to transfer the data. We also added more multipliers to match the arrival data rate, so when we had one data value coming in, we had nine multipliers active to do the three by three convolution. When we were passing two data. Elements and we had 18 multipliers, 3 data elements. We'd have 27 multipliers, and so forth. The performance numbers that I'm going to give you are end to end inference, so from on an embedded microcontroller saying, go to getting the getting the actual probabilities back. This was synthesized to the gate level for a 12 meter ASIC library to determine the size and we collected circuit activity and logic simulation for doing the dynamic power analysis and the power numbers are post place and route but pre clocked. Synthesis. I'm not gonna hand it over to my colleague Ashe to talk about some of the details about. The back end process around all of this. Take it away, Ashley. Thanks Russ. So here all you must have noticed that all these neural network designs are like pretty heavy in terms of memory usage, where it actually stores lots of intermediate. Debbies and it really requires lots of storage. That's one of the. A key factor which actually gives us a problem in the very customized the traditional flow of PNR, place and outflow starting from RTL to GDS 2. But here we are talking about the C so our traditional flow is like you start with the C. Convert your one of the approach of careful synthesis into RTL. Do your artier synthesis generate all your memory memory views, do the PNR and take your flow in the regular your users tool usage. But if you really want to make these different iterations of different combinations and variations, you really have to do all that part in. So many months and after that. On top of that basically you have to have like Big team to make that implementation. So here in metagraphics we develop a flow where. We just take the C code, convert into retail and then generate the memories based on the given technology and do our different variations and combinations of. Of designs standard cell libraries, memories and then generate a list of recipes and see that what is the. The power number over there, and then we select our best recipe where you get. The best selected memories. The which technology you want to work on and then take your flow into the your final place and route set of tools. So here is the detail of that. The flow which we have developed here, and this is completely automatic flow. You just need to do some preparation in the beginning for a specific technology where you just need to write out some templates and memory templates for different type of memory. Compiler of the given technology. And then basically it's all pretty straightforward and fully automatic, so the flow is like you just take your C code. With the Katapul characterized library. Generate different variations in directives. Optical of that careful input file generates different RTLS. And then those articles are actually taken into our memory. Oasis Optum RX memory constructor where it actually generates different type of memories requirement in different RTL configurations where you can just select. The the maximum memory size that which you want to make it. For example, if you have 64 K byte, 32 bit wide memory, but you really don't want to have that big memory in the design, but you may want to split into four so that also would be possible. So just generate those memories and this is all automatic. You don't need to do. All manual efforts over there. You just need to give some initial instructions in the form of tickle file, generate all the memory views. It actually replaces those behavioral models of retails of the memories into the physical memories, instantiate that and then take your flow into Oasis synthesis tool and generates the PNR flow there till precepts. And then we take that design into powerpro and generate all the power numbers based on different combinations of memories, different floor plans and different. Shapes different utilizations and then look at your PP number and whatever the best recipe you select from there based on the power number, you can just do the real implementation so we have seen in different designs. Coming from core CPUs and some IIT designs as well and implemented into the near. In our flow and then we got the very good success rate over there. So this is one of the flow plan examples, so as I mentioned that in the previous slide that it actually generates all automatic flow plan based on the incoming memories. So if you look at this flow plan that like almost 100 more than 150 memories. So by doing this flow plan by hand it actually takes a couple of days by one engineer and then just imagine if you have 16 rtas and you have 150 macros. So it it will take like a couple of weeks by couple of people. But here in the automatic flow flow which we have generated, it takes like. Couple of hours, maybe like one or two hours. You'll get all the floor plans. In in, in in a pretty quick time. So that's all Russ from my side, I'll just hand over to Russ right now. All right, one one other observation on that floor plan is it actually was kind of uninteresting and that it was just a whole bunch of boxes and those boxes were memory elements and so even at a very high level you can take a look at that and observe that. And that was the floor plan for the accelerator plus a processor. Plus all of the memory necessary to support it, and as you can see, it just becomes a very memory dominated design. So again, anything we can do. To sums in, in the ultimate implementation of this system. These are the results I'm going to start with the top line of this table, which was a software implementation on a microcontroller running this inference purely in software using floating point numbers. So there we have a speed of 174 milliseconds. Power of .66 milliwatts, and an area of about 10,000 square microns with 128 micro joules. Per inference and I also have listed down the right side of the table, the accuracy using the post training quantas, the pre training, quantization in post training quantization. So the accuracy Keras weights. Represents a pre. Pretraining quantization and the accuracy qaris weights is trained using the specific data types and what we can see is that as we go into the hardware implementation, any hardware beats the software by about a factor of 80 in terms of performance and also about a factor of 80 in terms of energy per inference. Now our power does go up because we've brought in a whole bunch of big multipliers. On our AC fixed 32 value those are 32 bit multipliers and we've got nine of them, so we significantly increase the size of the design. Now as we go to AC fixed 16 eight, what we're doing at that point is we're doubling the number of multipliers, but the design size goes down because those multipliers are 16 bit multipliers instead of 32 bit multipliers, so we can see even though we're adding more multipliers and getting more work done per clock our areas. Going down or powers going down or energy's going down. And and if you look at the accuracy numbers, you certainly wouldn't want to go much under 84. Much under 10 bits for the post training quantization, whereas with the quantized with the quantized using the AC data types we can go all the way down to. An AC fixed 4/2 and still get very good accuracy and it represents the smallest and most energy efficient design that we could build. So to wrap things up, you know post training. Quantization is not a bad thing. It really works. It offers some benefit, you know clearly going down to 10 bits from 32 bits is a big improvement over using Floating Points, but if we can retrain using the data types that we're going to be using in inference saying we're going to end up with an even better implementation. So the design power, performance, and area tend to be dominated by the the data that we're using. The data that we're multiplying, and the data that we're moving around. Smaller features are going to allow us to build a far more efficient design, and we've got a path for being able to measure the performance. The power of the area, and the accuracy using the high level synthesis and the place and route flow and power analysis tools that Ajay described. It is now practical to look at trading off that accuracy for the power performance area for ASIC and FPGA designs. All right, that's our presentation for today. Like to thank you for attending and we'll start taking some questions which I see are kind of piling up in the Q&A box. Yes, so please do continue submitting those via the Q&A widgets. We do have a couple of poll questions and we really would appreciate your feedback and answering those, and I will hand it off now to Russ and AJ. We also do have. Mike Finger off and Stuart Club on the line ready to answer all of your questions. Thank you so much for joining. Let's see I'll. I'll start with question number 12. Nordine ate. Was asking which processor is in the mix here. This was done using a Risk 5 processor called the. Cybex core, it is a three stage pipeline virtual memory core so fairly low end core. It did have a floating point unit in it so those floating point numbers that I gave you are not trying to emulate that floating point in software. The processor was not implemented in catapult. We just went and got the RTL from the low risk folks. Let's see question #11. Pascal asks were the area of the Srams included in the presentation, and yes, they were. So the area is the processor, the accelerator, and all of the memories that were included in the design. So both the. The memory used for weights and biases, intermediate storage as well as the program and data space for the processor. Rex asks on Page 2324. You get more accuracy with fewer integer fractional bits. Is this a rounding statistical error? Yes, it is the so as you. Each time you retrain the network, you're going to get a slightly different accuracy, and so as we're retraining with quantize aware, you're going to see that varying fair amount, and that is. It's just a statistical rounding. There is both a rounding error as well as the fact that we're doing retraining sessions, which gives you a slightly different neural network. Jovan asks how do you move from Keras or cute Keras model into the C++ for the HLS? This is done manually today and I I can imagine a universe where it would be automated. But today as we run this through the High-Level Synthesis tool, it's really necessary for the designer to fully understand the algorithm that's being run through the compiler in order to fully take advantage of all the capability of high level synthesis and so grabbing some C++ that they don't know from the third party is really not desirable, but what you find is that. Each of these layers is a relatively small amount of code. I don't. I think the convolution was the biggest 170 or 80 lines of C++ and we do have the ability to compare what we implemented and ran through High-Level Synthesis exactly match the original carrots and so you can. You can run those two in parallel and prove the equivalence between them. Let's see. And I run out of questions at at below 8. How is the transformation from floating point to fixed point done? Linear, logarithmic or other transformation? This is just a simple rounding. So if we've got a you know one bit in the fractional portion of our fixed point number. We're going to look at the floating point and we're going to round it either to you know being a one or a zero in that in that low order bit. And so if it's below a quarter, it's going to go to that bit. We'll go to a zero if it is a 1/4 to 3/4, it'll be a one, and then over 3/4 or and then, then over 3/4 it'll round back to the zero. So it's just a rounding representation. I'll take #13 so the question is from person. From silicon labs. So the question is how does a synthesized solution integrated into a memory mapped burst based SoC chip? So when we get the synthesis synthesized. Or our deal, so we actually. Eastern shape those behavioral memories with the physical memories and do the complete synthesize gate level netlist into our PNR solution and do the whole PNR CTS. I'm sure it is actually surprising the answer the question. All right, so somebody asks you also support an FPGA flow and the the RTL that's generated by our high level synthesis tool can be targeted either at an ASIC or an FPGA flow. Of course, in an FPGA flow we wouldn't have the back end process that'd be done using the the the FPGA vendors tools, but yes, we do support an FPGA flow. So there was a question at the beginning. Have you compared to CNN directly synthesized versus executing it over an accelerator synthesized? I think you're asking how would our synthesized accelerator compare to an off the shelf accelerator? That's actually something I'm working on right now, but I don't. I don't have numbers for it at the moment. So. Let's see. Is the tool relevant for designing programmable neural network accelerators where the network topology can be controlled by software? Yes, it is. So what you generate in your accelerator is really completely under your control. Whether you're generating some very dedicated hardware that's just going to implement 1 neural network, or you're generating something rather generalized that you could reprogram by setting registers in it and program it from software you can build. Either using this approach in this case, we were really targeting something very specific going for low energy, but you certainly could use the approach for something more programmable. So our third party libraries needed for qaris to export the inference model to catapult. No. So when when you run it in qaris what you're going to get is a set of weights and biases and these are going to be out of qaris they're going to be a set of fixed point numbers at a particular representation, and these can then be moved into the design built with catapult or into any design really. Cucharas was originally built to support the Tensor flow accelerator, so Google's Edge T PU is where is where it was originally targeted, being used and it the benefits because they're pretty significant. Work well for catapult flow as well. So yeah, there's no third party libraries involved. Uhm? So the smallest solution? How many hardware maps were implemented by the smallest solution there we were, so that was four bits. We were delivering 8. We were delivering eight data values, and so there were nine multipliers per data value, so 72 multipliers. But these are just 4 bit multipliers, so they're relatively small, 4 bits and eight bits out, and so it's it. It ends up being fairly efficient. If we use catapult, can we have access to your C code used in your example? Yeah, I'm actually happy to pass out the the the C code for the example, we'll probably put that on a GitHub site somewhere or make it part of the examples that we've got within the product. Let's see. So Michael Scott asks if the target is an FPGA, then does the PA flow still work quickly? Where the number of resources are fixed? But the timing depends on the routing congestion. For Fpgas, you're going to be using the the back end flow of the FPGA vendor and so they're our PA flow that we've described. Now that's that's not going to be appropriate. So there you're going to have to run through the the vendors tools. Matthew Kim observes the tensor flow also supports quantization aware training I, so when I when I first took this on, I looked at trying to use tensor flow quantization aware training and wasn't able to get it to work in the cucharas I did have access to Claudia and our Coello and he was able to get me over a couple of hiccups and so I have not directly used the tensor flow. Quantization aware training, but it's good to know that they've got it. I'll I'll look into using that again. It might have improved since the last time I tried it. Umm? So Nordine asks, did you perform a comparison between what's generated by Catapult and other IP's such as NBDL a new? We haven't done that, we're working on it, it's it's. It's tough to get a direct and meaningful comparison because you know 11 system is for high configurability. The other is, you know, sort of preconfigured. So finding the right points is is an interesting dance. But yes, we're working on a comparison. Right there we. We don't have it ready now. OK, I think I've exhausted all the questions and we're a little bit over time. I'd like to thank everybody for showing up and I I hope you found this useful. Yes, thank you so much. Russ and Ajay for that, and we'll leave the room open just a couple more minutes just to give you a chance to grab those slides if you'd like. And there we go. Thank you so much for coming today. Wishing you all a wonderful day or evening and we'll catch you later. Maybe at our virtual seminar? Thank you. All right, I'll close the room now. Thank you everyone for coming. _1652723746843

Inferencing for Convolutional Neural Networks (CNNs) is notoriously compute intensive. This makes them an ideal candidate for hardware acceleration, which is faster and more power efficient than running software on general purpose CPUs. Training and inferencing are typically done using floating point representations of the features, weights, and biases. Using a fixed point representation reduces the size and power of the operators in the accelerator. With a purpose built accelerator, the size of fixed point operators can be anything - they are not limited to 8 or 16 bits. Qkeras, or quantized Keras, is a library built on Tensorflow that allows developers to specify quantized fixed-point operations for each layer. It enables training and inferencing with reduced precision representations. This webinar will describe how to use Qkeras and High-Level Synthesis to produce a bespoke quantized CNN accelerator, and compares the accuracy, power, performance, and area of different quantizations.

**What you will Learn**

- How to determine the optimal operand sizing for a hardware accelerator deploying a neural network using QKeras
- How to determine the area, performance, and energy of a neural network accelerator
- How to compare software performance against hardware accelerated performance, and make informed trade-off decisions

**Who Should Attend**

- Developers of neural networks that will be deployed on the edge or in other contexts where low power and efficiency are required in addition to high performance.