Hello everyone and welcome to today's Webinar Edge Machine Learning Accelerator SoC design using catapult HLS. My name is Matilda Karsenti and I am a marketing programs manager here at Siemens CDA. Before we start, I'd like to go over few items regarding the interface at the bottom of your browser in the center you'll see icons in a widget dock. These already interactive tools you will use throughout the webinar you can minimize, maximize and scale each widget to customize your viewing experience. I'd like to call your attention to the Q&A widget should be open and is located on the left side of your screen. You are encouraged to submit your questions to the presenter throughout the webinar. Using this widget, second is the resources widget located at the right side of the widget dock. There you will find a PDF slide deck as well as other useful resources and links. Our presenter today is Kartik. Prabhu Karthik is a PhD student in electrical engineering at Stanford University advised by professor Priyanka Raina. He received his Masters degree in electrical engineering from Stanford University in 2021. And Bachelor of Sciences degree in computer engineering from Georgia Tech in 2018. His research focuses on designing energy efficient hardware accelerators for machine learning. Now I will hand it over to Kartik for today's presentation. All right, thank you all for joining. I'm really excited to be here. My name is Karthik and I'm a PhD student at Stanford. And I'm advised by Professor Priyanka Raina and today I'll be talking to you about our experience using catapult eels to design our edge machine learning accelerator. If you're interested in learning more about our work. Please check out our publication of the OSI 2021 detailed below. So first I'll start with some motivation and background for this work, and then I'll give you an overview of our system, and then I'll talk about how we went about designing the accelerator. So I'll talk about our design, space exploration and then how we implemented this design and HLS and then verification and how we integrated it into our Assoc. So first let's talk about by machine learning at the edges, important first, there are several benefits. User data can be kept on device and private. Second, applications don't need to communicate with the cloud and will therefore be responsive. 3rd, there's a lot of user specific data at the edge that can be used to improve the application. And finally, processing at the edge consumes less power than sending data over to the cloud. However, there are also several challenges associated with machine learning at the edge. First we need to be able to perform not only inference but also training at the edge. 2nd, we have a limited power budget, so there's a great need for energy efficiency. 3rd, the device needs to function at a high throughput. And finally, the Ann model sizes can be very large and require a lot of memory. And For these reasons, machine learning at the object is challenging with today's memory technologies like SRAM, DRAM and flash. Resistive RAM or RAM offers a possible solution, and in this work we show the benefits of onship nonvolatile RAM. RM is a voltage control device that can store zeros and ones by changing its resistance state and by traditional memories which store information as charge. And this memory technology has several benefits. 1st, it's nonvolatile, which allows for aggressive power gaining, and this is really important for edge systems where inferences are infrequent. Next, it has a high density since only one transistor is required per Sol. This is unlike an Assam Sol which requires 6 transistors. It also has a low Reed energy, around 8 pico joules per byte, which is several times lower than DRAM and a high bandwidth. And unlike these other traditional memories. Aram has been demonstrated to scale down to the two nanometer node. However, it also presents some challenges. REM has a high right energy. Around 40.3 nanograms per byte. And the high right lincy around, oh point, 4 microseconds per byte. In addition, it has low endurance. Which means that cells can only endure around 10K to 100K rights before permanent cell failure. And these limitations pose a challenge to memory intensive applications with many rights. For example, DNN training. So now I'll give an overview of our system and including the key contributions and the results of our work. Archer is the first non volatile DNA chip for edge AI, inference and training. And the key contributions of this work are first in RAM optimized, machine learning Assoc with two megabytes of on Chip RAM and a DNN accelerator. Next, in order to scale up to run large DNS, we demonstrate illusion, which is a scalable multi chip approach where we partition the neural network onto different chips and in doing so we're able to provide the illusion of a single chip with an abundant amount of on chip memory. And demonstrating minimal energy and latency overheads. And finally we demonstrate the first on device room training using a special AI training algorithm called lowering training. So going down onto our SoC. On our asset we have two megabytes of RAM and we use this to store the DNN weights and CPU instructions. We also have 512 kilobytes of Western and we use this to store the activations for the DNN. We have our DNA accelerator and this is the block that we've designed with catapult HLS. And we have a 64 bit risk 5 CPU. And we also have 216 bit bidirectional chip to chip links which we use for internship communication. And because all the weights in CPU instructions are stored in RN, we have no off chip DRAM accesses. So for a single chip, here are the actual measured silicon results. So on Imagenet for Resnet 18 we measure an energy of 8.1 millijoules per image with the latency of 60 milliseconds per image. And we achieve an average power of 136 milliwatts. With an efficiency of 2.2 tops per Watt. And this figure here shows us the breakdown of power between the different components in our system. And you can see that roughly 70% of the power is consumed by our DNN accelerator. And we show that our system has orders of magnitude less power compared to CPU or GPU. Now moving on to the multi chips situation. Are we actually demonstrate two different mappings? Of ways to partition this neural network onto the different chips. And we showed that with mapping one. We only incur a 4% execution time overhead and a 5% energy overhead with respect to the ideal case, which is a single chip that has all DNA weights stored on chip. And similarly for the 2nd mopping we're able to. We're able to operate 20% faster, but require 35% higher energy. And then our training algorithm lowering training reduces EDP, which is the energy delay product by 3:40 X. And the number of weight updates by 101 X. And in doing so, we still maintain the exact same accuracy as conventional SGD. So we fabricated the system in 40 nanometer CMOS technology. And we have a total die area of around 29 millimeter squared. On the left you can see our dye shot annotated with the different blocks in our system and on the right is our package check. So now our onto the main part, which is how we designed this accelerator. So first, let's start with understanding the problem. So the neural network that we target in this work is for RESNET 18. Let's start with understanding the problem. So the neural network that we target in this work is RESNET 18, which is a convolutional neural net that consists of 18 layers and like the name suggests a convolutional neural network is composed of convolutions. And you can see it on the left here. A convolutional operation convolution operation for badge size of one is composed of the six nested for loops. And these loops here can be reordered, tiled and unrolled. Which means that you perform these loop iterations in parallel and hardware. In many different ways. And these form all the huge number of logically equivalent implementations with different energy efficiencies. In addition, the memory hierarchy presents another tradeoff here, as large memories allow for larger tile storage, which means that you can increase data reuse. But results in an increased access cost. So now on the hardware side. How should we go about accelerating this computation? So what we can do is we can use systolic arrays. Uh and systolic is are a very common hardware architecture for parallelization. And essentially, in a systolic array we have an array of processing elements where pieces that compute in in lockstep. And systolically is that exploit a lot of data reuse. Which makes them a good architecture to use here. So let's see how this actually works. Suppose we have weight stored inside of each PE. And we streamed the inputs into the array. I hope we can see here that notice how the inputs here are skewed. Which is necessary in order to maintain like the lockstep nature of the solaray. And essentially what happens is each PE multiplies its incoming input with the weight stored inside. In order to produce a partial sum that then gets sent down. And it continues pushing these inputs over to the right. And as a partial sums flow down through the array, they continue being accumulated on. Until finally, you've produced a final output here. So now onto the design space exploration part, so we actually have a very large design space here. And there's a few different axes here that we need to optimize over. First is resource allocation, and this refers to how many compute elements should we have in our systolic array. How many levels of memory hierarchy should we have? And how big should our all these buffers and our memory hierarchy be? And then on the data flow side, what kind of data flow are we going to be having through our systolic array? Is it going to be weight stationary, output stationary, or row stationary? And finally, how are we going to be tiling these different operations? And essentially, how are we going to exploit the different reuse that we have? So to explore this very large design space we used interstellar, which is an open source tool that performs a pruned search over the design space, and interstellar basically requires 3 inputs. First, it takes in the neural network specification. And this classification basically describes all the different layers in your neural network along with the description of what are the sizes of the different tensors. It also takes in a design space specification, which is essentially a range of different memory sizes and hierarchies that the tools should explore. And finally it takes in the component energy models, which is essentially 4 year given technology that you're targeting. One is the cost of a Mac and the cost of a register file access and the cost of memories. And what these three things Interstellar then tells us first, the optimal architectural parameters that we need. So for example, the number of Macs that we should have in our array and our memory sizes and so on. And it also tells us the optimal data flow. But should that we should be using? So we performed this design space search and these are the results of the search. What we saw is we need a 16 by 16 systolic array. And each PE within the systolic array should have a 9 wide vector mag. And along with this for their buffers, we need 16 kilobyte input buffer, no weight buffer and a 32 kilobyte accumulation buffer. And then on the dataflow side. Are we need to basically unroll the filter computation inside of each Ppes? And we need to unroll the input channels vertically along the systolic array and the output channels horizontally along the systolic array. So essentially what this means is that each PE in the array will be. Our will be performing a 9 Mac reduction and producing one single output pixel and for. And through a given column in the systolic array. Each PE in a given column will be operating on a different input channel. And along the rows, each PE in a row will be using a different upper channel. So now that we have that. Alright, here's a high level block diagram that we can draw for our accelerator. So first we're going to need an input buffer, which will consist of other generators and an SRAM. And these under generators are used to index into the large 512 kilobyte SRAM in order to fetch a tale of activations. These activations are then stored into the input double buffer and reused several times. And this use of the double buffer here basically allows us to overlap the computation time and the IO time. And as a result, we can. We don't stall the system. Next, these activations from the input buffer are then passed into the second part, which is the systolic array. And in the systolic area, our peers are going to be using the weights that have been fetched from RN to compute partial sums. And we stored these partial sums into an accumulation buffer and accumulate on it several times until we have a final output activation ready. And these output activations are then passed into a final block that implements the post convolution operations. So this will be things like an activation function like revenue. Or and for some layers it'll be identity and for some layers it'll be Max cooling or average cooling. And finally, these outputs will then be sent out to be stored back into the large 512 kilobyte. We designed this accelerator in C and we were able to have it functional and just two to three months. And this was really big because it was much faster than writing RTL. So you might think why should we even use catapult here, right? The data path logic isn't really that complex. And it could easily be expressed in RTL. However, writing this control logic in RTL is really complicated and time consuming. And especially when you consider that all these blocks that we have within the accelerator consist of numerously numerous deeply nested for loops. And the nice thing about HLS is that it handles all these low level details for you. And it can easily generate this control logic. In addition, we made heavy use of C templates. We parameterize the data types, the memory sizes, and the systolic array dimensions. And this really shows in other strength of HLS. It's very easy to make these design changes. For example, in our implementation, we notice that if we slightly increase the buffer sizes in order to accommodate slightly larger tiles, we could reduce the latency significantly. And this wouldn't greatly increase our memory access energy. And we were actually able to implement this change and receive a huge performance benefit with only a single line of code. And even late in the design cycle, we can make these changes pretty easily. For example, we noticed issues in our implementation when we were all the way working in physical design. And our initial design for this systolic array, we had broadcasted weights to all the pipes and this actually resulted in a lot of wires and our place and route tool struggled with congestion and meeting time. So we were able to make a relatively complex change. Are such that we were basically streaming the weights into the systolic array using a chip register. And we were able to make this change in just a few days and have it completely verified. So this really shows one of the great benefits, which is how easy it is to make these design changes. And also since we've parameterized everything, this also means that we can reuse a lot of the pieces that we wrote for our accelerator for different designs in the future. So now, moving on to configurability, there's a lot of different types of layers in Resnet 18 and so our accelerator needs to be flexible enough to run them all. For example, you know we have three by three layers, which are the most common in Russian 18, but we also have some by 7 layers, and we also have one by one convolutions. So in order to implement this configurability to handle these different types of layers, we define a sort of ISA here. And essentially, these instructions that we have will have information on the tensor dimensions, the blocking scheme, memory addresses for each of the different tensor locations. And post convolutional operations. So are we performing a value? Are we performing Max pooling or average pooling or any of these other operations? And in C. It's pretty simple to implement this, we just use a struct and so in C land whenever we're trying to run our accelerator, we just pass in the struct additionally. And in hardware, this basically synthesizes to registers that we can then write to. And what we do on our SoC is that we memory map these registers which allows us to set these registers directly from the CPU. So it's actually very convenient to be able to control this accelerator from our CPU on our SoC. So in terms of verification, we implemented a very tight verification loop that closely integrates both our software and our hardware. So we started with a resonant 18 model in Pytorch. And we trained in quantized it in Pytorch. And what we do is that for each of the different, each of the different layers we take the input weight and output tensors. And dump these Numpy arrays to a file. And then we have a sync testbench that just reads in these different Numpy rings. And write some into memory and we then implement each of the layers in our ISA. And we can just run through the entire end to end resonant 18 inference. And the really nice thing about this is is all the RTL verification infrastructure that we need is automatically generated for us? Like how to pull and we don't really need to write any additional test matches. So really there's just one C Test match that we write and the catapult takes care of creating an RTL version for us. Now about debugging. So debugging in C was enough for us to catch pretty much every single bug that we had in our accelerator. And this was really good because it's way easier to debug at this higher level of abstraction. Like debugging in C using GDB is way faster and easier than trying to debug RTL. Especially when you're trying to parse through waveforms, it takes up a lot of time. And. And including the additional complexity of thinking about it in terms of different clock cycles, it's much easier to debug in C. And. In debugging, most of it is spent localizing bugs. So it's very useful for you to do all that you can in order to help you localize these bugs. So essentially what we do is that in addition to our our Pytorch model. We have a simple gold model that is performing a very simple convolution just based on the parameters that we pass in. So given an instruction, it performs a convolution. And by doing this, we're able to check that we correctly mapped each of the layers from our Pytorch model of RESNET 18. To our eyes, say for our accelerator. And it seems kind of simple, but there's a lot of times when we made some minor mistakes with this that just ended up being very difficult to debug because it turned out the bug wasn't in the design, it was actually in how we mapped it over to the ISA. So, given that once we narrow down that we've correctly mapped our layers. We then compare it with our cdesign. And if these don't match, this basically tells us that we have an actual design bug and we can further narrow this down with different unit tests to help us localize. Well, which block the error occurs in. And finally we check the C++ design with the synthesized RTL design and once again all this verification and structure for comparing the RTL to C. It comes from catapult. And it's pretty rare, but these don't match, but the times that they don't match are basically suggest that there's some sort of C bug. So for example, a bad memory access and the thing with this is, is that sometimes it will work in our C simulation, even though it's like technically undefined behavior. It can still work. And we only see these types of bugs pop up when. We synthesize the RTL and compare it, but there's actually tools now in catapult that'll help you. Find any of these like bad memory accesses with the design checker you think it's called. So now going over into the simulation. This is also something that's really great, so if you were to take if you were to perform an RTL simulation for Resnet 18 and we parallelized across every single layer, so meaning we do all these different these all these 18 layers of RESNET in parallel. It still takes us about one hour to finish. And this is still going to take up because we parallelize across all the layers. We're still going to consume 18 cores, and we're also also going to need 18 licenses. Meanwhile, on the C plus simulation sign, it only takes us 10 seconds to run the entire residency. Team inference and this is with no parallelization needed. So this is a really big boost and being able to to be able to verify your design within seconds is really important to being able to iterate quickly. And this was really nice for us because we could verify the accuracy of our quantized model. In a very reasonable amount of time. And we can basically take the image net data set, push it through our accelerator, and check that. We were still able to receive a high accuracy on this data set. And we could also tweak our design here in order to improve the accuracy so we could see what the effect of having different bits was. For example for accumulation. And we could see that we didn't really need to have such a big to have a lot of bids for accumulation and we were able to get by and maintain our accuracy with fewer bits. So this was really big for us to be able to do this. And this really shows one of the huge benefits. Is having this fast simulation time because you're when you're simulating at the C plus level it's orders of magnitude faster than trying to do an RTL simulation. And also I. Like I mentioned before, we also demonstrate training and for training. It's important for us to run many samples through our accelerator, so we need to be able to have some sort of way to push many samples through and see the accuracy over time and trying to do this in just RTL would have been pretty much impossible given the amount of time it takes. So this was really a big deal. And then integration. So we designed the rest of the Assoc using Chipyard, which is an open source framework for generating risk. 5 socs. And basically, a catapult has a nice way to integrate into other designs with its axe interface generation. And basically all this means is that for your block for accelerator. It exposes these this axiom interface, so you just connect up these axle channels that you have to the rest of your SoC. And catapult will generate all the additional hardware that's required for interfacing with your HLS block and the SoC. The rest of the SoC. So this was something that was really nice that we could have an easy integration. So to summarize, this was our design and verification flow given AC design and our C test bench. We were able to quickly verify our design was correct. And once we did that, once we knew we had a functional working design. We could push it through catapult HLS. And produce RTL. And this retail was then verified with Synopsys VCs and catapult like I mentioned. Catapult generates all this infrastructure needed to verify the RTL as well, so there was no additional work needed there. Then with this article we can push it through logic synthesis with DC to get a net list. And does not list can then be pushed through place and route with Innovis. And finally we got a nut list and parasitics. And along with the switching activities from our verification. We can then put it through our power analysis tool, for which we use prime power. And basically understand what is the power consumption. Of our design, for each of the different layers and resonant. And the really nice thing was that this floor was totally push button after we got it set up. And pretty much within 12 hours we could figure out how a design change affected the overall power for each of the different layers. So that was something that was really cool that we can do. And to summarize, here's how we ended up with our final accelerator design. So finally, at the end we ended up using 8 bit 8 bit weights and 8 bit inputs and we had an accumulation precision of the 18 bits. And like you see here, we had a 16 by 16 systolic array where inside of each PE we were producing this nine Mac reduction. And the other thing to note here is. Using this data flow also allowed us to exploit. The reuse that's inherent as you're striding over an image. And here, with these stride registers on the left side, we were able to cut down the number of accesses to our input buffer. So this is the final design that we ended up with. So now talking a little bit about the challenges. Although it was relatively quick for us to have a functional design. There was actually a significant learning curve to getting a better design. So it takes. It's like relatively easy to get started with using HLS, but it takes quite a lot of time in order to understand what the tool does. And this is really important for you to be able to get good performance out of the tool and have a very high performance design. And you essentially need to understand how the tool has synthesized your design. And better understand how your C code that you've written corresponds to the actual hardware, and this is really critical, because otherwise you know you might synthesize something and then you'll see that you're not getting the performance you want, or it's taking up a lot of area and you have no idea why. So essentially, it's really important that you understand what the tool does and it's become easier recently with the design analyzer from Catapult that helps you better understand. Alright, how your C code is actually being turned into hardware, but this is something that just takes some time to learn and understand. So for us we continue to improve our design after Tapeout, and in particular we you know we try to understand how to better optimize their performance. We improved our memory interfaces. We were able to reduce our area and power for accelerator. And the other thing that we did was we increased our data flow flexibility. So for example, I had mentioned before that our accelerator arpes produce A9 Max reduction. And this is really optimized for a three by three filter, which is really the majority in resident 18. But when you look at it rests at 50. For example, you have a lot of 1 by 1 convolutions, and in this case our nine Mac PE doesn't really work well. So in order to get around this we had to increase the flexibility in our data flow and essentially what we did is that inside of HPE we also chose to enroll the channels. And this flexibility actually needed a lot of control logic that would have been really tricky to do in RTL, but in hills it wasn't too difficult. And with all these changes, we were able to. Basically reduce our energy from 8 millijoules down to 1.8 molecules and significantly cut down our latency from 60 milliseconds to 15. And also reduce our power. So this took a few more months of better understanding the tool. And understanding how to debug the performance and optimize it. So compared to state of the art, how well does our new optimized design do? So we compare it to Simba, which is in 16 nanometer and meanwhile our design is in 40 nanometer and we see that our core energy for Ozark that 50 inference is about three times less than Simba. And for single Triple Inc, we're within around 20% of the latency, and we also have a pretty comparable multi chip throughput. And this is when we've scaled for frequency and Max in order to have a fair comparison between our designs. So this shows that with optimizations we did pretty well and we're able to, we're pretty comparable to state of the art. So to conclude. Would we use catapult again? The answer is yes, we will, and the reason being is that we have a very high design productivity. And it's especially really easy to make these design changes and. Your verification time is greatly reduced, and debugging is a lot easier. And finally, you have very fast simulation times. And with all of this, the key benefit of HLS is really that you can iterate really fast and try new designs and see how those affect your power and your latency, and it's very easy to keep on iterating and getting a better design. And this is really the key reason why we would continue using HLS. In order to have basically better design productivity. And be able to make changes quickly. So that that concludes our presentation and I'm happy to take any questions you guys might have. Awesome, thank you so much for that presentation. Kartik. In a moment we'll have Kartik as well as Professor Priyanka Raina and Sandeep guard on the line to answer your questions. Quite a few have come in so we'll try to get through as many as we can and do continue to submit those. If you have any additional ones via that Q&A widget. We also have a few poll questions and would appreciate your feedback on the presentation and future topics. First question we have is what changes would you like to have in Catapult HLS for getting better designs or easier optimization? So I think I it's pretty easy right now to. To try different types of designs. And understand how the different changes that you make to your design architecture impacts your overall latency and your power. So I think that part is really nice, but I think what's really crucial is to have a better understanding of how the tool is actually working through your design and implementing in hardware. So I would say that's probably the main thing, and the design analyzer is a great step towards helping you understand better. But I think a little bit more help in. Helping a user understand the results of the tool would be really helpful. OK, so next. So compromises on using saturation and writing for the data types? Yeah, so that's a that's a great question. So within our design. It's essentially if you look at the distribution of weights for a neural network, it's typically like a Gaussian distribution. So in doing that you don't really need the full entire range of your bit width. So for example we have 8 bit inputs and weights. And we actually don't need to accommodate like an entire. We don't need like a very high precision. In order to represent the result of this computation and we were able to get by with 18 bits for our accumulation. And basically we were able to arrive at that number by essentially trying to see what what's the lowest precision of which we can operate without losing too much accuracy. So I basically we could model the different different accumulation precisions and see how that changes our overall accuracy. Because ideally we would like to reduce that as much as possible, so we're not wasting any area and power on unnecessary precision. So that was something that was very convenient for us to do. So next. Question regarding the entire design flow. What stopped took the most time. So the most time was really spent in the place and route tool. So roughly across the entire flow. There was probably only around 30 minutes for HLS to run. And one thing that I generally found was that if something takes if, like your design takes over like 30-40 minutes. You've probably done something wrong in your design, or it's inferring some unnecessary hardware. And this specifically tells you that you need to go back to your coding and see what you're doing wrong. And generally, HLS doesn't take that long if you bring things correctly. But yeah, in our case it took around 30 minutes for just the the catapult part. And then the majority of the time was really spent in place and route. I think that was probably around like 6 hours. For the entire design. OK, and then another question is how is the C testbench reused for RTL verification? So essentially you write the C++ testbench and it's pretty straightforward, and your C++ verification. But then when you go over into RTL, basically catapult generates this verification infrastructure that compares the outputs from your C. Our version to your retail version and essentially in doing this you can see exactly if your design matches up. So if you see that your C verification passes. And you see that your outputs from C match the outputs from RTL. You know that your RTL design. Also correctly works. OK, so another question regarding data types. So did we use floating point or was it designed done entirely in fixed point? So we've quantized this design in Pytorch and we basically just use integer 8 compute for everything. And that's part of quantization. You have certain scale factors that you apply at the end to your outputs, so everything's basically done in integer 8. So other question was did you use a standard protocol for the CPU accelerator communication? Where did you design a custom one? So we used axe and catapult basically automatically generates all this interface design for you. So all you really need to write is some sort of RTL that just plugs in your AXI interfaces that you see from your accelerator with your SoC access interface. That way it's really easy to integrate your design into your associate. OK, another question. How did you build C code for the entire Resnet? Did catapult crash while compiling this large C code? And some sort of idea of how large the C code is. On so. Essentially the I think. I. Basically we have one. So it's the same C code for all the different layers of Resnet. But we have this configurability that basically allows us to try different layers and map it onto our accelerator in different ways. And combo number crashed on us because. It's not necessarily like a. It's it's essentially just like one design, right? It's not an entire RESNET layer that is hard coded in hardware. It's basically the different layers can be mapped onto this accelerator. And in terms of size. I don't remember off the top of my head exactly how many, but I would say it's probably in the thousands of lines of C++ code I would say. So other question was. The other question is asking about post HLS. I. Synthesis debug experience, either on FPGA or an ASIC so. We did have a lot of debugging. We had to do for our trip, but actually none of the the silicon debugging was on. The HLS block was actually all on the memory block, which is. On it, which is really good because you know it proved to us that the HLS tool worked correctly. And we didn't really see any issues there. OK and experience using HLS pragmas. Does catapult have user-friendly pragmas? So I would say for the most part it there is. There's not too many that you really need to understand. So let's say the main one is knowing how to unroll the loops and that is a pretty simple. Pretty simple program to do it and it makes a lot of sense to use it. And pipelining, and but also that's a little bit more complicated than enrolling. And it's a little bit more to understand, but I would say it's still relatively user friendly and it made sense. OK, so question regarding how did we measure the power breakdown for West Net 18. So what we did was. So I'm assuming this question is talking about the figure that we showed with the simulated power breakdown. And essentially what we did was we took our design that we had. Like all the way to the end of the VLSI flow, so we had a final gate level netlist along with parasitics. And we simulated the layers of Resnet on it and we average across all the different layers. On the power for all the different layers, in order to get this power breakdown between the different blocks here so it's it's a very accurate one, since it's like a post post layout gate level sync that we did here. And another question regarding how problematic is the low endurance? So how many training iterations can they associate with stand? So I think there's a. There's a lot here that we that I didn't cover in this presentation regarding the endurance part of this, so we were actually able to demonstrate in our training. That we were able to train a data set all the way up to in the jet accuracy. So we were able to successfully train our neural network. And we didn't really face any endurance issues there, and there's also some other specialized hardware that we have on our trip to help us with endurance. But For more information, please look at our VLSI submission that details more of the other parts of the chip as well. So another question was, how easy was it to write your design in C while allowing it to be synthesized by catapult? Are there C features catapult should support to make things easier for ML or are things good enough already? So I think what would be very useful is to be able to have a certain hardware architectures already supported through catapult. For example, being able to just generate a systolic array. Through catapult where maybe some sort of mechanism that allows you to specify what sort of Pdes you want, but being able to specify different hardware structures would be very helpful. Like something that is already supported. And perhaps you just use like some sort of library that is already available. That would be very useful to make things easier. So another question was do you generate a different accelerator per layer? No so we have just one accelerator, right? And we have this configurability that allows us to map. Different layers onto this one accelerator. And basically this is what interstellar the design space exploration tool we're using does. It's going to look at all the different layers and our network, and it's going to tell us what would be the best data flow, and for each of these different layers we require. A different like blocking scheme. And the whole point of having this configurability is to allow us to be flexible to support any of these different blocking schemes. On for the different layers that we have. So one question was, would you say that this sort of behavior of catapult? But it takes more than 30 minutes implies that there must be that there must have been an error in the design. Or do you see that as an inconvenience with catapult? So I think. Takes more than 30 minutes. Generally it means that it's doing something that you probably don't want it to do, so this ultimately really means that it's an issue you have in your C code that for some reason is making it. Try to synthesize all this additional hardware that is not needed. So this basically means that you're probably not doing something very efficiently. And you need to like, revisit your code and see. Take a look at your different. It seemed what's basically causing this issue. So it's not really an inconvenience of catapult, but or that. On you've made some sort of bad design. You've learned the code in a way that catapult can't. Design it correctly. There is a lot of questions. I think we can go through a few more up to you Kartik and your time and your schedule. If you still have time would love to run through some more. If not we can always respond to them via email. I'll leave it up to you. Yeah, so I can take a couple more. Awesome. So another question here is, how did the other generators work? Did they take advantage of data locality, say by issuing burst accesses? Yeah, so basically I. Our address generators are going to be fetching a tile from the main astram that we have. And with catapult we can specify a burst size that we want. So in this way we're able to basically burst across. To patch an entire or like a, for example, a row of the tile at once. And by doing this right, we're able to basically cut down on the number of like axiom transactions that we have. So yes, we we are able to take advantage of the state of accounting. OK, so another question. Within a struct, can you clarify which groups are one hot? I so in. I believe. I think within the strike it may not necessarily be possible to do that, but. Depending on how you write your C code. You basically might be able to get the hardware that you want. On for this. Right, so if you either only do Max pool or average pool, right? If you write your C code in such a way that it's only doing one of those two things. But basically gives you the behavior that you want. So other question, data type exploration versus silicon size accuracy? How's this performed? On so. Basically, we could easily change because we have. We parameterized everything we could pretty easily change the different data types in our design. And because we could quickly simulate. The the actual end to end inference. We could have a quick understanding of what the overall accuracy would be. And similarly, because in our flow we could, you know, pretty much pretty easily push button. Go to the end of the VLSI flow and get some sort of area metrics. It was pretty easy for us to do this. Different exploration where we could try different data types. And see it's a fact on area and accuracy. So is camera open source so we can't open source any of the technology related details because that has all IP from the foundry and we can't disclose that. But we might be able to open source the actual accelerator since that doesn't have any. Technology specific information. We might be able to in particular like one of the versions of our accelerator. So did we do any UVM RTL verification using the HLS artl? Did you hook up the HLS to your UVM environment so we actually didn't do any of this UVM verification? We just used the our C bench and the catapult generated verification infrastructure to do all our HLS verification and then of course once we plugged in our accelerator into our SoC. We did some additional verification there. Just by on on a full system level, but they did not include any of the C code. So I think we should wrap up now with the questions. On OK sounds good. We have received your questions though, so don't worry. We'll get back to them via email if we weren't able to answer them via text or orally just now. Sorry about that. Quite a few I must say. So thank you so much. Karthik again for the presentation today and really for taking the time answering all of the questions coming in from our audience. Thank you so much for joining us today. We really appreciate it. And yeah, there you go. Thank you everyone. _1652725286912

High-Level Synthesis (HLS) offers a fast path from specification to physical design ready RTL by enabling a very high design and verification productivity. HLS handles several lower-level implementation details, including scheduling and pipelining, which allows designers to work at a higher level of abstraction on the more important architectural details. Designing at the C++ level allows for rapid iterations, thanks to faster simulations and easier debugging, and the ability to quickly explore different architectures. HLS is a perfect match for designing DNN accelerators, given its ability to automatically generate the complex control logic that is often needed. This webinar will describe the design and verification of the systolic array-based DNN accelerator taped out by our group, the performance optimizations of the accelerator, and the integration of the accelerator into an SoC. Our accelerator achieves 2.2 TOPS/W and performs ResNet-18 inference in 60 ms and 8.1 mJ.  

What you will Learn

Who Should Attend