Name: Harvard University: Effective SW/HW Co-Design of Specialized ML Accelerators using Catapult HLS
Start: 2022-02-08T10:00:00.000-08:00
End: 2022-02-08T11:08:00.000-08:00

Hello everyone and welcome to today's webinar. Harvard University is here presenting on effective software hardware, codesign of specialized ML accelerators using catapult High-Level Synthesis. My name is Mathilde Karsenti and I am a marketing programs manager here at Siemens. Before we start, I would like to go over a few items regarding the interface at the bottom of your browser. In the center you'll see icons in a widget dock's already interactive tools you will use throughout the webinar. You can minimize, maximize and scale each widget to customize your viewing experience. I'd like to call your attention to the Q&A widget. The Q&A window should be open and is located on the left side of your screen. You're encouraged to submit questions to the presenter throughout the webinar. Using this widget. Your questions will be answered during the Q&A session right after the webinar. Please be aware that questions you submit will disappear, but will come back with an answer if answered by text. 2nd is the resource widget located at the right side of the widget dock. There you will find the PDF slide deck as well as others resources and links for you. Right next to that widget you will see the help widget. If you are experiencing any any technical issues, please click on it for assistance and also do know that if you simply refresh your browser that generally resolves any issue you may be experiencing. This presentation is being recorded and you will receive a link to it in a follow up email later today. If we are not able to get to your question during the live Q&A, we will respond to it via email. Our presenter today is tier Tombe from Harvard University. He is an electrical engineer, PhD candidate, advised by Professor Guyon Wei and Professor David Brooks. His current research interests focuses on design, energy efficient and high performance algorithms, hardware accelerators and systems on chip for machine learning and natural language processing in particular. He also bears a keen interest in Agile. Associate design methodologies. Prior to debuting his doctoral studies, Cherry was an engineer at Intel in Hillsboro OR where he designed various analog mixed signal architectures for high bandwidth memory and peripheral interfaces on Xeon Xeon Phi. HPC SOC's Cherry is a 2021 and NVIDIA PhD graduate fellow. Now I will hand it over to Terry for today's. Presentation. Thank you so much Michael for the introduction. Hi everyone, my name is Cherry Tambe and it's a pleasure to give this presentation and title effective software hardware codesign of specialized and accelerators using HLS. This presentation was put together with my colleagues at Harvard so I want to acknowledge that nearly Yang Yang, Yang Huang and my advisors. David Brooks and we always so let's get started. So as a result of moslow slowdown, we are seeing that the modern system on chips are getting more and more complex and to be able to keep delivering gains in energy efficiency, we see that the numbers of accelerators or IP blocks have been increasing steadily over the years, so the picture right here on the left only shows a snapshot of number of IP blocks in Appleseed is between. 2010 and and 2017. And and so as as ethos is are getting more and more specialized, the require longer and more and possibly more laborious development efforts. So then it begs the question of how to improve designer productivity. So one way to do this is to create more portable or reusable apps. Obviously have to be that actually really easy to port across process nodes and and computing platform and it would be preferable that the IP that we. The use is solved by nature, so this means that with very little efforts we can easily recalibrate the hardware to adjust to different timing and PPE and margins, and this is often too inflexible in the case of highlight P. Another way to improve designer productivity is to raise the level of abstraction for chip design and verification. And so, so if you can design hardware by coding and describing it in a high level language, then the design space exploration would be much quicker and that would lead to an appreciable reduction in design and verification cycles. So that's exactly what had ever sent. This is enables us to do so in High-Level Synthesis. So also the creator world is SLS, so hardware is described in the high level object oriented programming language, so namely C system C. Actually one of you when we were teaching an advanced vside class at Harvard, many students do not have without. But a lot of them were familiar with C, so obviously they were able to thrive while using HLS. So so for many software and comparison engineers, especially those working on the front end of the compilation stack, HLS will be an easier entry point. And in fact, we have seen that there is a growing interest in in in high level domain specific languages such as halide and and. We've seen that HLS is often used in these scenarios as a back end in the GSL flow. And it's also important to mention that the risk to using HLS is very low because the HLS tools have vastly improved over the years and and the the portfolios of successful payouts using HLS has grown enormously over the last few years in academia and also in industry. So these are some of the benefits of that we observe from from using HLS, so you get very high productivity because the hardware description and the test bench is done at the high level of obstruction. So typically you end up waiting much less code compared to handcrafted arcl. You also get to do really rapid design space exploration, and we'll show some example later on and there is very fast with Cushman throughput because the majority of bugs actually caught. Prior to prior to the high level synthesis execution. And in our experience, actually we were able to type out research this trip in about the formal formals in great part thanks to using HLS. So I think we can just chill the overall benefits of High-Level Synthesis, informing characteristics or advantages. One is modularity and I will go into this later on. The idea here is that we can instantiate some already specialized able hardware libraries and there is also the benefits of DSC facilitation, disastrous exploration, facilitation facility. Push on and an easy handshake with software. So in the following slide as I talk about how we Charles is being used for optimal software hardware codesign that will be showing these tags to drive and emphasize the point. So you know a very good example of how HLS is being used to design and accelerators is the object oriented design flow which was pioneered at NVIDIA. In this case the this flow also has heavily influenced the way we design accelerators at Harvard. So as I mentioned, the key threat, which is that the hardware is described in. Because our system C with the system C testbench, so this enables very fast verification of the hardware because this is based simulation. And they're only representation. As we mentioned, a number of things that you can do to catch the overwhelming majority of of of hardware bugs at this level as opposed to further down the flow. So all HLS also facilitate reuse of well maintained synthesizable libraries such as mutually so, so as the hardware designer. You don't need to reinvent every single wheel because much Lee makes the design more modular and correct by construction, and as part of my sleep we also have these latency insensitive channels which guarantees. So far from the correctness with arbitrary communication delays. And each of these tools also automatically do the hardware pipelining scheduling optimizations you know to the way they are. My personal favorite treat is the fact that the system see testbench that you design. To verify the Resistancy hardware can be reused. To verify the HLS unrelated so so so we don't need to create a separate this bench to verify the the, the and and so if the source code is a properly designed you can expect that the cycle level performance of that position will be within 3% of the illusional source code, so the whole HLS allows you to do. It promotes higher level investigations. And and design space exploration. So the so. So let's say this is the goal of this talk, so I will provide a concrete example of how we. We applied the object oriented high level synthesis software hardware codesign approach in many of our chip dip parts that we provide some more distinctive theories about that. And so, along with that, we will also share our experience with HLS using catapult. Goes from Siemens. We were talking about some of the challenge that. Something interesting encounters and we would provide, you know, recommendations for achieving optimal PA. We went through a lot of trials and errors to to get to a good design, so we hope that those learnings would be helpful to the audience. Also so the the so as you know the there's a number of HLS tools out there, so the learning may be adversely apply to all the hours tools. But what I'm about to talk about will essentially the experience was with each others. Also we will focus on systems based design. As you know, designers make code their hardware it in either C or system C depending on the needs or chief. So in case you use AC that actually has a higher resemblance to the software model is even more abstract compared to system C, and so this can be a good option in case that there are path is not too complex, but most of our design and much more structure. So the much more complex so. We we we designed so the parts that we have done was was done using a system C. So this would be the focus of this presentation. So some example of HLS based design at Harvard, so we've done a number of designs for for many applications. As you can see here, you know speech to text, natural language processing and so those have already been open sourced at these links. Also, design accelerators similar to the GPU you know. Also we design convolutional neural networks, accelerators and accelerators for security specs, sorting pretty mapping and also cryptocurrency mining. And we've also, as I mentioned earlier, we we, we have successfully integrated HA's in our depots, so an example is shown right here, where we integrated to. API in this Assoc so HLS allowed us to really explore and quickly iterate on the main differentiating features of the Socs which were the these HLS blocks and there are some XI component. Actually from from from much lead that you can instantiate to connect the apps with the rest of the SoC so it's pretty so is it actually easy to? Merge your HSAP with the rest of the SoC VRX I channels. So that says this is the outline for the rest of my talk, so I will talk about the approach we used to do software, hardware, codesign, and conversation flow with high level synthesis and then I will spend some time sharing our actual experience and then share some recommendations. So let's first start with software, hardware, codesign, and conversational flow. So in order to design A hardware accelerators, we need to have a deep understanding of the software mechanics. So High-Level Synthesis is a powerful medium that use that we can use to link these two words, so let me explain what I mean by that. So say that that we are interested in designing a hardware accelerator is Michaels. The particular medication. So in this example the application here is speech to text, so we want the hardware to be very energy efficient. So in the ML framework we can model this efficiency via quantization, clustering or sparsity. So then when it comes time to designing the hardware accelerator, INSISTENCY is really important to save time by using already available. Seizable components, so we've had great success with using much lead and and HLS lead. So much really is a. Is this disable hardware library? It was pioneered at individual. Now it is embedded into the the the catapult installation and it contains a lot of useful hardware component that you will want to instantiate in your design. My personal favorite are those latency sensitive channels so it allows you to connect your systems it blocks together. It also provides a various data types. Will these squash parts and also access interfaces that that you can use to connect the accelerator with the rest of the SEC so the full list of hardware blocks can be found at this link, so we have heavily used much in our designs and it was a huge time saver. As you know, so that there's we have also used HMS Lips, which provides a lot of sizable malfunctions. So we've done some accelerators to model recovering neural networks, and if you look into the equations of these new networks you will see that they use a lot of nonlinear functions such as 10 edge or sigmoid and so. So we were able to find. Those functions already modeled in the Charles leaves. So for example, they are piecewise linear functions such as tennis and signal that you can easily plug them into your design to models. These nonlinear function and also is another form from very popular in natural language processing and and one of the operations that you would have to do is the inverse square root, so that can be found also. In in the Chinese lead. So we were able to. Essentially, you know, be used this far from simultaneously to design our accelerators for energy. And so, and then, once you know we have finished designing the accelerators then we can start working on the on the test bench and obviously so we want to use actually data. We actually we. We often start with random data, but then we switch to their data to make sure that the. The hardware has the same operation as the software and we do that by shipping the widths and improve activation from the framework in in Numpy format so and so we instantiate the living library which allows you to reuse these non payer ways in your testbench. So and then you know, we in this bench that the way to close the loop, essentially between the software modeling and the hardware implementation is by checking to see if the output of the index from the application match the appreciation coming from the software modeling. So in case the match we can proceed to which others. Otherwise you have to go back to the systems implementation and to fix. Bugs. And so it's less so kind of put. It allows us to essentially look, you know, look, and say this. This environment on chair we have satisfactory GPA, so some of the things that we check obviously is functionality. Area, power, you know, delay and throughput and you you can. Parameterize your source code so that you can do easily suspicious proration so some of the things that we parameterize in hardware is at the number of P's defect. The size of the Mac, the memory size, or the precision of the data path, so you can you know with going to actively fast velocity, you can quickly iterate on. You know in this environment until you get the desired. PPA. So this is essentially summarize the the way we apply HLS at Harvard to design hardware accelerators for machine learning, and so they provide some concrete examples. So here are some of the positive outcomes from our design using the software hardware codesign flow that I just described. So in this work we used HLS to design an accelerator to efficiently run attention. Is SQL sequence models commonly used in the speech to text application, so this accelerator is called Flex CSR? And you can see here it was integrated in this system on CHIP that we tapped out and the post silicon results were quite good. As you can see here on the on the right. So the top row here shows the latency of flicks highlighted in the blue color. So we compare the performance versus CPU, GPU and FPGA's. And on the bottom row. Is the energy consumption. Of of the HLS inlet design fixr compared against again, CPU, GPU, and FPGA. So overall we saw that you know flexors, which is our design, was able to produce orders of magnitude speedups and energy efficiency compared to CPU, GPU and FPGA. And we we had also another part where we used kind of put each others to design A cashco and accelerators. So in this work the accelerator was used to efficiently execute mobile net inference and possibly results showed that the design produce much smaller energy per inference compared to SIMD on the CPU and also the embedded FPGA. Which we had. Yeah, which we had also integrated on the on the system on. So so the the flow that that I outline. Essentially, we we ship the the machine learning data over to the pretty much the put the environment I want to highlight an alternative approach which is essentially you can do the other way around by plugging in the HLS design directly into the ML framework. So in case you want to find more about it, you can read this white paper here from. So with that said, I would like to now proceed to share our province. We went through a lot of trial and errors to achieve. If designed by the US certified with and so I think some of the learnings here could be very beneficial to the Community. And so the first thing that I want to highlight here is that there could be a bit of a learning curve to write, compile to, write a compiling code that also pass. Archaeology fication, in other words, it's it's. It's pretty easy to write company codes. That feels actually verification and some of the things that we have witnessed is that if in case you have a case statement, obviously you don't want to push and pop, essentially send or receive data. In the from the same channel in the same case statement. What you want to do essentially is split in two different case statements and. Also, the GeForce that is very important. It's it's always important to mention the the next day transition in the defaults that we find out that the case you omit the next transition in the G4 case statement, the simulation will pass, but actually actually will fail. So there are things like that that it takes some learning curve to actually get a grasp of and. Yeah. And the other. So we've also noticed that there's also a bit of a learning curve to grabs that proper could inspire for for optimal PA, so you you can, you can, so it's pretty easy to write contacting code, and that's that, will provide an optimized PDF. So one example is in case you have a single thread that executes a very long and complex data path. Or you may have or in case you have operations executing directly on on Apple with reference argument. So then you actually give a concrete example. So suppose that you want to design A victor accumulator, so that's what started from right here. So it's pretty simple. You have a vector of 16 elements, and you want to accumulate all the 16 elements into the the scale of the output scale called out here. So unless you what you don't wanna do is directly accumulate on the reference output here called called out. So if you do that, you will see actually there will be additional arguments in the generated ACL. And so the problem was to do this is to 1st accumulate to a temporary variable and then assign the temporary valuable to the to the reference output and now you will have the correct implementation. So these are simple coding tricks actually affects PPA. Between these are two designs we see that you will see it in person, smaller area and a 27% smaller public consumption in case the design is coded in the approach shown on the right. Essentially, by storing the the output first with temporary variable and assigning the temporary valuable to the to the output reference. So this was done in the G F-12 from the process. So some recommendations. So number one, I would say that it is. Yeah, differently, make heavy use of hardware library components. We have heavily used much leisurely libraries, so this really allows you to do correct construction design and they're tremendously boost productivity. Again, you don't need to reinvent every single wheel. These some of these these components have been already heavily. Validated so on this slide I'm throwing actually the the number of instantiations. Of these hardware libraries in our flexors are design. You can see that the we we heavily use the connection channels with which which allows you to link different system C blocks with others, and we've also used a lot of memories components and we also use quite a bit of. Modules also from furniture slips. Mostly these are nonlinear functions. And also so #2 is that the to facilitate the design? Space exploration is really critical to parameterize the key hardware feature features. So and it's crazy to do essentially in the source code you just have to give a valuable of the key hardware features that you want to. And then in the image file you can assign. You can have you can assign. You know, delivers or. You know to these to these hardware features, so this would allow. So this will allow me to try various permutations and variations of the designs and to see which. Design points gives the best performance. So #3 here is is to really avoid overly complex acid thread. So when we design you know HS systems and design. Obviously we want to achieve an initiation interval of 1. So the initial interval for for those who don't know. So this is the number of cycles between successive operations and it is desirable to have an I of 1. Obviously for for high throughput, but in some cases in case the thread is very complex and in case the thread is very long, the total will be incapable of the sizing these loops unless the interval is relaxed. It's an example is shown right here. So in case you have a very long. Trade, you know you. You may, you may have to to relax the initial interval to a high value here and this will. This will degrade the throughput. And so really so by breaking up these loan data paths into shorter data path you can achieve. You can actually have eventually higher throughput by, you know, by having a smaller initial interval. So let me actually show you a concrete example, so we had an essay module here with those dual network pooling and so and we had a single thread where in which we essentially put all the operations inside this. Yeah, this single module, this single single. So we ended up having to use an interval for and so by moving some of these operations essentially to a different message rate, we were actually able to achieve an interval of one across these. As a threat, so this ended up giving us, you know, much faster operations. So this design was 2.5 X faster, with a slight increase in every adversity because you have much more tighter by planning and also slightly increase. Also in power. Yeah, so they take away here is that avoid avoid long since thread, but played and played to break those into shorter path. And however more theater design present challenges, and the common so common problem that you will hone into is that in case you have the global valuable, you cannot share the global variable among trades. So so if you want heals, you will run into this problem. That variable is so easy cannot be shared in multiple threads, so the solution for that is to use. Combinational channel, so in much lead which we use that quite a lot, you can instantiate the combinational channel which will push data into one set thread and then you can demo and see if that same data. In a different essay trade. So that's the way to get around. The issue of. Of of yeah, of of the laborers that cannot be shared between different threads. Also United splitting answers. Yeah, we have noticed that a significant amount of positionless functional failures were eliminated by enabling random starting so often you you will have design bugs that are only observed positionless. So random starting essentially. It it would, you know, as the name says, essentially it will randomly stars that poison channels in Systemc during systems CNC simulation and this can catch elusive bugs prior to HLS. So the name. So the way to enable that is to simply using this flag in your Mac fire. So we had. So we we so yes, so we also had a great success with these features, and this can actually enable you to do to have fast verification, throughput and cache. The majority of your bugs prior to running hours. And so, and so the number 5 here is to in case you have a very large and complex design, it is best to code it in a modular manner and do HLSS from bottom. You know from the bottom up and and then. So then you can then synthesizing verify each module individually. You know from the bottom to the top, so that's that's the approach that. We have followed in our payouts and and actually design. It's also advised to make heavy use of far from templates. This makes the source code much more legible and it's also easier in the source code. Also easier to understand. Yeah, so I guess this is this is my penultimate slide so some tips on getting started with HLS. I would highly advise to study open source examples that there's a lot out there with supporting testbenches and so many examples are shown here. Fishing boats and rocks and all that much leap and actually sleep. And also if you have a carpooling installation. So on the NBC home should example you will see also a lot of example, particularly in C++ that you can use and. Also it it can be a bit of a challenge to grasp proper syntax, but I found out that the Blue Book provides a lot of good advice on proper syntax for various functionalities and. And design constraints and so you can find a link here as well. Yeah, so with that said, yeah I would like to thank my colleagues at Harvard, Daniel Yang local giving Wang for helping in this presentation and also my advisors David Brooks and for contributing to this presentation. I would also like to acknowledge the Siemens. As well as the NVIDIA ASIC and virus research group, you know for their selfless support, which we really appreciate. And finally, I would like to thank our sponsors, SRC, John Frieda, that Packraft and the SEC, and also NSF for sponsoring our research. Thank you very much for listening and I'll be happy to pick some questions. Great, thank you so much cheerie I think. Sandeep wanted to say a couple words real quick. Umm? Michael, can we first maybe go through the questions which remain to be answered? Yes, of course. So thank you Terry for that presentation. In a moment, Terry Stewart Swan, Ellie Burns and Sandy Gard will answer your questions orally. We've been answering a bunch of your questions via text as you've been seeing those via the Q&A widget, do continue submitting any questions you may have and we'll try to get through as many as we can. We also have a few poll questions and would love your participation and those once again use our interactive widgets at the bottom of your screen to submit your questions and download the slide deck which is in the resource widget. So and as I said before, the recorded version will be available on demand. Today you'll receive an email with the link to view that. So now I will pass it on to Terry and everyone else to help. Go ahead. With the question. OK. Let me help, so let's go through the first one. Is supervised unsupervised or reinforcement machine learning being used? I think that's a question for Terry. Is supervised unsupervised reinforcement machine learning? Use so? So I described some of the accelerators that we use some accelerators that we. That we designed using HLS so the text iterators were designed mostly for. For supervised learning, so essentially we. So we train a model right in in that Council flow, make, model, and then we design. So we designed this hardware mostly for machine learning inference right after. Supervised training so I guess I'm not sure if this asking the the questions. Yeah, but yeah, so the the application that we. Designed accelerators for those for those were for inference of supervised. Next one, what's the main challenge difference? If you want to design an HLS based accelerator for image processing object detection system instead of an LP speech recognition? Yeah so. Yeah, so I would say that the the the common features among all of these applications publicly I would say is it's vector multiplications or mechanics matrix multiplications so. I think a lot of you know, so you have to be careful in making sure that the part of the hardware that handles when it picks up or it runs out aggressive properly designed. But between these different applications, the difference is will be in terms of the. The the type of nonlinearity used. Some applications may need to be using, you know more teenage for the applications have different you know nonlinear functions that you. That you have to use, but the overwhelming I would say that the the the main kernel I would say is still make this multiplication, so that's the part that you would have to be really careful to optimize carefully in the naturals. Great OK. In HLS we rely on the existence of test bench percent to be a set of stimulus developed for the high level abstraction models. These are used to to challenge to produce RTL. What's these stimulus are weak and do not have space of design states. So we thought I can maybe take a stab at that, OK? So basically the there have been a lot of questions in the UN about the verification approach and. One question was basically, is there a formal tool that basically compares the pre HLS system C or C plus model versus the RTL that's generated and we do have some formal checks that we can do, but it's not a complete proof that the entire design is exactly equivalent. What most customers do is use a traditional verification methodology, whether it be a C++ or system C test bench or system Verilog UVM to do. Functional tests and functional coverage in the of the pre HLS C or System C model and then and also code coverage of the pre HLS model and doing code coverage at three HLS level basically ensures that the verification is thorough even before you've generated any RTL and then basically what we do is we repeat the same set of tests and the same functional coverage closure and RTL code coverage closure. After each of us and we have a number of techniques that we apply when we do that, one of them I'll just mentioned right now is that we do random stall injection into the RTL DUT to stress the dot and cause additional states to occur, and resets to occur at unexpected times. And basically, you know through these techniques and some formal unreadability analysis techniques and so forth, we can achieve very high coverage closure numbers. On the RTL, so that's the that's the general set of techniques that are used. And I think that might cover, I mean Mattel. We have a couple other questions like can you auto generate tests to assure full coverage? So I think I'm hoping that that was answered in. Basically we have a very detailed coverage culture methodology which Stuart just described. So hopefully that. Is there anything else you want to add? 1 The kenyu auto generate test to assure full coverage closure. Uh, well we basically what we do is we stress the RTO, we, we, we, we try to the the general philosophy is do as much as you can in the pre HLS C++ and you know that's just that gets your efficiency weight up because of the it's easier to debug and it's early in the design, flow and et cetera. But then you basically repeat the the same scenarios and basically use the same verification metrics that traditional handwritten or TTL. Uses you know, in terms of code coverage, closure, you know focused expression coverage, line coverage, you name it. We we do all of that at the retail level. I just want to throw out there real quickly. There was another. There are so many questions I don't think we'll be able to answer them all during this this meeting. But there was a large set of questions about well. Should you be using system C or C? We don't have enough time to really get into that. You know, for hills, and we don't have enough time to to get in all the the details on that, but. The we have a large number of customers that are using both and the. Basically if you're trying to do time based modeling or need to be able to model time based behaviors, then you would need to use system C because that enables you to see things like. Uh, what? What, what Matchlight does is it has this notion of throughput accuracy that I think that the theory described and so you can see do performance analysis and so forth in the pre HL this model there are some designs which are data path dominated. Maybe you know that don't have a lot of control in them and so forth that are well suited to C++ and and don't need the sort of. There is some overhead. As I say with System City and so that we have a fair number of customers that are just using straight C++ to do those signs, so that's that's sort of the high level criteria. Yeah, I think there's a couple of other see Matilda, there's other questions, right? I think we've got. Stewart, do you want to make any OK go ahead? Come until. Which one should we do next? Are the expected machine learning framework go ahead? I think it would be interesting for Terry to comment on if using System C, why not you directly use Verilog? Because I think that the Terry commenting on the his use of match live and what value he saw in kind of the performance and the scalability of that as opposed to your Verilog experience. Since you're very experienced Verilog designer. Yeah, so yeah. So it's. It's not just much late, it's also the fact that with System C you can also easily handshake with the software model. I believe you can do it is that that can be easily done at a C++ level as opposed to doing this in the log. And and also I believe also the the the verification would be much faster as well as the process of those if you were to do an actual application so that those are, I guess some of the benefits for wanting to use systems as opposed to. So the important part there is that system C can be pin level can be kind of the level of abstraction of their law, but the important part is is that it can also be much higher level of abstraction than Verilog which enables all of this. So it's just the language and the level of abstraction is. Equally important, yeah. Yeah, I see a question here. Somebody is asking what kind of throughput you were able to get from your design. Your design, I believe. Yes. I showed that early on on slide that we don't show on this slide. OK, I think I actually forgot to measure it. So for example we were able to achieve a throughput of seven develops per Watt. So that's the direction was actually pretty good. So that's why. You can see here we we have orders of magnitude, speed up and energy efficiency over CPU, GPU and FPGA. So we were able to really meet our throughputs objective by using HLS and the yeah the next slide as well. Also was also I believe for the CC accelerator we maybe had a throughput in the orders of remember something around the 100. Or Jigger ups per year per Watt? Yeah, yeah, so yeah. So essentially we were able to meet our throughput objective with each others. Awesome. Let's see here. But questions we asked next or answer next. I will let you take the questions you know you're right. Maybe we could spend one minute on. I think it's your last slide about additional examples and things because I just want to make a quick comment on what's available out there. I think a lot of the questions that people have here about how you know if you're modeling timing behavior in match, how is it really different from Verilog RTL? And it will be helpful. I think to look at some of the examples that are mentioned here and so we maybe we could just quickly mention a few here, but I do want to point out. So the I think the flex ISR, which is the source code for what theory is presenting today, obviously a non trivial design but a realistic I think is available on GitHub. There is a number of other examples there. I think though, that if people are just getting started and just kind of wanna dip their toe into see what matchup looks like that, the link there that has accelera. Is that basically a kit that I created that lets you basically, you know, run some sort of hello world type examples in little AXI bus fabrics and you know discrete cosine transform and things like that, so there's a lot of sort of introductory examples and materials there and I think if you take a look at those design, some of the questions that we have about, well, how is this really different from pure C++ modeling and how is this different from Verilog? L modeling will become a lot clearer, so you may want to take a look at those examples. And do you wanna make any more comments about some of those other links there theory? Yeah, so yeah, I want to emphasize because I know that there there is definitely interest in C++. So one thing that I've noticed is is that actually the. The examples on underneath the carport installations tend tends to be more Cyprus plus examples. So in case folks are interested in that, yeah, so they're definitely. Check out the examples on the port installation. And yeah, that's yeah, that's sort of. Great, let's see here. Uh. Can you auto generate tests to assure full coverage? Well, I think we answered that during the when I'm commented on verification. OK. Yeah, I think I think your the coverage really will be as good as the way you, you. You could design it connecting the Test match so it is really critical that. But the Spanish is written to provide us as, as you know, as wide coverage you know, as coverage as possible. So yeah, definitely, efforts will need to be given to make sure that we have sufficient coverage in this match. Yeah, and just to repeat, I think one of the things that's unique about this flow that people aren't maybe used to if they've been to other types of C++ models, is that we actually tried to stress the system C model and Tyra mentioned the random stall injection with built in to match. And that's a very effective way to stress test the pre HLS system C model and so much of the overall verification effort ends up moving much earlier in the process so that that's a difference. Hey Terry, there's one here asking like I think it's probably a pretty common ask. Did you specifically during your course of using HLS compare power and area between handcrafted code and HLSL code? Yeah, let's see. Good question. I don't have concrete data on that, although I do have data in terms of simple designs, so I would say if the scope of the design is pretty simple, let's say say you know a Mac of a, you know a kariadi. Obviously the PPA the area of performance and power between the HLS generated Archangel and the handcrafted Dodger will be very, very similar, and there might be a little bit of difference as the design gets bigger and bigger and more and more complex. Yeah, you might see a little bit of difference, although I don't have data on that, but I do have anecdotal. Evidence that, yeah, it is that the difference could be around, you know 30 to 5% in terms of cycle behavior. Yeah, but yeah which which? Which by the way is is is is still pretty good. Yeah it's actually pretty good. Yeah. Awesome great you said that catapult never mind, let's see. Really do see one that we should ask next. I'm kind of going through. I think we'll have to go through and and probably take some of them offline. Maybe the one from Jason where I'm worried about maintaining templated libraries could end up being difficult because the tools are evolving rapidly. So maybe Terry you could take a minute to talk about some of the OR maybe Stewart on on how you know said should pragmas be separated? So just really kind of understanding the value of some of the the C++. Templated libraries and how you use those, and if that's a maintenance nightmare, or if it really UPS your productivity. Ohh, the C++ templates definitely I would say I mean the the the not only make the I guess the source code easier to read, but they also provided boosting productivity because you can reuse those same templates and perhaps provide. For example could be the data type that you would change. You know, yeah, so those, those definitely are very useful, especially in a very complex design. You want to make use of. Of of C templates? That's actually. That's disambiguation, because the question. Also here man talks about pragmas. Especially this separate. Yes, it's it's completely separate for in terms of how you would optimize your design by by using the various directories and pragmas. We address only want to achieve the best performance and power in the area so. He would. We use pragmas as a way to to actually this. The list of programs is very long, obviously, but the thing that we really care about for us is that throughput and also sometimes also when we have. Multiplying accumulate. We want to direct the tool to use, you know, either combinational or sequential elements, so they they are pragmas that you can invoke in your system SIC code to essentially instruct the tool to synthesize the hardware in a particular way. Yeah, so so, as a designer using HLS, yeah, it's it's. It's important to have some familiarity with these pragmas. You know just there is a. That's gonna say, but you didn't embed the pragmas of the tool pragmas actually into your design, right? They were probably tickle or external, so you didn't actually have tool problems in your source code or tool maintenance because of the embedded pragmas. Is that correct? How you how you implemented that, Terry? That's correct, we didn't have. Yes. Essentially we simply write the pragma in the source code, so we didn't have any maintenance issues. I would just say the the there's sort of a A when you've been using HLS for a while. So firstly on templates templates definitely help in terms of overall productivity for C templates and we use them extensively or things like data types and other things but. Pragmas, there's sort of a after you've been using HLS for a little while there. There are some fragments that you actually do want to embed directly in the source code because they effectively, they never. The never change. So for example, if you're doing an AXI transactor, you always want the throughput for the AXI transactor to be with an I of 1 and there's no reason not to just stick that in directly into the source text. But if you're doing design space exploration, then you want to either be setting those. HLS directives via the GUI or in a technical file because it enables greater flexibility, so there's sort of a. You learn how to adapt for the particular usage I see, so I guess just add on this I guess in case. It would be preferable against to embed the pragma in the Chicago Fire for excellence in case the source code will be portable across multiple technologies, right? Because in case the pragma is in beta directly in the source code, and in case you want to change process technology, then it might be depending on the timing margins of the new process. Women would be able to schedule the design in each other, so yeah, so so it better approach actually would be to actually end the in the keeper file that you used to run. So Matilda, I know we have a lot of questions left, but we are at the top of the hour. I'm just wondering if we should. Maybe ask one last question and then try to answer the rest. An email. Or answer them here. I don't know if there's a way that all people can see Matilda. Is there a way if we answer all of these eventually that they can come back to the. This platform and see them. If we. Then to all like we have been doing. But what I can also do is. Add an Excel spreadsheet to the resource list and the on demand with the answers there just so. Then as people come back. Probably tomorrow we'll have that added in the resource list, so then all of the questions and the answers will be there for you to see and download at will and. Just to let you know, we will make it anonymous just so then whoever is asked a question remains anonymous, so. Sound good, OK Ellie you want to recommend the last question? Oh boy. I I guess I would like one from that that would help get some input from Terry Terry as you look at kind of this whole thing was talking about hardware software codesign? There's a question on how we model the on chip CPU software execution and the hardware software interface performance in order to be able to come up with the most efficient hardware and software. Do you have any thoughts on on that on on how did you do? Just run your software against it or how? How you how you modeled that and how you did that and and why it was valuable. To you. So you said how how you modeled the softwares CPU software? How did you model model the CPU software execution? Yeah, so I guess the the software modeling is done using the AMF framework, so the there's a lot out there. There's tensor flow Python, right? So this is used to training. Various ML models right? For for the rest applications which has speech protects object detection and so forth. So I guess what have we done designed the hardware in in California? What we really care about is making sure that we have correct functionality between the hardware modeling and the and the the the application software, right so? Yeah, So what we would do is that we would actually check the numerical value of the. Output coming out of the. The the systems it testbench that we wrote against the output coming out of the the application software. Yeah I don't. I don't actually so I do spend some time to actually train the model and perhaps I would. And the various efficiencies that I want to see in the software such as quantization, pruning and so forth. And then I guess once I'm satisfied with the software behavior, then I I I then use the output of the software. In the software output to compare against the output coming out of the. Of the hardware model insistency. OK, I hope that answer the question. Super so again. So I go ahead, go ahead, sorry. No, I was going to say, you know, it's five after, so I wanted to kind of be respectful. Sandeep. OK, you're having audio issues so you can't do that. Hello, can you hear me? I can. It's a little bit. Yeah, go ahead. Ohk OK, alright so as Ellie was saying, you know it's five after 11 so you know just to be respectful of time. You know, let's close the this presentation now I think as Mathil had indicated earlier, we would be able to answer your questions. All the questions that are left unanswered. We'll engage with you and try to answer them after the event. Also, the recording of the presentation will be available to anyone. Who registers even after this event is over? Lastly, just want to thank the theory. Thank you very much for such a great presentation. Very informative. Thank you Stuart Swan Ellie Mathil for providing such great support. And audience, you are great. Thank you for all the great questions you know. Hopefully you know you found it informative and useful. And if you want to engage with us after the event, please feel free to reach out and we will try to help you be successful with HLS. Alright thanks, thanks and and the questions that weren't answered will go we will. We will answer them directly. So hang on give us this maybe about what till tomorrow and we'll get them all. We will get them all answered because I still see a couple really good more ones in here from bilagi, from from from field so OK. Yes, and just let you know if you have any questions later on and you're coming to rewatch the webinar. Check out the on demand. Feel free to still submit questions via the Q&A window. It will send an email to us and we'll answer them. In due time, so there you go. Thank you all so much for your time and joining us today. It was a wonderful successful event. I'll leave the room open just for another minute. In the event that you haven't downloaded the slides yet or want to. Grab any other resources from that resource widget, but thank you again for coming and thank you so much Terry for your hard work on this webinar. It was really wonderful. Thank you, thank you for the invitation. Thank you. Thank you everyone. Chris thanks Terry.

The slowdown of Moore’s law coupled with the surging democratization of machine learning has spurred the rise of application-driven architectures as CMOS scaling, alone, is no longer sufficient to achieve desired performance and power targets. In order to keep delivering energy efficiency gains, specialized SoCs are exhibiting skyrocketing design complexity with increasing development efforts. In this webinar, we will shed light on our agile algorithm-hardware co-design and co-verification methodology powered by High-Level Synthesis (HLS), which enabled us to reduce front-end VLSI design efforts by orders of magnitude during the tapeout of three generations of edge AI many-accelerators SoCs. With a particular focus on accelerator design for Natural Language Processing (NLP), we will share details on proven practices and overall learnings from a high-productivity digital VLSI flow which leverages Catapult HLS in order to efficiently close the loop between the application’s software modeling and the hardware implementation. Finally, we will mention some of the HLS challenges we encountered, offer recommendations cultivated from our learnings, and highlight internal and external efforts to further improve HLS user experience and ASIC design productivity.

What you will learn:

Proven practices enabling quick and correct-by-construction ML accelerator design via HLS.
Approaches for using HLS in closing algorithm-hardware verification loops.
HLS challenges we encountered and learnings.
Ongoing and future efforts to address current HLS limitations.

Who should attend:

Anyone interested in building ASICs using HLS
RTL design and verification engineers