All right, welcome everyone to today's webinar. We'll get started in just a minute, just letting making sure everyone is here. And thank you so much for joining us. OK, hello everyone, and welcome to today's webinar, Cornell University building sparse linear algebra accelerators with HLS. My name is Mathilde Karsenti and I'm a marketing programs manager here at Siemens. Before we start, I'd like to go over a few items regarding the interface. At the bottom of your browser in the center, you'll see icons and a widget dock. These are the interactive tools you will use throughout the webinar. You can minimize, maximize, and scale each widget to customize your viewing experience. I'd like to call your attention to the Q&A widget. The Q&A window should be opened and is located on the left side of your screen. You're encouraged to submit questions to the presenter throughout the webinar. Using this widget. Your questions will be answered during the Q&A session right after the webinar. Please be aware that questions you submit will disappear, but we'll come back with an answer if answered by text. 2nd is the resources widget located at the right side of the widget dock. There you will find the PDF slide deck as well as other resources. Right next to that widget you will see the help widget. If you are experiencing any technical issues, please click on it for assistance. If you do refresh your browser, that generally tends to resolve any issue you may be experiencing. This presentation is being recorded and you will receive a link to it in a follow up e-mail later today. If we are not able to get to your question during the live Q&A, we will respond to it via e-mail. Our presenter today is Yushau Yeshayahu is a second year PhD student at Cornell Computer Systems Lab. Advised by Professor 0 Zang. His current research interests include efficient acceleration of sparse workloads with High-Level Synthesis. He received his bachelor's degree in microelectronics science and engineering from the University of Electronic Science and Technology of China. In addition, you shall we have several experts on the line that will help answer all of your questions. We have Stuart Club, Petrie Solanki, Michael Finger off as well as Halid Islam. So thank you so much and shall I hand it over to you. Alright, thanks for the introduction material. And again this is ishal from Cornell University and I'm going to talk about how to build efficient sparse linear algebra accelerators with High-Level Synthesis. So. So today the we now probably know that the Moore's law is largely going towards a slowdown or towards an end, which calls for architectures with a higher efficiency. So that is the domain specific accelerators or we can call it dsas for sure. So a DSA is a hardware computing engine that is specialized for a particular domain of accelerator applications. For example, as this figure shows, the energy efficiency may measure the performance per Watt on 2 hot topics. One is deep learning measure in Japanese 50 and the other is genomics measure. In this spending Smith Waterman algorithm and we can see that domain specific accelerators perform really. Better than their CPU baselines. So today. So however, today's mainstream accelerator is mainly target compute intensive workloads, for example the tensor cores on GPUs and TPU product from Google. And please allow me to elaborate more about the world compute intensive using this roofline model. So the roofline model is plotted between the computational throughput measured in operations per second and the operational intensity defined as the number of operations performed by by per byte. Of memory traffic, which means the higher the OI is, the more compute has accelerated to do for each loaded byte, causing more opportunities for data reuse like caching. So in this case, we're exactly exactly where the mainstream accelerator stand, the performance actually limited by the speed of computation. However, on the other hand, if the OS is low, the performance is limited by the speed of memory, namely memory bandwidth. Such memory bonus parts work clothes have plenty of parallelism but hard to accelerate, and those sparse workloads is the focus of our work. And the reason is because leveraging sparsity can reduce the amount of data to be processed by orders of magnitude, as you can see from this spectrum of density. And among those sparse workloads, sparse linear algebra operations are often the dominant computation part. And the most prevalent or most important SLA operations include two kinds of matrix vector multiplications, depending on whether the vector is sparse or dense. So we named them SVV and SVM SPV for sure. And similarly we have two kinds of matrix matrix multiplications, namely SBMM and SPGM. Our group has developed various sparse linear accelerators. So firstly Transaurus is an accelerator of mixed sparse dense workloads including SBMM and also we had Matt router which focuses on sgam, so 10 servers and Mac Raptor. They put more focus on architectural research and we also build FPGA prototype accelerators that may graphically and and high sparse. They are mainly about spme. In addition, we are also building a versatile sparse accelerator that support multiple sparse linear operations in one piece of hardware. And in this talk I'll focus on our most recent works then high, sparse and versatile sparse accelerator. And here is the outline of today's talk. So we'll first do a deep dive on spme using high sparse development color port hills, and then we'll describe our ongoing work, which is a versatile sparse accelerator. And finally, I will leave some room for discussion. So we pick as PMB as the case study, mainly because it's a very good representative since has several characteristics that are shared among many sparse workloads. So for this example, I listed all the compute you need to do to generate the first five elements of the output vector. So by looking at this computer pattern, you can see it actually is a low operational intensity because in each row we have two arithmetic operations, but we have to do 4 memory accesses in order to perform them. And moreover, each element of the matrix A is used only once. You can immediately tell there's no room for any data use for matrix A. And furthermore, the access pattern to the input vector B is random because it depends on where the non zero is located in the matrix A. And finally we'll have read up the write carried dependencies over the results. So all of these features make it really hard to accelerate sparse workloads including SPMV. And what and? That leads to a situation called memory bandwidth bound, which means the performance of SPMB as a representative of SLA workloads. They're mainly limited by the available memory bandwidth. So our approach to efficient sparse efficient sparse processing is firstly we leverage the high aggregated aggregated bandwidth of multiple memory channels and in order to better utilize that available bandwidth, we Co design the sparse matrix format and the accelerator architecture for high bandwidth utilization. I finally we build dynamic executing pipelines to handle the irregularities. By saying I mean the random access part and the current dependency part. So after handling. We can ensure a high computerization. So for the matrix format part, there's a commonly used format called compressed sparse rows, or we can call the CSR for short. However, CSR is not very suitable for multi channel acceleration and here example using a four by four matrix accessed by two process engines and this is the access pattern. So since CSR relies on the rule pointer to denote the start location of each row, the each PE has to look up the row pointer array before loading the actual data which causes a non streaming access. Moreover, since roles in CSR has to be stored in contiguous memory locations, that's by definition. So two pipes can access the same memory channel at the same time, causing channel conflicts and they are competing for the bandwidth. So to resolve these problems we designed a specialized. Sparse matrix format called CPSR. So we first get rid of the row pointer array by inserting next row markers into the stream of nonzeros which ensures a full streaming access corresponding to the stream of rows. Keyword in the name of the matrix format. And actually this access also vectorize. I will explain more in the next slide when we talk about architecture. And furthermore, we store the different roles cyclically to different channels. That's actually according to the work distribution across different pieces and this ensures no conflict free access to multiple memory channels, so the piece won't compete for the bandwidth. And that corresponds to the cyclic keyword in the name of the matrix format. So in this way we can ensure a full streaming and conflict free access to different memory channels. As for the other side of the Codesign, and here is the brief architecture of our accelerator. So we further group Ppes into one cluster to access the memory in the background manner. So inside one cluster there are actually many peas. And inside each faster we also have on chip buffers for the input vector and the result vector to reduce our off chip memory traffic because we do have some data reuse on those vectors. And beside those cluster we also have a vector loader and the result training unit for feeding the clusters with the input vector and writing the results back to memory. And since we have on chip buffering, you might ask what if the size of the matrix is larger than one chip buffer? So for those inputs we tile the spme operation. So in this example the matrix is tiled into four parts and the vectors are split into two parts each, so which correspondingly to the size of the entry buffers. So we start with the first half of the matrix and the first half of the input vector. However, since the access must be limited within one matrix tile, we need to switch to the next row. At the tail boundaries, this leads to more natural markers and these markers will increase the actual memory footprint and also take a toll on bandwidth efficiency. And now let's move on to the second part of the Matrix and the second part of the vector. And we write the results to the first part of the result vector. But here, as we access the third half of the matrix, we need to load the first tile of the vector again. That's by definition of matrix vector multiply. Therefore tiling also causes repetitive access to the input vector. So in this example you can see the input vector is accessed twice. Therefore we need the sizes of the on chip buffers to be as large as possible to minimize the tiling overhead for the result buffer is pretty straightforward as different roles are assigned to different pieces, so as the number of pieces grows, the size of the result buffer naturally scales up. However, for the vector buffer, it's hard to scale because all the PE has to randomly access the same vector, so they are basically accessing a shared memory. So to deal with this problem we actually have two options. The first one is we can duplicate the dense vector to each PPE. So this this method insert ensures already high on chip bandwidth. However it has a very high demand for on chip memory capacity and the other operations we managed to share the dense vector among a set of Ppes. In other words, we are doing banking across Ppes and this increases the on chip memory usage efficiency but also we need to deal with conflicts. In this case. So in our accelerator, in high spirits, we actually take the latter operations. The latter option because. We really want to minimize the off chip, minimize the tiling overhead because SM is essentially bandwidth bound. To achieve this we split the PE into two parts. So part A and Part B is showing on the left hand side and the right hand side on the figure in the bottom. So the first part is just sending out the read request to the vector buffer banks and the second part will be collecting the responses and do the actual compute. And we insert two shuffle units in the middle to handle the bank conflicts. So you can think of the shuffle unit as on chip networks to route the requests and responses to and from the vector buffer. And however implementing the. Shuffle unit is not an easy task in heels. Here is a counter example. So we just write a shuffle unit in this way. So we feed the input payload to its target in our own role in our Rd loop. However, when this design is pushed through the tool the flow, the tool will assume the worst case traffic pattern and process all the improvements sequentially. If we have an influence in this case, the initiation interval of the pipeline will be in. This is mainly because there's no arbitration logic tiling the tool. Like how to handle the potential conflicts. So in high sparse we explicitly implement the arbitration logic and a rescinding mechanism to send the denied the payloads back to the input of the arbiter so we don't lose them. And here is the detailed architecture of our shuffle unit. So this part corresponds to two input muxes and the input of the arbiter which. Realize the functionality that we can resend the denied payloads back to the input of the operator. So if some payload is denied for access, the corresponding input line will not be read in the car running in this cycle. And in the middle we have the arbiter, and this arbiter can be pipelined to multiple multiple cycles so that we can meet our timing targets. And basically the arbitration logic inside arbiter is not. The key point here because the arbiter is actually a feed forward path, they don't have any feedbacks on the arbiter. Here we are using a round Robin arbitration policy, but in theory you can implement any arbitration policy you want. And finally, we also need a crossbar to actually route the packets. And you may notice these registers on the feedback path of the shuffle unit. So this is actually required to balance the latency of the feedforward path, which is the arbiter, and the feedback path, which are the arbiter outputs and the granted signal to control the maxes. And you can find a really good example in the user document of catapult. So I'm omitting the actual code to write those registers for simplicity. So this is the first part we built. This is the first dynamic execution pipeline we built for our. Husbands accelerator and the other part is the dynamic executing pipeline we built to handle the current dependencies. So again this is another counterexample. In this example we are accumulating over an output buffer. However, the buffer might take multiple cycles to read and write causing a codependency and as a result holes will increase the pipeline I to be the sum of the read latency and write latency to pull apart the read after write dependencies. So to deal with this dependency or data hazards in the pipeline, if you are doing RTL, you might immediately think of building data forwarding mechanism. However, building data forwarding in actuals is quite challenging and mainly because the pipeline registers are not visible in the source level, so we actually don't know where to retrieve the information of past issues writes. Well, since this is the key problem, we build a specialized data structure called the inflight write queue. This data structure is just as shipped register with shifts once per cycle, so it behaves exactly in the same way as pipeline registers. And we store the information of the past will rise into this queue so that we can retrieve them for data forwarding. And then in the middle you can see the actual data forwarding logic. This is not very fancy part. Is this actually the same logic that any RTL designer will do in his data forwarding logic? And notice that in this code I'm showing the actual functionality in the more abstracted way due to limitation of space. So you can see some member functions on the ifwe data afiq data structure, but that function is just straightforward and implement the functionality of data forwarding. And finally at the at the end of this loop, we will update in flight right queue with the latest issue right and shift the entire queue once to. And shift the entire queue once so to to maintain this behavior like similar to the pipeline registers. And that's all part of the implementation of high spars. The result has sparse and. Let's move on to the implementation of high sparse using High-Level Synthesis. So for high sparks we used 16 memory channels for the sparse matrix and two memory channels for the input and result vector. And for the entire buffers we have a 512 kilobytes result buffer in total that is given by 32 kilobytes per cluster times 16 clusters because different rows assigned to different pieces. So like I mentioned before, it naturally grows up. However, for the vector buffer we only have a 32 kilobytes vector buffer in logically because. The vector buffer across different clusters is still duplicated, so we end up having 16 copies of 32 kilobytes vector buffers. That's because we found it's actually not a really good idea for either for timing or ability if we share the vector across all the pieces in the system. So we have four processing engines per cluster. Therefore we have 64 pieces in total. And then we did the Isaac design with synopsis design compiler and we're using the RTL generated by color Polish class. We're targeting the global foundry 14 nanometer technology with a frequency target 1 gigahertz and all designs made time. Thus, as from generated is generated by the ARM memory compiler. By saying all designs I mean. We did try designs with different number of clusters to see scalability and the largest design with 16 clusters is have an area slightly more than 3.5 millimeter square and have a power around 5 watts. For the performance evaluation, we compared this accelerator with CPU baseline which is the Intel MKL library using 32 thread running on the young gold machine, and this machine has 282 gigabytes per second memory bandwidth. Also, there's a GPU baseline which is the NVIDIA cool sparse library running on the GTX 1080TI which has 484 gigabytes per second bandwidth. And finally, we also compare this ASIC design to our FPGA prototype which does which is implemented on assigning spga having 285 gigabytes per second. And note that these FPGA prototype is not developed with Catapult. And this table shows the throughput comparison between our basic design MKL sparse and the FPGA FPGA prototype. So our design runs 15X faster than MKL and 3.7 X faster than the library. And I have to say that this design is actually simulated with ideal memory because we don't have a memory model in any like HLS tool. And then since the target call frequency one gigahertz, so we managed to scale the performance number. Like as if we have 256 gigabytes per second bandwidth, and that number is pretty typical for any system with 16 memory channels. And we also compare the bandwidth efficiency measured as the throughput per unit bandwidth. And our asset design has an 8 point X higher bandwidth frequencies MKL and is also 3.7 X more bandwidth efficient than could sparse. As for the comparison of power, because the ASIC design only has the accelerator kernel itself, so the 5.2 watts energy as the power is actually estimated only for the accelerator. But in the real system there will be like memory controllers, maybe memory crossbars and some other external logic that make the whole system work. So we don't think this is a fair comparison. To compare the energy energy efficiency between the ASIC design as to other designs in this table, but you can also look at the energy efficiency numbers just as a reference. And that's all. Good question. You show someone asked, did you do a tape out of this? Uh, no, we didn't do PayPal. OK, great. Thank you. OK. So before I'm moving on to the other part which is our ongoing work is any other remaining questions about? High spirits that you want to be addressed right now. If not, and then let me move on to the how would you compare? How would you compare with this HLS and catapult HLS and the contents of your sparse application? Ah. That's a good question. I think it was a brief answer for now and and we can discuss about that you know in the final discussion panel. So I I don't feel a really big difference between whites and counterparts. So they do have some syntax differences and the way they treat your design. But As for designing the actual logic of the heart of the accelerator, I don't feel too much differences. Right. OK. OK, I see many questions here. We could proceed in that region. The app ask you at the end if you want. Q&A At the end. OK, thanks. So let me move on to our ongoing work. Which is the worst aspects accelerator. So there are two reasons that drives us to build a versatile sparse accelerator. So firstly, real-world applications usually involve different algebra operations, as shown in this table below. And the second reason is that these applications come from a wide range of sparsity. Here I'm referring back to the spectrum of density again. So meaning this, even if we have one accelerator that's works well for one application, it might not perform well for the other. Application, even if those applications share the same SLA operation. And the previous work on the SLX writers mainly focus on like either one operation or one fixed compute pattern. By saying compute pattern I'm referring to different ways of generating the output. So taking Sgam as an example, there are three prevalent compute patterns which is the inner product generating the output element by element and outer product with generous output matrix by matrix. And finally we have row wise product also known as the Ghost Tolerance algorithm generate. Health post roll by roll. So our goal is to build a unified sparse accelerator with configurable compute patterns. But the major challenge of our design is the design space exploration because each SLA operation has a various configurations like what is the computer pattern and what how do you do tiling and also simulation of SRA operations is slow. Therefore our approach is to do analytical modeling that's do adding the modeling for rapid design space exploration. So we build an analysis and analytical analytical performance models for our accelerator which takes in the characteristics of. Matrices and also what which slow operation you are evaluating and what is the computer pattern, the tiling scheme and also the architectural parameters of our accelerator and we'll predict the performance for you and the time it takes to get performance is significantly faster using our model when compared to the traditional simulation based approach. And As for the hardware design of this sparse accelerator, we pick the computer pattern that achieves the highest Oli. According to our ethical model, that because these workloads are typically bandwidth bound, so higher ROI means higher throughput. And we build a accelerator that has a shared data path for all the operations and computer parts and listed in this table. This accelerator is based on high bandwidth memory and have several HPE which means HBM burst engines to read data in and out from memory. And we also have some other access unit and process engines. They are designed in a flexible way that you can reconfigure them to support different asset operations and different compute patterns. And we also did the design of this accelerator with catapult, it's running at 2 gigahertz using the global foundry 14 millimeter technology. And we also simulate the performance of this accelerating the Gemfile simulator under 256 gigabytes per second bandwidth. So here where you actually using a actual memory performance model. And this and sorry and the. And this table shows the comparison of our vertical accelerator with respect to state-of-the-art accelerator over their most performant sparse nature of our operation. So our accelerator deliver up to 14X speedup and 7X bandwidth efficiency improvement compared to the state-of-the-art device. And finally, I would like to discuss some. We think that that would be presidential enhancements to more than tools. So the first thing is the memory model initially, because memory model is essential for evaluating such memory bound sparse designs. And we want to to to model memory. The memory model has to run concurrently with the user logic to service requests. So just as this example shows here, if we use traditional C simulation methods, then the whole simulation will start in one of these rifles. Right, the real way to. Enabled us is to run the test memory and use the logic. They don't have to be truly parallel, but they have to. Run in the concurrent way. So this has to be enabled for C simulation to capture the golden results of the user logic. Because with all that you don't need to know how to run RTL simulation. And in the RTL simulation for the user logic we can use normal high level synthesis to generate it and for the memory model we can use. I imagine we have some actually tool that can generate non synthesizable RTL code just for modeling the performance of the of the test memory. Because the this memory is only used for testing for performance, so there's actually no need to be synthesizable. And you can see in this example I'm using some delay statements in Verilog. And another thing we would like to decide discuss is how can we generate an unpunched buffer organization using HLS. Here I'm I'm mainly referring to automatically generate some partitioning logic interleaving the address mapping or some banking mechanism I just showed in my design. So ideally we would like to have something like this. So we have a buffer which is an array in hotels and we attach a pragma or or maybe a directive to that array saying that we would like to bank it into end parse with our interleaved address mapping and use round Robin arbitration to resolve any potential bank conflicts. And if this is possible, then the rest of the design will be pretty much simpler, right? We just access this banked buffer in like what we will do as if it's not banked. And we hope actually this can generate those shuffle units automatically for us. So. Actually these are two things we both found missing. We found missing both invites and catapult because it's a hard problem for each class, but it will be really good if we can have them for building more complex and more like large scale system. So to conclude my talk, the first thing we figured out is hardware specialization is a very promising way for efficient sparse processing. And that's because we have the ability to customize the memory hierarchy and the data layout, which leads to a higher bandwidth realization. And also it is possible to build an entire system using High-Level Synthesis, and high sparse can serve as a useful reference. So there's a link to the code base of high sparse ASIC using catapult. But this GitHub ripple is not open source yet, so if you are interested in looking at the code you can contact us by e-mail. And finally, we discussed about some potential enhancement to the modern insurance tools, mainly about how to modeling the behavior of behavior and performance of memory in C simulation and how to generate. Different entry buffer organizations automatically with HLS. And finally, I would like to thank my contributors and collaborators within Cornell Hanchen, Jin Zhao, Yuan Yuan, Jiang Changdong and also outside Cornell Dr Nitish. So first of all from Google and wishing you and Professor Jason Cohen from UCLA. And also I would like to specially thank Simon EA for inviting me to give this talk to you. And also greatly thanks to Sandeep and Mattel for organizing this event. And finally I want to thank our sponsors for supporting our research. And with this, they will be. This will be the end of my presentation. And I'm happy to take any questions. Actually, there are already a few. Yes, quite a few. Thank you so much for that presentation. You shall in a moment. I shall and our other experts will answer your questions live, so please continue submitting those via the Q&A widget. We also have a few poll questions and would appreciate your feedback on those and future presentation topics. Once again, please use your interactive widgets at the bottom of your screen to submit your questions. And download the slide deck. This recorded version of the webinar will be available on demand within 24 hours. You will receive an e-mail with the link to view the recording at anytime thereafter. I will now pass it over to you, Shao, and our experts for our live Q&A. Cool. Right. Great. Very good. Alright. Yeah. So should we start at the beginning? I'll go ahead and ask you the question you shall how would you compare Vitis HLS and catapult HLS in the context of your sparse application? Sure, a good question. I just mentioned it during my presentation. So there are not a huge difference in terms of expressiveness and the ease of using the actual tool of building the sparse accurate accelerator. So we have to deal with those randomness and data layouts manually. You know, that's pretty common with any actual design tools. So the difference, the main difference because is that varieties will rely more on. Pragmas that will be inserted in the source code. So it might make the design more readable, but it also makes it more design specific right? In catapult the optimizations are mainly given by applying directives to some objects or operations in your design. So the directives are written in a separate script. So which may like the make the source code like not intuitive in terms of what kind of optimizations you make. But the good point is that you can use this piece same piece of source code to generate different kind of designs. You don't need to edit it anymore. So that's mainly the the I I personally think that's the most difference between whites and catapult. In terms of using them, pretty much the same. OK, cool. Can you address what factor is make the FPGA performance so much lower than the ASIC? Sure. So the actual factor is the frequency. So the FPGA design is running at around 230 megahertz, right, while the ASICS running at one gigahertz. So that will be the main difference. And and also maybe that because ASIC design is running in simulation so that might get rid of some overheads. That can also be another like small reason of that. OK, cool. What are the real life applications that involve sparse matrices and vectors? Could any matrix be decomposed to sparse matrix through some transformation? How could the real data be transformed to be acceptable, acceptable by the accelerator? What is the expected range of computation? Throughput. Speed enhancement? OK. There are actually lots of questions. Thanks for giving that. So for the real world applications, let me push on one of my backup slide. So actually graph processing are really good representative of matrix, sparse matrix, vector multiplies. So these are three representative graph processing applications. We heard of BFS, shortest path and also the page rank algorithm developed by Google. So if you are familiar with page rank, you know that it's in the back end. He's calling SMB, and for the other two sparse workloads there's a framework called Graph Blast which express graph workloads with in the linear in the language of sparse linear algebra. So basically you can do all of this with sparse linear algebra, and basically and also. Although probably is not the focus on this talk, but these three algorithms are. Both are all supported by graph using sparse matrix vector multiplication. So this could serve as an example of real-world applications. And As for the composing of matrix through some transformation, there might be a way. I, I, I know there is a way to, you know, transform a matrix into some triangular form, right? If there are enough number of zeros in the triangular form that might it might be worthwhile to treat it as a sparse matrix. But real world sparse matrix they just naturally come out of like a representation of connection or relationships like graphs, right? So basically the sparsity. Or the actual, I mean the actual data layout of the sparse matrix is. Not something you can easily manipulate, right? For many sparse workloads. And the four transforming. I I do hear some transformations about like sampling. I mean not some accountants in a sparse matrix into a dense 1 S the dimensionality of the density of the new matrix will be smaller but will be denser. So that's actually a trade off between handling those sparse matrix is the more irregularities we have in terms of compute, but the denser the matrix is, the more computer have to do. So this is actually a trade off between that. And but yeah, it generally is possible. And how could the real data to be transformed? To be acceptable by the accelerator. So that's actually goes back to our. Format slide. And let me display all the animations here. OK, so to transform the real matrix into this format, all we need to do is assume we are starting with CSR, so we just need to find the in the location of each row. Maybe I should use? Pointer, right? We just need to find the end location of each row. Like. Like here. And here and here. And finally at the very last and insert next row markers to these locations and we can complete the least row pointer after using that and also we just need to change the mirror layout. So previously we are assigning we are storing roles, storing the matrix row by row. But here we are kind of doing a cyclic partition manner. So basically you need to know how many process elements are there in your system and do this on the fly role by role to assign roles cyclically to pieces. And also to channels. This is how we construct our matrix format. And the final question, what is the expected range of computation throughput speed? Computation throughput speed enhancement. Umm, I'm not sure I understand it correctly. So are you asking about what is the expected speedup right when compared to previous designs? If it's if it's point to speed up then. I can say it depends on the input matrix. Like I said, the sparse workloads are pretty irregular, so when switching to a different matrix you can see the actual performance improvement. A virus. A lot. I mean switch to this slide. Right, so you can see from here on this podcast data set, just look at all the other designs right? Basically they behave in a similar way. So on the most streamed data set we perform pretty well, but when it when it comes to Pocock the performance just dropped by nearly 1/2. Though that's mainly decided by the nonzero distribution in the matrix and it's really hard to predict right at, you know, given your target XR target application and that that's also a main motivation to build the versatile accelerator. So we would like to tackle this problem by some modeling and resource sharing approach to make the accelerator suitable for a larger range of like matrices with different. Scarcities. So if you would like to calculate speed up right, the speedup can be as low as three times. And two times something like that. I'm talking about the FPGA prototype, right? With Isaac, there's no way to directly compare it to the prototype because it's missing some necessary parts of the system. But you can also see the pattern here. It totally depends on the matrix. Sometimes it's two times faster, sometimes it's three times or even 3.5 times faster. Right. I think that's the question from Bazaar. That's why. All right. Next question, did you compare high sparse to state-of-the-art sparse accelerators? Yes, we do. The reason I didn't show it in this presentation is because this person would be more focused on the ASIC design with high level synthesis and we know there's this is not a fair comparison, right, because it's design is running on ideal memory. But we did have a did compared to state-of-the-art accelerators in the high sparse paper. Although it is an FPGA, FPGA baseline. But if you are interested you can go back and check the numbers listed in the paper. Right. OK, great. Was the FPGA implementation entirely manual? Um. Yes, Lauren, yes, entirely manual. So the actually codes are handwritten. And the floor planning and all the other implementation steps are all manually tuned to give us a better frequency and timing something. Yeah. OK, awesome. Do you do CPSR conversion ahead of the computation and will this address load balancing? A good question. So yes, we do the conversion ahead of computation. But actually that conversion will happen only once. So if you think about graph algorithms, right, the structure of the graph is given and it will be not be changed during many calls of the same algorithm, for example page rank, right? They will actually page rank is an iterative algorithm, so we'll take several iterations to finish on the same graph. And the typical iteration number is like. Uh. Like 10s or maybe 20s thirties of iterations and the the preprocessing overhead can be amortized across the different invocations, right? But if you are using like. In sparse matrix for machine learning, then just can be probably ignored because the same machine learning model will be running in the cloud or on the edge for hours and days and processing thousands, maybe millions of input instances. So in that case you can probably ignore the preprocessing overhead. And also if you're interested, we also measured and evaluated the preprocessing overhead in the high sparse paper, so you can look at them and and see exactly how many milliseconds it takes. To transform the original matrix into our CPSR format. And that's for a lot of balancing. That's actually the main reason we assign roles cyclically to different process elements. I can see here we are seeing raw cyclic to cyclically to different process element and that's mainly. Design for load balancing. But there are some previous study, let me see. OK, so I mentioned about. Math Raptor here. Right. So it's also one of our work previously and Matt Raptor and analyzed this problem of load balancing and figure out that if you do cyclic assign of workloads then the balance at the load across different pieces will be pretty much balanced. And based on that the rapper also developed their own matrix format that's used for row wise product. OK, great. Moving on, did you benchmark your design with matrices from sparse matrix collection from Fluidics FM applications? OK, so actually these matrix datasets they come from. Let me find that slap. Here it is. So the metrics data size they mainly come from, so it's sparse and O GB which stands for open graph benchmarks. So we are picking some real-world graphs, so you can you can infer from the name. So mouse gene is the genetic graph of a mouse. Right and. Older be on products, so is is is a product relationship graph right, which is typically used for recommender systems from the open graph benchmark. Right. So this this matrix is both cover like engineering science and also like real world graphs, social network and something like that. OK, awesome is the FPGA. If the FPGA was not built using HLS, what did you use for that part? OK here I need to further clarify. So the point I would like to make is the FPGA is not built using catapult HLS because we are targeting and so we have to use instead. So I I think previously we have a question like what's the difference between varieties and catapult, right and answer that there's like pretty much no big difference between them. So we also used heels for our FPGA prototype, but that's another different tool. OK, great. You said timing was met. Was that based on fully annotated post physical design data or just timing data out of design compiler? Special design. That's so you're talking that's actually. The post physical design data. So did you play some route it or? Yeah, I played with it right. So that's the post physical design data and I'm trying to record the virus reports I looked at during the float right that's post physical design data. Right. And all of the entire system. So we're not doing like something some people do like part by part and and they made timing for different parts, but we do push the entire design through the flow. Cool is the NOC designed using HLS. Let me clarify. So if we're talking about the. This shuffle unit, yes it is designed HLS. And for. This. Knock Network on Chip, I'm showing this figure. This also actually last part, we are referring to the NOC topologies in MESHLAB. OK great. Did you iterate a lot to find the right architecture? Did catapult help you for that? Umm. Since we're here, let's just talk about the versatile accelerator first. So in this design, since the main challenge is how can we integrate different compute patterns and different kernels together. So in that case we are mailing relying on our performance models instead of using color power to evaluate them. But when it's come back to the high sparse ASIC design, we did did several iterations using design. Operations using catapult and like I mentioned before, the ability of generating different designs using the same piece of source code did help us a lot. Awesome. OK. Were trade studies performed to determine suitability of untimed C++ versus system C with clocking? Umm. I have to say we didn't try using our system C because writing C. Is what we believe like more natural to do so? So in terms of clocking I can think of like see, a system C allows you to like use different clocks for different modules in your design, but we'll C will like the tool will assume that all designs are synchronized under the same clock. But I remember in Colorado you can also assign different colors to different C functions or blocks using some directive called directives. So I can only say in terms of expressiveness or flexibility there should be no difference between them. But I'm not sure since we didn't try system state design. OK, great. Is there any example design of using this in real CNN EG FLEXNET and so on? A example design using. Well, they are, I can say they are. I can say there are machine learning accelerators that's leveraging support sparsity. But we haven't seen like a real world CNN design that's using particular this design, right. And you see an accelerator that using this design in particular, but there are lots of like CNN accelerators that's also leveraging some sparsity in their weights. OK groovy. What were some of the reasons for you to use HLS and not manual RTL? Would you continue using HLS in your future research projects as well? Also, good question. Yeah, we choose HLS mainly because of his, you know, productivity. So you write fewer lines of code and you get more like focused more on the top level. I mean high level functionality or high level behavior of the design. But for example if you really did RTL design you notice such a pain to deal with multi cycle Srams. Right. But in actual S, just put an array there and the 12 will do everything for you. So that's basically the reason we pick a trans mainly for productivity. And yes, we will continue to use high level sentences in our future research projects because that's. A huge difference between using HLS and RTL. Awesome. Let's see here. How difficult or easy was it for you and the rest of the research team to ramp up on using catapult HLS? Um. I will say if you are familiar with like architectural design or artl design. You'll be pretty easy to, you know, get hands on with High-Level Synthesis because you'll be easy to understand what kind of hardware that code is generating or is representing. Umm. It might be a little bit harder for a person that comes from a software background. So you know, these people, they see they treat the C code differently as we see them, as we hardware engineers see them. So that's why I'm also like proposing the, you know, the potential. Not actually for protein, that just discussing about the potential enhancements to modern HRMS tools, especially the memory. It's actually the murmur all the vision part. So for some people from hardware like me might be straightforward or or like not that hard to do the shuffle units, but for some people from the software domain. You might know what it means by banking or interleaving, but we might not know how to do that in HHS. Right, so actually that depends. So the answer is depends on the person's background. Right now, strong hardware background would be fairly easy. Alrighty, so is CPSR done ahead of computation? Does this eliminate load imbalancing? We're just seeing the similar. I think we had that, didn't we right. It didn't load balancing, but not completely eliminated. OK, for the C++ memory emulator that you mentioned, would you for example envision it as an additional module for algorithmic C? Yeah, that would be good to be an additional module for the AC library. I I would say might not necessarily be part of arithmetic C because it's not actually doing arithmetics. So maybe a part of like a different library that's provided by the HR schools which gives you an class that represents an abstract memory model. So you can define some key parameters like some read latency, write latency, burst learns and something like that and it can it can generate the. Right, and there's some this green box of nonsense RTL that's just called some RTL called snippet snippets from you. Know that library and swap it in with some nonsense able but simulation simulate able RTL. So it would be good to have them as part of the library. OK, cool. High sparse ASIC would be specific to a single application which is SPMV. What changes can you make in the design so that it can become more general for any sparse application? Thanks, good question. So Speaking of generalizing high sparse, there are actually two ways of doing that. So one way is just Speaking of SPMB. So we can already use SPMV to do inner product based SVM. That's one general way of of you know, extending it beyond SPMV. And the other way is we do some modifications to the architecture, we just take out some necessary part like the shuffle unit, right? It's essential for handling the random access and this is pretty common for all. Sparse workloads and and actually the in in graph really I mentioned before. So graph really also has an SPM SPV accelerator, but it's it's fairly small compared to SVM and SVM USB accelerator is built reusing some parts of the spme accelerator. So in this way you can also extend high sparse to like. Just take out their building blocks and plug and play them together to form your own sparse, sparse linear accelerator. OK, then I'm. Have you experienced experimented with other tools to increase productivity, for example chisel? If so, why did you choose HLS in the end? Umm. I kind of this is a similar problem to be so cheese so we so cheso is essentially doing a high level so RTL and we are to be honest we didn't do too much investigation in terms of cheso but we do believe that HR will give us a higher productivity you know when compared to cheese or other like high level RTL tools. So because that's the main reason and also. Maybe our actual S you know you are still you're also using C++, write your test bench right? Maybe that's better for software and hardware Co design, so you can just write your preprocessing code and any accelerator scheduling code in C so when later your hardware design is turned into something. Maybe. Not necessarily. Silicon meditation turned into a prototype. You can reuse the same test bench and slightly change it into your real application code that cost the accelerator. So that's another part I can think of why we choose HMS. OK. Thank you so much for answering that. Um, just a couple more questions. I think we'll stay on a little longer if that's OK with you. OK. Would be interesting to run on an RTX 4090 clocks at 10.5 gigahertz MEM and peak click at 2.8 gigahertz latency. Right. That's actually interesting. So we don't think the. Well. Let me answer this question this way. So we know that the sparse workloads are bandwidth bound instead of compute bound. So the real problem is how much memory bandwidth are available on RTX 4090. So I remember you know it's one of the latest card. I remember the bandwidth on 4090 is. Close to 700 or 600 gigabytes per second. So if we can run this run run the same workload on this graphics card, the performance will definitely be much higher. But if you calculate the bandwidth efficiency, I mean throughput per unit bandwidth. And that's actually an interesting. A comparison to make. So you might not. It might be higher than 1080, but might not be as high as FPGA or ASIC design with custom memory layouts. OK. All right. Which data type was used in the designs benchmarking? Is there major differences between using fixed versus floating point? Alright another good question. So to fully answer this question is take a little bit on little bit longer time. So I'll keep let me try if I can explain in a few sentences. So we are using fixed points for all the designs. And the real difference is let me go back to this. This slide. Right. So the major difference is this data forwarding technique. So with fixed point designs we are able to make this addition happen within one cycle. So in this case the entire data forwarding works. But with floating point on Fpgas you will need definitely need multiple cycles to do that addition. So you can so you have to stall the pipeline when doing accumulation over the same value, right that that comes a little bit like that affects the. As an activity, but on ASIC designs, depending on your processing, your technology node and your target frequency, it is also possible to make this floating point addition fully combinational. Actually we would try that, so we push the design with a floating point combinational adder. Here in this in this diagram under one gigahertz clock and the slack is only -, 0.3 nanoseconds. So negative. So we didn't make timing, but there's really helpful, right? It's really hopeful. If we increase the like time and target for probably to 1.5 gigahertz, right, maybe there will still be a negative slack, but we will be able to run it on A1 gigahertz clock. So in that case, we'll generate the same performance as the fixed point design. So it's a really, it's really a implement specific problem to deal with different things to deal with. Floating point rpga is where you don't have the ability to meet timing with a combinational adder. We also discussed ways about that problem to dealing with that problem in the high sparse paper. So basically there are other ways to deal with this carrier dependency. With 14 points it will hurt the performance or the resource realization a little bit. OK. Has this been used in any CNN's? If so, how much acceleration did you get? And that's kind of similar to a previous question. So we, yeah, we we didn't specifically apply this accelerator to CNS. This access is mainly targeting like really sparse workloads. For example, graphs, right, the sparsity can range like 10 to the -, 8 to 10 to -, 6. For CMS the sparsity is around like 10 to the -, 2. Let me pull over this. Right. So far for since this, you know around like 10 to -, 2 maybe several persons of density, right so. So we didn't explicitly apply our accelerator to CNN, but there are CNN accelerators that's leveraging sparsity to accelerate. OK, cool. Alright, just a couple more. How does BFS compare to DFS methods for your research? BFS. Umm. Let's see, I'm not quite sure about this question. So it's it's about how we conduct our research or like why do or or it's about why do we talk about BFS or DFS. So the reason I mentioned BFS and DFS before is that there's a question about is there any real world applications that using those sparse kernels and answered yes and showed an example like BFS is representative and DFS it might. Also be possible to to express devices DFS in terms of sparse linear algebra? But the question is how efficient it is. So BFS in sparse linear algebra is pretty efficient. It boils down to sparse matrix vector multiply with some small element wise kernels. That's it. But DFS that might be more complicated and as if you're asking about how we conduct our research. Well. And that will be a more general like approach. So maybe I can I can talk about this with an example of like exploring the design space, right. We're doing design space exploration, it's actually a BFS step, right? In the beginning, you want to explore more possibilities like trying different kinds of directives, optimizing directives on the same piece of code, right, in color mode, right, but as the design stage goes further on. You are more confident of what kind of optimization you want to use. Like you, you, you finally decide their architecture, right? We decided to share the vector, we decided to do banking and after that you go deeper and deeper, try to optimize for pipeline and and do some like really fine grained tuning like balancing the latency between feedforward and feedback paths. Yeah, that's it. OK, you use HBM. Will this accelerator have an advantage if you use conventional DRR memory? Umm. I think the key part of this by the way, this is a really good question. So I think the key part is have an advantage with with respect to which platform. So what is the baseline you're comparing to? So if you are also comparing to DDR, let's say we use DDR for our accelerator and we compare it to other platform, all that also using DDR then? It will be pretty much the same. So fair comparison, right? If we compare our xlerator using DDR with fewer bandwidth to an HDM one, then it's not fair to just look at the performance numbers, right? Well, we have to do a comparison on bandwidth efficiency. Right, and that's only about performance. The advantage can also like. Like reflects in different parts like energy efficiency, right? If you use fewer channels with fewer bandwidth, probably you can save some area about the memory controller, or save some energy because you're having fewer bandwidth so that you have fewer compute to do. So that's actually depends on your design target. So it's really hard to to decide what advantage really means in different situations of application, right? In terms of high performance computing, where you want your performance to be as high as possible, you don't really care too much about energy. Done using HBM will be a better choice than DDR. OK, cool. Did you look at the latency performance? It would be interesting to compare running on an RTX 3090 or wait, did we already answer that one just like I think we did? Although I think I think the question is more geared at latency rather than through, right? So maybe I can ask the comments on that so. So in our comparison we are showing throughput here. But yeah, we initially we did measure the latency of processing one graph. So As for running on the RTS graphic, I think the major problem is how to get that car. Right now it's just it's just released, so we'll take it and assuming it doesn't melt. There has some, you know, problems with their power delivery. So sometimes the power pin just melted. Right, so it has there. There are some problems with this latest. Graphics card or latest GPU? Right, so. So As for latency. I can say we we we the raw data, we measure this is latency, but when doing comparison that's that's more like. Like a common way to compare the throughput, because the throughput can be, you know, average and as a representative for your design just assume it will be applied on other datasets, right? If you measure latency then the comparison will only will be only limited to those data sets that you picked. So that's that's basically why we are more focused on throughput. But yeah to begin with you have to measure latency to to calculate the throughput. OK, great. And how did you measure power? So. Let me go to the slide. Here it is. So for the power of the ASIC design. So this power is actually an estimated one of the design compiler after the physical implementation. So we are actually not feeding some simulation trace into like prime time and get actual power number because that will take. Too much time, because we are we're still in the early stage of our design, right? We we didn't evaluate it on the real memory model and we are missing several key components in the system. So we didn't do that power simulation for now and for the other designs PGA and GPU CPU, we have vendor provided tools that can directly report the running power of these platforms. OK, great. Last two questions and then we'll conclude today. Have you thought of combining your work with compute and memory? A great question so. I think this architecture. I think this architecture might be suitable for like computing memory or processing processing memory like like PM architectures. So with PIM typically you will exploit more like more about the data level parallelism. So instead of like. Several tons of channels you will have access to maybe Tens, 20s or hundreds of DRAM banks or DRAM, DRAM ranks, right, depending on what your team architecture is. So I think this architecture can be applied to PIM. But that will you know you have to design your specialized data layout, namely the matrix format that is suitable to exploit the data level parallelism with computing memory architectures. So that's that can be difficult because the number of ranks and banks are you know typical at the level of like hundreds in the real memory system. OK, I think that about does it. Um, for today? All right. Very good. Well, yes, thank you everyone for attending today and thank you, Shao, of course, and our experts for answering all of those questions. That we received this is going to conclude today's webinar. We will leave the event running just for another minute just to give you a chance to submit any last minute questions, which we will answer via e-mail and also to download any of the resources in our resource list widgets. Thank you for attending and have a wonderful day or evening. All right. _1711652727605

Sparse linear algebra (SLA) operations are essential in many applications such as data analytics, graph processing, machine learning, and scientific computing. However, it is challenging to build efficient hardware accelerators for SLA operations since they typically exhibit low operational intensity and irregular compute and data access patterns. In particular, some of these challenges are not well studied in the context of High-Level Synthesis (HLS). 
 
In this talk, we first introduce HiSparse, an accelerator on sparse-matrix dense-vector multiplication (SpMV). To achieve a high bandwidth utilization, we co-design the sparse storage format and the accelerator architecture. We further demonstrate the use of Catapult HLS to build a high-throughput pipeline that can handle irregular data dependencies and access patterns. Building on our SpMV accelerator, we further develop a versatile sparse accelerator that can support multiple SLA operations with run-time configurability to support different compute patterns. Our architecture design is guided by a novel analytical model which enables rapid exploration of the design configuration search space. According to our evaluation using both HLS implementation and simulation, our sparse accelerators deliver promising speedup with increased bandwidth and energy efficiency when compared to CPU and GPU executions.

_1711652727823