Hello everyone and welcome to today's webinar. So our webinar today is Duff architectural improvements for low power and functional safety of data flow. CNN accelerators using HLS. So my name is Mathilde Karsenti. I'm a marketing programs manager here at Siemens. Before we start, I'd like to go over a few items regarding the interface. At the bottom of your browser and the center, you'll see icons and a widget dock. These are your interactive tools you will use throughout the webinar. You can minimize, maximize, and scale each widget to customize your viewing experience. I'd like to call your attention to the Q&A widget. The Q&A window should be open and is located on the left side of your screen. You're encouraged to submit questions to the presenter throughout the webinar using this widget. Your questions will be answered during the Q&A session right after the webinar. Please be aware that questions you submit will disappear, but we'll come back with an answer if answered by text. Second is the resources widget located at the right side of the widget dock. There you will find the PDF. Slide deck as well as other resources and links right next to that widget, you will see the help widget. If you are experiencing any technical issues, please click on it for assistance. And just as an FYI, the simple refresh of your browser generally resolves any issues you may be experiencing. This presentation is being recorded and you will receive a link to it in a follow up e-mail later today. If we are not able to get to your questions during the live Q&A, we will respond to them via e-mail. Our presenter today is Dionysis Philippus. Dionysis is a PhD student in electrical and computer Engineering at Democracy Democritus University of Trace. He received his diploma and Master of Science degree in the same department in 2019 and 2021 respectively. His experience involves the design of network on chips and customized hardware accelerators for data clustering algorithms. This current research focuses on design. Of power efficient convolutional neural network accelerators as well as the design of customized floating point arithmetic units using HLS. Now I will hand it over to dionisis for today's presentation analysis. Thanks. Thanks. Thanks for the introduction. And so. Today I will present you some of the work that we're doing here regarding the accelerators of convolution neural networks. And I will start right now so. First of all, convolutional networks are very important today. They can be found in many applications for in machine learning as classification, image segmentation, object detection everywhere, almost everywhere and it increases everything every time a minute. So overall the. The architecture of the convolution network is composed of some smaller layers. It's convolution layer is composed of the convolution itself, then some nonlinear function, some normalization or pooling, but the most. The most? This part of the convolution is the convolution itself. So every everyone is trying to optimize this. This specific part of the CNN layer in order to be more efficient. So to improve convolution neural networks, we find some many different architectures. Two of the most important are the spatial data flow accelerators, in that case the. The architecture of the of the each layer is. This is is is designed in a specified hardware. So we have a dedicated dedicated hardware for each layer that is composed of dot product unit and a local memory for the wait so on each layer. Compute its result and then it pushes the output to the next layer. So every every layer is in the same hardware. So on the other hand we have systolic accelerators where the convolution now is transformed into a general matrix multiplication problem. So a DMA is used to map to push the data in the correct order inside the array of processing elements and there the computation is computed. So this array is in fact an array of multiplying the cumulated units, where we compute the output and we send you the output buffer. Now in our work we are targeting specifically spatial data flow pipeline accelerators. And. I will see later we we find we try to optimize some not so common convolutions. But before that let's see how the overall architecture of this accelerators is. So we can see that for each layer we have. A dedicated engine that it's composed of an input feature line buffers where as the as the input is streamed inside the engine it is stored in until it is used and it's not, it's no longer available. And then? We have a window buffer where the the inputs that participate in the current output are stored and we have some product units where the. The computation of the output is produced where the the input features from from the buffer and the weights are multiplied, accumulated and produced. Overall, these architectures are very common. Are common because they you they can optimize the memory reuse and the data reuse. So. In this presentation, I will I will talk about how we can optimize the buffering for strategy and dilated convolutions so that we can reduce the power consumption. In addition, we will see another aspect of convolutional networks and teach their functional safety and how we can design some efficient takers to be able to check their correctness. And lastly, I will talk about some floating point arithmetic operators that we designed in order to help us in. In this in this implementations because as we talk about HLS implementations, so. All of our work is based on HLS High-Level Synthesis, so all our designs are designed in C and are synthesized using catapult data less. This approach helped us a lot in order to. To improve to faster, check the correctness of our designs, as C communications are very much more easier than artl simulation. Much more faster. In addition to that, the SUV flow allowed us also to. To be able to verify the RTL produced by Catapult. And lastly, due to the directives that are available, we can perform a very wide design, space exploration and see many different. Architectures of our design in order to to select the best or see what is the difference between similar architectures. Uh, so. Now let's start with. With the special variance of convolution in general, convolution can be described as a kernel. I will see I will present all the examples in the 2D convolution, but all these that I will talk about can be easily. Scale to the whole CNN layer. So a convolution can be seen as a kernel sliding over the input of the over the input feature. So it's time we have every available. The Pixel that participates in the available, then an output is computed. So in the pipeline data flows that we. That we study the pictures are are pushed in the engine in a pipeline in streamed mode, so. Every time a new pixel arrives, it goes in the active region, produce an output, then the next picture pixel arrives and you have this computed, and this operation is continued until the end of the convolution, so moving from. This basic. Flow of convolution that have been studied a lot and a lot of accelerators have been proposed to optimize this work. We tried to study some variants of the convolution that OK, maybe common, not so common, but we're trying to present to find a way to optimize their performance. So first of all, we have the strided convolutions where the operation is very similar to standard standard ones, with the main difference that the filter now. Highlights with a bigger stride over the input image. So. As the animation presents, we can see that the. The output is reduced compared to the original output and also that the kernel slides in the biggest write in both. That is both here and Y axis of the input. So the next variant of convolutions spatial variance is dilated convolutions. We are now the filter itself. It gets inflated with zeros. So by looking at that example we can see that now the. The three by three kernel is applied to to a much bigger part of the input image and the intermediate. Pixels of the image that are not indicated with X are just multiplied with zero values. Therefore in that kind of convolution. We can see that again the output is reduced as. It's. We can say that it's like a using a 5 by 5 filter with a lot of zeros. So and this operation will also continue until the end of the operation. Now. Let's see about how we. We studied the first variant, which is the strided convolution. First of all, our main idea is based on the fact that as the kernel slides with a bigger stride over the input then. Some of. The sum of the coefficients of the of the filter will be multiplied only with a part of the input image. So by looking at the different colors we can see that the blue. Pixels of the input will only be multiplied with. The blue of the filter. Similarly the other colors will multiply 1 to another, so. In that way we can split the. The whole operation into smaller. Smaller convolutions each one. Because each one consists of the pixels that we group together and part of the filter that these pixels convolve with, so by. Performing a standard convolution. Now we stride equal to 1 to to the smallest groups of inputs and filters. We can recreate the output by performing an addition. To the individual. Result. That we produce. So our target is to find a way to to share, to efficiently share the resources of a standard engine. In order to perform this operation. By that I mean not. If we see a standard trendline that has the line buffers and the window buffer as I said earlier, then for standard convolutions. It's pixel inside the window buffer is slided to the one next to it, and the pixels from the line buffers are pushed inside the window buffer in order to produce an output. So the the same design however can be changed. For. Implementing a convolution with a bigger stride, so we can see that now the line buffers can push inputs to different. Registers of the window buffer and by grouping them with color we can see how we can share the same hardware to compute the convolution with stride 2 or stride equal to three. Overall. In order to to. To achieve that, we start with the assumption that we have a single smaller standard engine for each one of the channels that we. That we created by splitting the initial input. So. As we can see, we have different sounds for different channels, for example of stride equal to two. Uh, and so. Due to the fact that the input is streamed and the inputs that arrive are belong belong to the same row, then we can see that channels A&B that. Reference to refer to the same line of the input. The same row of the input will be intensified because as the input arrives, we have. One going to channel and the 2nd going channel be. So in that way we group the. The land buffers from the two tunnels and we. We I think we are adding some multiplexing, excuse me for that. OK, we're adding some multiplexing in order to forward the data to the correct window buffer. Now moving on. We have a. We can see that the number of registers that implement the four separate window buffers is equal to the number of the to the weights of the kernel to the coefficients of the kernel as it refers to the same initial convolution. So we arrange this. Transistors and creating and we are creating a bigger. A single window buffer that is served. And now we reach at the. At the same picture. Where we have the same initial engine, but the the multiplexing allows us to. To a physical and to efficiently move the pixels inside the corresponding registers in order to produce to produce an output. In that way. We can reduce greatly the the switching activity inside the engine to understand to explain this I can I will show you an example of how. A3 by three kernels with the convolved with an input with the stride of two. So starting. In order to compute the first output we need all the pixels inside the. The Red Square to be available. As the people start to ask the pixels start to arrive from the first row, we see that they are pushed inside the line buffers. And channel late that refers to that as this picture belongs to channel is activated. Then moving to the second input, this one is also pushed in the same line buffer, but now channel B is active as this pixel belongs to channel B. The same operation continues until the end of the first row. And thus we can see during the during that time only one lane buffers are. Because the the the pixels refer on to that one, so we have to gate the rest of the. Of the design. In addition, the the window buffer remains gated as we don't have all the pixels required for the output, so we cannot. Would we don't need to to push data in there, then moving to the next to the 2nd row, the window buffer will remain gated as again there will be no input. But now we can see that the arriving inputs are pushed to the second line buffer. And sometimes she and dad are activated interns instead of that we have earlier. So again, only one line buffer is active. Now when we move to the third row, the pixels that will arrive will be, will complete the square and the output should be computed. So now as the new pixels arrive, they push. They are pushed also in the way in the window buffer. And in addition. They update the line buffer that corresponds to the to the designated row. So. As the pixels arrive, we can see that the part of the line of the window buffer that does not refer to the active channels is gated, so only part of its columns is active on its time. And when the when a four arrives that completes the square, then the output is computed by multiplying the pixels inside the window buffer with the corresponding filter. And. The operation continues. Until the end of the row. And now that. As the filter as we have a stride of two now in the in the next row again there will be no output computed, so the window buffer will remain gated. The input pixels will move to the corresponding line buffers. Until the last row of the input were again the two rest. The rest of the outputs will be computed. Overall. As when the conclusions finish, as we can see that. Only part of the engine was active at its time, so by doing that we can reduce the switching activity of our design. And. This was a. Quantified by our. Was that we performed our analysis on that and we started using the PowerPro Power analysis and optimization tool that we can reduce the total power of the design from 10 to 1130% depending on the size of the kernel that is used as well as the stride of the convolution. As we implemented this design in order to be programmable in terms of the stride. There is some overhead. In, in, in in case that we. That will compute as standard convolution. That is about 5% of power and around 2.5% in the in the area of the design. That is due to the extra multiplexing that we inserted in our architecture in order to. To allow the data to have this movement inside the engine. Now moving on. To our second. Design now where we move to the dilated convolutions. We designed, we created another architecture. In order to improve again the efficiency of their lead convolutions, so similar to the previous one, we can split the initial delete convolution into smaller ones. In this case, the ASK the kernel is inflated with zeros, we can see, we can observe that. By sliding the kernel over the input, some pixels will participate in outputs and, but. They are grouped in. In. In the picture, in the same group will participate model, but will not. Will never be multiplied with pixels from other groups. So as the kernel slides over the input, we can see that the blue ones are one group that will be used in the same output, so we group them into one smaller. Feature feature map the the Yellow ones are are will be used together so they are grouped to another one and similarly the rest. Pictures of the input are grouped and where we create four smaller channels. So having these smaller groups available, now we can perform the same standard convolution. For each one using the same filter that we have on the input without the zeros this time. So instead of using instead of performing a 5 by 5. A convolution with a 5 by 5 filter. Due to the dilation, we perform smaller convolutions with three by three filters and by rearranging the output produced by its individual convolution we can create we can recreate the output that would be the output of the initial convolution. Again, for our architecture, we start by the assumption that we have a single engine, a single convolution engine for each channel. But. In this case, only one one, only one channel can produce an output. Due to the row wise due to the streaming arriving of the data in that. So we can use only a single parallel multiple multiplying accumulate. 3. That will be served across the. The separate engines. Now similar to the previous occasion. The the channels A&B are referring to the same row of input, so again the line buffers can be served. And by adding some. Multiplexing we can forward the. The picture from the right buffer from the line buffer to the correct window. Apart from that, because the pics arrive. From the same row and when a row of input is finished then we move the 2nd row, we can see that. The pairs of A&B channels will never be in the same time in the same row active as the channels CD, so. We we use only two window buffers to do 4 that are shared. Between. That's annels of the different rows. Again, some extra multiplexing logic is introduced that allows us to. To to send the the inputs the the data from the line buffers to the appropriate window buffer in order for the others to compute. And lastly as the row wise flow of the of input. Allows us to. Allows the output to be produced on the correct sequence, so it's time another is produced. It refers to the. It is on the same position that it would be if the initial convolution was computed. So the rearrangement that we need to do in the output is performed by itself, so no extra hardware is required for that. And overall with this architecture we we show that the number of registers that are required to implement these window buffers is smaller than the ones that would be required if we designed an engine to compute a 5 by 5 convolution using the inflated. You know. Now to see how how this sharing is performed, we will see an example so. Let's say that we are in a cycle in the. In the operation where the Gray pixels have already arrived at the engines and are stored in the in the line buffers. And now Pixel A7 arrives and it is going to in this post in the window buffer, so. The rest of the required pixels are pushed in the window buffer from the line buffers. And. Only one window buffers are. So. Uh, the the the pixels are pushed the the wind buffer shift itself as it's active for. For some delay and then as the next one pixel arrives. We see that the first the first, the the upper window buffer remains gated and only the second one activates. By receiving the new input and shifting its values. And following the same flow of data as in our previous example, we see that again. Now the first window buffer is active and with the arrival of the of pixel are 8. We can see that all the available all the required pixels of the input are available in order to produce the first output of the dilated convolution. So the output will be computed by the multiplication addition 3 and it will be forward to the out. And then the next. A pixel will arrive, it will be pushed to the lower window buffer and. The second output will be computed. Now as we move to the next row, we see that the upper. Window buffer is now used by channel C instead of a. So. The US US during the the arrival of pics in this row suddenly will never be active. They its content is overwritten. So we we have we don't lose any data and we can continue the operation. And the same happens when for the lower window buffer, which is now used for channel D instead of channel B. And overall we can see that only a small part of the of the design is used on each iteration of the of the convolution. Overall, we used our our proposed architecture in order to implement a variant of VGG 16 with a lot of dilated kernels, and again our code is designed in C. It was synthesized through capitalism and the power analysis was performed by PowerPro. Also. Our design we'll synthesize both for ASIC and for FPGA implementations. And as we can see. For for for convolutions where the dilation is increased, we have a very significant reduction in the overall power consumption of our design. And with that. I conclude presenting the architectures for low power and variance of convolution and moving on to the second part where I will discuss about how we can check. We have performed the fact we can check about the functional safety of our design which is very important when designing CNN accelerator so. The to for the functional safety. We decided to implement some an online checker that operates in parallel to the convolution engine and it uses the input of the convolution to predict a checksum. And then use this predicted texon to compare it with the actual exam that is computed through the output of the convolution and check if they match so so it can inform us about whether it was there was an error in the computation or not. Again, it's important that this. This design was. It works in parallel to the convolution engines and has no has no dependency to its operations at all. In general, when looking at a convolution operation. We can see. First of all, in this picture we see that the white squares represent the input image, the yellow squares, the green squares represent the filter and the. The. The yellow and red squares represent the output pixel that we produced by the operation. That is picture. So. As we can see. The the middle part of. Of the convolution is in fact the actual result of the convolution when convolving a filter with. With an image, the actual result refers to the pixels that when the center the center coefficient of the filter is inside the input demand, so. By performing the the summation of this red pixels red result. Who in fact have the output of the convolution, the checksum, the the actual section of the convolution, so. The state-of-the-art takers for for convolutions try to predict this one this value exactly by performing the same operations with convolution in parallel to that. But instead of but. Because but they they calculate the summation of the output. Now we can see that apart from this results, we have some extra results that belong to. The periphery of the of the of the input image of the result, and it's not used in the output. They're they are discarded. They are not useful. But however, we have proven in our work that by calculating. Uh. By calculating this this the summation of these outputs and by subtracting them from the product between the whole input image and the. And the coefficients of the filter. We can find the same. The same result. So we can implicitly implicitly now predict the output section. And. With that we can reduce the total operations that needed to to calculate the checksum. And in fact, we can our engine can choose whether we should use the explicit or the implicit computation of the. Of the of of the checksum in order to have reduced the. That number of operation. The overall architecture of. Of this checker is. He's presenting this slide. As we can see, the checker is completely in parallel to the convolution engine it receives the input of. It receives the input of the engine that arrives to the engine and uses the output of the engine to to perform the typing. The other architecture is consistent by an accumulation where all the input. All the input features are accumulated. And. A second part of a second group of accumulators where we accumulate the the pixels the input pixels regarding to the coefficient of the filters that there will be multiplied in the actual output. And then by performing a subtraction to these values and. Performing the actual multiplication with the coefficients of the filter we can produce, we can predict the. The output and by comparing it with the actual section we can say if there was a fault, an error in the convolution itself. Apart from that, we we generalise this architecture. In order to be used for strategic convolution, our our design is based on the decomposition to multiple channels that we're presented, as I presented earlier, and the main difference is that we have that we need. Multiple accumulators, accumulators now that each one accumulates the the pixels of each channel that is created. By the decomposition. So I in in the previous slide I I present you how? It is the end works the engine for a single two deconvolution, but the engine can very easily generalized for a whole CNN As for each output feature. We can see that we perform multiple standard convolutions. And at the end we perform addition to the to the to the results in order to produce an output feature map, so by performed by. Using our in parallel to each of these convolutions our checker, we can find the. They're predicted sections. We can compute the sections and at the end we can perform an addition. To this individual sections and. And compare it to the actual output and see if there was an error to the complete CNN layer. So the overall operation of the online sector is completely in parallel to every channel of the convolution. As I said earlier, our goal was to to minimize the amount of operation that's required in order to perform detection. So. In this slide I present that. Overall, the depending on the size of the image, we can see that the implicit with the implicit. Computer computation of the taxon, which is the one that we proven it's we can perform, has a greatly reduced. Amount of operations compared to the state-of-the-art. Implicit, explicit. Section. So. That in addition to the fact that we don't need any more memory or registers in order to store some intermediate results or use some output, our design is very susceptible to folds. And in order to prove the efficiency of our proposed design. We perform some extended. Simulation where we injected. On its register of both the engine, the convolution engine and the. And the secure. Uh, we injected some faults in order to see how. If the the checking account detected, the way that we inject the faults is. The possibility for a fault to be present for a bit flip, for example, bit present in a flop of of the design is the same for every register in the design. So due to the fact that the checker has a lot more, a lot less. Hardware and registers than the actual convolution. The possibility for sake for a fault to be present there is much smaller than to be on the convolution itself. A again. I as the table depicts. In case that we have. Two faults injected randomly in the design. The attacker can detect the majority of them, and in most cases, and in all cases above the 90% of. Of the false injected when moving to four. Faults then the prediction. The detection is almost 100%. To. To check the to study how our design is. Is less accessible, susceptible to false love than other state-of-the-art implementation we we compared our work with. With one word from bibliography where. They use some intermediate result. They store some intermediate results in order to. To reduce their their number of operations, but the the memory needed for their implementation means that they are more susceptible, susceptible to fault. So overall we can see that our design. Has a very good. Is very good at detecting faults without compared to other implementations and. The initial value that we can see in the in its graph. Is in fact a limitation of the textual method and it's not a problem of the proposed architecture. Now. Having presented all our architectures, I will move to some. Practice good practices that I found out useful due to my during my work on these designs. First of all. To perform the decoding in the in the case of stride and dilation, it was important to perform some modulo operations. By. When when writing a simple plus. Code it best practice the the main practice to. To create A to to write a. This decoder would be like the one shown in the figure. However, this one would produce. We would produce a hardware where a model operator would be present, so it would be. About hardware, when you know that every stride, every value of the stride or dilation will be a power of two, so. In that case, we decided that it's better to use. Uh, it much more custom implementations where a case is used and depending on the value of the stride or the convolution, then the decoding is much more simpler. Than using an actual modulo operator. Another one. A good tip would be to would be about how to achieve the sharing of the. Multiply and addition that I presented in the dilated convolution. So when we have the the filter function that performs the actual product operation and we know that. On dilated on dilated convolutions, this if statement will only be performed once. For the iterations of the 1st. A loop. So by knowing that. Then the the filter function will be called only once. However, in that practice, in that code practice, the sharing is not clear. We cannot instruct the compiler to know that this function will be used only once to avoid that. Sorry, and using that practice we see that we have 18 registers that implement the window. 2 multiply in addition trees and the overall area of the design is 9000 micrometers squared. Now instead of doing that, we can use a local array. And. Assign the needed values to that array before using the filter function. So by using this local array and calling the filter function using that we can see that now only one Mullet 3 is used at the other produced RTL and the total area of the design is around 5000 micrometer square. Lastly. When are we? We want we have a multidimensional array in our CNN implementations. And we wanted to pack some of the. Of the input in a bigger world word in order to save to store it in a memory. So we have a memory that we want to. Two to pack the two last dimensions in a single word. So we use a local variable in order to do so by reading one input at a time and pack it in the in a bigger variable and then when we want to store this. This world in the memory we we see that there is an undesired memory read as the result. So. To avoid that. To understand how we could design that we implemented. It's a custom data type that is based on recursive template metaprogramming and that allows us to map this multidimensional array into a single 1D. Array and using these templates to be able to. To use in our code this template this data type in the same way that. We would use a custom standard C++ array. So by doing so we the catapult HLS was understanding perfectly how. Our implementation. Now. Going to the last part of this presentation, I will talk about the floating point arithmetic for Fuse dot product. Overall for for convolutions. The the main operation is the dot product operation. So when we want to write this operation in C, We would write it something like the code presented in the slide. But with this code either a serial multiply and addition operation. Would be implemented or an FMA. Fused multiply, accumulate chain in case an FM operator is selected for the implementation. In both cases, on after the end of each operator, rounding is rounding. Normalization is performed at the output of the operator. So we wanted to have we wanted to have some future operations where the running and the normalization will not will be only performed once at the end of each dot prod. In addition, we wanted to have some implementations. Where a different input and output precision. Different inputs output that precision for our data. So. To do that, we decide to create a new fast load library, especially for High-Level Synthesis and this library is based on a CNC library and it's currently public on GitHub. And the main feature. New feature that on high level synthesis for this library is our fused vectorized product unit. Which? Is part of this library. In fact, we implemented the unit that performs in parallel the multiplication of the inputs and then by using a single reduction tree. It computes the output, and at the end it rounds the result and pushes to the output of the unit. This. This this operator is enabled is is is designed in and tableted manner, meaning that it can be used for any precision floating point numbers or for any term dot product operations and the overall use of this. Of this design is very simple as we can see the example or the example of how we can implement matrix vector multiplication by using the dot function of the first load for each library. Now one of the most recent works well for most. Our most recent work was based on systolic array architectures. Very. Simple and small. Talk about systolic architectures. Historically, is composed of multiple processing elements. Each processing element is performing fact multiply and accumulate operation and especially for weight stationary. Data flow and stoic arrays. We have the array that is preloaded with the weights of the convolution and then as the input arrives the output is produced. And overall on its iteration of the design, and multiply and accumulate operations performed at each processing element. So by looking at a column of this of this of a systolic array, we can see that each processing element pushes its output to. To the one below and when talking about floating point operations. It's a processing element. Is a multi cycle operation of a fused multiply and accumulate and. It requires some increased. It has some increased latency compared to. To the. The pipeline of the architecture. So for we designed this. We designed this architect. We designed a new version of this pipeline targeting. Especially low precision Floating Points where the operation. Can perform efficiently into in a two stage pipeline and. Achieving the require the even A1 gigahertz frequency. So what we wanted was to to improve the initiation interval of this pipeline. So. What we achieved is by performing some. Speculative operations are about the exponent, that is the main. Dependency in the pipeline and by. Fixing this speculation in the next cycle of the pipeline of which. Of its processing element, we can achieve a reduced pipeline and now that the pipeline stages of the two pipeline stages of the processing elements can be overlapped and achieve and initiation table of equal to 1. And by that presenting our most recent work, that is we try to get public in the next weeks, I suppose and I hope. I will conclude my presentation and thank you very much for your attention. Awesome. Thank you so much dialysis for that presentation. In a moment, he will be joined by Georgios, who is the associate professor at Democritus University as well as Stuart Club to answer your questions. So please continue submitting those through the Q&A widgets. Very good. Thanks matilde. So we we we had a question early on that I just put put to the side. The question was about the systolic array and whether the Mac units were reconfigurable and could they support different bit size multiplications. Uh. Are we talking about the last? No, it was pretty. It was pretty early on, like 7 minutes into. The the indices are can help a bit, but we have designed is not ready for multi precision at the moment, but it is does not have any true dependency of not being but we have not designed it yet. Thank you. I will check that as answered. Umm. OK. We had a a a few questions about the, the if statements between the the 24 loops, right. And Georgios, I think you you, you answered those of course with the loops actually unrolled. There was the pragma unroll yes loop that's kind of like that's just conditionally running those loops and doing that striding right. So I. There was a question about fault injection and the details thereof. And Georgios, you referred to your paper is the paper, is there paper that did we, do we have the link for that? I think that's that GDI MI track. This is my my my large with pace. It's a GD Mitaka GitHub dot IO. In overall the the in this can help also in overall the the fault injection involves at the moment although is there is no true stiction for that, only bit flips on the registers and efram cells and the flip flops. So with that, let's take the more registers or the more buffers, design hubs, it is more successful, accessible to have a bit flip. OK, so part of the efficiency of the architecture that we have is that we don't need to reuse the results because we have a very small number of additions in any case. So because of not reusing, this means less registers, which means less probability of being hit by big. But also we can use also false in combinational logic and see how they propagate in the. There's no true limitation, but the experiment we represent refers only to bit flips on rexes or something. Cool, OK. And another question that was answered is, are any of these catapult HLS designs open sourced? And I think georgio's again, you're gonna make those available on the IC lab Dooth GitHub, right? The time depends on the units. How fast we can do that? Gotta clean clean up that code. OK um, alright. There's a couple of questions that have have rolled in, so let's take a look at that. What is the difference between CCS block and CCS design? Can we combine CCS block with SC main? So the analysis, would you like me to take a stab at that? I mean. From. My experience? I think that when we choose a. SCS blog at the SC verify flow we. With the note to the flow that this. This part of the code this function should be. Tested as a block of as a hierarchical block of the design and so. The testing should be performed now. I think the systems design is is a. Pragma. Pragma flag? I don't know how to say it. A macro OK that is used at the test bench at the main function of the code inside the test bins to. To say to the to the flow, to instruct the flow that the function that we call the CCS design is in fact the top level of our design of our architects. If I'm not, if I'm right, if I remember correctly, this is the difference. Yeah, that's right. So CCS design was the original way of doing the rapping for SC verify in the C++ flow. That was then expanded for the C flow for the concept of CCS block, which meant that whichever was currently the top level hierarchy, that would then be able to be pulled into SC verifier. So for example, if you had a test bench and then an instantiation that. Called sub blocks, which is very common in a C++ hierarchical design. Then whichever sub hierarchy was the current top level could still be done in SC verify. That's a little different in the SC main system C flow, where you've basically got one dot, and that's for fusion. Ask that question, so I'll flag that as answered. There's a question from Balaji Balaji, good to see you here. In your development, how did you study the logic slash architecture inferred to the C code and make improvements? Ah, can we? I I don't understand the question, sorry. So I think I think the the the question is, you know how how were you able to look at what you got out of catapult from the C++ that you actually wrote and determined where you needed to make either logical or architectural changes? OK, so but. After each step of our design by doing any changing to the code, I studied both the resources that were used. The schedule of that was achieved by. This architecture and lastly I started the test the from question and tried to see to understand if the flow of the data is what I want to achieve or if there are any differences. So by combining all this I. By and the little trying, can I help a bit on that? Yeah, OK. First of all, the the exploration was done by experience. We have a lot of experience in optimizing designs. This is what we do for the last 20 years. OK, at least me not be honest, which is much younger, OK. And the second part is all the designs that we have are let's say implemented down to layout. So for each design we have, we have let's say complete seamless layout for basics or full FPGA implementations in prototypes running the design for the. Evans after taking the URL from catapult. So in all cases, let's say the numbers we get are from static timing analysis would get post the CMOS layout OK? So this is the most, let's say. That's the most accurate way of seeing the results of their designs we expect. Cool. Thank you. And and and that probably leads on actually to one of the questions that's sitting in the inbox right now, which is were the design designs fabricated and if yes, which technology and what kind of frequencies were you able to achieve? No, it's not fabricated. It's that seamlessly out to are either 40 nanometers and 28 nanometers. Some of the designs we have also in 20 nanometers. The frequencies if I remember correctly different per design, let's say we may have a 500 gigahertz or one 500 megahertz or one gigahertz. And for FPGA there are much lower let's say one hundred 150 something like. But it's not. Let's say that the top frequency that we can get is somewhere in the middle, right? Very good. I'll click flag that as answered. Thank you, belagio. Appreciate the question the, the, the, the comment there. Alright. I'm not seeing any other questions that have come in. Mattil, back to you to close. Yeah. OK. So once again, please continue using our interactive widgets at the bottom of your screen to submit any last questions you may have and also of course, download the slide deck. The recorded version of the webinar will be available on demand within a couple of hours. You'll receive an e-mail with the link to view to recording thereafter at any time. So that essentially concludes today's presentation since we don't have anymore questions. But thank you all for joining us and special thanks to Dionysis and Jorgos for answering all the questions as well as Stuart's always a fun time OOP FPGA technology from sushi. Yeah. So one of the questions was what what exactly? Yes, thank you. Alright, great. Well, thank you all for your time today. I will close this webcast and wishing you a great day or a great evening wherever you may be. Thank you so much. Thank you all. Great session. Bye now. Unless you are. _1728943059771