Hello everyone and welcome to today's webinar. So today's webinar is from the Ultra Video Group at Tom Perry University managing the complexity of an HEBC encoder Kvazar implementation on FPGA with Catapult HLS. My name is Matilde Carcenti and I am a Marketing Programs Manager here at Siemens. Before we start, I'd like to go over a few items regarding the interface. At the bottom of your browser in the center, you'll see icons in the widget doc. These are the interactive tools you will use throughout the webinar. You can minimize, maximize, and scale each widget to customize your viewing experience. I'd like to call your attention to the Q&A widget. The Q&A window should be open and is located on the left side of your screen. You're encouraged to submit questions. To the presenter throughout the webinar using this widget, your questions will be answered during the Q&A session right after the webinar. Please be aware that questions you submit will disappear, but will come back with an answer if answered by text. Second is the Resources widget located at the right side of your widget doc. There you will find all types of resources from today's webinar, including the PDF slide deck as well as links. To various other resources that Doctor Pannu has provided right next to that widget you will see the Help widget. If you are experiencing any technical issues, please click on it for assistance. Do know that a simple refresh of your browser tends to resolve any issues you may be experiencing. This presentation is being recorded and you will receive a link to it in a follow up e-mail later today. If we are not able to get to your question during the live Q&A, we will respond to it via e-mail. Our presenter today is Doctor Pannus Chauval. He received his Masters of Science degree in Automation Engineering from the Tom Perry University of Technology in 2015 and PhD degree in Computing and Electrical Engineering from Tamper University in 2023. Currently, he is a post Doctoral researcher at Tom Perry working in deleting academic video coding group in Finland, Ultra Video Group. His current research interests include High-Level Synthesis, video and coding, hardware and system on chip designing, Fpgas and Linux kernel driver development. So now I will hand it over to Dr. Pannu. Thank you, thank you for the introduction. Let's see and I will again. I've had some slugging in my laptop so hopefully nothing's gonna happen but during this presentation. We'll get through it. Oh, you're really sorry. And yeah, as Milad already presented you, this is my topic for the day. It's a bit of a mouthful. I had a few suggestions, but this was the most interesting one. So the gist of this today's presentation is how we managed the complexity of a very complex HPC encoder application and with Catapult. HLS specifically did use any other HLS to relied on Catapult for our. Complexity management and our final implementation was done on Fpgas. So we were outside of simulations. So we got our final applications really done. So let's dive into that And but before we get to that, I'll just briefly introduce myself. So yes, my name is and I'm from Dunbar, Finland and I just for your liking I located in there. On the northern side of Europe, and also, and as you already heard, I'm working as a boss doc at the university and you can see a few images there. So during summer it's very green, during winter very white, a lot of snow. And I'm part of the Ultra Video group, which is a research group here at the university and the group is led by Associate Professor Yarnovanne. Post my supervisor and supervised my PhD thesis and to be to be humble about it. We are the number one academic video group in Finland which is very basic. It's like safe to say, but we do have some good merits and good publications and if you want to see those publications or learn more about us, I guess not to the put those links to deal the video group in the results box. Or then you can just take a look at the QR code here for the UVG website. And about my research achievements, I've had my name on 14 publications. Then of those publications were included in my PhD thesis that I defended successfully last March. So I may brand new PhD. And the title of that PhD thesis was the feasibility study of HLS. And the implementation of a real time HPBC encoder on NFPGA. So again a very mouthful of keywords, but the idea was to study how can we actually use HLS to implement the HPC encoder on FPGA fully using HLS. And that's kind of the topic of today. And if you want to read more about my thesis, again there's the QR link and hopefully it's in the resource box also ready for you. And ten of my applications are included in thesis. It should be Open Access so you can read those if you don't have accesses like to the IE Explorer ATM journal. And yeah, so if you're if if I missed something during this presentation, lacking some descriptions of of the hardware, most of the stuff is already like also explained in my thesis and the included publications. Okay, my laptop is getting pretty luggage at the moment, so hopefully, but hopefully you can see now the image of our group and the chest to point out the leader. Associate Professor Gardner is here on the purpose Church and I'm located here. And this is our group, so there's a plenty of us. There's three bus docs at the moment and bloody of PhD students and research assistants working on different areas of video encoding, as the name alter Video Group might imply. And to get more familiar with what we are expertise expertising at. It's most, as I said, mostly related to video stuff. So video coding, video processing, video streaming, and then a lot of applications using those video technologies. So today we're talking about the HEVC encoding, but there's also the newest standard VVC that we are working on. And then plenty of keywords, AI, which is interesting coding and multiple resolutions and 360 processing and all of that. But again, if you want to know more about what we do, I do. I would like you to go to the group website to get more knowledge on us. But then we can get to the actual topics and why the topic is that important, why it was important for us to research on the first part. Is of course how the synthesis. From publications we know that how the synthesis is very good for data intensive algorithms. It was very apparent like seeing what has been implemented HLS. For us it was important that can we actually implement the whole HVC encoder using HLS. That will help us with the complexity of the HVC encoder. The HVC encoder is a very complex video compression. Algorithm and using HLS for the whole application we can reduce the design verification efforts. So we didn't want to start on our journey using those traditional just using Verilog and VHDL. We wanted to do something else. It's also helpful for us that we had our reference algorithm that was our HEVC encoder pretty much under almost done at the point that we started. Implementing it is on hardware, so we had a working C reference code that we have from where we can get these reference algorithms for X less implementation. So we can just hopefully take those algorithms from the code and just put them together and compile them. But that was not the case. I will go through a few examples later on, but then at least we had some kind of a working reference algorithm done so we're not gonna cover that. And finally, we're actually implementing something on FPGA. We're going for ASIC at the moment. I will also call that later on. But the Fpgas were important for us in a matter that we didn't want to stay in simulations. We wanted to have a proof of concept platform that actually can encode 4K videos, that's high frame rates. So can we use HLS? To implement the HPC encoder on FPGA. That was the topic and then I'll have to cover HPC background and you might be familiar with this, but most likely you're not fully aware of what HPC is doing and why we need that. I'm not gonna cover the background for HLS. Hopefully you're more aware of that and if you have some questions, I'm sure it seems people will answer your questions. And then after the background with HVC, I will dive into the and show you the complexity of our hardware design before I will talk about how the complexity was managed. So HVC, why do we need that? So if you're thinking of your mobile phone, if you're taking. Pictures. If you're taking videos with that, you have that camera. There's a sensor, that sensor produces raw data and if we want us to store that raw data, we would run out of the storage very quickly. It's very data intensive. There's a lot of data coming coming from that sensor itself. If you wanted to transmit that raw data, high resolution, high quality video or network, that would be impossible. With some low resolutions that might be doable, but for K8K? Even 16 K, there's no way. And to solve that problem we introduced the HVC video encoder that compresses the video. There are multiple algorithms doing video encoding, like the previous standard H264 or AVC. The HVC is the widespread, like the most widespread at the moment. And then then there's also the upcoming BVC. Encoder and of course some other algorithms, But to solve the problem we need to compress or encode the raw video in order to minimize the file size, to be able to transmit it somewhere. So the energy again here is that you're copying all of your you're taking the video with your mobile phone. The mobile phone itself is compressing the video with using some algorithm that's incorporates its own software on hardware on the mobile phone, and after that you can transmit it to the cloud for storage, for backup, backupping needs. And when you want to view it, because it's compressed, because it's encoded, you need to decompress it or decode it again using hardware or software and in order to get it back to the raw format and in order to play back it again. So that's the idea of the HPC video encoder or in encoding, video encoding in practical. And then to dive into a bit deeper into the HPC what it actually does. How does encoding work? The simpler explanation would be that you take the raw video. You divide that video into frames. Those frames can be divided into blocks and then depending on if you're doing interprediction or interprediction. With interprediction, we're working with frame by frame, block by block, and utilizing the spatial redundancy in A-frame and block to be able to compress. The video into smaller sizes. So basically it's just taking the above and left reference pixels and predicting the contents that you can see in a block. And if you're compressing inter encoding interpretation then you're utilizing the temporal redundancy between frames and blocks. So basically if you're. Have a static background and then there's just a small amount of objects moving. You're not encoding everything again, you're just saying that these objects move into this place, which saves a lot of space. Or at this it improves the compression efficiency. That's the simple explanation. It's actually even more complex than that. There's a variety of coding tools in its in HPC and for the bolded ones are here to represent that was that was implemented by us for the hardware. So at the moment it just supports intra encoding and that means that we need to implement that to implement the interprediction, the discrete causing transform quantization and the reconstructions inverse quantizations, inverse discrete causing transform and then the entry entropy encoder and if you wanted to. Implement the inter side. There's again the motion estimation algorithms and then motion compensations and a bunch of buffers everywhere. And this is the reason that why we need HLS. We want to be able to handle the complexity of this. And then this is an advertisement slide for you. What other video group does? As I said, we're using the Quasar HTPC encoder. As a reference implementation, Kosar is actually an award-winning #1 open source HTPC video encoder and you can see the link here. And it was developed fully by us, the video group at the University of Finland. So if you need a open source HTPC video encoder, please yours, ours, it's a very good video encoder, so it's not the only reason why we're using us that that's the reference implementation that it was implemented. Not the same group and the same university, but it is the best open source option that you can have. So that was quickly the background for HEVC and then we can actually dive to the hardware design and it's got a bit complex. With HLS it was manageable, you will see it after the flag Rams, but it did get a bit complex. So here. You can see the top level of our design and I will have a link to the journal paper that explains each block in detail. But I will try to be quick to showcase you how we managed to do something like everything with that design and I will try to Draw Something here to showcase what I'm talking about. So let's start. This is a bad color, let's try with red one. So here. You can see a few interfacing blocks and we did implement everything with HLS and most likely we could have implemented even these block with HLS. But these are the ones that we needed in the beginning to be able to interface with our encoder. And I was very, very familiar with VHDL at that point, so it was kind of natural to try to even use VHDL at some blocks. In hindsight I could have now used HLS to do these blocks, but unfortunately these are only done with HLS VHDL, sorry. So we have some interfacing blocks, some configuration blocks, intra config that's responsible for configuring the intra search core and then few configuration blocks for the gabak one. And then of course we have the DMA that is connected to the. To the PC or the connected desktop or server or whatever device using PCIe. Or then alternatively we can have these Ethernet parsers so we can get actual Ethernet packets from the network and there the Dmas and the parser's responsibility is for sending data to these memories here. And these are the memories needed for actually doing the intra search intra encoding on the hardware. So basically that's the interfacing to our HLS blocks and the same thing in the bottom, basically reverse order. So those are the results that we need from our encoder and again DMA and Ethernet TX. Blocks, we're sending the data back, but now let's dive into the actual HLS implementations. So let's start with the intra Search core which looks like something like this. And there should be some familiar names here into prediction transforming. But as I said, HLS was very good for data intensive stuff, but we were we wanted to use for everything. So even for the control site we used HLS. And when we look into the control units, there are several blocks, and all of those green blocks are implemented using colorful HLS. There's initialization, scheduling, executing, and RDO which is great distortion optimization block, and then some stacks for storing data that we get from the encoding process and then some reference border generation. Blocks and all of these were done with HLS and basically the idea was that we tried to minimize the projects and the amount of like iterations that we can do with an HLS product. So all of these blocks are different projects. So we're just combining or generating the these blocks with catapult and then connecting them up with just glue logic, so we HDL and connecting the memories. Instantiating them somewhere else, so that was a control unit and the same applies for the interpretations. Again, everything implemented HLS, but then the connections happen somewhere else. So we are not compiling or generating the RTL for this whole block, but each of these blocks individually. It seemed like the better case for us and then more manageable. And then finally the transform units. In this case the previous ones were just single projects, but here the DCT and IDCT, these are the biggest block that we implemented with HLS and those utilized like this, the hierarchy that you can have in a credible project or submodules. So all of those blocks are being generated at once. But then again the quantization and coefficient cost and reconstructions were not. And then getting back to the top level and moving to the COPAK, the Intra search, it's already had some control parts and most of the other algorithms for data intensive. COPAK in itself is very sequential, very control heavy. But we were very, the HLS was very capable of even generating this block and in the 200 people that we had, it was very comparable for the related work. So we were very happy on how we were able to implement the using HLS. So if we take a look at the presentation units here we had again multiple HLS blocks, but we also utilized the clock crossing domains that multiple clocks that. You can have with the catapult project and we were able to minimize the clock and better routing for the Fpgas minimizing some clock domains. So the clock domain one might be running at lower clock frequency than the more demanding unit on the clock domain 2 with a higher clock. And the same thing is also applied here in the encoding units where we are. Where we implemented this arbiter that gets the data from the clock domain one and clock domain two and then sends it to the actual comeback process that is running on clock domain 3. Even that is running at even higher frequency than the 1:00 and 2:00 for generating the visualization of of the commands coming from the intra search core. So that's the explanation of our hardware. And getting back to the top level, you can actually see that the design here is a single instance, but here you can see that on the FPGA, the final FPGA that we are using, Audio 10, you can actually have three of these instances running in parallel, so. It was very nice to get more speed in that way so we didn't have to parallelize even more single instance, but then just have more instances of the same. And as I said, if you want to get a more in depth explanation of all of those blocks. I'll ask you to read this ACM journal, The high level census implementation of an embedded real time HPVC encoder on FPGA for media application and again then I'm helpful. We're good at doing this, very complex titles and this publication is also included in the thesis. So if you don't have access to this one, take a look at my thesis. It's in the end, so you're gonna read it. And now we can actually get to the how we managed that, how we were able to do that. So managing the complexity, an important thing about what we were doing was the application suitability. So HEVC coding is a very data intensive process and as I said previously from reading publications, HLS is very good at doing those data intensive algorithms, so number crunching. And here you can actually see that most of the algorithms are like that. So DCT quantization, the quantizations, inverse DCT and so on are very data intensive and they have worked very well previously. And as just as an example, this is the implementation for quantizations and the quantizations, this is a generalization. This is the easiest block that we implemented then. But. It's nice to see it here that I can actually put it on a single slide the amount of code that you need to do the quantization. So it's just a full look, some calculations and slicing and clipping and that's it. It's very it was very simple to implement with HLS, but as I said, there's these sequential and control heavy parts of HEVC also, so namely encoding control. That is actually taking handle of sending the blocks the data from the raw data from the frame to the encoding pipeline. So doing the interdiction, choosing which modes to use, what kind of a structure the final video has. So as I said, it divides the frames into blocks. Those blocks are basically only in HVC 64 by 64 blocks. And it can actually split those blocks into smaller blocks. The lower you go, the better quality you will have. The bigger blocks you have, the better compression you have. So depending on the content of your video, you might need to split all the blocks to the smallest ones 4 by fours or if it's just like aesthetic color. Then you don't need to go those go to those smaller blocks and stay on the pick up blocks and have better compression efficiency. So the encoding controller is taking part of that based on the results that it gets from the transform phase, the Dcts quantizations and the integration itself. And then you have the argument encoding go back that is responsible for generating the HEVC bit stream according to the results that the encoding. It does or provides it. So it's providing it the interdiction and the actual structure of the final video. And as I said, we achieved very good results with capital HLS even on the control intensive parts. I don't know if it was a happy surprise. We didn't have any doubts that it couldn't do that and like comparing our results with related work. We were as good or better so we are we were already have a happy how Cavallas like performs for for these very controlled heavy pots And to give you a like a like a indepth loop on what what the organization unit has. It was about 5000 lines of code. There's about 30 to 4 loops, 55 conditional. Channel rights. So those are the rights that are generating the the Gabaka penalization stuff. So the penalization of the data coming from the inter encoding to the actual encoding units. So there's a lot of writing of data and a lot of if else clauses, 170 in total in the whole penalization units. So you might think. That's it's very sequential, very control heavy, very dependent on previous stages and it worked perfectly. And that's the idea for the managing of the complicity. Using just HLS for everything helped a lot, so we didn't have to implement some parts with VHDL already lock, but everything with HLS and then the next. Management stuff, the complexity was that we relied on that the logic of that optimizing algorithms and system will outperform micro architectural optimizations and we should use time where we can gain more more performance so. We didn't try to micro manage small parts of different algorithms, but we tried to optimize the full algorithm, the whole system. So if we figure out something good to do in the whole system, we did that. Or trying to optimize the whole algorithm rather than just small portions like that you could do with VHL Verilog. And what's fun about HEVC that the HPC standard defines only the decoding process. So it basically doesn't matter how we do the encoding, as long as COPAK works perfectly fine because that's part of the also the decoding process. But then again, how we do the encoding Chest effects are the compression ratio and basically the speed of the encoder. So if we're doing very simple stuff, we will get good speeds but bad compression efficiency. Then again if you're spending a lot more time on the encoding searching. More thoroughly the structure of the video we will get better computation, but might not get that through frame rates. And with the HDBC and HLS it made it possible to create basically these three major versions of the encoder. A fully working encoder and all of our versions improved the previous one and it was really really easy to do. So this will just prove that. Optimizing the whole system, the whole working of those algorithms is better than just micro matching small stuff. So basically the first version that we had of our encoder was just a proof of constant version and it was basically implementing all intra encoding tools for luma pixels, so luminance, so we're encoding YUV data, not RGB but YUV, so the Y component, the luminance and the second version. Was to optimize the use of TSP units basically mostly in the TCD and interpretation and then duplicating this type sensitive resources. So we figure out some bottlenecks on the pipeline, video encoding pipeline and we're able to duplicate this. And then we also improved by duplicating this type of resources. We brought the decoding pipeline and parallelism. And also added support for the chroma pixels, so the UMV components. And then for the third version we improved even further. So we might have simplified some complex memory structures, code changes, rewriting some parts to be able to pipeline them more better using catapult, generating those RTLS again and then improving the chroma coding. And then finally the implementation of Coq, you might see that some of the stuff is missing from the 1st and 2nd versions. I will come to that later on. But basically the missing parts of the encoder were run on software and the third one is the final one that's actually implementing everything on hardware. So a fully embedded and that's the final product that we had. And do you have a few examples on? How we did this, as I said, we started with a verified reference algorithm was our HVC. So we took the reference algorithms from Quasar that are fully working and incorporated them with our test bench and for the purposes basically generating golden data. And then we also took that same algorithm and started most likely with the design X that you can see on the left. So. Just trying to compile that see how catapult works by just compiling same algorithm that is used in quasi GPC. And then we started to explore the solution more so the design structure. Is it a flat design? Do we split it into different projects? Do we incorporate this internal hierarchy that you can have in a catapult project in a catapult design? And then the possible algorithm modifications. Do we need to modify the algorithm to get more performance? Design Explorer to get both directives? How much do we pipeline stuff? How much do we unroll the loops? What do we want to address? Registers, Memories. All of that. And then through managing the parallelism be an area and speed, we try to get the best performance of our implementation and. I already mentioned that, but that's basically the design structure on how many if you have multiple projects or hierarchy. And the compilation time was also important. We didn't want to use that much time on combining stuff with catapult, so we're trying to minimize the designs, have fast compiling, multiple iterations, trying to explore all alternatives that we can have and to just show you examples on how did this work. On the first design, you can now see the interdiction unit that we had there in the working version. So it has the interdiction control and then the prediction algorithms and SAT sum of Absolute differences as the metric for choosing the best mode and then Rams for storing the data. And on the second version, we realized that there's some battle caps in the video voting pipeline on the pipeline. So we separated some blocks from the direction control. Very easy stuff to do with HLS. Removing something, creating a new project, getting more speed, more. I optimized the pipeline, adding some memory in between the control integration control and the prediction blocks for a more better scheduling of the control stuff and the actual prediction of the data and in the final form. Rewriting the bridge and blocks to be able to pipeline the main loop with initially the one one and in the end removing those very resource heavy RAM blocks with a single FIFO. And if we were using traditional design methods, I don't think we would have done these huge structure changes that easily with actual scalable. It was very easy to do. And then very important stuff when you're implementing hardware is how you do the verification, how can you reduce the effort that you're using for the verification. And this is basically the design flow that we had to help with our verification project or stuff. So we have first. The unit verification, the HLS flow, basically running just the HLS implementation and the testbench and verifying our HLS code or RTL in most cases. We didn't do anything with RTL, we didn't verify the RTL in simulations, usually if you did it with the C algorithm, the C algorithm. It meant and if it passed this bench, it's working RTL and you can just put that on the FPGA and be happy about it. On the FPGA we did some also some units of verification then sure, because you can just have a single RTO block that, but we can also do the system verification there. We connect those RTO blocks generated by the catapult and then we have the soft core processor. Taking code from the reference code or actually running the same test bench that we're running on the HLA flow with small verifications of course. But we can generate the data to the interfaces exactly the same way that it does on the test benches and verifying the data coming from those blocks. So very easy system verification on the FPGA. And then to add to that, we can actually do some software hardware code verification, you can do that. Solely on the FPGA. But we also implemented our PCIe driver connection to a desktop PC, so we can actually run the same plus our HPC encoder on the CPU, on the desktop, on the FPGA, on the soft core and also at the simultaneously on the hardware. So we have multiple ways to verify that our hardware is working exactly the same that it should be on the software. So we weren't going for one to one implementation, so. Exactly the same results on the hardware as we wanted on the software and then the DE complexity. Also very fun, very fun thing with HLS that it provides this platform independency. So we started with a cyclone two, it was just familiar for me from studies. I started with that small implementations of the interpretation then noticing that. We do need some better Fpgas implementing the DCT IDCT on the Aria 2 Gx then moving on a cycle and fine associate that has these ARM processors. This board is still used in our teaching, so we have incorporated on the system design course that I'm partially lecturing and also assisting on exercises. The students basically take the introduction. Hardware accelerator and incorporated to the Cyclone 5 and trying to maximize the speed of the HVC encoding on the arm using the inter prediction as the hardware accelerator and as a side branch of that we are moving from the Cyclone 5 socks to the AMD Exciting pink boards also again generating the RTL from catapult. But on the main branch, we still continued. We needed bigger boards. So Aria 5 Slcs to have a fully embedded encoding on a single device, but then we moved on on Aria 5 PCIe connecting those Fpgas to PCs and have more processing power that we needed at that time. And the final product Aria 10 PCIe that we were still using, trying to have that fully embedded HPC encoding and video encoding acceleration. So to sum up the better design reasonability, it's much better than with traditional design approaches. And when a new platform was adapted, we just generated the RTL again without modifying the existing source codes. The modification of the source codes source code was just because of improving our design, but. When we have to move from device to device, platform to platform, we just needed to generate the RTL to optimize the RTL code again. And the changes were affected by because it was a research process and you do availability that we can as a university have and moron to the platform independency at the moment because we are still the current design is for Intel Fpgas. We are planning on having the same design on an ASIC. So that is our object now. So we are doing that in the in the the SoC hub initiative and the idea is to migrate the existing HVC encoder on ASIC and we are doing that in the SoC hub project. And so that's just a small introduction to the SoC hub. What it's about It's a initiative to develop this system on chips, and it's a joint effort to design new Socs for 6G AI and imaging and security application. And the idea is to boost the competence and expertise in Socs chips. Embedded systems in Finland, in partners. Company, university included and it is also powered by CMS and there again you can see the QR code. So if you want to get more familiar with soak up, please visit the website there or just click from the results box. Hopefully the link is there and then you just give you a small view of what has happened in soak up. So basically there has been already been. Two tapeouts getting the ballast and the tackle and the head sale is upcoming and the porting of or the migrating of the cluster hardware encoder is hopefully going to happen with the Bold design. So we have already had our first chips implemented and still going up strong. But if you want to know more about this, please visit the website. And finally about the increased productivity that the previous advantages that I explained, they do add a compelling productivity increase over these traditional audio designs. And if you want to learn more about this soccer a lot had this very nice survey paper about. The state of High-Level Synthesis and from the conclusion of reading all the publications that are that were part of the study. Basically the HLS reported to have more than four times as high the productivity in terms of system performance and the developed time when compared to these traditional RTL designs and also the average development time of HLS project. Is only 1/3 of F of a manual 1. So HLS does provide good results. And to summarize our increased productivity, as I explained, we had three major versions. Basically, the total number of comments that we had in our repository for the HLS code was 480. That includes 41,000 lines of HLS code and when we generated that to RTL it was almost over half a million lines of code of Verilog. And as I explained, the manual part was includes the seven and a half thousand lines of code for the clue logic and the FPG interfacing. And this is a hard one because it has been a research. Process and we haven't been working on this like 24/7 and we have to have this publications in between and trying to improve our knowledge of HLS. But we did the estimate like a 21 person wants effort was spent on the implementation. And just to conclude everything that I've gone through here without HLS it would be very challenging to manage the project schedule. And they all will design complexity. So in other words, if I've started doing this with traditional methods, I wouldn't have done my PhD thesis yet. And then I think I've spent 45 minutes, so I will quickly go through on. On applications, So what we did, so we're not just talking about how to implement something, we're not just running them in simulations. We actually implemented something working. So basically the first proof of proof of concept design that we had or a encoding rig was that we used 1PC that had two Fpgas. And we were able to real time and go to 34K30FPS cameras and stream them over network and then have the playback in three different laptops. So we were able to do that with a single PC, the SCAP PC. And because we're using the PC, IEA, Fpgas, we can also use that in different platforms. And here's just showcasing what we have done on a server side that has an Intel processor. We do Fpgas, we are able to do 4K1 or 20FPS with good quality and even over a lower end device like the NVIDIA HGX or in board we can with a single FPGA included there and go at 4K30FPS. But the PCI stuff is not everything and if you want to know more about this please check my publications, I have more than one of those. Talking more about it and these images are also included in these ones, but the encoder also supports networking SO40 Gigabit Etherneting. And this is basically showing what we did with one of our publications that we constructed this cloud encoding service with two Fpgas that we had in hand. But basically this is not limited in any way with PCIe cards. Might be able to fit two or three of them on the desktop PC if possible, then running out of space or lanes with native parking. We can add as many services as many Fpgas that we want, and we can scale our encoding speed accordingly. And that was it. That was my presentation. And here's my contact information if you want to contact me using my e-mail or link it in. Or then going to the ultra video dot fi site or our GitHub page. Thank you so much. Thank you so much, Panu. That was amazing. Very cool. Now. Yeah, very good. So in a moment, Stuart and Petri will both join to help answer questions alongside Panu. We also have a few poll questions and would appreciate your feedback on the presentation as well as future topics. So we'll go through those. And as I said before, there's all of the resources listed in the resource widget that were in the presentation, but if you were using your camera and grabbing those QR codes, that's great. And as I said before, this was being recorded. You'll receive a link to the recording a little later today. And yeah, if you have any continual questions, please submit those and let's get it started. OK, cool. So let's start off with an easy one. Hi, demo. Thanks for the question. Was this designed with C + C Plus Plus or System C? It was done with C++. We didn't use Systemc at all. Basically the reference algorithm is implemented with C So I would say most of the stuff is even in C But we also incorporated C stuff there. Very cool. Okay. Probably as a continuation of that, I think we've seen this question repeated twice. Is the source code for this project available or open for a license? Not for the hardware part, but for the software part of course. And the publications will explain what has been done. But the the hardware code is not public, unfortunately, at least not yet. I don't know if you have in in my mind. Yeah to to open it most certainly there's nothing stopping us but but we just need to do that. But the yeah, the software version of course is open or reopen. Thank you. OK. Let's see, we got a good one from Gregory at Samsung. Regarding the algorithm changes, did any of those affect the compression quality? I.e., bit rate? Signal noise ratio? If so, how did you balance an algorithm change for speed and complexity against lowered compression performance? A good a good one for the RTL Designer is constantly struggling to pull for what the implementation is going to be. Yeah. It's a it's a it's hard to manage what to do how to get the most, most out of the FPGA. So we were going more for speed. So we're. Can you hear me? Yep. Going now. Yeah. Sorry, I wasn't hearing anything for a while. Yeah. Good. The designations are basically made paint on how how much we implemented from the original source code. So running the source code will provide us what the the compression ratio was. So implementing more coding tools of course will give us more compression radio and ratio with better quality. It's it's a it's a hard matter to to to open here and how on on on what was what's what's the design choices on how to what was for the final implementation. But it's basically based on the presets of the app that was our HPC encoder and we were going for more for speed than the best quality and the placebo modes basically of the encoding, real time encoding. I would say okay, super. Let's see what else do we have. So that one is answered. Are you going to use the Pink board for the system design course we want to use it, yes. Well, that's mostly related for the app, for the teaching, but we haven't had time to fully port it to the board yet. It's kind of working, but not yet. We haven't had enough time to fully implement it on that yet. Okay sounds good. We have more of those ports. We are running out of the cycle, 5 ones. They're breaking up on us, Okay. The question here always comes up. Did you compare the area between the High-Level Synthesis design and traditional Verilog design? I think probably the answer is we wouldn't have got the job done with Verilog for single algorithms. Yes, of course the related work will give the area figures that they had for single coding tools. And of course we have some comparisons to fully implement it like basic designs. Maybe that. Or done on simulations, but to show case that this is the only known natureless implemented HPC encoder on FPGA. Fully implemented. Cool. OK, how complex was it to optimize any or probably fix any timing issues in the path until you had the necessary performance of latency? It really depends like adding more parallelism, if there's timing issue, add more parallelism, you get more speed. So trying to figure out what the critical paths are going for better devices, like some devices are not just fast enough to do this. So that's why we were processing from like this lower grade Fpgas to higher grade Fpgas and that of course increased the speed substantially. Very good. Good question here. Hi Matt. Do you foresee any challenges moving to ASIC? Would more handwritten RTL be required or will the design target and ASIC platform with essentially the same source code? I think it's going to be the same source code. I don't see any problems for us. We're of course we're using Tsps, but. We're using Dsps because we're running out of the FPGA logic, and there's a lot of Dsps available on the Fpgas. But on an ASIC, it really just depends on how much area, the final area it's going to be and if we have the money to do that. So we might need to optimize something, but that's fully doable. On HLS itself, the interfacing is pretty simple, it's just memories. Just memories simple. It's simple, very good. OK, this might be an interesting one. Can you brief what are the additional features compared to Vivado HLS? Unfortunately I haven't used Vivado cuz this was I did it for Intel right? It's like from the research perspective, it's a problem that we're not using a bunch of HLS tools to compare how they do against itself. But then again, my understanding is correct. With HLS you need to use sliding sports. With Catapult we can have both of those. Even the Iftel site has some HLS encoding our HLS programs. But are those more for OpenCL? I'm not that familiar with that. That's a problem. But then again, if you read my PhD thesis, I had an idea that the application was so complex. So moving from different Hlas tools to one another, we showcase that whatever tool you use, I'm sure you can do better with Hlas than with traditional designs. Mm HM sounds good now the question just came in from Carsten. Did you compare the results against Cadence C2S? Probably not, because C2S doesn't do C As I said, Catapult HLS was the only HLS do that we used for generating our RTL. Very cool. Okay. We're seeing a lot of questions from nandini. I don't know if you can see them. Specific about the blocking filter hardware architecture, how can I speed up the architecture? And also questions, you know I want to reduce the number of access cycles between the memory and architecture of the blocking filter. So I think you know, actually for the deblocking part, the problem is that we're using just inter, and with deblocking you can do that on the decoding side. So there's basically not like a like a feedback loop from the decoding on the inter side. So we didn't implement deblocking because we're not supporting inter encoding yet. So the deblocking for us comes from the decoder itself, so it's not important to do that on the encoding side. That makes sense Nandini. I hope that answered some of the questions you've got associated with deblocking. If not, please feel free to answer any of his questions because he didn't do anything with deep blocking. But so I mean, deblocking is usually windowing and line buffering and so. It it really falls to Okay? Well, if you're keeping up with the pixel rate, then the filter that you apply on the window, or the various size windows depending on the macro blocks that are being thrown out from decode, you're pretty much free to do whatever algorithm you want to smooth that behavior, right? And this is part part of the the fun of doing video Okay, What else do we have here? What is the usage of Kazaa to produce vector singlers for verification of the design? I'm guessing more on a broader comment as how was verification done? Might be a more more relevant part of. You've got C code, you've got RTL, how do you know it works other than just throwing it on the FPGA board? Well, as I said, HPC is a fun thing that you can basically do almost anything in the encoding side. It just affects the compression ratio and the the efficiency. So if you do something badly it's it's it's it's gonna do a lot of its job. So our idea on verifying these single units was to like. Exhaustively verify sending random data and verifying that our design does exactly the same thing as the reference algorithm does. And then when we go to the bigger parts, we just connect those components and then run on the FPGA. Run multiple sequences and just taking the data from the hardware and running on the software at the same time and just comparing those because the implementation on the hardware is 1 the one that's the software, so we can do that. And the same applies when we connect the PCIe FPGA to a bigger design. We can still do that. We can hardware accelerate it or we can verify it. We can do all of that. Okay, neat. I realized we've gone a little past the top of the hour. Matilde, let's see if there are anything coming in. It's fine. If we have some more questions, we can get through the. Let's take a look. See now the VVC contains neural networks. Do you think the implementation of codec in FPGA is very much needed or hardware software, code design will be required? Unfortunately I'm not that familiar, I haven't went to the VVC site yet, so I do know about it. And also if you want to know more about that, we're working on the item of the UBG. Version of an open source PVC encoder, but on hardware side. We haven't had any talks about that yet, so unfortunately I can't answer anything about that all right. I think there was one question that came really early which was can you explain the functionality of each module in the HEVC encoder in brief? And I think probably not in brief, but if you want to know more please read my this is more my publications and everything is explained in full detail there. That sounds very good. Okay. There's a a question here that I will answer live. Hi Richard, thanks for the question. Does catapult handle mallocs in C++? No, you can't do dynamic memory allocation in synthesizable code. You have to specify a specific array size and work with that. Your follow on question from that can I order map to Intel embedded memories such as M20K's and such like yes memories be it Xilinx, Intel or your own ASIC memories one read one write dual port, single port and read write resolution and all that kind of stuff are all handled. So I hope that how that covers that there. And we utilize those a lot as used. So from the hardware logs, exactly. Just memories. Just memories, yeah, Okay. There's a question here, what's the state-of-the-art in HLS verification, verifying C, HLS models without giving up all the techniques and tools established in the industry for system viral log? Block level RTL verification including UVM coverage driven verification, constrained random assertion based etcetera. There are a number of tools and technologies. A couple of years ago we created a product called Catapult coverage. Catapult coverage does HLS aware so instantiation aware indexing and such like coverage. So basically your your statement coverage, your branch coverage, your expression coverage, as well as borrowing from system verilog in terms of cover points, bins, crosses for a functional verification constrained random is is interesting. There was some stuff of course that went on with, you know, portable stimulus and such like. And this is actually one of the places where System C and Matchlib start to come into the picture. Your hands with being able to use Matchlib to have a pin level view that actually connects to UVM and allows you to use UVM basically to drive drive the dot. We've also seen customers use UVM to use the C++ as the dot and still collect the coverage from that there. So there's a lot more talk more to us and we'll we can talk a bit more about the HLV solutions that Siemens is providing as well. Okay as I talk two more came in, let's see this is a question on VVC. So I think we're gonna. OK, VVC having a larger block size compared to HEVC, What's the benefit of having a larger block size? Well, compression, I do just need that. Most likely the compression will just be bigger like better because you're compression better blocks, right? OK, I'm not quite sure is there any architecture possibility is there like in real RTL how we planned? Ranga Babu, I'm not sure what your question is there. Sorry about that. Okay. Click. That one has answered. Click that one has answered. OK, that drops us down to one last one in the bucket. I want to reduce the number of access cycles between the memory and architecture of a deblocking filter because we know that deblocking filter is more computationally OK Nandini, this is again deblocking deblocking filter. Yeah, give us a call, we can probably help, but not in this session. As they say. OK, looks like we're out. So thank you both Doctor Penu Stewart and Petry for taking the time to answer all these questions that we received. This is going to conclude today's webinar. We will leave the event running just for another minute to give you a chance to. Download any of the resources and we'll leave it at that. Thank you so much for taking the time to being, for being here with us today. And we look forward to seeing you for our next session or at that if you're coming. Thank you. _1733462550361