Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
An Open-Source FPGA-Optimized Out-of-Order RISC-V Soft Processor (2019) [pdf] (u-tokyo.ac.jp)
239 points by varbhat on Jan 5, 2021 | hide | past | favorite | 74 comments


To push this up from the comments, if you're interested in why this is important or what the authors are trying to do the PDF where they describe their approach and architecture is really interesting.

http://www.rsg.ci.i.u-tokyo.ac.jp/members/shioya/pdfs/Mashim...


If I understand this correctly:

Typically, I think of an FPGA as something used to accelerate specialized operations. But sometimes, in the middle of one of these specialized operations, you might want to do something more general, like run a network stack, without returning to the CPU. A soft processor like this allows you to run an ordinary network stack (with ordinary code) inside the FPGA.

Is that right?

I thought one of the things people used FPGAs for was accelerating network stacks, so I don't quite know why you'd want to use a soft processor for that. But it does make sense that you'd want to be able to run ordinary code in an FPGA (as part of a larger FPGA operation that is not ordinary code).

EDIT: Also, I don't understand this statement: "for example, one main compute kernel, which is too complex to deploy on dedicated hardware, is run by specialized soft processors". What do the authors mean "too complex to deploy on dedicated hardware"?


FPGAs are frequently not connected with a dedicated CPU at all, and even when they are they may be connected over a link with lower bandwidth/higher latency than you would like. In these cases you usually have a bunch of management and logic tasks which are better suited to a CPU (basically anything with a large number of different serial steps and complex control flow will probably synthesize poorly directly to an FPGA: this is probably what the authors described as 'too complex for dedicated hardware'), so you synthesize a CPU in the FPGA fabric which usually has a direct memory map to whatever registers you need in the FPGA design to accomplish your goals. This is common enough basically every FPGA vendor also sells FPGA SOCs which have a hard CPU attached to the fabric, and if you are using one of those then you will generally not synthesize a CPU in the fabric because it'll usually be slower and less power efficient than the hard CPU you have. But that CPU also isn't free and if your CPU compute requirements are not particularly high than the soft CPU might be more efficient for your usecare (or a hard CPU may not be an option in the range you need).


It's definitely common to accelerate specialized operations without the overhead of a general processor, but it's also possible to effectively use them as a much more flexible microprocessor when you need it.

If you go looking for a microcontroller for your project, you have to choose among what is available. Maybe this microcontroller has two hardware UART interfaces and 1 SPI interface. If I don't need any UART but instead need a CANBUS interface that microcontroller won't work for me. Sure I can bitbang the protocol on GPIO ports but that uses up a lot of the limited processing power on the microcontroller... Usually that means a more expensive microcontroller.

There is a threshold that you can cross where a small FPGA is cheaper than a microcontroller that has enough pins and processing power for your application. This does come with an additional upfront design cost of also writing (but more often integrating) the soft cores but sometimes that makes sense.

Sometimes peripherals just don't exist at the price point you need. Try and find a microcontroller that has a MMIO controller for under $5 (I probably couldn't do it at under $10 but I haven't gone looking recently), it's rare they're needed but sometimes a design requires one.

There has also been a lot of recent interest in doing formal verification of hardware logic. A lot of the microprocessors and even that full CPU in whatever device you're reading this on has a lot of undocumented black boxes and undefined behaviours both of which prevent that verification from being meaningful.


>What do the authors mean "too complex to deploy on dedicated hardware"?

It costs time and effort to translate a software function to HDL/FPGA, so it's not always worth doing. For example you could do TCP/IP in hardware, but unless you have particular performance requirements (say HFT) you're probably better off with a soft core and a tested software TCP/IP stack.

Also each feature translated to hardware takes up space in the fabric. When you crunch numbers on an FPGA, it's ideal if you can lay out the entire sequence of operations as one big pipeline, so you can keep throughput as high as possible. Sufficiently long or complex sequences may not translate efficiently to FPGA.


Generally softcores are used for command and control for the FPGA. For eg you have a framegrabber and you would like to adjust the shutter speed, fps etc you would create a softcore and run normal firmware to setup the registers based on the users choice.

You can think of FPGA as an ASIC and the softcore to control this ASIC. The hot data path and heavy processing is done in the FPGA and processing options for the ASIC can be done using the softcore firmware.


To make a custom chip it requires one time mask fees of up to $30 million for a 5nm chip. If your volume is high then you can amortize that cost. If not then you probably go to an FPGA. An FPGA has a high per part cost but no custom mask costs and you can reprogram it in a few hours / days instead of 2 months for a new custom chip to be manufactured.

A high end FPGA already has hardened CPU blocks, USB and PCIE interfaces, and lots of other things built in. Then it has a large area of generic reconfigurable logic that you can customize to do whatever you want.

This reconfigurable logic will not be as fast as a custom chip but it is still far faster than software and can be used to implement your own CPU (assuming you don't want to use the hardened CPU or got an FPGA without them)


Ok, we've changed the URL to that from https://github.com/rsd-devel/rsd, which doesn't give much background.


Is there a quantification of "high performance"?

It will obviously be much lower than the IPC of an actual high performance CPU (modern x86-64), but how big is the difference? And how does it compare to typical mobile processors?


It seems broadly comparable to an ARM Cortex A9 in width and reorder depth but it seems to be using a lot more pipeline stages to do that. Probably that's because they had less engineer-years to invest in the design than ARM did. It looks like it's using pure automatic synthesis so the clock speed will tend to be lower on equivalent process nodes too. The big question is how accurate the prefetcher and branch predictors are. It's entirely possible for an 2 issue out of order core to be slower than 2 issue in order core if the later is significantly better at speculating.

Still, that's a good bit of work they should be proud for putting out there and I hope other people build on it.

EDIT: Oh, wait, they don't mention register renaming. Hmm, well, I guess no speculating over multiple iterations of a loop then.

EDIT2: No, the PDF the link mentions a rename unit. http://www.rsg.ci.i.u-tokyo.ac.jp/members/shioya/pdfs/Mashim...


I believe the closest measurement would be in Table VII on page 8 of the paper: http://www.rsg.ci.i.u-tokyo.ac.jp/members/shioya/pdfs/Mashim...


"In comparison to the BOOM, the RSD achieved 2.5-times higher DMIPS and 1.9-times higher DMIPS/MHz", which should be comparable to ARM cores around 4 years ago.


Note that BOOM is now on v3[1], which claims to be 3.93 DMIPS/MHz, or about twice RSD.

[1] https://people.eecs.berkeley.edu/~krste/papers/SonicBOOM-CAR...


RV32IM, 2.04 DMIPS/MHz, 95.3 MHz, 15K LUTs, 8K LUT FFs, 6 BRAMs on Zynq 7020.

I'm curious if this will work on Lattice ECP5- I'm not really sure if Synplify supports system verilog to the same degree as the Xilinx tools. ECP5 is interesting because it's a $10 FPGA..


FWIW Google is working on improving the support for SystemVerilog in open source tools.

On the developer tooling side, there is https://github.com/google/verible for linting, code formatting and code indexing.

On the actual compilation side with there are https://github.com/alainmarcel/Surelog and https://github.com/alainmarcel/UHDM which are then being coupled with open source tools like Yosys to allow targeting Xilinx 7 Series and Lattice ECP5 FPGA ICs with fully open source flows using fully open source FPGA tools like symbiflow.github.io


At a glance I think it would fit although I might be thinking of the bigger ECP5s which are in the same price bracket as some Zynqs


It's targeting small FPGAs, comparisons to full custom designs wouldn't be very fair.


"high-performance" is relative, so they need to compare it to something at all.


I would imagine it is high-performance within its own domain: soft microprocessors. There isn't much danger of something like this being used outside of domains where it is relevant, so just calling it "high-performance" gets the message across fairly well to anyone who would actually need it.

I can't comment on how well it performs myself without testing it, but a quick skim of the paper reveals that it apparently performs well in comparison to its rival open-source out-of-order soft processor.

Comparing soft processors to other soft processors is fairly easy if they can both run on the same hardware, but comparing them to real silicon is inherently kind of meaningless, as they don't really compete at the moment, and the performance of the design in absolute terms will depend on the FPGA it is implemented on. Nonetheless, you could compare the raw numbers presented in the paper for curiosity's sake and see that indeed, it isn't very fast compared to modern silicon processors.


Sorry, what I meant was: They need to present benchmark results (preferably against other soft-core processors, of course) to claim the "high-performance" attribute. A single row of a synthetic benchmark against only two other contenders is a little ... lacking.


This has 5 backend pipelines, which I would say is getting there speed-wise. I think AMD Zen has 10 execution pipelines to play with, although the exact structure is usually not commented upon so I don't know how long they are.

It's 32 bit so it's not desktop-class necessarily but this should blow a microcontroller out of the sky, for scale.


> This has 5 backend pipelines, which I would say is getting there speed-wise.

It's pretty easy to slap down pipelines. What is far harder is keeping them all fed and running without excessive stalling and bubbles whilst avoiding killing your max frequency and blowing through your area and power goals.


If I understand correctly, most designs effectively #define the number of frontend and backend pipelines. That means you are free to make as many as you like, for as much performance as you like, but some parameters might scale very badly outside what the core was designed for. For example, I would imagine the number of gates and power use to go up to unfeasibly high levels above the quoted numbers.


It's a bit harder than that. You have to worry about forwarding results from one pipeline to another bypassing the registers if you want to avoid having to deal with stalls waiting for results. The transistor cost of the bypass network grows as the square of the number of pipelines so it can be pretty significant, potentially being larger than the cost of the execution pipelines themselves.

Many modern designs aren't fully bypassed and involve clusters of pipelines to manage this. IBM's recent Power chips and Apple's ARM cores are particularly known to do this.


Does optimal pipeline topology vary with the specific workload or is more dependent on the instruction set semantics and result visibility?

Is this something that could be automatically optimized via simulation?

Is it something that could be made dynamic and shared across a bunch of schedulers so that cores could move between big/little during execution.


It's highly dependent on both workload and instruction set.

I'm sure it could be automatically optimized in theory, even without the solution being AI complete, but I don't think we have any idea how to do it right now.

No, not unless you're reflashing an FPGA. You'd have better luck sharing subcores for threadlets I think.


I've been wondering if it could make sense to put a small ALU with each register (when they are not virtual/renamed but maybe then too?). This would allow instructions that use the destination register as a source to only use one read port, potentially allowing higher IPC. Has anyone looked into this and if so what did the analysis show?


It has two integer pipes. This core is similar in performance to the dual-issue in-order Cortex-M7- also 2 DMIPS / MHz. I guess I view this core as a technology demonstration that you can have a practical (not too large, not too low MHz) out of order processor in an FPGA.

On the other hand, I think there is still space for a small in-order dual-issue FPGA RISC-V with 2 DMIPS / MHz performance.

Actually there is one:

https://opencores.org/projects/biriscv

1.9 DMIPS / MHz..


Do you really want pipelinning on a microcontroller though?


Why not? I think Cortex-M3 has a 3 stage pipeline, obviously I would not expect the 0.01$ chinese ones to have one.

It's worth saying here, that a big ooo CPU is pipelines differently to a small/old risc processor - even amongst discussions about compiler optimizations people still use terminology like pipeline stall, when a modern CPU has a pipeline that handles fetching an instruction window, finding dependencies, doing register renaming and execution, that pipeline is not like an old IF->IF->EX->MEM->WB - it won't stall in the same way a pentium 5 did. The execution pipes themselves have a more familiar structure.


Not sure about Cortex-M3, but I can confirm that the Cortex-M4 has a pretty basic 3 stage in order pipeline that gets flushed on branches. So unless there are caches between the core and memory (I think some STM32 have that), that CPU is still trivially deterministic.


Yes, m4 from stm have art accelerator (cache memory) which make them less deterministic (but much faster) for flash access. See https://www.st.com/en/microcontrollers-microprocessors/stm32...


I am glad they are using System Verilog. It is hard for me to understand why SiFive chose Chisel as RTL language. I think that quietly slows down the RISC-V adoption. I honestly tried to understand the advantages of Chisel, but I can not see any. There is an answer on Stack Overflow regarding Chisel benefits, it is just embarrassing [1].

[1] https://stackoverflow.com/questions/53007782/what-benefits-d...


I wrote a long blog post about the VexRiscv RISC-V CPU and how its design methodology is radically different than traditional RTL languages.

The VexRiscv is written in SpinalHDL, which is a close relative of Chisel.

The advantage of SpinalHDL/Chisel is that it supports plug and play configurability that’s impossible with languages like SystemVerilog or VHDL.

You can read about it here: https://tomverbeure.github.io/rtl/2018/12/06/The-VexRiscV-CP...

That said: there must be at least 50 open source RISC-V cores out there, and only a small fraction is written in Chisel. I don’t see how the use of Chisel has held back RISC-V in any meaningful way.


I really want to bonk on the head anyone praising SystemVerilog. This collective undertaking of corrupt commitee members locked us in to this horrible language and these grossly outdated tools, forever. There is not a slightest bit of good in that languge, neither for design nor for verification.


That's a big call. Care to support it?


There are tons of people using System Verilog for RISC-V. The majority of work done is with System Verilog and not Chisel. Having lots of cores in both Chisel, System Verilog and many other languages (VHDL,BlueSpec and so on) is a huge benfit for RISC-V

SiFive values programmability above everything and for that Chisel is pretty clearly an advantage.


Please note I am software dev, not a hardware guy, and I just got my first FPGA during holidays, and am just beginning to play with it.

> There is an answer on Stack Overflow regarding Chisel benefits, it is just embarrassing [1].

I don't understand what is embarrassing about the answer ? As a software guy above answer make sense. Some problems you want to use C (or similar) for and some problems you want to use scripting language for, and then again sometimes the right tool is erlang, rust or go-lang ...

But like I said, that's my software guy perspective, so I am wondering what I missed?


I just read the answer as well. It's not "embarrassing", but it basically doesn't answer the question. Instead, it argues that the question is equivalent to asking what's the point of Python vs. C.

So in the end, the answer doesn't provide any specific answer regarding SystemVerilog and Chisel. All I found is one mention of negotiating parameters, which Verilog doesn't do. I would have loved to hear a lot more about examples of what Chisel makes more convenient than SystemVerilog.


SiFive really didn’t choose chisel as much as created it. It’s basically a company created by Krste and some former graduate students.

I like chisel as a concept but the learning curve is too high: Scala is kind of a mess and when you add a custom DSL + lots of functional programming on top a different hardware design methodology, it becomes overwhelming to your typical ce/ee, who probably doesn’t have that exposure. I simply ran out of time to learn it properly.

It also a second class citizen when it comes to rtl tools. The verification engineers have to work with the generated verilog and it looks like a nightmare. There was some improvements recently but the engineers knowledgeable enough to work on this stuff seem pretty bandwidth constrained.

The biggest headwind to chisel is the breadth of knowledge required to work and improve it IMO.

I’m really hoping that pymtl gets a firrtl backend soon. Python has a pretty decent record for building DSLs.


I don't have any relation to Chisel but the answer basically said you can take advantage of the Scala ecosystem to create hardware designs. That includes the ability to write arbitrary Scala to generate parameterized designs and the ability to create chisel libraries and publishing them as if they were regular Scala libraries. If you don't like Scala (like me) none of this matters for a "from scratch" design.


Google built one of their TPUs in Chisel [1].

TL;DW: Chisel is beautiful/fun to write in, with a definite productivity bonus, but has a pretty large learning curve and had a much greater verification cost, partly because it's an HLS (most have that problem) and also lack of any tooling. Both of those costs are gradually being reduced (though in my opinion, not enough to not make verification a PITA).

[1] https://www.youtube.com/watch?v=x85342Cny8c


Facts:

Chisel is NOT HLS at all. Chisel is one of many languages that generates HDL, that is, you describe code to built a circuit whereas in eg. Verilog you just describe the circuit (Verilog has a limited ability to do dynamically with generate statements).

An HLS is one that raise the abstraction. Almost all of them today allows you to write "lightly" annotated C[++] that gets translated into a circuit. Almost universally, the timing relationship isn't explicit at all.

Opinions:

All existing HDLs and HLSes are terrible and there's fertile ground for creating something to really advance the art. Personally I'm looking for something that is more productive that HDLs, but with more control than an HLS. Some promising examples: Handle C, Google's XLS (assuming promised development), and Silice.


What is an HLS?


High Level Synthesis


The gif of the Konata pipeline visualizer seems to show pretty much one instruction per cycle most of the time... Many parts of the trace show as low as 0.2 instructions/cycle..

Wouldn't we expect much higher numbers (more parallelism) considering the number of frontend/backend pipelines?


Spoiler-alert, even your x86 desktop cores struggle to hit more than 1 IPC on most workloads. That's just the reality of cache misses, instruction dependencies, branch prediction, and more.


All those abbreviations on the block diagram make it very difficult to interpret. A key map in the image would be great, or at least some markdown directly below it.


For people in the industry: how likely are we to get RISC-V servers/VMs/laptops/desktops in the next 5-10 years? You know, go on a PC Part Picker and assemble a RISC-V desktop, for example.


I'm not in the industry - just dabbling, but SiFive have a mini-ITX with all the expected interfaces[1] available for pre-order. The Mouser link lists March for the initial deliveries.

The Getting Started Guide indicates it comes with a micro-SD with a bootable Linux image, but mostly goes on to describe console access. That said, it does recommend a GPU, but it's unclear whether it can boot to a graphical desktop out of the box.

[1] https://www.sifive.com/boards/hifive-unmatched


Can you go on PC Part Picker and build an Arm desktop yet?


Part of ARMs licensing generally prevents socketed CPU chips.


I haven't tried, I assume not. But aren't the hardware fashions changing rapidly now? Compared to 2005-2020, with x86 everywhere I mentioned.


Sure, order a Raspberry Pi 4 and an external USB SSD ;)


so fun fact: I was recently curious how the RasPi4 compared to my 15-year old Dell Precision M65 (Intel Core Duo T2400).

I ran the sysbench CPU test on each, and the M65 trounced the RasPi4, being over 3x as fast in single-core (and about 1.5x as fast in multicore, which makes sense with the T2400 being 2-core to the RasPi's 4-core).

So the RasPi4 (a cheap-class SoC) remains slower than a performance-class PC from 14 years prior. Moore's law certainly helped in peformance-per-watt and performance-per-dollar, but if pure perfomance is what you want... I don't think there's anything available to consumers outside of Apple's offerings.


Well done to the authors for making a surprisingly readable core for once.


I guess somebody will implement 64 bit GC extensions to run linux on it.


Ahem, you need to full supervisor support as well with virtual memory (page table walkers, TLBs etc). And atomics. And floating point (etc).

This is all non-trivial and would make the design ~ twice as big and likely impact the cycle times in a rather sad way. But possible of course.

Anecdata: Full RocketChip (RV64GC) built for ECP5 85F comes in at 54k LUTs (out of 84k) and clocked at 14.8 MHz. However, the cycle time is related to the FPU which assumes retiming which yosys can't do. Without the FPU it's a more reasonable 50-60 MHz.


I don't think it has an MMU (didn't see a TLB or table walker in the source), so a lot more work is needed than just the extra instructions.


GC extensions? What does 'GC' stand for?


G is short for MAFD (multiplication, atomics, floats, doubles) and C for compression.

See https://en.wikipedia.org/wiki/RISC-V#Design .


Ah sorry I get it - ISA extensions - I thought it was compiler extensions.


I know Yosys has a limited support for System Verilog, but any success synthesizing this using FOSS toolchain ? What features are missing if not successful ?


Is anyone working on low power open risc-v implementations? (Ideally including manufacturing, i.e. a physical device that I could buy/build on top of)


Seeed has some RISC-V Arduino-alikes [1], but my memory is that they have teased a more substantial, linux-capable board here in Q1.

1: https://www.seeedstudio.com/SeeedStudio-GD32-RISC-V-Dev-Boar...



One of the two ULP (Ultra Low Power) co-processors on ESP32-S2 chips is based on RISC-V. Boards using ESP32-S2 are available for relatively little, e.g. see https://www.mouser.com/ProductDetail/Espressif-Systems/ESP32....

It's admittedly a niche use-case but it's an option for playing with RISC-V hardware...


If the FPGA is a closed-design are we really that better off?


Even if the FPGA design were fully public, you won't necessarily find the required fab around the corner, so what's there to gain from the FPGA design? If you're seeking trust, you'd need to establish trust in the whole chain: design, mask production, chip fab, transport... that's a tall order.


There are fully reverse-engineered FPGAs and open source toolchains to drive them.

The real question is: Is there anything hidden in the silicon? That's something you can only solve by owning your own fab - the US military approach.


The argument here usually is: with a fixed silicon chip, vendor can hide the backdoor in various locations and be it triggered by various events (e.g. a particular sequence of incoming ICMP packets would overwrite the first byte of response with content of some register). With FPGA, the vendor can't really know where a particular register is located, or where incoming packets are processed, as it is highly dependent on the synthesised CPU design and can even be non-deterministic.

This does not mean that there is no way vendor can backdoor the chip you are getting, but it does narrow the possibilities significantly.


Good luck hiding an effective backdoor in an FPGA. The attacker (the FPGA fab) has no idea of how it's going to be programmed.


The usual thing the military worries about is a "kill switch" (a very unlikely sequence of bits) which disables the hardware completely. The idea is that at the beginning of a war, the kill signal is broadcast by the enemy by every means possible which brings all your electronics to a halt.

This can be hidden in an FPGA - for example attached to the input pins or SERDES - without needing to know anything about the application.

(Article: https://spectrum.ieee.org/semiconductors/design/the-hunt-for...)


Triggering a malfunction is incredibly easy compared to a proper backdoor. A kill signal could also be injected through side channels e.g. a power line, and the kill mechanism could be implemented in many other semiconductors than an FPGA.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: