In this lab, we will explore the Chipyard framework.
Chipyard is an integrated design, simulation, and implementation framework for open source hardware development developed here at UC Berkeley. It is open-sourced online and is based on the Chisel and FIRRTL hardware description libraries, as well as the Rocket Chip SoC generation ecosystem. It brings together much of the work on hardware design methodology from Berkeley over the last decade as well as useful tools into a single repository that guarantees version compatibility between the projects it submodules.
A designer can use Chipyard to build, test, and tapeout (manufacture) a RISC-V-based SoC. This includes RTL development integrated with Rocket Chip, cloud FPGA-accelerated simulation with FireSim, and physical design with the Hammer framework.
Chisel is the primary hardware description language used at Berkeley. It is a domain-specific language built on top of Scala. Thus, it provides designers with the power of a modern programming language to write complex, parameterizable circuit generators that can be compiled into synthesizable Verilog. You will be writing a few lines of basic Chisel code in this lab, however, it is NOT the focus of this lab. This lab aims to familarize you with the Chipyard framework as a whole.
Here are some resources to learn more about Chisel:
Students interested in designing accelerators and other IP (writing significant RTL) are especially encouraged to consult these resources.
Throughout the rest of the course, we will be developing our SoC using Chipyard as the base framework.
There is a lot in Chipyard so we will only be able to explore a part of it in this lab, but hopefully you will get a brief sense of its capabilities.
In particular, the lab provides a brief overview of Chipyard’s diverse features, and then guides one through designing, verifying, and incorporating an accelerator into an SoC (via RoCC and MMIO interfaces).
Getting Started
First, we will need to setup our Chipyard workspace. All of our work will occur on the BWRC compute cluster. Make sure you have access and can connect to the BWRC compute cluster before starting this lab. For this lab, and the course in general, please work in the /tools/C/<your username> directory, where you should use your EECS IRIS account username. If you do not see a directory under your username, please create one and if you run into any issues, contact The BWRC Sysadmins. DO NOT work out of the home directory
0) SSH into a BWRC login server: bwrcrdsl-#.eecs.berkeley.edu (Make sure you have the campus VPN on, if accessing the BWRC cluster off campus.)
This script is responsible for setting up the tools and environment used in this lab (and more generally by the course). Specifically, it does the following right now:
Conda is an open-source package and environment management system that allows you to quickly install, run, and update packages and their dependencies. We activate the base conda environment for the course’s conda instance.
We use commercial tools such as VCS from a common installation location on the BWRC compute cluster. We add this location to the path and source relevant licenses.
TAs will manage further changes to this script to simplify environment/workflow setup in the coming weeks.
You will need to source this script in every new terminal & at the start of every work session.
<your username>@bwrcrdsl-#:/tools/C/<your username> $ cd ee194-lab1
Optionally, set the repo path as an environment variable by running export chipyard=/tools/C/<your username>/ee194-lab1. We will be referring to the repo path as $chipyard from now on. If you do not wish to set up this environment variable, you will need to write out /tools/C/<your username>/ee194-lab1 every time we use $chipyard.
In Chipyard, we use the Conda package manager to help manage system dependencies. Conda allows users to create an “environment” that holds system dependencies like make, gcc, etc. We’ve also installed a pre-built RISC-V toolchain into it. We want to ensure that everyone in the class is using the same version of everything, so everyone will be using the same conda environment by activating the environment specified above. You will need to do this in every new terminal & at the start of every work session.
The init-subodules-no-riscv-tools.sh script will initialize and checkout all of the necessary git submodules. This will also validate that you are on a tagged branch, otherwise it will prompt for confirmation. When updating Chipyard to a new version, you will also want to rerun this script to update the submodules. Using git directly will try to initialize all submodules; this is not recommended unless you expressly desire this behavior.
git submodules allow you to keep other Git repositories as subdirectories of another Git repository. For example, the above script initiates the rocket-chip submodule which is it’s own Git repository that you can look at here. If you look at the .gitmodules file at $chipyard, you can see
An env.sh file should exist in the top-level repository ($chipyard). This file sets up necessary environment variables such as PATH for the current Chipyard repository. This is required by future Chipyard steps such as the make system to function correctly.
Over the course of the semester, we will find ourselves working with different Chipyards, such as one for this lab, and one for the SoCs we build this semester.
You should source the env.sh file in the Chipyard repository you wish to work in by running the above command every time you open a new terminal or start a new work session.
Chipyard Repo Tour
You will mostly be working out of the generators/ (for designs), sims/vcs/ (for simulations)* and vlsi/ (for physical design) directories.
However, we will still give a general repo tour to get you familiar with Chipyard as a whole.
*VCS is a propietory simulation tool provided by Synopsys while Verilator is an open-source tool. There are some subtle differences form the user perspective, but VCS is usually faster so we’ll be using that throuhgout the course. Everthing done with VCS can easily also be done in Verilator (the subdirectory structure is the same as well).
You may have noticed while initializing your Chipyard repo that there are many submodules. Chipyard is built to allow the designer to generate complex configurations from different projects including the in-order Rocket Chip core, the out-of-order BOOM core, the systolic array Gemmini, and many other components needed to build a chip.
Thankfully, Chipyard has some great documentation, which can be found
here.
You can find most of these in the $chipyard/generators/ directory.
All of these modules are built as generators (a core driving point of using Chisel), which means that each piece is parameterized and can be fit together with some of the functionality in Rocket Chip (check out the TileLink and Diplomacy references in the Chipyard documentation).
SoC Architecture
Tiles
A tile is the basic unit of replication of a core and its associated hardware
Each tile contains a RISC-V core and can contain additional hardware such as private caches, page table walker, TileBus (specified using configs)
TileLink is an open-source chip-scale interconnect standard (i.e., a protocol defining the communication interface between different modules on a chip)
Comparable to industry-standard protocols such as AXI/ACE
Supports multi-core, accelerators, peripherals, DMA, etc.
</ul>
Interconnect IP in Chipyard
Library of TileLink RTL generators provided in RocketChip
RTL generators for crossbar-based buses
Width-adapters, clock-crossings, etc.
Adapters to AXI4, APB
</ul>
</td>
</tr>
Constellation
A parameterized Chisel generator for SoC interconnects
Open-source L2 cache that communicates over TileLink (developed by SiFive, iykyk)
Directory-based coherence with MOESI-like protocol
Configurable capacity/banking
Support broadcast-based coherence in no-L2 systems
Support incoherent memory systems
</ul>
DRAM
AXI-4 DRAM interface to external memory controller
Interfaces to DRAM simulators such as DRAMSim/FASED
</ul>
</td>
</tr>
</tr>
Peripherals and IO
Chipyard Peripheral User Manual put together by Yufeng Chi who took the Sp22 iteration of this class. This document is a living document, so feel to add comments on sections that you don't understand/woud like to see added.
Open-source RocketChip + SiFive blocks:
Interrupt controllers
JTAG, Debug module, BootROM
UART, GPIOs, SPI, I2C, PWM, etc.
</ul>
TestChipIP: useful IP for test chips
Clock-management devices
SerDes
Scratchpads
</ul>
Documentations of the peripheral devices can be found here
</ul>
</td>
</tr>
</table>
## Config Exercise
Configs desribe what goes into our final ystem and what paramters our designs are elaborated with. You can find the configs in `$chipyard/generators/chipyard/src/main/scala/config`.
Look at the configs located in `$chipyard/generators/chipyard/src/main/scala/config/RocketConfigs.scala`, specifically `BigRocketConfig`
```
class BigRocketConfig extends Config(
new freechips.rocketchip.subsystem.WithNBigCores(1) ++ // single rocket-core
new chipyard.config.AbstractConfig) // builds one on top of another, so the single rocket-core is built on top of the AbstractConfig
```
RocketConfig is part of the "Digital System configuration" depicted below. It is built on top of the AbstractConfig which contains the config fragments (each line like freechips.rocketchip.subsystem.WithNBigCores(1) that adds something to the overall system is called a config fragment) for IO Binders and Harness Binders (depicted below).
Question
Answer
How we found the answer?
Is UART enabled? If so, which config fragments enabled it?
We grep for AbstractConfig in $chipyard/generators/chipyard/src/main/scala/and find AbstractConfig at $chipyard/generators/chipyard/src/main/scala/config/AbstractConfig.scala. We search for UART and find the corresponding config fragments.
How many bytes are in a block for the L1 DCache? How many sets are in the L1 DCache? Ways?
64 Block Bytes, 64 Sets, 4 Ways
We don't see anything about L1 DCaches in AbstractConfigWe grep for WithNBigCores at $chipyard/generators/rocket-chip/src/main/scala/. We find it in $chipyard/generators/rocket-chip/src/main/scala/subsystem/Configs.scala We see that the fragment instantiates a dcache with DCacheParams We notice it passes in CacheBlockBytes to blockBytes. So, we grep for CacheBlockBytes in $chipyard/generators/rocket-chip/src/main/scala/ and see
Then, we grep for DCacheParams and find it in$chipyard/generators/rocket-chip/src/main/scala/rocket/HellaCache.scala where we find the nSets and nWays fields
Is there an L2 used in this config? What size?
Yes. 1 bank, 8 ways, 512Kb.
We once again start looking at RocketConfig which leads us to AbstractConfig. Looking at the comments of the various config fragments we see the comment /** use Sifive LLC cache as root of coherence */ next to new freechips.rocketchip.subsystem.WithInclusiveCache ++ (You can read more about SiFive here). We could have grepped in the generators directory for WithInclusiveCache or noticed that a rocket-chip-inclusive-cache submodule existed under $chipyard/generators. Navigating through it we eventually find the WithInclusiveCache class at $chipyard/generators/rocket-chip-inclusive-cache/design/craft/inclusivecache/src/Configs.scala.
Inspect `MysteryRocketConfig` & answer the following questions. You should be able to find the answers or clues to the answers by grepping in `$chipyard/generators/chipyard/src/main/scala/` or `$chipyard/generators/rocket-chip/src/main/scala/`.
**1. How many bytes are in a block for the L1 DCache? How many sets are in the L1 DCache? Ways?**
**2. How many bytes are in a block for the L1 ICache? How many sets are in the L1 ICache? Ways?**
**3. What type of L1 Data Memory does this configuration have?**
**4. Does this configuration support custom boot? If yes, what is the custom boot address?**
**5. What accelerator does this config contain? Is this accelerator connected through RoCC or MMIO?**
**6. Is there a L2 cache? If so, what are the dimensions?**
**7. Is UART enabled?**
**8. Does this config include a FPU (floating point unit)?**
**9. Does this config include a multiple-divide pipeline?**
## Running Some Commands
Let's run some commands!
> *So far, we have been working on login servers. From this point on, we will be running some more compute-intensive commands on compute servers. Prepend all compute heavy commands (everything ran in the `vlsi/` directory & `sims/` directories) with `srun -p ee194 --pty` This submits the job to a special queue of compute servers for the class so we don't crash the login servers and mess up ongoing research work (or cause each other to lose valuable work :))*
We'll be running the `CONFIG=RocketConfig` config (the `-j16` executes the run with more threads). All commands should be run in `$chipyard/sims/vcs`. Run
```
@bwrcrdsl-#:$chipyard/sims/vcs $ srun -p ee194 --pty make CONFIG=RocketConfig -j16
```
> *Notes: [error] `Picked up JAVA_TOOL_OPTIONS: -Xmx8G -Xss8M -Djava.io.tmpdir=` is not a real error. You can safely ignore it.*
[FIRRTL](https://github.com/chipsalliance/firrtl) is used to translate Chisel source files into another representation--in this case, Verilog. Without going into too much detail, FIRRTL is consumed by a FIRRTL compiler (another Scala program) which passes the circuit through a series of circuit-level transformations. An example of a FIRRTL pass (transformation) is one that optimizes out unused signals. Once the transformations are done, a Verilog file is emitted and the build process is done. You can think of FIRRTL as an HDL version of LLVM if you are familar with LLVM (depicted below).
After the run is done (can take ~20 minutes), check the `$chipyard/sims/vcs/generated-src` folder. Find the directory of the config that you ran and you should see the following files:
- `ChipTop.v`: Synthesizable Verilog source
- `TestHarness.v`: TestHarness
- `XXX.dts`: device tree string
- `XXX.memmap.json`: memory map
Answer the following questions:
**1. Looking only at the emitted files, how many bytes are in a block for the L1 DCache? How many sets are in the L1 DCache?**
**2. Looking only at the emitted files, how many bytes are in a block for the L1 ICache? How many sets are in the L1 ICache?**
**3. Try to find the top-level verilog modules that correspond to the ICache/DCache? What are they called? *Hint: what modules look like they represent memories?***
## Chipyard Simulation
Simple RISCV test can be found under `$RISCV/riscv64-unknown-elf/share/riscv-tests/isa/`and can be run as:
```
@bwrcrdsl-#:$chipyard/sims/vcs $ srun -p ee194 --pty make run-binary CONFIG=RocketConfig BINARY=$RISCV/riscv64-unknown-elf/share/riscv-tests/isa/rv64ui-p-simple
```
**1. What are the last 10 lines of the `.out` file generated by the assembly test you ran? It should include the *** PASSED *** flag.**
**2. How many cycles did the simulation take to complete?**
**3. What is the hexadecimal representation of the last instruction run by the CPU?**
In summary, when we run something like:
```
@bwrcrdsl-#:$chipyard/sims/vcs $ srun make run-binary CONFIG=RocketConfig BINARY=$RISCV/riscv64-unknown-elf/share/riscv-tests/isa/rv64ui-p-simple
```
The first command will elaborate the design and create Verilog.
This is done by converting the Chisel code, embedded in Scala, into a FIRRTL intermediate representation which is then run through the FIRRTL compiler to generate Verilog.
Next it will run VCS to build a simulator out of the generated Verilog that can run RISC-V binaries.
The second command will run the test specified by `BINARY` and output results as an `.out` file.
This file will be emitted to the `$chipyard/sims/vcs/output/` directory.
Many Chipyard Chisel-based design looks something like a Rocket core connected to some kind of "accelerator" (e.g. a DSP block like an FFT module).
When building something like that, you would typically build your "accelerator" generator in Chisel, and unit test it using ChiselTesters.
You can then write integration tests (eg. a baremetal C program) which can then be simulated with your Rocket Chip and "accelerator" block together to test end-to-end system functionality.
Chipyard provides the infrastructure to help you do this for both VCS (Synopsys) and Verilator (open-source). The same infrastructure enables a few other applications as depicted below.
- SW RTL Simulation: RTL-level simulation with VCS or Verilator. If you design anything with Chipyard, you should be running SW RTL simulation to test.
- Hammer VLSI flow: Tapeout a custom config in some process technology.
- FPGA prototyping: Fast, non-deterministic prototypes (we won't be doing this in this class).
- FireSim: Fast, accurate FPGA-accelerated simulations (we won't be using this in this class, but if you're curious about FireSim, check out its documentation [here](https://fires.im/) and feel free to reach out to a TA to learn more).
</tr>
</table>
## In summary...
- Configs: Describe parameterization of a multi-generator SoC.
- Generators: Flexible, reusable library of open-source Chisel generators (and Verilog too).
- IOBinders/HarnessBinders: Enable configuring IO strategy and Harness features.
- FIRRTL Passes: Structured mechanism for supporting multiple flows.
- Target flows: Different use-cases for different types of users.
### [Optional] More Complicated Configs & Tests
Complete this section if you want to see some more complicates systems. Navigate to `$chipyard/generators/chipyard/src/main/scala/config/TutorialConfigs.scala`. We'll be running the `CONFIG=TutorialNoCConfig` config whichs adds one of the aforementioned Constellation topologies into our system. Run
```
@bwrcrdsl-#:$chipyard/sims/vcs $ srun -p ee194 --pty make CONFIG=TutorialNoCConfig -j16
```
and inspect the generated files at `$chipyard/sims/vcs/generated-src`
To run some more interesting tests, first, go to `$chipyard/tests` and run the following commands.
```
@bwrcrdsl-#:$chipyard/tests $ cmake -S ./ -B ./build/ -D CMAKE_BUILD_TYPE=Debug
@bwrcrdsl-#:$chipyard/tests $ cmake --build ./build/ --target all
```
The exact semantics of these commands aren't too important now, but they tell [CMake](https://cmake.org/), a build system to setup a project with debug symbols included and to build all of the tests. CMake is especially helpful when managing large code bases and is one of the leading tools for C/C++ build management.
> *Note: if you are wondering, the `.riscv` binaries are actually [ELF files](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format). We are naming it with the .riscv extension to emphasize that it is a RISC-V program.*
Afterwards, you should see the `.riscv` bare-metal binaries compiled in the new build folder along with a bunch of `.dump` files which contain the corresponding disassemblies. Go back to `$chipyard/sims/vcs` and try running (prepended with `srun -p ee194 --pty`):
- `make CONFIG=TutorialNoCConfig run-binary-hex BINARY=../../tests/build/fft.riscv` Runs tests on the FFT accelerator that's connected through a MMIO.
- `make CONFIG=TutorialNoCConfig run-binary-hex BINARY=../../tests/build/gcd.riscv` Runs tests on a GCD module that's connected through a MMIO.
- `make CONFIG=TutorialNoCConfig run-binary-hex BINARY=../../tests/build/streaming-fir.riscv` Runs [FIR](https://en.wikipedia.org/wiki/Finite_impulse_response) tests.
- `make CONFIG=TutorialNoCConfig run-binary-hex BINARY=../../tests/build/nic-loopback.riscv` Runs test on the [NiC](https://en.wikipedia.org/wiki/Network_interface_controller) tests.
# Designing a Custom Accelerator
In this section, we will design a simple "accelerator" that treats its 64-bit values as vectors of eight 8-bit values. It takes two 64-bit vectors, adds them, and returns the resultant 64-bit sum vector. (As you might have realized, this isn't a very practical accelerator.) This accelerator will sit on a Rocket Tile and communicate through the RoCC interface. At the end of the lab, you will have the option to implement a 32-bit version that uses memory-mapped IO (MMIO) instead.

Note that the idea here is to learn how to incorporate a custom accelerator in an SoC by writing an accelerator generator and effectively utilizing the simplicity and extensibility of Chipyard. Our emphasis here is NOT on designing an accelerator from scratch, as that involves learning how to write RTL of significant size and complexity in Chisel, which might not be useful to the majority of the class.
If you are interested in designing an accelerator or other IP for the course, you should be prepared to write significant RTL. We encourage you to look at sections 1 through 5 of the [Chisel Bootcamp](https://github.com/freechipsproject/chisel-bootcamp) and implement `TODOs` by yourself.
If not, feel free to utilize answers provided in the lab.
# RoCC Design
- RoCC stands for Rocket Custom Coprocessor.
- A block using the RoCC interface sits on a Rocket Tile.
- Such a block uses custom non-standard instructions reserved in the RISC-V ISA encoding space.
- It can communicate using a ready-valid interface with the following:
- A core on the Rocket Tile, such as BOOM or Rocket Chip (yes, it's an overloaded name :)
- L1 D$
- Page Table Walker (available by default on a Rocket Tile)
- SystemBus, which can be used to communicate with the outer memory system, for instance
For more on RoCC, we encourage you to refer to:
1. Sections 6.5 and 6.6 of the Chipyard docs, and related examples
2. Bespoke Silicon Group's [RoCC Doc V2](https://docs.google.com/document/d/1CH2ep4YcL_ojsa3BVHEW-uwcKh1FlFTjH_kg5v8bxVw/edit)
Here's an overview of the `custom-acc-rocc` directory inside `$chipyard/generators/`.
```
custom-acc-rocc/
baremetal_test/ <------ (4) bare-metal functional tests
functionalTest.c
project/ <------ project properties/settings
src/ <------ source code
main/ <------ Chisel RTL
scala/
Configs.scala <------ (3) Config to include this accelerator
CustomAccRoCC.scala <------ RoCC Scaffolding RTL
VectorAdd.scala <------ (1) Accelerator RTL
test/ <------ Chisel tests
scala/
TestVectorAdd.scala <----- (2) Basic unit test
target/ <------ output from build system
```
Let us begin by inspecting `src/main/scala/CustomAccRoCC.scala`.
`LazyRoCC` and `LazyRoCCModuleImp` are *abstract* classes that allows us to separate the implementation of a RoCC accelerator from the definition and implementation of the RoCC interface.
`customAcceleratorModule` provides the implementation of our specific accelerator module. For ease of understanding, we define all functionality in a module called `vectorAdd`, and wire up RoCC I/O signals to `vectorAdd` I/O signals.
Answer the following question:
**1. Find the file `LazyRoCC.scala`. The RoCC interface is defined in `RoCCIO`. What fields does our `RoCCIO` accelerator use?**
## Accelerator RTL
Let us now **implement the accelerator** in `src/main/scala/vectorAdd.scala`, as described above. Your task here is to fill in all blocks/lines marked `/* YOUR CODE HERE */`.
Questions to consider:
- What kinds of inputs/outputs does the `vectorAdd` module use? You should inspect the `io` field of the module for this.
- Does this module use ready-valid interfaces for I/O? How many ready-valid interfaces, and in which directions?
In no particular order, here are the required lines of code:
## Testing
The next logical step is testing the `vectorAdd` module to ensure it behaves as expected. There are two main ways to test your design:
1. using ChiselTest
2. baremetal functional testing: baremetal here refers to the fact that your tests directly run on the hardware, i.e., no OS underneath.
The former is more useful for fine-grained module-specific testing while the latter is more useful to test the accelerator as a whole, and its interactions with the rest of the SoC. Both kinds of tests will be run in RTL simulation.
We will unit test with ChiselTest right now, and come back to baremetal testing when integrating our accelerator with the rest of the SoC.
### ChiselTest
ChiselTest is the batteries-included testing and formal verification library for Chisel-based RTL designs. It emphasizes tests that are lightweight (minimizes boilerplate code), easy to read and write (understandability), and compose (for better test code reuse). You can find the repo [here](https://github.com/ucb-bar/chiseltest), an overview [here](https://www.chisel-lang.org/chiseltest/) and API documentation [here](https://www.chisel-lang.org/api/chiseltest/latest).
Let us now write a unit test using Chiseltest in `src/test/scala/TestVectorAdd.scala`.
`VectorAddTest` is our test class here, and `"Basic Testcase"` is the name of our only test case. A test case is defined inside a `test()` block, and takes the DUT as a parameter. There can be multiple test cases per test class, and we recommend one test class per Module being tested, and one test case per individual test.
Here, we will be using Verilator as our simulator backend, and generate waveforms in an fst file.
Most simulation testing infrastructure is based on setting signals, advancing the clock, and checking signals, and asserting their values. ChiselTest does the same with `poke`, `step`, `peek`, and `expect` respectively.
**Complete the unit test named "Basic Testcase"** in `testVectorAdd.scala` by filling in all lines marked `/* YOUR CODE HERE */`.
In no particular order, here are the required lines of code:
"h_0F_0D_0B_09_07_05_03_01".U
c.clock.step(1)
true.B
Before we run any tests, we must keep in mind that our RTL is written in Chisel whereas most simulator backends and VLSI tools expect Verilog/SystemVerilog. Thus, we compile our code from Chisel down to an Intermediate Representation (FIRRTL), and finally the relevant Verilog/System Verilog.
To compile the design and run our tests, we use the Scala Build Tool (sbt). `$chipyard/build.sbt` (in the root Chipyard directory) contains project settings, dependencies, and sub-project settings. Feel free to search for `custom_acc_rocc` to find the sub-project entry.
In a new terminal window inside **the root Chipyard directory**, run:
```
@bwrcrdsl-#:$chipyard $ srun -p ee194 --pty sbt
```
Give it a minute or so to launch the sbt console and load all settings.
In the sbt console, set the current project by running:
```
sbt:chipyardRoot> project custom_acc_rocc
```
To compile the design, run `compile` in the sbt console, as follows:
```
sbt:custom_acc_rocc> compile
```
This might take a while as it compiles all dependencies of the project.
To run all tests, run `test` in the sbt console, as follows:
```
sbt:custom_acc_rocc> test
```
Exit the sbt console with:
```
sbt:custom_acc_rocc> exit
```
(You can use `testOnly ` to run specific ones.) Test outputs will be visible in the console. You can find waveforms and test files in `$chipyard/test_run_dir/`.
Use `gtkwave` to inspect the waveform at `$chipyard/test_run_dir/Basic_Testcase/vectorAdd.fst`.
**Please ensure your accelerator passes the basic test case before proceeding.**
## Integrating our Accelerator
Now that our accelerator works, it is time to incorporate it into an SoC. We do this by:
1. Defining a config fragment for our accelerator
1. Defining a new config that uses this config fragment
Inside `$chipyard/generators/custom-acc-rocc`, inspect `src/main/scala/configs.scala`. `WithCustomAccRoCC` is our config fragment here.
Answer the following questions:
**2. What does `p` do here? (Think about how it could be used, consider the object-oriented, generator-based style of writing, and feel free to look through other generators in Chipyard for examples.)**
**3. Give the 7-bit opcode used for instructions to our accelerator. Searching for the definition of `OpcodeSet` will be useful.**
We want to add our accelerator to a simple SoC that uses Rocket. To do this, we must make our config fragment accessible inside the chipyard generator. Open `$chipyard/build.sbt`. At line 161, add `custom_acc_rocc` to the list of dependencies of the chipyard project.
Next, navigate to `$chipyard/generators/chipyard/src/main/scala/config/RocketConfigs.scala`. **Define `CustomAccRoCCConfig`** such that it adds our accelerator to `RocketConfig`. The previous step made `customAccRoCC` available as a package here.
Hint: `CustomAccRoCCConfig` should look like the following:
```
class CustomAccRoCCConfig extends Config(
/* YOUR CODE HERE */
)
```
### Baremetal Functional Testing
Inside `$chipyard/generators/CustomAccRoCC`, let us inspect `baremetal_test/functionalTest.c`. `rocc.h` contains definitions for different kinds of RoCC instructions and the custom opcodes. We use the same test case as before, but we test integration of the whole system as values are loaded into registers on the Rocket core, sent to the RoCC accelerator, and results from the accelerator are loaded into a register.
Since our accelerator reads two source registers and writes to one destination register, we use `ROCC_INSTRUCTION_DSS`.
Inline assembly instructions in C are invoked with the `asm volatile` command. Before the first instruction, and after each RoCC instruction, the fence command is invoked. This ensures that all previous memory accesses will complete before executing subsequent instructions, and is required to avoid mishaps as the Rocket core and coprocessor pass data back and forth through the shared data cache. (The processor uses the “busy” bit from your accelerator to know when to clear the fence.) A fence command is not strictly required after each custom instruction, but it must stand between any use of shared data by the two subsystems.
While one can compute results for each test case a priori, and test for equality against the accelerator's results, such a strategy is not reliable nor scalable as tests become complex - such as when using random inputs or writing multiple tests. Thus, there lies significant value in writing a functional model that performs the same task as the accelerator, but in software. Of course, care must be taken in writing a correct functional model that adheres to the spec.
**Inspect `$chipyard/tests/rocc.h`**.
Answer the following question:
**4. What does the last argument of `ROCC_INSTRUCTION_DSS` stand for? In what situation would you need to use that argument?**
Next, we compile our test by running the following in the `baremetal_test` directory:
```
@bwrcrdsl-#:$chipyard/generators/customAccRoCC/baremetal_test $ riscv64-unknown-elf-gcc -fno-common -fno-builtin-printf -specs=htif_nano.specs -c functionalTest.c
@bwrcrdsl-#:$chipyard/generators/customAccRoCC/baremetal_test $ riscv64-unknown-elf-gcc -static -specs=htif_nano.specs functionalTest.o -o functionalTest.riscv
```
Here, we're using a version of gcc with the target architecture set to riscv (without an OS underneath). This comes as part of the riscv toolchain. Since we want a self-contained binary, we compile it statically.
Now, let's disassemble the executable `functionalTest` by running:
```
@bwrcrdsl-#:$chipyard/generators/custom-acc-rocc/baremetal_test $ riscv64-unknown-elf-objdump -d functionalTest.riscv | less
```
Inspect the output. Answer the following question:
**5. What is the address of the `ROCC_INSTRUCTION_DSS`?**
Looking through `` and looking for `opcode0` should be helpful.
It's time to run our functional test. Let us use VCS this time around. Navigate to `$chipyard/sims/vcs`, run:
```
@bwrcrdsl-#:$chipyard/sims/vcs $ srun -p ee194 --pty make -j16 CONFIG=CustomAccRoCCConfig BINARY=../../generators/custom-acc-rocc/baremetal_test/functionalTest.riscv run-binary-debug
```
It might take a few minutes to build and compile the test harness, and run the simulation.
Inside, `$chipyard/sims/vcs`, for each config,
- `generated-src` contains the test harness
- `output` contains output files (log/output/waveform) for each config.
**Inspect the log and output for our config.** Do the results of the accelerator and model match? (`** PASSED **` in the .out file and output values matching in the .log file should indicate this.)
**Inspect the waveform (.fsdb) for our config** using `verdi -ssf `. Synopsys has transitioned to a new waveform viewer called Verdi that is much more capable than DVE. Verdi uses an open file format called *fsdb* (Fast Signal Database), and hence VCS has been set up to output simulation waveforms in fsdb.
In the bottom pane of your Verdi window, navigate to `Signal > Get Signals...`. Follow the module hierarchy to the correct module.
```
TestDriver
.testHarness
.chiptop
.system
.tile_prci_domain
.tile_reset_domain
.rocket_tile
.customAccRoCC
```
# [Optional] Adding Modulo Arithmetic
We highly encourage those interested in designing accelerators and other custom IP to do the following exercise.
Currently, our accelerator wraps around the range [0, 255], i.e., when the sum of two numbers exceeds 255, you get the result modulo 255. Let's say we desire values saturating at 255. Implement this feature. (Note: As the sum of two unsigned ints >= 0, we don't have to worry about the lower bound.)
You should do the following:
1. Modify RTL
2. Write 1 more unit test
3. Modify the functional model
4. Write an integration test that uses random numbers in the entire range as inputs.
# [Optional] MMIO Design
This is a completely optional design exercise. It's digs more deeply into the Chipyard infrastructure and althought an MMIO accelerator is not difficult inconcept, is rather tricky to integrate. There will be limited support for this part in favor of helping the class understand and internalize the previous parts.
Often, an accelerator or peripheral block is connected to the rest of the SoC with a memory-mapped interface over the system bus.
This allows the core and external IO to configure and communicate with the block.
```
generator/
chipyard/
src/main/scala/
example/GCD.scala <--------- If you want to see another example
unittest/
config/ <--------- (3) Where we'll test our design
DigitalTop.scala <--------- (2) Where we'll connect our deisgn to the rest of the SoC.
ExampleMMIO.scala <--------- (1) Where we'll design & setup our accelerator.
```
## Setting up & designing our accelerator
Navigate to `$chipyard/generators/chipyard/src/main/scala/ExampleMMIO.scala` where we'll be designing our MMIO Acclerator. Remember, the goal is to desigin an "accelerator" that takes in two 32-bit* values as vectors of 4 8-bit values. The accelerator takes in 32-bit vectors, adds them, and returns the result.
Most of the logic of the accelerator will go in `VecAddMMIOChiselModule`. This module will be wrapped by the `VecAddModule` which interfaces with the rest of the SoC and determines where our MMIO registers are placed.
**Add the necessary FSM logic into `VecAddMMIOChiselModule`** Notice how `VecAddMMIOChiselModule` has the trait `HasVecAddIO`. The bundle of input.output signals in `HasVecAddIO` are how the accelerator interaces wit the rest of the SoC.
**Inspect `VecAddModule`** There are 3 main sections: setup, hooking up input/outputs, and a regmap. Setup defines the kinds of wire/signals we're working with. We hook up input/output signals as necessary: we feed x and y into the accelerator along with a rest signal and the clock; we expect the result of the addition; we also use a ready/valid interface to signify when the accelerator is busy or avaiable to process fruther instructions. `VecAddTopIO` is used only to see whether the accelerator is busy or not. Then we have the regmap:
* `RegField.r(2, status)` is used to create a 2-bit, read-only register that captures the current value of the status signal when read.
* `RegField.w(params.width, x)` exposes a plain register via MMIO, but makes it write-only.
* `RegField.w(params.width, y)` associates the decoupled interface signal y with a write-only memory-mapped register, causing y.valid to be asserted when the register is written.
* `RegField.r(params.width, vec_add)` “connects” the decoupled handshaking interface vec\_add to a read-only memory-mapped register. When this register is read via MMIO, the ready signal is asserted. This is in turn connected to output_ready on the VecAdd module through the glue logic.
RegField exposes polymorphic `r` and `w` methods that allow read- and write-only memory-mapped registers to be interfaced to hardware in multiple ways.
Since the ready/valid signals of `y` are connected to the `input_ready` and `input_valid` signals of the accelerator module, respectively, this register map and glue logic has the effect of triggering the accelerator algorithm when `y` is written. Therefore, the algorithm is set up by first writing `x` and then performing a triggering write to `y`
## Connecting our design to the rest of the SoC
Once you have these classes, you can construct the final peripheral by extending the `TLRegisterRouter` and passing the proper arguments. The first set of arguments determines where the register router will be placed in the global address map and what information will be put in its device tree entry (`VecAddParams`). The second set of arguments is the IO bundle constructor (`VecAddTopIO`), which we create by extending `TLRegBundle` with our bundle trait. The final set of arguments is the module constructor (`VecAddModule`), which we create by extends `TLRegModule` with our module trait. Notice how we can create an analogous AXI4 version of our peripheral.
`VecAddParams` This is where we define where our MMIO accelerator will be placed. `address` determines the base of the module’s MMIO region (0x2000 in this case). Each TLRouter has default size 4096. Everything `address` to `address` + 4096 is accessibl and only the regions defined in the regmap (as preivously defined) will do anything (reads/writes to other regions will be no-ops).
**Copy paste the following two code blocks into `ExampleMMIO.scala`**
```
class VecAddTL(params: VecAddParams, beatBytes: Int)(implicit p: Parameters)
extends TLRegisterRouter(
params.address, "vecadd", Seq("ucbbar,vecadd"),
beatBytes = beatBytes)(
new TLRegBundle(params, _) with VecAddTopIO)(
new TLRegModule(params, _, _) with VecAddModule)
```
```
class VecAddAXI4(params: VecAddParams, beatBytes: Int)(implicit p: Parameters)
extends AXI4RegisterRouter(
params.address,
beatBytes=beatBytes)(
new AXI4RegBundle(params, _) with VecAddTopIO)(
new AXI4RegModule(params, _, _) with VecAddModule)
```
Now, we have too hook up everything to the SoC. Rocket Chip accomplishes this using the cake pattern. This basically involves placing code inside traits. In the Rocket Chip cake, there are two kinds of traits: a `LazyModule` trait and a module implementation trait.
The `LazyModule` trait runs setup code that must execute before all the hardware gets elaborated. For a simple memory-mapped peripheral, this just involves connecting the peripheral’s TileLink node to the MMIO crossbar.
**Copy paste the following two code blocks into `ExampleMMIO.scala`**
```
trait CanHavePeripheryVecAdd { this: BaseSubsystem =>
private val portName = "vecadd"
// Only build if we are using the TL (nonAXI4) version
val vecadd = p(VecAddKey) match {
case Some(params) => {
if (params.useAXI4) {
val vecadd = LazyModule(new VecAddAXI4(params, pbus.beatBytes)(p))
pbus.toSlave(Some(portName)) {
vecadd.node :=
AXI4Buffer () :=
TLToAXI4 () :=
// toVariableWidthSlave doesn't use holdFirstDeny, which TLToAXI4() needsx
TLFragmenter(pbus.beatBytes, pbus.blockBytes, holdFirstDeny = true)
}
Some(vecadd)
} else {
val vecadd = LazyModule(new VecAddTL(params, pbus.beatBytes)(p))
pbus.toVariableWidthSlave(Some(portName)) { vecadd.node }
Some(vecadd)
}
}
case None => None
}
}
```
```
trait CanHavePeripheryVecAddModuleImp extends LazyModuleImp {
val outer: CanHavePeripheryVecAdd
val vecadd_busy = outer.vecadd match {
case Some(vecadd) => {
val busy = IO(Output(Bool()))
busy := vecadd.module.io.vec_add_busy
Some(busy)
}
case None => None
}
}
```
Note that the `VecAddTL` class we created from the register router is itself a `LazyModule`. Register routers have a TileLink node simply named “node”, which we can hook up to the Rocket Chip bus. This will automatically add address map and device tree entries for the peripheral. Also observe how we have to place additional AXI4 buffers and converters for the AXI4 version of this peripheral.
Now we want to mix our traits into the system as a whole. This code is from` generators/chipyard/src/main/scala/DigitalTop.scala`.
**Copy paste ` with chipyard.example.CanHavePeripheryVecAdd` into DigitalTop & `with chipyard.example.CanHavePeripheryVecAddModuleImp` into DigitalTopModule**
Just as we need separate traits for `LazyModule` and module implementation, we need two classes to build the system. The `DigitalTop` class contains the set of traits which parameterize and define the `DigitalTop`. Typically these traits will optionally add IOs or peripherals to the DigitalTop. The `DigitalTop` class includes the pre-elaboration code and also a `lazy val` to produce the module implementation (hence `LazyModule`). The `DigitalTopModule` class is the actual RTL that gets synthesized.
And finally, we create a configuration class in `$chipyard/generators/chipyard/src/main/scala/config/RocketConfigs.scala` that uses the WithVecAdd config fragment defined earlier.
**Copy paste the following**
```
class VecAddTLRocketConfig extends Config(
new chipyard.example.WithVecAdd(useAXI4=false, useBlackBox=false) ++ // Use VecAdd Chisel, connect Tilelink
new freechips.rocketchip.subsystem.WithNBigCores(1) ++
new chipyard.config.AbstractConfig)
```
## Testing Your MMIO
Now we're ready to test our accelerator! We write out test program in `$chipyard/tests/examplemmio.c` Look through the file and make sure you understand the flow of the file.
**Add in a C reference solution for our accelerator**
To generate the binary file of the test, run two following two commands in the terminal
`riscv64-unknown-elf-gcc -std=gnu99 -O2 -fno-common -fno-builtin-printf -Wall -specs=htif_nano.specs -c examplemmio.c -o examplemmio.o`
`riscv64-unknown-elf-gcc -static -specs=htif_nano.specs examplemmio.o -o examplemmio.riscv`
Then, navigate to `$chipyard/sims/verilator` and run `make CONFIG=VecAddTLRocketConfig BINARY=../../tests/examplemmio.riscv run-binary-debug` to run the test. If successful, you should see the terminal print whether you passed the test or not. This may take a while.
**Please submit:**
1. The entirety of the code for `VecAddMMIOChiselModule`.
2. Your entire C refenence solution.
3. A screenshot of your test passing.c
# END OF CHIPYARD LAB (due EoD 1/30)