Post-Implementation Functional Simulation

This tutorial describes how to verify the functional correctness of a circuit implemented by VPR using Verilator, an open-source hardware simulator.

Functional simulation is useful for:

Verifying that the VTR flow has correctly implemented the circuit’s intended logic
Quickly checking correctness before running a full timing simulation
Debugging unexpected circuit behaviour after implementation

Unlike timing simulation, functional simulation does not require a Standard Delay Format (SDF) file; only the post-implementation Verilog netlist and VTR’s primitive definitions are needed.

Note

This tutorial requires Verilator 5.006 or later. Verilator can be installed from your distribution’s package manager or built from source: https://github.com/verilator/verilator

On Ubuntu, install it with:

$ sudo apt install verilator

The Design

The circuit for this tutorial is a 4-bit ripple-carry adder with the following ports:

Inputs: cin (carry-in), a0-a3 and b0-b3 (4-bit operands, a0 / b0 are the least-significant bits)
Outputs: s0-s3 (sum bits, s0 least-significant) and cout (carry-out)

When synthesis compiles this verilog file to a blif file, buses get turned into individual bits rather than ports. Using individual single-bit ports (rather than bus ports such as input [3:0] a) keeps the post-implementation netlist port names simple and easy to reference in the testbench.

Create a working directory and save the design as top.v:

$ mkdir func_sim_tut
$ cd func_sim_tut

Listing 32 4-bit ripple-carry adder top.v.

module top (
    input  cin,
    input  a0, a1, a2, a3,
    input  b0, b1, b2, b3,
    output s0, s1, s2, s3,
    output cout
);
    wire c0, c1, c2;

    // Bit 0
    assign s0 = a0 ^ b0 ^ cin;
    assign c0 = (a0 & b0) | ((a0 ^ b0) & cin);

    // Bit 1
    assign s1 = a1 ^ b1 ^ c0;
    assign c1 = (a1 & b1) | ((a1 ^ b1) & c0);

    // Bit 2
    assign s2 = a2 ^ b2 ^ c1;
    assign c2 = (a2 & b2) | ((a2 ^ b2) & c1);

    // Bit 3
    assign s3 = a3 ^ b3 ^ c2;
    assign cout = (a3 & b3) | ((a3 ^ b3) & c2);
endmodule

Generating the Post-Implementation Netlist

Copy the VTR flagship architecture into the working directory:

$ cp $VTR_ROOT/vtr_flow/arch/timing/k6_frac_N10_frac_chain_mem32K_40nm.xml .

Note

Replace $VTR_ROOT with the root directory of the VTR source tree.

Run the full VTR flow using run_vtr_flow.py. The vpr --gen_post_synthesis_netlist option instructs VPR to write the post-implementation Verilog netlist alongside the usual place-and-route output files:

$ source $VTR_ROOT/.venv/bin/activate
$ python3 $VTR_ROOT/vtr_flow/scripts/run_vtr_flow.py \
$     top.v \
$     k6_frac_N10_frac_chain_mem32K_40nm.xml \
$     --gen_post_synthesis_netlist on

After the flow completes you should see the post-implementation netlist:

$ ls temp/top_post_synthesis.v temp/top_post_synthesis.sdf
top_post_synthesis.sdf  top_post_synthesis.v

Note

The SDF file (top_post_synthesis.sdf) contains back-annotated timing delays and is used for timing simulation. Functional simulation does not require it.

Inspecting the Post-Implementation Netlist

The generated netlist expresses the circuit using FPGA primitives defined in $VTR_ROOT/vtr_flow/primitives.v. Below is a representative snippet:

Listing 33 Snippet of top_post_synthesis.v.

module top (
    input \cin ,
    input \a0 ,
    input \a1 ,
    input \a2 ,
    input \a3 ,
    input \b0 ,
    input \b1 ,
    input \b2 ,
    input \b3 ,
    output \s0 ,
    output \s1 ,
    output \s2 ,
    output \s3 ,
    output \cout
);

    // ... wire declarations and interconnect ...

    LUT_K #(
        .K(5),
        .LUT_MASK(32'b00000000010000010000000000010100)
    ) \lut_s0  (
        .in({ \lut_s0_input_0_4 , 1'bX, \lut_s0_input_0_2 ,
              \lut_s0_input_0_1 , 1'bX }),
        .out(\lut_s0_output_0_0 )
    );

    // ... more LUT_K cells ...

endmodule

The module is named after the top-level module of the design being packed, placed, and routed by VTR (top in this tutorial). Port names beginning with \ are standard Verilog escaped identifiers; \cin (with a trailing space) is the same identifier as plain cin, so they connect to testbench signals named without the backslash.

The primitives used here, LUT_K (look-up table) and fpga_interconnect (routing wire), are defined in primitives.v. If the architecture’s carry chain is used, adder primitives from primitives.v may also appear.

Creating a Testbench

The testbench tb.sv drives every possible combination of the 9 input bits (2⁹ = 512 vectors), computes the expected sum using ordinary arithmetic, and flags any mismatch:

Listing 34 Testbench tb.sv.

`timescale 1ns/1ps
module tb;

    // DUT I/O
    logic cin;
    logic a0, a1, a2, a3;
    logic b0, b1, b2, b3;
    logic s0, s1, s2, s3, cout;

    logic [4:0] expected, actual;
    int errors;

    // Instantiate the post-implementation netlist.
    // The escaped port names (\cin, \a0, ...) are identical to the
    // unescaped names used below, so the named connections work directly.
    top dut (
        .cin (cin),
        .a0  (a0), .a1 (a1), .a2 (a2), .a3 (a3),
        .b0  (b0), .b1 (b1), .b2 (b2), .b3 (b3),
        .s0  (s0), .s1 (s1), .s2 (s2), .s3 (s3),
        .cout(cout)
    );

    initial begin
        errors = 0;

        // Sweep all 2^9 = 512 input combinations
        for (int c = 0; c < 2; c++) begin
            cin = c[0];
            for (int a = 0; a < 16; a++) begin
                {a3, a2, a1, a0} = 4'(a);
                for (int b = 0; b < 16; b++) begin
                    {b3, b2, b1, b0} = 4'(b);
                    #1; // allow combinational outputs to settle

                    expected = 5'(a + b + c);
                    actual   = {cout, s3, s2, s1, s0};

                    if (actual !== expected) begin
                        $display("FAIL: a=%0d b=%0d cin=%0d  expected=%0d  got=%0d",
                                 a, b, c, expected, actual);
                        errors++;
                    end
                end
            end
        end

        if (errors == 0) begin
            $display("All 512 tests PASSED.");
            $finish;          // exit 0 — success
        end else begin
            $display("%0d test(s) FAILED.", errors);
            $fatal(1);        // exit 1 — failure
        end
    end

endmodule

Lines 16-22 instantiate the DUT (Device Under Test), the post-implementation netlist module top, using named port connections. Line 36 computes the expected 5-bit result using integer arithmetic, and line 37 assembles the actual 5-bit result from the individual DUT output bits. Lines 39-43 use $display (not $error) to report each mismatch. $display is non-fatal, so every one of the 512 vectors is checked and all failures are reported before the simulation exits. Lines 50-56 select the final exit path: $finish exits with code 0 (success) and $fatal(1) exits with code 1 (failure), making the simulation suitable for use as a CI test.

Simulating with Verilator

Compile the testbench, post-implementation netlist, and VTR primitives together. Verilator’s --binary flag produces a standalone simulation executable in a single step:

$ verilator --binary -sv \
$     tb.sv \
$     temp/top_post_synthesis.v \
$     $VTR_ROOT/vtr_flow/primitives.v \
$     --top-module tb \
$     --Mdir sim_build \
$     -j $(nproc)

This compiles tb.sv, the post-implementation netlist, and the VTR primitives into a standalone simulation binary under sim_build/. The -j $(nproc) flag parallelises the build across all available CPU cores. This generates the simulation binary sim_build/Vtb. Run it to execute all 512 test vectors:

$ ./sim_build/Vtb
All 512 tests PASSED.

Note

primitives.v uses Verilog specify blocks to model timing. Verilator ignores these for functional simulation, so no timing information is back-annotated, which is the desired behaviour here.

If any test vector fails, the simulation prints each failure, reports the total count, and exits with a non-zero return code:

FAIL: a=3 b=5 cin=0  expected=8  got=7
1 test(s) FAILED.
%Fatal: tb.sv:53: $fatal called

Such an error would indicate a bug in VPR’s implementation of the circuit (assuming the input Verilog design and the testbench are coded correctly/match), which can then be investigated by examining the post-implementation netlist or running the full timing simulation with waveform capture.