RTL-Synchronized Transaction Reference Models

DesignCon East 2003

System-on-Chip and ASIC Design Conference




Experiences with RTL-Synchronized Transaction Reference Models


Dave Whipp,

NVIDIA Corporation.


Abstract

Verification methods are increasingly being built on the transaction abstraction. But this abstraction is not sufficient to capture adequately the intricate timing relationship between interacting transactions. This gap typically is bridged using a reification approach that either adds timing details to the model or replaces a part of the model with a more precise implementation. This paper discusses the alternative approach of imposing timing information as an input to the scheduler from an RTL simulation. The paper focuses on the practical experiences drawn from two projects in which the approach was used. It describes the mechanisms used to convert a functionally accurate model into a transaction model, and the problems that arise when we attempt to impose timing on a supposedly agnostic model.

Author's Biography

Dave Whipp has recently joined NVIDIA as Senior Verification Engineer. This paper is based on work at Fast-Chip inc. where he experienced the rise, and then the fall, of a Silicon Valley startup. Prior experience includes verification on modeling for Infineon's TriCore RISC/DSP microprocessor, and for GEC Plessey Semiconductor's System-ASIC (SoC) project. Since graduating from UMIST in 1993, he has authored a number of papers [e.g. (1,2)] on modeling for conferences and user-groups: and can be found on several internet discussion forums.


Introduction

Functional, or Cycle Accurate? This is a basic question asked when discussing the C model of an ASIC. Often we find that both abstractions are required, so we build two models: generally with little, or no, shared code between them.

When I joined Fast-Chip, in November 2000, the team had created a functionally accurate C-model. There was no customer pressure for a cycle-accurate variant. The verification group, however, was troubled. The PolicyEdge network services processor achieves its performance through parallelism, and precise timing of state changes determines the behavior of interacting threads. Although the model had enabled verification of isolated features, it did not seem possible to use it as a reference for random testing. Creating a new cycle-accurate model would impose an undesirable schedule slip.

We decided to use an alternative approach: to use timing information from the RTL simulation to determine how interactions between threads should be resolved.

synchronizing a single-port memory: delays may be variable, but an arbiter controls the sequence
Figure 1: synchronizing a single-port memory -- delays may be variable, but an arbiter controls the sequence

Verification Flow

Our verification flow was typical. A test script defines a set of transactions that stimulate the design. These transactions are first applied to the C model. If the result is reasonable, then the same test is run on the RTL. Finally, a fuzzy compare checks that the two simulations are sufficiently similar.

A fuzzy compare is adequate when we can partition the outputs of the simulations into streams of in-order transactions. It breaks down, however, when no such streams exist, and when interactions between streams result in functional effects. In these cases, it is necessary to introduce timing information into the reference model. We modified our flow to achieve this, without creating a cycle-accurate model.

This new flow first runs a test on the RTL. If the RTL simulation completes successfully (that is, it completes with no assertion/checker errors), then we process its output to extract a flow of synchronization data. The test is then run on the C model (modified transaction accurate), with the synchronization data as an additional input. This input controls the scheduler, to ensure that the order of interactions between threads mirrors that of the equivalent interactions in the RTL. Because of this synchronization, the fuzzy compare becomes somewhat trivial with respect to timing differences.

The Transaction Model

The original model was a single-threaded application. This thread implemented packet processing (and other functionality) as a series of operations, decisions and loops. This end-to-end model was adequate for isolated processing, but was unable to support concurrent, interacting, transactions.

Two approaches suggest themselves. The first is to introduce multiple threads; the second is to refactor the code to use callbacks from a scheduler. For reasons of portability and risk-management, we chose to restrict ourselves to single-threaded solutions. Multi-threaded code is notorious for its bugs, due to interaction between threads. Furthermore, we would need to suppress the non-deterministic aspects of thread interaction.

The single-threaded approach has its own risks. The execution-flow of a functional model is expressed explicitly in the code. The introduction of a scheduler abstracts this flow. The simulation of multiple threads with a homegrown scheduler forces us to abandon language level control flow (e.g. while loops); and curtails our use of the stack as the execution context [cf. (12)]. Fortunately, the transactions of the PolicyEdge are sufficiently simple that the single-threaded approach remains viable.

Refactoring to a Transaction Model

To Refactor is to modify of the design of existing code, without changing its behavior. Martin Fowler defined it (3) as a mechanical process, of small steps supported by tests. The conversion of the original functional model to the transaction model is a set of such steps. For each transaction:

  1. Move local variables into a context structure. Create an instance (on the heap, not the stack) at start of transaction -- and delete at end.
  2. Replace iterative loops with recursive functions.
  3. For each function that requires synchronization (directly or indirectly), replace the call with a request/callback pair.

After each step, it is possible to run a complete regression: the behavior of the code will be unchanged. The following example demonstrates these steps. Keep in mind the mechanical nature of the transform. Whilst we were not able to automate it via a script, the rules of the transformation are mindless:

    int classify_packet(Packet* pkt, Rule *rule)
    {
        int hop_count = 0;

        while (rule && hop_count++ < MAX_HOP_COUNT)
        {
            int field = extract(packet, rule);
            int result = interpret(field, rule);

            if (result != ITERATE)
            {
                return result;
            }

            rule = rule->next;
        }
        return DELETE_PACKET;
    }

This code classifies a packet, according a list of rules. The "extract" function may need to wait for the requested part of the packet be input to the system: if multiple instances of this function are to interact (while "extract" is waiting), then we must transform the code according our set of steps. Space does not permit me to show the intermediate steps, but the footprints are clearly visible in the resulting code. First, the "context" structure of step 1:

    struct context
    {
        Packet *pkt;
        Rule *rule;
        int hop_count;
        int field;
        int result;
        void (*callback) (int result);
    };

The final field in the structure is a callback, used when the classification is complete. Even though "extract" is the only part that will block, the need for callbacks ripples up to the root of the transaction. After replacing the loop with recursion, the resulting code is a set of fragments that implements the transaction. The recursive nature of classify packet iteration remains, broken only if "request_extract" uses the scheduler (and therefore returns, unwinding the stack, before actually calling the interpret step):

    void request_classify_packet (Packet* pkt, Rule_Tree *rule, void (*callback)(int))
    {
        struct context *self = calloc(1, sizeof(struct context));
        /* initialise context */
        classify_packet_iteration_begin(self);
    }

    void classify_packet_iteration_begin (struct context *self)
    {
        if (self->rule && self->hop_count++ < MAX_HOP_COUNT) 
        {
            request_extract(self, &classify_packet_iteration_interpret)
        }
        else
        {
            self->result = DELETE_PACKET;
            classify_packet_iteration_end(self);
        }
    }


    void classify_packet_iteration_interpret (struct context *self)
    {
        self->result = interpret(self->field, self->rule);
        self->rule = self->rule->next;
        classify_packet_iteration_end(self);
    }

    void classify_packet_iteration_end (struct context *self)
    {
        if (self->result == ITERATE)
        {
            classify_packet_iteration_begin(self);
        }
        else
        {
            /* free self; call callback, with result */
        }
    }

The fragmentation of the function-aspect is clearly visible in the resulting code. In addition to its opaque control flow, the code size is greater. The PolicyEdge C-model grew from 18.8 K lines of C to 29.2 when we introduced transactions (these figures are from our CVS repository, and are not exclusively a result of the refactoring. Bug fixes, and some other enhancements, were made over the same period). One of the goals of future work will be to minimize the overhead incurred by the introduction of explicit transactions.

The Scheduler

The scheduler acts a buffer between any two stages of the transaction. Instead of accessing a shared resource directly, the prior step calls a submit function. The scheduler will call the actual next-step function at some time in the future. This time is input as a synch-message, which contains the name of the step and a key that identifies the instance of the transaction (this key is needed only when multiple instances of the transaction can be pending on a step, and may be serviced out-of-order). Internally, the scheduler maintains queues of contexts, and their associated callbacks. The synch-message selects the appropriate context/callback for the transaction.

During a simulation, the scheduler interleaves two input sources: commands from a test-stimuli file, and synch-messages from the synch-file. These two input sources are both file-streams, and are read using blocking-IO. The rule that correctly interleaves the two sources is to read the synch-file until the synchronized transaction does not exist: and then to read a record from the test-stimuli file. If the stimulus creates the required transaction, then we synchronize it (i.e. call the callback), and continue reading the synch-file again. If the stimulus does not create the transaction, then we continue reading (and buffering) more stimuli until the required transaction is created. If the end of file is reached, or if the buffer space is exhausted, then the test fails.

In practice, a C model that requires a synch-file (extracted from an RTL simulation) is too specialized. The verification group is only one user of the simulator. To enable other users to run the model (for example, our customers), we created a mode in which the same set of synchronization points were scheduled by priority.

Transaction Synchronization in Action

The introduction of transaction synchronization was our response to the urgent need to move to the next stage of verification. Cycle-accurate timing is a don't-care behavior, which we exploited to simplify verification. By definition, if a condition is a don't-care, then the RTL implementation is valid. Therefore, we simply use whatever timing the RTL provides. The transaction model defines what behaviors are legal: the RTL tells us which legal behavior to chose.

Although the precise timing is fungible, freedom is not usually absolute. There may be fairness constraints on arbiters; and there may be performance requirements that constrain the number of cycles permitted for a transaction. These constraints are orthogonal to synchronization: both by assertions within the testbench, independently of the model. Being a property of sequence, not of absolute time, fairness can be checked by either the testbench or the model. Whichever approach we use, the issue can be ignored for synchronization.

The need to run tests with complex interactions was urgent, so we did not have time to redesign the testbench. Fortunately, the existing testbench output a log of activity within the RTL. The simplest way of extracting synchronization information was to filter this log file through a Perl script. As we made progress synchronizing the transactions, a number of common themes emerged.

Pipeline Delays and Priorities

The testbench's log file included timestamps that indicate when an event occurred. The RTL implements transactions as pipelines and state-machines, and we need to find the point in time when transactions appear to interact. It is sometimes easier for the testbench to output a message a few cycles before or after the actual access to a resource. In almost all cases, this simplification could be overcome by an adjustment to the timestamp (adding or subtracting a fixed number of cycles) of one or both of the messages.

synchronizing a dual-port memory
Figure 2: synchronizing a dual-port memory -- balancing the delays is easy if they are constant

Even when delays are balanced, the interaction ordering may still need adjustment. When two threads access a resource in the same cycle we need to know which has priority. An example of this is a dual ported memory: if it is valid for a read and write to occur simultaneously to the same address, then only priority can tell us which value will be returned by the read. We sometimes found that adjusting a priority for one resource would break a priority relationship with another. In most cases, we could juggle the priorities to form a consistent group. Occasionally, we were forced to create a new synchronization point, which increased fragmentation of the model.

As the result of these adjustments, we created a new log file: stripped of unused lines, and sorted by the adjusted timestamps and priorities.

Queue Sizes

RTL-synchronization makes use of the don't-care nature of cycle timing. However, don't-cares are not limited to cycle times. Other implementation artifacts can also be don't-cares. One such detail is the size of queues.

Instead of defining the sizes of queues in the model, we implemented a dual-natured synch point. When the pending transaction is add element to queue, the corresponding synch points become accept, and drop. Although we defined minimum size restrictions (e.g., it is an error to drop an element when adding to an empty queue), we tended not to restrict the upper bound. This decoupling provides another dimension for designers to modify the implementation without affecting the model.

Transaction Tracking

The synch-file, which controls the scheduler, is a simple record-per-line format. When multiple instances of a transaction could be valid at a given step, this record must contain a transaction identifier. This identifier is usually available as a signal within the RTL. For example, a memory request may include a tag that identifies the context to which the reply will be mated. However, there are some cases where no such tag exists. In these cases, it was necessary to set up a virtual tag in the testbench, which tracks the transaction as it progresses through the chip.

Our testbench language, Perl, enabled this transaction tracking. GreenLight's Pivot (6) is a Perl module that provides an abstraction over the Verilog PLI, similar to that of Cadences TestBuilder (7,8) for C++. Using Pivot, we represent transactions as Perl objects: and the pipelines as a sequence of Perl arrays. The object moves through the system by being shifted from one array, and pushed onto the next. The use of arrays (rather than scalar variables) introduces slack into the transaction tracking: the different arrays represent significant points in the transaction path, not individual pipeline stages.

Non-Causal behavior

speculative read
Figure 3: Speculative read: only in stage 4 can the model decide it needs the data requested in stage 1

In a hardware design, effect can appear to precede cause -- because the RTL may read a memory speculatively. In extreme cases, the design reads a memory on every cycle. In contrast, the transaction model accesses the resource only when it knows that the access is required. The filtering script can usually remove the excess accesses: but it cannot synchronize an access, which is part of a transaction, before the transaction itself has been started.

The solution was to rearrange the order of messages. The problems associated with speculation only become apparent when two or more resources (memories) are involved. In the diagram, a "control" RAM determines the use of values from a "data" RAM. If we shift the (extracted) timings of one relative to the other, then we avoid the non-causal aspect of speculation. Of course, we also need to filter the accesses to the "data" RAM, to synchronize only those reads that are actually used.

This approach, of changing the perceived ordering of the transactions, can be likened to creative accounting: and did occasionally backfire. When we move the timing of one port, we also need to move the timing all the other ports. This movement requires slack in other, unrelated, transactions; and can cause ripple effects until such slack is found. As a practical matter, we chose not to analyze these effects too closely. We made an unchecked assumption that the necessary slack would be available. The result was occasional, and obscure, false-fails during random testing. When such problems arose, we were forced to add the correct (implementation-specific) transactions to the model: this was usually less painful than it had initially appeared.

Testbench "forces"

To stress the design, it was sometimes useful to force values onto nets during a simulation. Many of these forces were intended only to increase backpressure on the pipelines, to hit more corner-cases. This increased pressure affected the timing of the transactions: but any functional changes were a result of these timing differences. For these cases, the synchronization points we already had were sufficient to propagate the forces into the C simulator.

white-box testing
Figure 4: white-box testing: we inject values into the increment-path, and sample values used by the client

However, other forces were not limited to timing effects. As an example, consider a background process that walks through the memory, checking/updating its contents. This may take many milliseconds to walk through the entire address range: each access being triggered by a timer tick. If only a small number of addresses are actually used by a test, then we are able to stress the design by limiting the background process to touch only those addresses that are in use. But when the addresses being touched are forced away from their natural progression, then the C model will indicate a functional error (wrong address ticked). We needed a way to propagate the forced addresses into the model.

Our solution was a consequence of the history of the model. It had begun life as a functional simulator, and only later been converted to transactions. A deterministic functional model cannot have background processes so the authors had, instead, chosen to include a do tick command in its control language: users were required to issue this command periodically.

We created a script that interleaved do tick commands with original test-script to create a new test script. We then used this new test as the control stimuli for the C model. Thus, the addresses became a primary input to the simulator: but the timing remained under the control of the synch-file. In a second project, we attempted to include the (functional) address information within the synch-file input: this approach worked, but was not significantly simpler. However, we continue to explore the potential of functional information within the synchronization-stream.

Triage and Debug

The most obvious benefit of transaction synchronization is the reduced modeling effort required to track the evolving design. However, there are additional implications beyond the model. The process of debugging is changed by the use of synchronized transactions. When a test fails, the first activity is to triage the class of failure. It turns out that the most pervasive class is synchronization failure the test fails because the RTL executed a transaction-step that the model does not expect. A consequence of this classification was a perception that synchronization is a continuous headache.

The reality is somewhat different. The huge number of issues reported as synchronization failures is a testament to the power of the technique: minor deviations from expected behavior are quickly detected. Many of these deviations would be missed under a cycle-accurate modeling approach, because cycle accurate modelers almost always end up looking at the RTL to discover the exact cycle behavior.

Many of the synchronization errors were in the testbench or the model. We often needed to adjust the pipeline delay, or priority, of a monitor in the testbench. One fact that I have not previously emphasized is that the extraction of synchronization messages occurs as post-processing of the simulation log file. If we believe that a problem is a testbench/model issue, then we do not need to rerun the RTL simulation: we can manually re-order lines in the log file. If this fixes the problem then we will adjust the priorities/delays accordingly. The ability to use the transaction log in this replay mode is one reason for the efficiency of our verification methodology.

A second attempt

After the successful tapeout of the PolicyEdge, it was natural to follow a similar approach on the next project. Having seen the effectiveness of the transaction model, we decided to build the model using fragmented transactions: but to omit the initial construction of the functional model, and thus the subsequent refactoring.

This was a mistake. In constructing the transactions early, we did not have the benefit of seeing the RTL's reification. Although we made intelligent guesses, we tended to fragment the transactions more than was required. The resulting model had functional errors, which would have been easy to debug in a functional environment (where it is possible to single-step through a loop): but which became obscured by the fragmented transactions. In hindsight, we should have ignored the perceived need for transactions, and followed the rule of YAGNI (4). Conversion to transactions had been demonstrated to be trivial (via refactoring), so we should not have pre-empted the process.

More successful, was our foresight in the testbench. Knowing that we would need to track the transactions, we were careful to create monitors on the appropriate interfaces: and to include transaction identifiers that could be used to synchronize the model. As a result of this methodology, the script that extracted the synchronization information was significantly simpler than its predecessor.

Future Directions

Methodology

Embracing RTL-synchronization did more that just simply the problem of cycle-precise verification. It changed our approach to verifying a complex ASIC.

Our points of verification became accesses to resources. Where previously we had verified transactions at the interfaces to modules, we now verified behavior deep inside the blocks. This de-emphasis of interfaces gave flexibility to the design team. Designers became free to move functionality between blocks. They were able to reorganize the hierarchy in response to feedback from the back-end flow. One function actually changed clock domains several times, and ended up partitioned across two domains.

This flexibility was not without limits. Verification is only one constraint on the design process. Our approach enables the designer restructure a design in response to the real physical constrains of the design, rather than worrying about the impact of such a change on the verification flow.

A second effect of this de-emphasis of module interfaces is that we delayed writing unit testbenches. A unit-level testbench exercises a block by stimulating the interface. With fluid interfaces the production of these tests would be largely a wasted effort. RTL-Synchronization enables excellent localization of faults from system-level tests. This, combined with a philosophy that prioritized end-to-end transactions over functional completeness, allowed us to postpone unit testing until deep into the physical design process. By that time, we had detailed coverage information, from the system tests, to guide our efforts; and the interfaces had stabilized in the back-end flow.

These methodology comments are a result of observing the experimental methodology on just one chip. The effects could be a result of the personalities of the designers and verification engineers involved; and it may be specific to the nature of the design. Broader experience is needed to see if it is more widely applicable.

Technology

The modeling mistakes of the second project taught us an important lesson: avoid fragmenting the code until there is a demonstrated need to do so. The fragmentation is a result of simulating thread interactions in a single-threaded application.

There is a perception that multi-threaded applications are more difficult to create than single-threaded ones. This is due to the need to synchronize access to shared variables. Thus using a threading library, such as "pthreads" (10), could create more problems than it solves. We desire the benefits of multiple stacks, without the problems of multiple threads (12). An alternative is the Quick-Threads library (9), used as the heart of SystemC (11). Neither pthreads nor System-C provides a file-synch capability, but this can be implemented as a wrapper-layer on either.

An RTL-synchronized transaction model provides cycle accuracy only when that accuracy is supplied as its synch-input. This is not a solution that can be supplied to users of the model who do not have access to the RTL (nor the infrastructure to create the synch-file). Nevertheless, the fact that the model can become cycle-accurate suggests that it contains sufficient hooks on which to hang cycle-accuracy: it should be possible to create a pure "timing model", which creates the synch-input as a plug-in to the model. If a true cycle-accurate model is not required until late in the project (e.g. until after tapeout), then the plug-in timing approach becomes very attractive: the model can be made cycle-accurate, after-the-fact, without needing to modify it. Even when performance modeling requires a cycle-model early in the project, an architecture that treats timing as an orthogonal aspect seems advantageous from a software engineering viewpoint.

If a cycle-model can be a plug-in, can we find other useful scheduling policies? One that I find interesting is a "random" synch-model: which will choose any legal synch-point at each step. This will enable us to stress the architecture beyond the requirements of a specific micro-architecture.

Summary

I have described a verification flow where we use a transaction model to enable cycle-accurate comparison between the RTL and reference model: without needing to create a cycle-accurate model. This exploits the don't care nature of cycle timing, and of specific other implementation artifacts. The resulting design flow enables RTL engineers to modify their pipelining without requiring corresponding changes to the reference model. Towards the end of the project, where timing closure becomes the focus, this decoupling acts to remove the verification team from the critical path of the project.

Earlier in the project, rapid development of an architectural model is possible by postponing consideration of thread interactions until we have tests that fail for its lack. I have described a refactoring that converts a functional model into a synchronizable transaction model. In order to implement this synchronization, we extract transaction commit points from an RTL simulation: currently by parsing its transaction log. A transaction-based testbench ensures that the required information is available.

Verification against the transaction model at every commit point can be very finicky. The dominant failure mode for tests in the regression became the "synchronization error" -- indicating that the transaction captured from the RTL did not match the transaction modeled in the C simulator. We developed a number of techniques for working with this failure mode: ultimately allowing significant debug without needed to rerun the simulation (nor even look at waveforms). Using these techniques, verification based on comparison against the RTL-synchronized transaction model becomes more efficient than either using a timed, or untimed, functional model.

Acknowledgements

In writing this paper, I wish to acknowledge the people who made it possible. Firstly, Fast-Chip Inc: for providing the environment in which to experiment. Secondly, the verification team at Fast-Chip, who worked with me to make my ideas become reality especially Chris Refino, the primary coder of the C simulator. Finally, Green Light LLC, whose testbench environment makes experimentation so easy.

References

  1. "Constructing High Level Macrocell Models Using the Shlaer-Mellor Method", Dave Whipp -- www.esscirc.org/papers-97/22.pdf
  2. "Bottom-Up Modeling", Dave Whipp -- www.projtech.com/pubs/confs/2002.html
  3. "Refactoring: improving the design of existing code", Martin Fowler -- ISBN:0-201-48567-2
  4. YAGNI: a practice of extreme programming (5) -- www.c2.com/cgi/wiki?YouArentGonnaNeedIt
  5. "Extreme Programming Explained", Kent Beck -- ISBN:0-201-61641-6
  6. GreenLight LLC -- www.greenl.com
  7. Test Builder -- www.testbuilder.net
  8. "The Transaction Based Methodology", Dhananjay S. Brahme et al -- www.testbuilder.net/whitepapers/tbv00tr2.pdf
  9. "Tools and Techniques for Building Fast Portable Threads Packages", David Keppel -- www.cs.washington.edu/research/compiler/papers.d/quickthreads.html
  10. "Multithreaded Programming with PThreads", Lewis/Berg -- ISBN:0-13-680729-1
  11. System C -- www.systemc.org
  12. "Cooperative Task Management without Manual Stack Management", Atul Adya et al -- research.microsoft.com/~adya/pubs/usenix2002-fibers.pdf

index home