DesignCon 2003

System-on-Chip and ASIC Design Conference

Experiences with RTL-Synchronized Transaction Reference Models

Dave Whipp,

Fast-Chip Inc.

Diagrams are available in the associated powerpoint presentation.

Abstract

Verification methods are increasingly being built on the transaction abstraction. But the transaction abstraction is not sufficient to capture adequately the intricate timing relationship between interacting transactions. This gap typically is bridged using a reification approach that either adds timing details to the model or replaces a part of the model with a more precise implementation. This paper discusses the alternative approach of imposing timing information as an input to the scheduler from an RTL simulation. The paper focuses on the practical experiences drawn from two projects in which the approach was used. It describes the mechanisms used to convert a functionally accurate model into a transaction model, and the problems that arise when we attempt to impose timing on a supposedly agnostic model.

Authors Biography

Dave Whipp is Senior Verification Engineer at Fast-Chip inc., since November 2000. Prior experience includes verification on modeling for Infineons TriCore RISC/DSP microprocessor, and for GEC Plessey Semiconductors System-ASIC project. Since graduating from UMIST in 1993, he has authored a number of papers [e.g. (1, 2)] on modeling for conferences and user-groups: and can be found on several Internet discussion forums.

Introduction

Functional, or Cycle Accurate? This is a basic question asked when discussing the C model of an ASIC. Often we find that both abstractions are required, so we build two models: generally with little, or no, shared code between them.

When I joined Fast-Chip, in November 2000, the team had created a functionally accurate C-model. There was no customer pressure for a cycle-accurate variant. The verification group, however, was troubled. The PolicyEdge network services processor achieves its performance through parallelism, and precise timing of state changes determines the behavior of interacting threads. Although the model had enabled verification of isolated features, it did not seem possible to use it as a reference for random testing. Creating a new cycle-accurate model would impose an undesirable schedule slip.

We decided to use an alternative approach: to use timing information from the RTL simulation to determine how interactions between threads should be resolved.

Verification Flow

Our verification flow was typical. A test script defines a set of transactions that stimulate the design. These transactions are first applied to the C model. If the result is reasonable, then the same test is simulated on the RTL. Finally, a fuzzy compare process checks that the two simulations are sufficiently similar to warrant that the test has passed.

A fuzzy compare is adequate when we can partition the outputs of the simulations into streams of inorder transactions. It breaks down, however, when no such streams exist, and when interactions between streams result in functional effects. In these cases, it is necessary to introduce timing information into the reference model. We modified our flow to achieve this, without creating a cycle-accurate model.

This new flow first runs a test on the RTL. If the RTL simulation completes successfully (that is, it completes with no assertion/checker errors), then we process its output to extract a flow of synchronization data. The test is then run on the C model (modified transaction accurate), with the synchronization data as an additional input. This input controls the scheduler, to ensure that the order of interactions between threads mirrors that of the equivalent interactions in the RTL. Because of this synchronization, the fuzzy compare becomes somewhat trivial with respect to timing differences.

The Transaction Model

The original model was a single-threaded application. This thread implemented packet processing (and other functionality) as a series of operations, decisions and loops. This end-to-end model was adequate for isolated processing, but was unable to support concurrent, interacting, transactions.

Two approaches suggest themselves to model transaction interaction. The first is to introduce multiple threads; the second is to re- factor the code to use callbacks from a scheduler. For reasons of portability and risk- management, we chose to restrict ourselves to single-threaded solutions. Multi-threaded code is notorious for its bugs, due to interaction between threads. Furthermore, we would need to suppress the non-deterministic aspects of thread interaction. The decisive factor, however, was a

requirement to run the C model on embedded platforms. The single-threaded approach has its own risks. The execution-flow of a functional model is expressed explicitly in the code. The introduction of a scheduler abstracts this flow. The simulation of multiple threads with a homegrown scheduler forces us to abandon language level control flow (e.g. while loops); and curtails our use of the stack as the execution context [cf. (12)]. Fortunately, the transactions of the PolicyEdge are sufficiently simple that the single-threaded approach remains viable.

Refactoring to a Transaction Model

To Refactor is to modify the design of existing code, without changing its behavior. Martin Fowler defined it (3) as a disciplined process, of small steps supported by tests. The conversion of the original functional model to the transaction model is a set of such steps. For each transaction:

Move local variables into a context structure. Create an instance (on the heap, not the stack) at start of transaction and delete at end.
Replace iterative loops with recursive functions.
For each function that requires synchronization (directly or indirectly), replace the call with a request/callback pair.

After each step, it is possible to run a complete regression: the behavior of the code should be unchanged. The following example demonstrates these steps in action. When considering the code, keep in mind the mechanical nature of the transformation. Whilst we were not able to automate it via a script, the rules of the transformation are mindless.

	int classify_packet(Packet* pkt, Rule *rule)
	{
		int hop_count = 0;
		while (rule && hop_count++ < MAX_HOP_COUNT)
		{
			int field = extract(packet, rule);
			int result = interpret(field, rule);
			if (result != ITERATE)
			{
				return result;
			}
			rule = rule->next;
		}
		return DELETE_PACKET;
	}

This code classifies a packet, according a list of rules. The extract function may need to wait for the requested part of the packet be input to the system. If multiple instances of this function are to interact (while extract is waiting), then we must transform the code according to our set of steps. Space does not permit me to show the intermediate steps, but the footprints are clearly visible in the resulting code. First, the context structure of step 1:

	struct context
	{
		Packet *pkt;
		Rule *rule;
		int hop_count;
		int field;
		int result;

		void (*callback) (int result);
	};

The final field in the structure is a callback, used when the classification is complete. Even though extract is the only part that will block, the need for callbacks ripples up to the root of the transaction. After replacing the loop with recursion, the resulting code is a set of fragments that implements the transaction. The recursive nature of classify packet iteration remains, broken only if request extract uses the scheduler (and therefore returns, unwinding the stack, before actually calling the interpret step):

	void request_classify_packet (Packet* pkt, Rule_Tree *rule, void (*callback)(int))
	{
		struct context *self = calloc(1, sizeof(struct context));
		/* initialise context */
		classify_packet_iteration_begin(self);
	}
	void classify_packet_iteration_begin (struct context *self)
	{
		if (self->rule && self->hop_count++ < MAX_HOP_COUNT)
		{
			request_extract(self, &classify_packet_iteration_interpret)
		}
		else
		{
			self->result = DELETE_PACKET;
			classify_packet_iteration_end(self);
		}
	}
	void classify_packet_iteration_interpret (struct context *self)
	{
		self->result = interpret(self->field, self->rule);
		self->rule = self->rule ->next;
		classify_packet_iteration_end(self);
	}
	void classify_packet_iteration_end (struct context *self)
	{
		if (self->result == ITERATE)
		{
			classify_packet_iteration_begin(self);
		}
		else
		{
			/* free self; call callback, with result */
		}
	}

The fragmentation of the function-aspect is clearly visible in the resulting code. In addition to its opaque control flow, the code size is greater. The PolicyEdge C- model grew from 18.8 K lines of C to 29.2 KLOC when we introduced transactions (these figures are extracted from our CVS repository, and are not exclusively a result of the refactoring. Bug fixes, and some other enhancements, were made over the same period). One of the goals of future work will be to minimize the overhead incurred by the introduction of explicit transactions.

The Scheduler

The scheduler acts a buffer between any two stages of the transaction. Instead of accessing a sharedresource directly, the prior step calls a submit function. The scheduler will call the actual next-step function at some time in the future. This time is input as a synch- message, which contains the name of the step and a key that identifies the instance of the transaction (this key is needed only when multiple instances of the transaction can be pending on a step, and may be serviced out-of-order). Internally, the scheduler maintains queues of contexts, and their associated callbacks. The synch- message selects the appropriate context/callback for the transaction.

During a simulation, the scheduler interleaves two input sources: commands from a test-stimuli file, and synch- messages from the synch file. These two input sources are both file-streams, and are read using blocking-IO. The rule, that correctly interleaves the two sources, is to read the synch- file until the synchronized transaction does not exist: and then to read a record from the test-stimuli file. If the stimulus creates the required transaction, then we synchronize it (i.e. call the callback), and continue reading the synch- file again. If the stimulus does not create the transaction, then we continue reading (and buffering) more stimuli until the required transaction is created. If the end of file is reached, or if the buffer space is exhausted, then the test fails.

In practice, a C model that requires a synch-file (extracted from an RTL simulation) is too specialized. The verification group is only one user of the simulator. To enable other users to run the model (for example, our customers), we created a mode in which the same set of synchronization points were scheduled by priority.

Transaction Synchronization in Action

The introduction of transaction synchronization was our response to the urgent need to move to the next stage of verification. Cycle-accurate timing is a dont care behavior, which can be exploited to simplify verification. By definition, if a condition is a dont care, then the RTL implementation is valid. Therefore, we simply use whatever timing the RTL provides. The transaction model defines what behaviors are legal: the RTL tells us which legal behavior to chose.

Although the precise timing is fungible, freedom is not usually absolute. There may be fairness constraints on arbiters; and there may be performance requirements that constrain the number of cycles permitted for a transaction. These constraints are orthogonal to synchronization: both can be checked by assertions within the testbench, independently of the model. Being a property of sequence, not of absolute time, fairness can alternatively be checked by the model. Whichever approach is used, the issues can be ignored for synchronization.

The need to run tests with complex interactions was urgent, so we did not have time to redesign the testbench. Fortunately, the existing testbench output a log of activity within the RTL. The simplest way of extracting synchronization information was to filter this log file through a Perl script. As we made progress synchronizing the transactions, a number of common themes emerged.

Pipeline Delays and Priorities

The testbenchs log file included timestamps that indicate when an event occurred. The RTL implements transactions as pipelines and state- machines, and we need to find the point in time when transactions appear to interact. It is sometimes easier for the testbench to output a message a few cycles before or after the actual access to a resource. In almost all cases, this simplification could be overcome by an adjustment to the timestamp (adding or subtracting a fixed number of cycles) of one or both of the messages.

Even when delays are balanced, the interaction ordering may still need adjustment. When two threads access a resource in the same cycle, we need to know which has priority. We frequently found that adjusting a priority for one resource would break a priority relationship with another. In most cases, we could juggle the priorities to form a consistent group. Occasionally, we were forced to create a new synchronization point, which increased fragmentation of the model.

As the result of these adjustments we created a new log file: stripped of unused lines, and sorted by the adjusted timestamps and priorities.

Queue Sizes

RTL-synchronization makes use of the dont care nature of cycle timing. However, dont cares are not limited to cycle times. Other implementation artifacts can also be dont cares. One such detail is the size of queues.

Instead of defining the sizes of queues in the model, we implemented a dual-natured synch point. When the pending transaction is add element to queue, the corresponding synch points become accept, and drop. Although we defined minimum size restrictions (e.g., it is an error to drop an element when adding to an empty queue): we tended not to restrict the upper bound. This decoupling provides another dimension for designers to modify the implementation without affecting the model.

Transaction Tracking

The synch- file, which controls the scheduler, is a simple record-per- line format. When multiple instances of a transaction could be valid at a given step, this record must contain a transaction identifier. This identifier is usually available as a signal within the RTL. For example, a memory request may include a tag that identifies the context to which the reply will be mated. However, there are some cases where no such tag exists. In these cases, it was necessary to set up a virtual tag in the testbench, which tracks the transaction as it progresses through the chip.

Our testbench language, Perl, enabled this transaction tracking. GreenLights Pivot (6) is a Perl module that provides an abstraction over the Verilog PLI, similar to that of Cadences TestBuilder (7, 8) for C++. Using Pivot, we represent transactions as Perl objects: and the pipelines as a sequence of Perl arrays. The object moves through the system by being shifted from one array, and pushed onto the next. The use of arrays (rather than scalar variables) introduces slack into the transaction tracking: the different arrays represent significant points in the transaction path, not individual pipeline stages.

Non-Causal behavior

In a hardware design, effect can appear to precede cause, because the RTL may read a memory speculatively. In extreme cases, the design reads a memory on every cycle. In contrast, the transaction model accesses the resource only when it knows that the access is required. The filtering script can usually remove the excess accesses: but it cannot synchronize an access, which is part of a transaction, before the transaction itself has been started.

The solution was to rearrange the order of messages. It was not possible to move the access to memory (because that would break synchronization against write accesses): but it was often possible to move the start-point of the enclosing transaction. This approach can be likened to creative accounting and did occasionally backfire, resulting in obscure false- fails during random testing. When such problems arose, we were forced to add the correct (implementation-specific) transactions to the model: this was usually less painful than it had initially appeared.

Testbench forces

To stress the design, it was sometimes useful to force values onto nets during a simulation. Many of these forces were intended only to increase backpressure on the pipelines, to hit more corner-cases. This increased pressure affected the timing of the transactions: but any functional changes were a result of these timing differences. For these cases, the synchronization points we already had were sufficient to propagate the forces into the C simulator.

However, other forces were not limited to timing effects. As an example, consider a background process that walks through the memory, checking/updating its contents. This may take many milliseconds to walk through the entire address range: each access being triggered by a timer tick. If only a small number of addresses are actually used by a test, then we are able to stress the design by limiting the background process to touch only those addresses that are in use. But when the addresses being touched are forced away from their natural progression, then the C model will indicate a functional error (wrong address ticked). We needed a way to propagate the forced addresses into the model.

Our solution was a consequence of the history of the model. It had begun life as a functional simulator, and only later been converted to transactions. A deterministic functional model cannot have background processes so the authors had, instead, chosen to include a do tick command in its control language: users were required to issue this command periodically.

We created a script that interleaved do tick commands with original test-script to create a new test script. We then used this new test as the cont rol stimuli for the C model. Thus, the addresses became a primary input to the simulator: but the timing remained under the control of the synch- file. In a second project, we attempted to include the (functional) address information within the synch- file input: this approach worked, but was not significantly simpler. However, we continue to explore the potential of functional information within the synchronization-stream.

Triage and Debug

The most obvious benefit of transaction synchronization is the reduced modeling effort required to track the evolving design. However, there are additional implications beyond the model. The process of debugging is changed by the use of synchronized transactions. When a test fails, the first activity is to triage the class of failure. It turns out that the most pervasive class is synchronization failure the test fails because the RTL executed a transaction-step that the model does not expect. A consequence of this classification was a perception that synchronization is a continuous headache.

The reality is somewhat different. The huge number of issues reported as synchronization failures is a testament to the power of the technique: minor deviations from expected behavior are quickly detected. Many of these deviations would be missed under a cycle-accurate modeling approach, because cycle accurate modelers almost always end up looking at the RTL to discover the exact cycle behavior.

Many of the synchronization errors were in the testbench or the model. We often needed to adjust the pipeline delay, or priority, of a monitor in the testbench. One fact that I have not previously emphasized is that the extraction of synchronization messages occurs as post-processing of the simulation log file. If we believe that a problem is a testbench/model issue, then we do not need to rerun the RTL simulation: we can manually re-order lines in the log file. If this fixes the problem then we will adjust the priorities/delays accordingly. The ability to use the transaction log in this replay mode is one reason for the efficiency of our verification methodology.

A second attempt

After the successful tapeout of the PolicyEdge, it was natural to follow a similar approach on the next project. Having seen the effectiveness of the transaction model, we decided to build the model using fragmented transactions: but to omit the initial construction of the functional model, and thus the subsequent refactoring.

This was a mistake. In constructing the transactions early, we did not have the benefit of seeing the RTLs reification. Although we made intelligent guesses, we tended to fragment the transactions more than was required. The resulting model had functional errors, which would have been easy to debug in a functional environment (where it is possible to single-step through a loop): but which became obscured by the fragmented transactions. In hindsight, we should have ignored the perceived need for transactions, and followed the rule of YAGNI (4). Conversion to transactions had been demonstrated to be trivial (via refactoring), so we should not have pre-empted the process.

More successful, was our foresight in the testbench. Knowing that we would need to track the transactions, we were careful to create monitors on the appropriate interfaces: and to include transaction identifiers that could be used to synchronize the model. As a result of this methodology, the script that extracted the synchronization information was significantly simpler than its predecessor.

Future Directions

Although we made some mistakes, the verification methodology of the second project was successful: there was no bottleneck of functional verification at the end of the project. We are now working on our next generation product, and are building on the success. This section describes some options.

Multi-Threading

The modeling mistakes of the second project taught us an important lesson: avoid fragmenting the code until there is a demonstrated need to do so. The fragmentation is a result of simulating thread interactions in a single-threaded application. We are investigating the viability of using transaction synchronization with a multi-threaded model.

There is a perception that multi-threaded applications are more difficult to create than single-threaded ones. This is due to the need to synchronize access to shared variables. Thus using a threading library, such as pthreads (10), could create more problems than it solves. We desire the benefits of multiple stacks, without the problems of multiple threads (12). An alternative is the Quick-Threads library (9), used as the heart of SystemC (11). Although neither pthreads nor System-C provides a file-synch capability, we have implemented it as a wrapper. This wrapper uses semaphores to constrain thread execution, so is somewhat inefficient.

Scheduling Policies

An RTL-synchronized transaction model provides cycle accuracy only when that accuracy is available as synchronization messages. This is not a solution that can be supplied to users of the model who do not have access to the RTL (nor the infrastructure to otherwise create the synch-file). Nevertheless, the fact that the model can become cycle-accurate is proof that it contains sufficient hooks on which to hang cycle-accuracy: it should be possible to create a pure timing model which creates, algorithmically, the synch- input. Such a model would be created as a plug- in component, not integrated into a monolithic, cycle-accurate, model.

If a cycle-accurate model is not required until late in the project (e.g. not until after tapeout), then the plug- in timing approach becomes very attractive: the model can be made cycle-accurate, after-the- fact, without needing to modify it. Even when performance modeling requires a cycle-model early in the project, an architecture that treats timing as an orthogonal aspect seems advantageous from a software engineering viewpoint.

If a cycle-model can be a plug- in, could we create alternative plug- ins, to implement other scheduling policies? One tha t I find interesting is a random synch-model: which will choose any legal synchpoint at each step. This will enable us to stress the architecture beyond the requirements of a specific micro-architecture.

Summary

I have described a verification flow where we use a transaction model, yet perform a cycle-accurate comparison between the RTL and reference model. We exploit the dont care nature of cycle timing (and of other implementation-specific artifacts) to synchronize the transaction interactions using RTL simulation results. The flow enables RTL engineers to modify their pipelining without requiring corresponding changes to the reference model. Towards the end of the project, where timing closure becomes the focus, this decoupling acts to remove the verification team from the projects critical path.

Earlier in the project, rapid development of an architectural model is possible by postponing consideration of thread interactions until we have tests that fail for its lack. I have described a refactoring that converts a functional model into a synchronizable transaction model. In order to implement this synchronization, we extract transaction commit points from an RTL simulation: currently by parsing its transaction log. A transaction-based testbench ens ures that the required information is available.

Verification against the transaction model at every synch-point can be very finicky. The dominant failure mode for tests in the regression became the synchronization error indicating that the transaction captured from the RTL did not match the transaction modeled in the C simulator. We developed a number of techniques for working with this failure mode: ultimately allowing significant debug without needed to rerun the simulation (nor even look at waveforms). Using these techniques, verification based on comparison against the RTL-synchronized transaction model becomes more efficient than either using a timed, or untimed, functional model.

Acknowledgements

In writing this paper, I wish to acknowledge the people who made it possible. Firstly, Fast-Chip Inc: for providing the environment in which to experiment. Secondly, the verification team at Fast-Chip, who worked with me to make my ideas become reality especially Chris Refino, the primary coder of the C simulator. Finally, Green Light LLC, whose testbench environment makes experimentation so easy.

References

Constructing High Level Macrocell Models Using the Shlaer-Mellor Method, Dave Whipp (link)
Bottom-Up Modeling, Dave Whipp (link)
Refactoring: improving the design of existing code, Martin Fowler ISBN:0-201-48567-2
YAGNI: a practice of extreme programming (5) (link)
Extreme Programming Explained, Kent Beck ISBN:0-201-61641-6
GreenLight LLC (link)
Test Builder (link)
The Transaction Based Methodology, Dhananjay S. Brahme et al (link)
Tools and Techniques for Building Fast Portable Threads Packages, David Keppel (link)
Multithreaded Programming with PThreads, Lewis/Berg ISBN:0-13-680729-1
System C (link)
Cooperative Task Management without Manual Stack Management, Atul Adya et al (link)