When I joined Fast-Chip, in November 2000, the team had created a functionally accurate C-model. There was no customer pressure for a cycle-accurate variant. The verification group, however, was troubled. The PolicyEdge network services processor achieves its performance through parallelism, and precise timing of state changes determines the behavior of interacting threads. Although the model had enabled verification of isolated features, it did not seem possible to use it as a reference for random testing. Creating a new cycle-accurate model would impose an undesirable schedule slip.
We decided to use an alternative approach: to use timing information from the RTL simulation to determine how interactions between threads should be resolved.
A fuzzy compare is adequate when we can partition the outputs of the simulations into streams of inorder transactions. It breaks down, however, when no such streams exist, and when interactions between streams result in functional effects. In these cases, it is necessary to introduce timing information into the reference model. We modified our flow to achieve this, without creating a cycle-accurate model.
This new flow first runs a test on the RTL. If the RTL simulation completes successfully (that is, it completes with no assertion/checker errors), then we process its output to extract a flow of synchronization data. The test is then run on the C model (modified transaction accurate), with the synchronization data as an additional input. This input controls the scheduler, to ensure that the order of interactions between threads mirrors that of the equivalent interactions in the RTL. Because of this synchronization, the fuzzy compare becomes somewhat trivial with respect to timing differences.
Two approaches suggest themselves to model transaction interaction. The first is to introduce multiple threads; the second is to re- factor the code to use callbacks from a scheduler. For reasons of portability and risk- management, we chose to restrict ourselves to single-threaded solutions. Multi-threaded code is notorious for its bugs, due to interaction between threads. Furthermore, we would need to suppress the non-deterministic aspects of thread interaction. The decisive factor, however, was a
requirement to run the C model on embedded platforms. The single-threaded approach has its own risks. The execution-flow of a functional model is expressed explicitly in the code. The introduction of a scheduler abstracts this flow. The simulation of multiple threads with a homegrown scheduler forces us to abandon language level control flow (e.g. while loops); and curtails our use of the stack as the execution context [cf. (12)]. Fortunately, the transactions of the PolicyEdge are sufficiently simple that the single-threaded approach remains viable.
int classify_packet(Packet* pkt, Rule *rule)
{
int hop_count = 0;
while (rule && hop_count++ < MAX_HOP_COUNT)
{
int field = extract(packet, rule);
int result = interpret(field, rule);
if (result != ITERATE)
{
return result;
}
rule = rule->next;
}
return DELETE_PACKET;
}
This code classifies a packet, according a list of rules. The extract function may need to wait for the
requested part of the packet be input to the system. If multiple instances of this function are to interact
(while extract is waiting), then we must transform the code according to our set of steps. Space does
not permit me to show the intermediate steps, but the footprints are clearly visible in the resulting code.
First, the context structure of step 1:
struct context
{
Packet *pkt;
Rule *rule;
int hop_count;
int field;
int result;
void (*callback) (int result);
};
The final field in the structure is a callback, used when the classification is complete. Even though
extract is the only part that will block, the need for callbacks ripples up to the root of the transaction.
After replacing the loop with recursion, the resulting code is a set of fragments that implements the
transaction. The recursive nature of classify packet iteration remains, broken only if request
extract uses the scheduler (and therefore returns, unwinding the stack, before actually calling the
interpret step):
void request_classify_packet (Packet* pkt, Rule_Tree *rule, void (*callback)(int))
{
struct context *self = calloc(1, sizeof(struct context));
/* initialise context */
classify_packet_iteration_begin(self);
}
void classify_packet_iteration_begin (struct context *self)
{
if (self->rule && self->hop_count++ < MAX_HOP_COUNT)
{
request_extract(self, &classify_packet_iteration_interpret)
}
else
{
self->result = DELETE_PACKET;
classify_packet_iteration_end(self);
}
}
void classify_packet_iteration_interpret (struct context *self)
{
self->result = interpret(self->field, self->rule);
self->rule = self->rule ->next;
classify_packet_iteration_end(self);
}
void classify_packet_iteration_end (struct context *self)
{
if (self->result == ITERATE)
{
classify_packet_iteration_begin(self);
}
else
{
/* free self; call callback, with result */
}
}
The fragmentation of the function-aspect is clearly visible in the resulting code. In addition to its opaque
control flow, the code size is greater. The PolicyEdge C- model grew from 18.8 K lines of C to 29.2
KLOC when we introduced transactions (these figures are extracted from our CVS repository, and are
not exclusively a result of the refactoring. Bug fixes, and some other enhancements, were made over the
same period). One of the goals of future work will be to minimize the overhead incurred by the
introduction of explicit transactions.
During a simulation, the scheduler interleaves two input sources: commands from a test-stimuli file, and synch- messages from the synch file. These two input sources are both file-streams, and are read using blocking-IO. The rule, that correctly interleaves the two sources, is to read the synch- file until the synchronized transaction does not exist: and then to read a record from the test-stimuli file. If the stimulus creates the required transaction, then we synchronize it (i.e. call the callback), and continue reading the synch- file again. If the stimulus does not create the transaction, then we continue reading (and buffering) more stimuli until the required transaction is created. If the end of file is reached, or if the buffer space is exhausted, then the test fails.
In practice, a C model that requires a synch-file (extracted from an RTL simulation) is too specialized. The verification group is only one user of the simulator. To enable other users to run the model (for example, our customers), we created a mode in which the same set of synchronization points were scheduled by priority.
Although the precise timing is fungible, freedom is not usually absolute. There may be fairness constraints on arbiters; and there may be performance requirements that constrain the number of cycles permitted for a transaction. These constraints are orthogonal to synchronization: both can be checked by assertions within the testbench, independently of the model. Being a property of sequence, not of absolute time, fairness can alternatively be checked by the model. Whichever approach is used, the issues can be ignored for synchronization.
The need to run tests with complex interactions was urgent, so we did not have time to redesign the testbench. Fortunately, the existing testbench output a log of activity within the RTL. The simplest way of extracting synchronization information was to filter this log file through a Perl script. As we made progress synchronizing the transactions, a number of common themes emerged.
Even when delays are balanced, the interaction ordering may still need adjustment. When two threads access a resource in the same cycle, we need to know which has priority. We frequently found that adjusting a priority for one resource would break a priority relationship with another. In most cases, we could juggle the priorities to form a consistent group. Occasionally, we were forced to create a new synchronization point, which increased fragmentation of the model.
As the result of these adjustments we created a new log file: stripped of unused lines, and sorted by the adjusted timestamps and priorities.
Instead of defining the sizes of queues in the model, we implemented a dual-natured synch point. When the pending transaction is add element to queue, the corresponding synch points become accept, and drop. Although we defined minimum size restrictions (e.g., it is an error to drop an element when adding to an empty queue): we tended not to restrict the upper bound. This decoupling provides another dimension for designers to modify the implementation without affecting the model.
Our testbench language, Perl, enabled this transaction tracking. GreenLights Pivot (6) is a Perl module that provides an abstraction over the Verilog PLI, similar to that of Cadences TestBuilder (7, 8) for C++. Using Pivot, we represent transactions as Perl objects: and the pipelines as a sequence of Perl arrays. The object moves through the system by being shifted from one array, and pushed onto the next. The use of arrays (rather than scalar variables) introduces slack into the transaction tracking: the different arrays represent significant points in the transaction path, not individual pipeline stages.
The solution was to rearrange the order of messages. It was not possible to move the access to memory (because that would break synchronization against write accesses): but it was often possible to move the start-point of the enclosing transaction. This approach can be likened to creative accounting and did occasionally backfire, resulting in obscure false- fails during random testing. When such problems arose, we were forced to add the correct (implementation-specific) transactions to the model: this was usually less painful than it had initially appeared.
However, other forces were not limited to timing effects. As an example, consider a background process that walks through the memory, checking/updating its contents. This may take many milliseconds to walk through the entire address range: each access being triggered by a timer tick. If only a small number of addresses are actually used by a test, then we are able to stress the design by limiting the background process to touch only those addresses that are in use. But when the addresses being touched are forced away from their natural progression, then the C model will indicate a functional error (wrong address ticked). We needed a way to propagate the forced addresses into the model.
Our solution was a consequence of the history of the model. It had begun life as a functional simulator, and only later been converted to transactions. A deterministic functional model cannot have background processes so the authors had, instead, chosen to include a do tick command in its control language: users were required to issue this command periodically.
We created a script that interleaved do tick commands with original test-script to create a new test script. We then used this new test as the cont rol stimuli for the C model. Thus, the addresses became a primary input to the simulator: but the timing remained under the control of the synch- file. In a second project, we attempted to include the (functional) address information within the synch- file input: this approach worked, but was not significantly simpler. However, we continue to explore the potential of functional information within the synchronization-stream.
The reality is somewhat different. The huge number of issues reported as synchronization failures is a testament to the power of the technique: minor deviations from expected behavior are quickly detected. Many of these deviations would be missed under a cycle-accurate modeling approach, because cycle accurate modelers almost always end up looking at the RTL to discover the exact cycle behavior.
Many of the synchronization errors were in the testbench or the model. We often needed to adjust the pipeline delay, or priority, of a monitor in the testbench. One fact that I have not previously emphasized is that the extraction of synchronization messages occurs as post-processing of the simulation log file. If we believe that a problem is a testbench/model issue, then we do not need to rerun the RTL simulation: we can manually re-order lines in the log file. If this fixes the problem then we will adjust the priorities/delays accordingly. The ability to use the transaction log in this replay mode is one reason for the efficiency of our verification methodology.
This was a mistake. In constructing the transactions early, we did not have the benefit of seeing the RTLs reification. Although we made intelligent guesses, we tended to fragment the transactions more than was required. The resulting model had functional errors, which would have been easy to debug in a functional environment (where it is possible to single-step through a loop): but which became obscured by the fragmented transactions. In hindsight, we should have ignored the perceived need for transactions, and followed the rule of YAGNI (4). Conversion to transactions had been demonstrated to be trivial (via refactoring), so we should not have pre-empted the process.
More successful, was our foresight in the testbench. Knowing that we would need to track the transactions, we were careful to create monitors on the appropriate interfaces: and to include transaction identifiers that could be used to synchronize the model. As a result of this methodology, the script that extracted the synchronization information was significantly simpler than its predecessor.
There is a perception that multi-threaded applications are more difficult to create than single-threaded ones. This is due to the need to synchronize access to shared variables. Thus using a threading library, such as pthreads (10), could create more problems than it solves. We desire the benefits of multiple stacks, without the problems of multiple threads (12). An alternative is the Quick-Threads library (9), used as the heart of SystemC (11). Although neither pthreads nor System-C provides a file-synch capability, we have implemented it as a wrapper. This wrapper uses semaphores to constrain thread execution, so is somewhat inefficient.
If a cycle-accurate model is not required until late in the project (e.g. not until after tapeout), then the plug- in timing approach becomes very attractive: the model can be made cycle-accurate, after-the- fact, without needing to modify it. Even when performance modeling requires a cycle-model early in the project, an architecture that treats timing as an orthogonal aspect seems advantageous from a software engineering viewpoint.
If a cycle-model can be a plug- in, could we create alternative plug- ins, to implement other scheduling policies? One tha t I find interesting is a random synch-model: which will choose any legal synchpoint at each step. This will enable us to stress the architecture beyond the requirements of a specific micro-architecture.
Earlier in the project, rapid development of an architectural model is possible by postponing consideration of thread interactions until we have tests that fail for its lack. I have described a refactoring that converts a functional model into a synchronizable transaction model. In order to implement this synchronization, we extract transaction commit points from an RTL simulation: currently by parsing its transaction log. A transaction-based testbench ens ures that the required information is available.
Verification against the transaction model at every synch-point can be very finicky. The dominant failure mode for tests in the regression became the synchronization error indicating that the transaction captured from the RTL did not match the transaction modeled in the C simulator. We developed a number of techniques for working with this failure mode: ultimately allowing significant debug without needed to rerun the simulation (nor even look at waveforms). Using these techniques, verification based on comparison against the RTL-synchronized transaction model becomes more efficient than either using a timed, or untimed, functional model.