Newbie question - differing simulation results...

G

Gareth Owen

Guest
Hi,

I'm new to Verilog, coming from a software background. Here's the problem I'm trying to solve

I have two synchronous clocks, one running at 100Mhz and one at 50MHz.
I have a source of 8bit values, generated on the 100MHz clock.

I want to produce a stream of 16bit values synced with the 50MHz clocks,
containing sequential values. So I wrote some naive code and simulated it.

Under cver (GPL version) it does sort of what I expect.
output: XXFF 0302 0504 0706 0807

Under Icarus Verilog it really doesn't:
output: 01XX 0101 0303 0505

Can anyone
a) point out what I'm misunderstanding about this.
b) give me some pointers about how much of the behaviour of "simultaneous events" and synchronous clocks I can rely on w.r.t. synthesis (Xilinx Vivado) and physical systems with clock jitter etc.

// Cut here. Hopefully google news won't mess this up too much...

`timescale 1ns / 100ps

module data_merge(byte_in,
byte_clk,

word_out,
word_clk);
input [7:0] byte_in;
input byte_clk,word_clk;

output [15:0] word_out;
reg [15:0] word_out;

reg [7:0] MSB;
reg [7:0] LSB;

reg next_is_MSB;


initial
next_is_MSB = 1;

always @(posedge byte_clk)
begin
if(next_is_MSB)
MSB = byte_in;
else
LSB = byte_in;

next_is_MSB = ~next_is_MSB;
end

always @(posedge word_clk)
begin
word_out[15:8] = MSB;
word_out[7:0] = LSB;
end

endmodule


module data_merge_tb();

reg clk_100;
reg clk_50;

reg [7:0] byte;
wire [15:0] word;


always
#5 clk_100 = ~clk_100;

always
#10 clk_50 = ~clk_50;

always @(posedge clk_100)
byte = byte + 1;


always
#10000 $finish;


data_merge dm(.byte_clk(clk_100),
.word_clk(clk_50),

.byte_in(byte),
.word_out(word)
);

initial
begin
// $dumpfile("output.vcd");
// $dumpvars(0,data_merge_tb);

byte = 0;
clk_100 = 1;
clk_50 = 1;
end



endmodule
 
Gareth Owen wrote:
Hi,

I'm new to Verilog, coming from a software background. Here's the problem I'm trying to solve

I have two synchronous clocks, one running at 100Mhz and one at 50MHz.
I have a source of 8bit values, generated on the 100MHz clock.

I want to produce a stream of 16bit values synced with the 50MHz clocks,
containing sequential values. So I wrote some naive code and simulated it.

Under cver (GPL version) it does sort of what I expect.
output: XXFF 0302 0504 0706 0807

Under Icarus Verilog it really doesn't:
output: 01XX 0101 0303 0505

Can anyone
a) point out what I'm misunderstanding about this.
b) give me some pointers about how much of the behaviour of "simultaneous events" and synchronous clocks I can rely on w.r.t. synthesis (Xilinx Vivado) and physical systems with clock jitter etc.

// Cut here. Hopefully google news won't mess this up too much...

`timescale 1ns / 100ps

module data_merge(byte_in,
byte_clk,

word_out,
word_clk);
input [7:0] byte_in;
input byte_clk,word_clk;

output [15:0] word_out;
reg [15:0] word_out;

reg [7:0] MSB;
reg [7:0] LSB;

reg next_is_MSB;


initial
next_is_MSB = 1;

always @(posedge byte_clk)
begin
if(next_is_MSB)
MSB = byte_in;
else
LSB = byte_in;

next_is_MSB = ~next_is_MSB;
end

always @(posedge word_clk)
begin
word_out[15:8] = MSB;
word_out[7:0] = LSB;
end

endmodule


module data_merge_tb();

reg clk_100;
reg clk_50;

reg [7:0] byte;
wire [15:0] word;


always
#5 clk_100 = ~clk_100;

always
#10 clk_50 = ~clk_50;

always @(posedge clk_100)
byte = byte + 1;


always
#10000 $finish;


data_merge dm(.byte_clk(clk_100),
.word_clk(clk_50),

.byte_in(byte),
.word_out(word)
);

initial
begin
// $dumpfile("output.vcd");
// $dumpvars(0,data_merge_tb);

byte = 0;
clk_100 = 1;
clk_50 = 1;
end



endmodule

First of all, there is a race condition at time zero because you
initialize the clocks to 1, but use them as rising edge. Then
in the module itself you initialize next_is_MSB in an initial
block. So at time zero, the initial block in the TB and the
initial block in the module under test can happen in any order.
This wouldn't normally matter except that if the clocks are
initialized to 1 (as in your TB) there is a posedge detected
(from X to 1) at time zero. In the module, the startup behavior
will change if the clocks are initialized before or after
next_is_MSB. If the clocks are initialized first, then the
first clock edge occurs while next_is_MSB is still X and
thus at time zero all registered outputs will be XX. Then
next_is_MSB gets initialized (still at time zero but after
the clock edge) to 1 so on the next clock edge, the MSB
gets loaded. In the opposite case, next_is_MSB would be loaded
to 1 first, and then still at time zero the clock goes high
so the MSB gets loaded immediately at time zero and on the
next clock next_is_MSB will have toggled low and the LSB gets loaded.
Note that depending on how the inputs are initialized, the
registers at time zero could end up with garbage.

The normal work-around for this ugly start-up behavior is to
initialize rising-edge clocks to 0 in the initial block.

Now in real hardware there would be other issues with this
simulation. In a Xilinx FPGA, clock outputs of the same
DCM or PLL will be in phase on their rising edge, which is
how you modelled them in the test bench, however they do
not behave nicely at time zero or in fact until the DCM or
PLL has locked. So again the normal way of dealing with this
is to add reset logic, possibly driven by the DCM or PLL
"locked" signal to keep all registers in a known state until
the clocks are valid.

--
Gabor
 
On Thursday, January 8, 2015 at 4:19:13 PM UTC, gabor wrote:


Thanks Gabor,

That's extremely useful to me. "Add reset logic" has been on my TODO for a while - I guess that's the next priority.

In hardware, the two clocks are generated from a common oscillator using the ClockWizard DCM. Assuming I eliminate the race conditions at time zero with reset logic, I assume that any propagation delay that might cause the edge phase to vary slightly from zero is dealt with by the synthesis?

Thanks again,
Gareth
 
Gareth Owen wrote:
On Thursday, January 8, 2015 at 4:19:13 PM UTC, gabor wrote:

[snip helpful advice]
--
Gabor

Thanks Gabor,

That's extremely useful to me. "Add reset logic" has been on my TODO for a while - I guess that's the next priority.

In hardware, the two clocks are generated from a common oscillator using the ClockWizard DCM. Assuming I eliminate the race conditions at time zero with reset logic, I assume that any propagation delay that might cause the edge phase to vary slightly from zero is dealt with by the synthesis?

Thanks again,
Gareth

Not by "synthesis" as Xilinx uses the term, but assuming you have at a
minimum a PERIOD constraint on the clock the router will work to fix
hold time errors if there is too much clock-to-clock skew.

One other note, when you use the clock wizard the default is to add
global clock buffers for each DCM output. This generally makes the
clock skew very small, but there can be some issues if the tools
don't pick optimum locations for the DCM and its associated buffers.
Make sure to look at any warnings you get during place and route.

Finally, I would typically avoid using two clocks for code of this
nature, as everything could be accomplished using only the higher
frequency clock. If you have a lot of logic running on the lower
frequency clock and the high frequency clock would cause issues
meeting timing, then I might use your approach. However 100 MHz
is not considered "high" for most modern FPGAs so I don't think it
would be necessary in your case.

--
Gabor
 
GaborSzakacs <gabor@alacron.com> writes:

Finally, I would typically avoid using two clocks for code of this
nature, as everything could be accomplished using only the higher
frequency clock. If you have a lot of logic running on the lower
frequency clock and the high frequency clock would cause issues
meeting timing, then I might use your approach.

Thanks again for the advice.

The use of the slower clock is motivated not by timing constraints, but
by the application, and the desire to use Xilinx's standard video
processing blocks.

The 100MHz oscillator is on an external camera sensor, which generates
50MHz/16bpp output as two 8bit values@100MHz. So from a video-processing
perspective the 50Mhz clock is the rate of pixel production, and this is
the clock that is needed to drive the standard video processing IP
blocks.

I can't see any other way to use that standard IP - but as I mentioned,
I'm very new at this.

Gareth
 
On 1/10/2015 6:47 AM, Gareth Owen wrote:
GaborSzakacs <gabor@alacron.com> writes:

Finally, I would typically avoid using two clocks for code of this
nature, as everything could be accomplished using only the higher
frequency clock. If you have a lot of logic running on the lower
frequency clock and the high frequency clock would cause issues
meeting timing, then I might use your approach.

Thanks again for the advice.

The use of the slower clock is motivated not by timing constraints, but
by the application, and the desire to use Xilinx's standard video
processing blocks.

The 100MHz oscillator is on an external camera sensor, which generates
50MHz/16bpp output as two 8bit values@100MHz. So from a video-processing
perspective the 50Mhz clock is the rate of pixel production, and this is
the clock that is needed to drive the standard video processing IP
blocks.

I can't see any other way to use that standard IP - but as I mentioned,
I'm very new at this.

Gareth

To get this right, you will need to synchronize the data packing
(MSB vs LSB) to the camera module. Your code as written doesn't
seem to have any way of preventing the bytes to be swapped. Usually
a camera module would have signals like "line valid" and "frame valid"
and you would need to start to capture data (MSB first or LSB first
as defined by the camera) at the assertion of "line valid".

The other possibility is that the video processing IP may have
a clock enable or "data valid" input that allows you to run it
on every other cycle of the 100 MHz clock. If not, then you
will need two clocks. The tools should take care of the clock
crossing for these related clocks. You will have to ensure that
the 100 MHz clock applied to the I/O registers allows them to
meet setup/hold timing as supplied by the camera module. This
means you need OFFSET IN BEFORE constraints for the 100 MHz clock.
At least that's how they were called in ISE. I'm not using
Vivado yet, because it seems like it's not quite ready for
prime time (check Xilinx forums to see the kinds of bugs and
deficiencies being reported).

--
Gabor
 
On Sunday, January 11, 2015 at 5:04:45 PM UTC, gabor wrote:

To get this right, you will need to synchronize the data packing
(MSB vs LSB) to the camera module. Your code as written doesn't
seem to have any way of preventing the bytes to be swapped. Usually
a camera module would have signals like "line valid" and "frame valid"
and you would need to start to capture data (MSB first or LSB first
as defined by the camera) at the assertion of "line valid".

Yes, that's correct. My real code is more complex than what I posted - but I'm happy that I can get the byte-swap logic correct as long as my intuition on ordering/simultaneity - so I posted a cut down version so as not to get distracted by those "implementation details."

In reality, it's a state machine whose states map (roughly) to the frame-valid and horizontal- & vertical- blanking periods.

Thanks for your help, I think I'm going to be able to achieve what I want to do now.

Gareth
 
On 1/8/2015 8:18 AM, GaborSzakacs wrote:

The normal work-around for this ugly start-up behavior is to
initialize rising-edge clocks to 0 in the initial block.

If you need the clock to start high (you are using both edges) then the
other thing you can do is use a non-blocking assignment for the clock
initialization. Depending on exactly how the logic gets initialized you
may need to use some delay in the clock non-blocking assignment to
compensate for initialization delays.

The other big issue is you are modeling flip-flops with blocking
assignments instead of non-blocking assignments. Add in a bunch of race
conditions at the clock edges and that is what is causing the simulation
mismatches. You are operating in the non-determinism region of the
simulators and no order is specified.

This is a common issue with programmers who have not done much parallel
programming (and many new to Verilog). You need to remember most of
these statements get translated into hardware that runs in parallel not
serially like you wrote it in the Verilog source. For me, I think about
how things get translated (basic synthesis in my head) and then assume
there is some delay along with slight mismatches in the clock edges. I
then think about what I need to do to design a system that works under
those constraints. Keeping this all in mind leads to robust designs.

For example:

always @(posedge byte_clk)
begin
if(next_is_MSB)
MSB = byte_in;
else
LSB = byte_in;

next_is_MSB = ~next_is_MSB;
end

This code looks like three flip-flops and I'll ignore that next_is_MSB
is not initialized correctly for synthesis. You are changing the state
of next_is_MSB at the same time you are using it to decide if this byte
is the MSB or not. Now think about the race created by this dependency
with real hardware. How would you fix that? The code as written works
because you are suing blocking assignments, but that is not how you
model flip-flops. You have a similar issue when sampling the MSB and LSB
with the word clock:

always @(posedge word_clk)
begin
word_out[15:8] = MSB;
word_out[7:0] = LSB;
end

The two clocks are aligned so MSB and LSB are changing at the same time
they are being sampled by the word clock and depending on the simulation
order you can get different results. In hardware you have setup/hold,
propagation delay and the skew between the clocks, clock jitter, clock
tree delay, etc. that have to be dealt with. It may be better to sample
the bytes on the negative edge so that you have significant setup/hold
time and all the other issues can just be ignored because they in
aggregate are significantly smaller than the clock period. If you don't
have negative edge flip-flops then the synthesis tool needs to work a
bit harder/be smarter to make sure the new MSB/LSB signals arrive after
these two flip-flops are activated (it usually just adds buffers). Using
a negative edge would alleviate the need for buffers so could be lower
power and area.

After fixing things up a bit and using Icarus Verilog I get the
following which I assume is the expected output:

XXXX 0100 0302 0504 0706 ...

Here is my changed code. It updates the initial clock assignments to use
non-blocking assignments, the flip flops to us non-blocking assignments
and the initial polarity of next_is_MSB needs to be changed because with
non-blocking assignments you are always using the previous value not the
new value. The fact that next_is_MSB is not initialized in the same
block as the update would prevent my and I believe most other tools from
synthesizing the data_merge block.

`timescale 1ns / 100ps

module data_merge(byte_in,
byte_clk,

word_out,
word_clk);
input [7:0] byte_in;
input byte_clk,word_clk;

output [15:0] word_out;
reg [15:0] word_out;

reg [7:0] MSB;
reg [7:0] LSB;

reg next_is_MSB;


initial
next_is_MSB = 0;

always @(posedge byte_clk)
begin
if(next_is_MSB)
MSB <= byte_in;
else
LSB <= byte_in;

next_is_MSB <= ~next_is_MSB;
end

always @(posedge word_clk)
begin
word_out[15:8] <= MSB;
word_out[7:0] <= LSB;
end

endmodule


module data_merge_tb();

reg clk_100;
reg clk_50;

reg [7:0] byte;
wire [15:0] word;


always
#5 clk_100 = ~clk_100;

always
#10 clk_50 = ~clk_50;

always @(posedge clk_100)
byte <= byte + 1;


always
#10000 $finish;


data_merge dm(.byte_clk(clk_100),
.word_clk(clk_50),

.byte_in(byte),
.word_out(word)
);

initial
begin
// $dumpfile("output.vcd");
// $dumpvars(0,data_merge_tb);

byte = 0;
clk_100 <= 1;
clk_50 <= 1;
end



endmodule

As a final comment. In general when you are an inexperienced Verilog
user if you find different simulators giving different results it is
usually because of timing issues/races in your code not the simulators.
Yes simulators have bugs, but in my experience most simulators get the
basics correct. Now if 1 + 1 is not 2, 0 or -2 depending on the context
there is a problem.

Cary
 
On Thursday, January 15, 2015 at 9:47:20 PM UTC, Cary R. wrote:

The other big issue is you are modeling flip-flops with blocking
assignments instead of non-blocking assignments.

[snip]

This is a common issue with programmers who have not done much parallel
programming (and many new to Verilog). You need to remember most of
these statements get translated into hardware that runs in parallel not
serially like you wrote it in the Verilog source.

This describes me perfectly.

It may be better to sample the bytes on the negative edge so that you
have significant setup/hold
time and all the other issues can just be ignored because they in
aggregate are significantly smaller than the clock period....
Using a negative edge would alleviate the need for buffers so could be
lower power and area.

More great advice - I'd have never thought of that...

After fixing things up a bit and using Icarus Verilog I get the
following which I assume is the expected output:

XXXX 0100 0302 0504 0706 ...

Here is my changed code.
[snip code and excellent explanation]

That's really great of you. Thank you. Thanks to your and Gabor's insights, I'm slowly modifying my thinking from my procedural mindset. It's really helping.

As a final comment. In general when you are an inexperienced Verilog
user if you find different simulators giving different results it is
usually because of timing issues/races in your code not the simulators.

Oh, I was more than aware of this. Any suggestion the simulators were at fault was purely unintentional. I was using the simulator differences to motivate the fact that my code was -- to use the software vernacular -- exhibiting "undefined behaviour".

Thank you for your advice. I'm getting there...

Gareth
 

Welcome to EDABoard.com

Sponsor

Back
Top