Guest
Now, back to the original topic at hand, before Stuart and Ales so
rudely created a gschem and gEDA turf war over schematics not being
part of PCB's charter -- I was unaware that gEDA had the right to
dictate design to the PCB project which existed long before gEDA. I'll
let DJ, Harry, Stuart and Ales work their turf war out ... leave me out
of it please.
I considered strongly adding schematics into PCB some four years back,
while Harry was still maintaining it himself out of JHU. When Harry
dropped off the face of the earth for a while, I even considered
starting a PCB project at sf.net, till I saw one day Harry had created
one. Harry had believed strongly that Schematics do not belong in PCB,
and as chief maintainer of the sf.net project, that was his choice.
Before starting FpgaC on sf.net, I was strongly tempted to pickup and
continue to support an Xaw version, and add Schematics in, as an sf.net
project called PCB2 ... fork the project since Harry and I have very
different goals and expectations about UI's and the types of designs
PCB should support/produce. I asked very clearly on the user forum for
PCB if Xaw was dead, trying to get a clear idea if it would remain a
crippled Gtk version ... and no answer. I was actually supprised to
find DJ had done a Lesstif variation, and had deviated strongly
(forked) from the old/Gtk UI.
In the end I decided I would be more useful digging TMCC out of the
grave, and bringing it forward a decade to be useful with todays FPGA
products.
TMCC/FpgaC suffers badly from the same working set problems I posed for
PCB. Very small changes in a project code transition FpgaC compile
times from a few minutes, to hours ... and in one case from 45 minutes
to over a day and a half, simply by exceeding working set sizes for L2
cache. Interestingly enough, the same C code does the same thing to
GCC at a slightly different boundry point.
Student, and other toy, projects frequently contain simple algorithms
that are ok inside a typical processors L2 cache size these day ...
that when the data set grows just slightly, fail horribly performance
wise. In this case, linearly searching a linked list works fine up to
about 90-95% of L2 cache size. When you exceed that threshold,
performance drops and run times increase roughly 10X or better because
of the nature of LRU or pseudo-LRU cache replacement policy.
Consider for example, a small cache of 4 "bins" of marbles taken from a
bowl of 300 marbles. If we first reference a certain red marble, it's
taken from the bowl and placed in a cache bin after searching the 300
marbles for it. We keep using, and replacing the red marble avoiding
the search in the bowl. Later we also use a green, blue, and yellow
marble, which take the three remaining bins in the cache. Because of
the nature of the task, we always use red, green, blue, and yellow in
that order, always taking from the cache, and replacing in the cache.
When our working set expands to five mables, we have a cache failure,
which goes like this. We access the red, green, blue, yellow marbles in
order from the cache, then we need a white marble. The red mable is
least reciently used so it's removed from the cache, and replaced with
the white marble. We then repeat our cycle, next needing the red mable
which is no longer in the cache, so we must fetch it from the bowl, and
due to the LRU algorithm, replace the green marble with the red
marble. However next we need the green mable, which forces the blue out
of the cache. Next we need the blue marble, forcing the yellow out of
the cache. Next we need the yellow, forcing the white out of the cache
.... and so on, with every cache hit faulting, requiring a lengthy
access and search of the bowl.
LRU algorithms fail horribly with sequential searches of the cached
working set, resulting in a very sharp reduction in performance as the
working set is exceeded. In FpgaC's case, the primary data structures
are linked lists which are frequently searched completely to verify the
lack of duplicate enteries when a new item is created. When the working
set under these linked lists exceeds the processors L2 cache size, run
times jump by more than a factor of 10 for many machines these days ...
the ratio of L2 cach performance compared to memory performance. Thus,
depending on the host processors L2/L3 cache size, there are critical
points for FpgaC where the run times to compile incrementally small
increases in program size, jumps dramatically. The fix for this is
relatively simple, and will occur soon, which is to replace the linear
searches with tree or hash searches to avoid referencing the entire
working set to invoke the LRU replacement failure mode problem.
Similar problems exist at several levels in the responsiveness of PCB.
Any event which forces a search of the design space, will require the
working set to hold all the objects that are required to be searched.
When that working set increases past various cache sizes, noticable
increases in latency will result, to the point that they are visible in
the UI ... that point will vary depeding on the particular machine
being used (L1/L2/L3 cache sizes, and the relative latency of reference
for faults). Developers who only use and test on a fast processors,
with large caches, and fast native memory, will not notice extremely
jerky performance that someone using a P2 333 celerion (128K cache)
with a 66mhz processor bus and fast page mode memory will encounter.
Slightly larger designs running on 4Ghz processors with 512K caches
will fail equally noticably with a design some 4-10 times larger.
Certain operations will fail harder, those which invoke a series of X
output, as they will also incure the memory overhead of part of the
kernel, some shared libraries, the X server, and display driver in the
"working set" for those operations. While a 512K cache is four times
larger, the available working set is Cache size minus X working set,
meaning that for small cache sizes there might not be much working set
left at all, while doubling the cache size may actually increase the
usable working set by 10X or more.
Just taking a guess, PCB + kernel + Xaw + Xserver, probably has a
working set something around or slightly larger than 128K for very
basic PCB tasks. Thus, we will see cache flushing and reloading between
X calls, and locally ok cache performance at both ends. As the L2/L3
cache grows to 512K this is probably less of a problem.
What does become a problem is when the PCB<==>X working set get
continually flused by every call, such as making a long series of
little calls to the Xserver, faulting all the way to the X server, and
faulting all the way back ... calling performance drops like a rock ...
factor of 10 or more. This happens when the task at either, or both, of
the PCB or X server end requires a slightly larger working set, making
the total working set for LRU into worst case failure mode.
I suspect that the Gtk failure modes do this, by including Gtk overhead
into the working set, such that every PCB to Gtk to X server call
faults round trip, and runs at native memory performance. The reason I
believe this is that in my testing a 550Mhz PIII machine with SDRAM is
only about twice as slow, as a 2GHz P4 machine with DDR SDRAM, in this
failure mode ... rather than the 4-6X normal compuation difference when
running at CPU speeds from L1 cache, or even L2 cache.
With synchronous calls to Gtk and the X server, it's difficult for PCB
to keep it's event processing in real time.
I have a several day class I used to regularly teach that discusses in
detail, designing for hysteresis problems that occure with step
discontinuties of the processor load vs thruput function that is quite
useful for recognizing and designing architectural solutions to
problems of this class.
So ... application architecture in respect to working set sizes is a
critical performance issue. Algorithm choices which conserve working
set, and avoid sequential LRU faulting are a critical issue for design.
And carefully managing data set representation for compactness, even at
the cost of a number of cpu cycles for packing/unpacking can greatly
push off the working set failure threshold with care full design.
Consolidation of processes to minimize frequency and location of long
working set calls to external processes (like including them as threads
if necessary) are critical to align them with places in the UI
interaction where latency hiding in transparent, rather than highly
visible.
fpga_toys@yahoo.com wrote:
rudely created a gschem and gEDA turf war over schematics not being
part of PCB's charter -- I was unaware that gEDA had the right to
dictate design to the PCB project which existed long before gEDA. I'll
let DJ, Harry, Stuart and Ales work their turf war out ... leave me out
of it please.
I considered strongly adding schematics into PCB some four years back,
while Harry was still maintaining it himself out of JHU. When Harry
dropped off the face of the earth for a while, I even considered
starting a PCB project at sf.net, till I saw one day Harry had created
one. Harry had believed strongly that Schematics do not belong in PCB,
and as chief maintainer of the sf.net project, that was his choice.
Before starting FpgaC on sf.net, I was strongly tempted to pickup and
continue to support an Xaw version, and add Schematics in, as an sf.net
project called PCB2 ... fork the project since Harry and I have very
different goals and expectations about UI's and the types of designs
PCB should support/produce. I asked very clearly on the user forum for
PCB if Xaw was dead, trying to get a clear idea if it would remain a
crippled Gtk version ... and no answer. I was actually supprised to
find DJ had done a Lesstif variation, and had deviated strongly
(forked) from the old/Gtk UI.
In the end I decided I would be more useful digging TMCC out of the
grave, and bringing it forward a decade to be useful with todays FPGA
products.
TMCC/FpgaC suffers badly from the same working set problems I posed for
PCB. Very small changes in a project code transition FpgaC compile
times from a few minutes, to hours ... and in one case from 45 minutes
to over a day and a half, simply by exceeding working set sizes for L2
cache. Interestingly enough, the same C code does the same thing to
GCC at a slightly different boundry point.
Student, and other toy, projects frequently contain simple algorithms
that are ok inside a typical processors L2 cache size these day ...
that when the data set grows just slightly, fail horribly performance
wise. In this case, linearly searching a linked list works fine up to
about 90-95% of L2 cache size. When you exceed that threshold,
performance drops and run times increase roughly 10X or better because
of the nature of LRU or pseudo-LRU cache replacement policy.
Consider for example, a small cache of 4 "bins" of marbles taken from a
bowl of 300 marbles. If we first reference a certain red marble, it's
taken from the bowl and placed in a cache bin after searching the 300
marbles for it. We keep using, and replacing the red marble avoiding
the search in the bowl. Later we also use a green, blue, and yellow
marble, which take the three remaining bins in the cache. Because of
the nature of the task, we always use red, green, blue, and yellow in
that order, always taking from the cache, and replacing in the cache.
When our working set expands to five mables, we have a cache failure,
which goes like this. We access the red, green, blue, yellow marbles in
order from the cache, then we need a white marble. The red mable is
least reciently used so it's removed from the cache, and replaced with
the white marble. We then repeat our cycle, next needing the red mable
which is no longer in the cache, so we must fetch it from the bowl, and
due to the LRU algorithm, replace the green marble with the red
marble. However next we need the green mable, which forces the blue out
of the cache. Next we need the blue marble, forcing the yellow out of
the cache. Next we need the yellow, forcing the white out of the cache
.... and so on, with every cache hit faulting, requiring a lengthy
access and search of the bowl.
LRU algorithms fail horribly with sequential searches of the cached
working set, resulting in a very sharp reduction in performance as the
working set is exceeded. In FpgaC's case, the primary data structures
are linked lists which are frequently searched completely to verify the
lack of duplicate enteries when a new item is created. When the working
set under these linked lists exceeds the processors L2 cache size, run
times jump by more than a factor of 10 for many machines these days ...
the ratio of L2 cach performance compared to memory performance. Thus,
depending on the host processors L2/L3 cache size, there are critical
points for FpgaC where the run times to compile incrementally small
increases in program size, jumps dramatically. The fix for this is
relatively simple, and will occur soon, which is to replace the linear
searches with tree or hash searches to avoid referencing the entire
working set to invoke the LRU replacement failure mode problem.
Similar problems exist at several levels in the responsiveness of PCB.
Any event which forces a search of the design space, will require the
working set to hold all the objects that are required to be searched.
When that working set increases past various cache sizes, noticable
increases in latency will result, to the point that they are visible in
the UI ... that point will vary depeding on the particular machine
being used (L1/L2/L3 cache sizes, and the relative latency of reference
for faults). Developers who only use and test on a fast processors,
with large caches, and fast native memory, will not notice extremely
jerky performance that someone using a P2 333 celerion (128K cache)
with a 66mhz processor bus and fast page mode memory will encounter.
Slightly larger designs running on 4Ghz processors with 512K caches
will fail equally noticably with a design some 4-10 times larger.
Certain operations will fail harder, those which invoke a series of X
output, as they will also incure the memory overhead of part of the
kernel, some shared libraries, the X server, and display driver in the
"working set" for those operations. While a 512K cache is four times
larger, the available working set is Cache size minus X working set,
meaning that for small cache sizes there might not be much working set
left at all, while doubling the cache size may actually increase the
usable working set by 10X or more.
Just taking a guess, PCB + kernel + Xaw + Xserver, probably has a
working set something around or slightly larger than 128K for very
basic PCB tasks. Thus, we will see cache flushing and reloading between
X calls, and locally ok cache performance at both ends. As the L2/L3
cache grows to 512K this is probably less of a problem.
What does become a problem is when the PCB<==>X working set get
continually flused by every call, such as making a long series of
little calls to the Xserver, faulting all the way to the X server, and
faulting all the way back ... calling performance drops like a rock ...
factor of 10 or more. This happens when the task at either, or both, of
the PCB or X server end requires a slightly larger working set, making
the total working set for LRU into worst case failure mode.
I suspect that the Gtk failure modes do this, by including Gtk overhead
into the working set, such that every PCB to Gtk to X server call
faults round trip, and runs at native memory performance. The reason I
believe this is that in my testing a 550Mhz PIII machine with SDRAM is
only about twice as slow, as a 2GHz P4 machine with DDR SDRAM, in this
failure mode ... rather than the 4-6X normal compuation difference when
running at CPU speeds from L1 cache, or even L2 cache.
With synchronous calls to Gtk and the X server, it's difficult for PCB
to keep it's event processing in real time.
I have a several day class I used to regularly teach that discusses in
detail, designing for hysteresis problems that occure with step
discontinuties of the processor load vs thruput function that is quite
useful for recognizing and designing architectural solutions to
problems of this class.
So ... application architecture in respect to working set sizes is a
critical performance issue. Algorithm choices which conserve working
set, and avoid sequential LRU faulting are a critical issue for design.
And carefully managing data set representation for compactness, even at
the cost of a number of cpu cycles for packing/unpacking can greatly
push off the working set failure threshold with care full design.
Consolidation of processes to minimize frequency and location of long
working set calls to external processes (like including them as threads
if necessary) are critical to align them with places in the UI
interaction where latency hiding in transparent, rather than highly
visible.
fpga_toys@yahoo.com wrote:
DJ Delorie wrote:
fpga_toys@yahoo.com writes:
It really needs to be the same tool,
Or at least *seem* like it's the same tool. Otherwise, I agree.
On larger designs, memory is being pushed to maintain lists and objects
instantiated already. Paging severely cuts into performance. When
running as a separate application, there is substantial page
replication introduced for every data page for a long list of shared
library instances, plus replication of the netlists. Likewise,
performance is critially tied to working set, having a second
application running concurrently with equally large working set, will
provoke substantial cache thrashing, which will show up as memory
latency induced jerkyness in the UI, as the cache is flushed out and
reloaded between contexts. While these may seem like parameters in the
application architecture that can be ignored, perceived UI performance
is heavily dependent on them. Similarly the communication between
separate applications results in context switches, which causes
additional cache thrashing by including large sections of the kernel in
the working set. Consider the processor is some 20-100 times faster
than L2/L3 cache these days, and the cache is frequently another 10-50
times or more faster than memory. Exceeding cache working sets,
effectively turns the machine into a 50MHz processor again.
There are substantial performance reasons suggesting that it should be
the same application, (just a different thread at most) to conserve
memory resources, and improve performance. While they may not be
critical for toy student projects, for many real life projects which
are much larger, they become critical UI problems. The sample
ProofOfConcept design I sent you, is about 1/5 the size of several
production designs I have done using PCB.
When the typical desktop CPU comes standard with 10MB or better of L2
cache, these issues might go away. Last time I checked, this was only
available for high end Itianum processors, well outside the reach of
most mortals in cost (or me right now).