Max number concurrent page sizes...

D

Don Y

Guest
I\'ve been canvassing processors to determine the maximum number
of concurrently active page sizes supported in the hardware.
Most common values are:
- 0 (boring processors :>)
- 1 (old school)
- 2 (modern common)
with, so far, a maximum of *7* supported.

Anyone know of current hardware that supports a greater number?
(PMMUs, only -- not interested in segmented architectures)
 
Don Y <blockedofcourse@foo.invalid> Wrote in message:r
> I\'ve been canvassing processors to determine the maximum numberof concurrently active page sizes supported in the hardware.Most common values are:- 0 (boring processors :>)- 1 (old school)- 2 (modern common)with, so far, a maximum of *7* supported.Anyone know of current hardware that supports a greater number?(PMMUs, only -- not interested in segmented architectures)

You should be getting ready for that lake effect snow.

;)
--


----Android NewsGroup Reader----
https://piaohong.s3-us-west-2.amazonaws.com/usenet/index.html
 
On 11/17/2022 10:20, Don Y wrote:
I\'ve been canvassing processors to determine the maximum number
of concurrently active page sizes supported in the hardware.
Most common values are:
- 0 (boring processors :>)
- 1 (old school)
- 2 (modern common)
with, so far, a maximum of *7* supported.

Anyone know of current hardware that supports a greater number?
(PMMUs, only -- not interested in segmented architectures)

Hi Don,
the 32 bit power cores I use support just 4k page size; however they
also have block address translation, which is quite a blessing (e.g.
some of the most vital OS code can be put into a block starting at
0, block size can get really huge, with the same protection bits
like pages, block having priority over pages it overlaps).
The 64 bit core I want to use does not have that but has a few page
sizes, I have yet to think it over how it will be (still finishing
that programming I told you about, have been finishing it for ages now,
latest target is end of this year :). At the moment I already miss
block address translation but I have not thought it in depth yet.
 
On 11/17/2022 2:38 PM, Dimiter_Popoff wrote:
> the 32 bit power cores I use support just 4k page size; however they

4K seems to be the most commonly supported size (where pages are
supported). Some of the ARMs had \"tiny\" page support but that
seems to be an obsolescent feature (?)

On the other end, you can find support for 2GB pages (WTF?).
This, of course, limits the uses of the MMU.

also have block address translation, which is quite a blessing (e.g.
some of the most vital OS code can be put into a block starting at
0, block size can get really huge, with the same protection bits
like pages, block having priority over pages it overlaps).

Is there a more efficient addressing mode for pages residing
that low in memory (like \"page 0\" for the 68xx\'s?). I.e.,
what advantage for the \"address translation\" -- the *protection*
mechanisms obviously bring something to the party...

The 64 bit core I want to use does not have that but has a few page
sizes, I have yet to think it over how it will be (still finishing
that programming I told you about, have been finishing it for ages now,
latest target is end of this year :).

The good thing about targets is you can always MOVE THEM! :>

At the moment I already miss
block address translation but I have not thought it in depth yet.

With a PMMU, you should be able to relocate <whatever> to any address
range you like. There will be a cost, however, in the TLB lookups
(and cache misses, etc.)

But, I would assume your mapping is largely static? Set it once at
initialization and then forget?

You don\'t, for example, give each process its own (overlapping!) address
space org\'ed at SOMEHEXCONSTANT?

I rely on the VMM system, heavily, at runtime. So, its hardware
implementation is of particular interest to me...
 
On 11/18/2022 3:26, Don Y wrote:
On 11/17/2022 2:38 PM, Dimiter_Popoff wrote:
the 32 bit power cores I use support just 4k page size; however they

4K seems to be the most commonly supported size (where pages are
supported).  Some of the ARMs had \"tiny\" page support but that
seems to be an obsolescent feature (?)

On the other end, you can find support for 2GB pages (WTF?).
This, of course, limits the uses of the MMU.

also have block address translation, which is quite a blessing (e.g.
some of the most vital OS code can be put into a block starting at
0, block size can get really huge, with the same protection bits
like pages, block having priority over pages it overlaps).

Is there a more efficient addressing mode for pages residing
that low in memory (like \"page 0\" for the 68xx\'s?).  I.e.,
what advantage for the \"address translation\" -- the *protection*
mechanisms obviously bring something to the party...

There is, the lowest 64k, for load/store. The people who did the
power architecture during the 80-s at IBM did something the rest
have yet to catch up with.

The 64 bit core I want to use does not have that but has a few page
sizes, I have yet to think it over how it will be (still finishing
that programming I told you about, have been finishing it for ages now,
latest target is end of this year :).

The good thing about targets is you can always MOVE THEM!  :

You can move them, I am not so sure it is a good thing though :).

At the moment I already miss
block address translation but I have not thought it in depth yet.

With a PMMU, you should be able to relocate <whatever> to any address
range you like.  There will be a cost, however, in the TLB lookups
(and cache misses, etc.)

There are several BAT registers so you can have one or more regions
translated where there is no need to do tablewalks, this is very
convenient. It also has other implications, say for interrupt
latency. Place your interrupt handling code in a BAT translated
area (dps allows you that) and your interrupt code will never
cause a tablewalk etc.

But, I would assume your mapping is largely static?  Set it once at
initialization and then forget?

Not many things are static in dps, one of the first things it does
during boot is to process a file setup.syst, which can have a line
saying say
paging 30
This means you will have a 2^30 virtual memory space mapped into
the physical memory (e.g. 128M). All code is position independent
so each task is happy to reside anywhere in the 2G (up to 4G that one)
address space, it does get spawned with a system data section and
a user data section which can reside anywhere as well; from there
on a task can allocate for itself memory, optionally registered
in its history record buffer (so it will be deallocated upon task
kill). You can mark memory regions non-swappable (page translated,
BAT translated can never be swapped anyway).

You don\'t, for example, give each process its own (overlapping!) address
space org\'ed at SOMEHEXCONSTANT?

Each process does have its own memory but *no overlapping*. I am
firmly against doing it, it buys you nothing except backward
compatibility to older systems you have - which I don\'t.
a 4G address space is adequate enough for a 32 bit machine,
and a 2^64 address space is more than enough for a 64 bit
machine so I see no need to allow tasks to reside on overlapping
addresses.
I rely on the VMM system, heavily, at runtime.  So, its hardware
implementation is of particular interest to me...

Well of course, this is one of the fundamentals. It has to allow
you to protect pages, allow/disallow access to BAT translated
areas (e.g. you may want to allow system code to be read/executed
in user mode but not written to, same or different for supervisor
mode etc., the power architecture provides for all that, don\'t know
about the rest. I remember some ARM cores would do caching based
on logical address which makes them useless for many purposes,
perhaps they have better ones nowadays).
 
On 11/18/2022 9:03 AM, Dimiter_Popoff wrote:
Is there a more efficient addressing mode for pages residing
that low in memory (like \"page 0\" for the 68xx\'s?).  I.e.,
what advantage for the \"address translation\" -- the *protection*
mechanisms obviously bring something to the party...

There is, the lowest 64k, for load/store. The people who did the
power architecture during the 80-s at IBM did something the rest
have yet to catch up with.

The modern equivalent is the \"base 4GB\" (32b) space. :>

The 64 bit core I want to use does not have that but has a few page
sizes, I have yet to think it over how it will be (still finishing
that programming I told you about, have been finishing it for ages now,
latest target is end of this year :).

The good thing about targets is you can always MOVE THEM!  :

You can move them, I am not so sure it is a good thing though :).

Depends on whether or not you ever want to be DONE! (or \"late\")

At the moment I already miss
block address translation but I have not thought it in depth yet.

With a PMMU, you should be able to relocate <whatever> to any address
range you like.  There will be a cost, however, in the TLB lookups
(and cache misses, etc.)

There are several BAT registers so you can have one or more regions
translated where there is no need to do tablewalks, this is very
convenient. It also has other implications, say for interrupt
latency. Place your interrupt handling code in a BAT translated
area (dps allows you that) and your interrupt code will never
cause a tablewalk etc.

I wire down those TLB pages -- as well as the pages containing the code
and data.

But, I would assume your mapping is largely static?  Set it once at
initialization and then forget?

Not many things are static in dps, one of the first things it does
during boot is to process a file setup.syst, which can have a line
saying say
paging 30
This means you will have a 2^30 virtual memory space mapped into
the physical memory (e.g. 128M). All code is position independent
so each task is happy to reside anywhere in the 2G (up to 4G that one)
address space, it does get spawned with a system data section and
a user data section which can reside anywhere as well; from there

But, they can\'t REALLY reside \"anywhere\" as you won\'t allow them to
conflict with previous allocations.

If you allow overlap, then they can truly reside \"anywhere\" (except
the kernel access points)

on a task can allocate for itself memory, optionally registered
in its history record buffer (so it will be deallocated upon task
kill). You can mark memory regions non-swappable (page translated,
BAT translated can never be swapped anyway).

But you don\'t use the VMM system to move data/objects between processes (?).
E.g., when I want to push a packet to a network address, the memory that
contains that packet EXITS my address space and ENTERS the address space
of the network process that will get it transported to the target.

So, I can scribble on that memory immediately after the call and not worry
about whether the original memory contents or the \"scribbled\" contents
will actually get transmitted. This lets me turn all calls into \"by value\"
instead of \"by reference\". WITHOUT incurring the copy cost.

[Imagine processing live video and passing frames of data from one
process (\"image enhancement\") to another (\"scene analysis\") and
another (\"motion detection\") and another (\"capture\") -- each process
independently acting at a rate appropriate to their responsibilities
and resources without worrying about some other process/bug stomping
on something \"important\".]

Likewise, I can accept an entire process address space from another
node and just drop it in place, without worrying about \"if it will fit\"
(in the context of previous allocations).

You don\'t, for example, give each process its own (overlapping!) address
space org\'ed at SOMEHEXCONSTANT?

Each process does have its own memory but *no overlapping*. I am
firmly against doing it, it buys you nothing except backward
compatibility to older systems you have - which I don\'t.
a 4G address space is adequate enough for a 32 bit machine,
and a 2^64 address space is more than enough for a 64 bit
machine so I see no need to allow tasks to reside on overlapping
addresses.

I\'ve found the biggest advantage to be that I don\'t have to \"bake in\"
decisions at design/compile time. I.e., if you want to have disjoint
address spaces (neglecting the fact that you may want to share a single
code image in multiple processes), then you need some way of partitioning
the single address space, /a priori/. So, if you want to support 1000
processes, then you set aside 10 address bits to determine which-is-which.
Another one for the kernel/user distinction. Your 32b machine is now a
real-mode \'286 (~20 bit addresses).

If you want to support multiple nodes -- processes from any of which can
be migrated ONTO your node -- then you need to set aside enough additional
address bits to allow for their non-overlapping address spaces (and, have
to ensure that each node KNOWS what portion of the SHARED, SINGLE address
space they can use for their processes). I.e., treat the \"system\" as having
a single, unified address space -- though physically distributed across
multiple devices.

[I already need 9 bits to identify a \"node number\" so we\'ll have set
aside 20 address bits, already (10 process, 1 user/kernel, 9 node).
A larger installation could easily use more bits to support a greater
number of nodes]

If you assume a page is 4K (12 bits), then you\'ve exhausted all 32 of the bits
available -- and only support a single page per process! (note very effective
as a means of transferring memory contents between processes :> )

Also, partitioning the address space leaks information. If YOUR code,
executing at 0x03678xxx (process 678 on node 03), accesses a block of memory
at 0xNNPPPxxx (copied into YOUR local address space), then you know it
originated in process PPP on node NN. This is information you shouldn\'t
know or see (as, someday, an exploit may be discovered that plagues node
NN or some particular process PPP regardless of the node on which it resides)

E.g., I have a separate mechanism that prevents the system from migrating
certain processes to nodes that aren\'t PHYSICALLY secure (e.g., located
in places where they can be compromised). But, this is handled at
configuration time -- because it is site dependent. If a malevolent
process could identify when an exploitable process has been migrated
onto a vulnerable node, then security is silently compromised.

[If your system is closed and you -- or a trusted few -- are the sole
developers, then you don\'t have to worry about malevolent actors. OTOH,
if foreign software can be installed with the *intent* of adding value,
then you have to assume some of that software can be deliberately or
accidentally malevolent in its actions]

If you assume a process should be able to manipulate multiple *pages*
(as interfaces to other processes), then a process needs additional
addressing bits beyond those required to isolate a byte within *a*
page.

I rely on the VMM system, heavily, at runtime.  So, its hardware
implementation is of particular interest to me...

Well of course, this is one of the fundamentals. It has to allow
you to protect pages, allow/disallow access to BAT translated
areas (e.g. you may want to allow system code to be read/executed
in user mode but not written to, same or different for supervisor
mode etc., the power architecture provides for all that, don\'t know
about the rest. I remember some ARM cores would do caching based
on logical address which makes them useless for many purposes,
perhaps they have better ones nowadays).

Protecting memory is only a small part of the problem. That attempts
to prevent surreptitious/accidental \"communication\" between processes
(benevolent or malevolent).

But, you also need to be able to constrain *explicit* communication.
Should process X be able to talk to the network stack? More generally,
should it be able to talk to process Y?

And, if it can talk to a particular process, what should it be allowed
to say/request? There\'s likely no harm in letting an arbitrary process
check the health of the battery. But, probably not as trusting to
be that lax with allowing processes to disconnect the battery (and
interfere with charging or killing power).

As the hardware mechanisms are too crude to give that sort of fine-grained
access control (permission bits per function invocation?), you have to build
mechanisms to implement and enforce those features. Mechanisms are typically
active -- more *processes* (a \"wider\" PPP) to act as gatekeepers.
 
On 11/19/2022 1:25, Don Y wrote:
On 11/18/2022 9:03 AM, Dimiter_Popoff wrote:
Is there a more efficient addressing mode for pages residing
that low in memory (like \"page 0\" for the 68xx\'s?).  I.e.,
what advantage for the \"address translation\" -- the *protection*
mechanisms obviously bring something to the party...

There is, the lowest 64k, for load/store. The people who did the
power architecture during the 80-s at IBM did something the rest
have yet to catch up with.

The modern equivalent is the \"base 4GB\" (32b) space.  :

The 64 bit core I want to use does not have that but has a few page
sizes, I have yet to think it over how it will be (still finishing
that programming I told you about, have been finishing it for ages now,
latest target is end of this year :).

The good thing about targets is you can always MOVE THEM!  :

You can move them, I am not so sure it is a good thing though :).

Depends on whether or not you ever want to be DONE!  (or \"late\")

At the moment I already miss
block address translation but I have not thought it in depth yet.

With a PMMU, you should be able to relocate <whatever> to any address
range you like.  There will be a cost, however, in the TLB lookups
(and cache misses, etc.)

There are several BAT registers so you can have one or more regions
translated where there is no need to do tablewalks, this is very
convenient. It also has other implications, say for interrupt
latency. Place your interrupt handling code in a BAT translated
area (dps allows you that) and your interrupt code will never
cause a tablewalk etc.

I wire down those TLB pages -- as well as the pages containing the code
and data.

But, I would assume your mapping is largely static?  Set it once at
initialization and then forget?

Not many things are static in dps, one of the first things it does
during boot is to process a file setup.syst, which can have a line
saying say
paging 30
This means you will have a 2^30 virtual memory space mapped into
the physical memory (e.g. 128M). All code is position independent
so each task is happy to reside anywhere in the 2G (up to 4G that one)
address space, it does get spawned with a system data section and
a user data section which can reside anywhere as well; from there

But, they can\'t REALLY reside \"anywhere\" as you won\'t allow them to
conflict with previous allocations.

Well anywhere without overlapping already allocated regions of course.
Memory is allocated and deallocated all the time, the *first* call
I have written back in 1993 (or was it 1994) is \"allcm$; the caller asks
for a number of bytes (32 bit number on 32 bit machines) and gets
in return an address where this memory starts and the actual amount
of memory allocated (which will be cluster aligned, useful to preserve
for subsequent deallocation of that piece). Ever since then allocation
proceeds doing worst fit so you will always have large contiguous
pieces to allocate from.

If you allow overlap, then they can truly reside \"anywhere\" (except
the kernel access points)

on a task can allocate for itself memory, optionally registered
in its history record buffer (so it will be deallocated upon task
kill). You can mark memory regions non-swappable (page translated,
BAT translated can never be swapped anyway).

But you don\'t use the VMM system to move data/objects between processes
(?).

No, but the address space being just one I can pass pointers quite
easily. Sending an IP packet via Ethernet goes without copying; it
can also be scattered in a number of pieces, this goes out without
copying, too.
Incoming packets are another matter; you can check how many bytes
you have available at a tcp connection and request all or a part
of that to be put at a certain address; the tcp stack gathers
these from the buffered packets - which can be in multiple Ethernet
pools, having arrived in any sequence etc. - and copies the data
where you want it to, releasing the buffers etc.

E.g., when I want to push a packet to a network address, the memory that
contains that packet EXITS my address space and ENTERS the address space
of the network process that will get it transported to the target.

So, I can scribble on that memory immediately after the call and not worry
about whether the original memory contents or the \"scribbled\" contents
will actually get transmitted.  This lets me turn all calls into \"by value\"
instead of \"by reference\".  WITHOUT incurring the copy cost.

This is an interesting approach of course. I am not sure I follow
why it is necessary but obviously you must have had your reasons.

[Imagine processing live video and passing frames of data from one
process (\"image enhancement\") to another (\"scene analysis\") and
another (\"motion detection\") and another (\"capture\") -- each process
independently acting at a rate appropriate to their responsibilities
and resources without worrying about some other process/bug stomping
on something \"important\".]

Well on the way out I can do that without copying. Doing it on the way
in from Ethernet without copying, just handing the scattered data as it
came in packets to the processing code seems of little value to me,
you must be collecting the data in some way which involves copying
as IP packets are of arbitrary length and don\'t come page aligned.

Likewise, I can accept an entire process address space from another
node and just drop it in place, without worrying about \"if it will fit\"
(in the context of previous allocations).

You don\'t, for example, give each process its own (overlapping!) address
space org\'ed at SOMEHEXCONSTANT?

Each process has also a \"common\" data section, in fact what you call a
process I call a task and I call a process the group of tasks sharing
that common section. It is allocated - and can happen to come anywhere,
cluster aligned etc. - upon the first task of the group (I am phasing
out the \"process\" term for the group for nearly 30 years now...).
Each process does have its own memory but *no overlapping*. I am
firmly against doing it, it buys you nothing except backward
compatibility to older systems you have - which I don\'t.
a 4G address space is adequate enough for a 32 bit machine,
and a 2^64 address space is more than enough for a 64 bit
machine so I see no need to allow tasks to reside on overlapping
addresses.

I\'ve found the biggest advantage to be that I don\'t have to \"bake in\"
decisions at design/compile time.  I.e., if you want to have disjoint
address spaces (neglecting the fact that you may want to share a single
code image in multiple processes),

But neglecting the ability to run the same program module (each task
has at least one program module, with a system maintained module
descriptor etc.) is not acceptable at all for me. While you can
define a module as being non-reentrant I have never written one.
E.g. you can have 10 shell instances, each run off the same module
(the OS keeps track which file it was loaded from, if you run
another file of the same name it will get a new module) and allocated
just the user and system data sections; there is also the initialized
data section, which can be part of the file from which the module
is loaded, if it is there is is also loaded (in the user data section).


then you need some way of partitioning
the single address space, /a priori/.  So, if you want to support 1000
processes, then you set aside 10 address bits to determine which-is-which.
Another one for the kernel/user distinction.  Your 32b machine is now a
real-mode \'286 (~20 bit addresses).

It does not work like that in dps. You have a list of task descriptors,
each containing the program module ID, user stack, system stack, many
other data.
Where these are allocated is immaterial to the code, it just uses
the addresses it has been given at run time.
The address space is divided into clusters - 4k per cluster at the
moment, it is a bitmap for the entire memory - so you don\'t have
to think about dividing the address space etc. It is all there,
the OS makes sure it is not fragmented by doing worst fit.
I can\'t see how this can be improved on, to be honest.

If you want to support multiple nodes -- processes from any of which can
be migrated ONTO your node -- then you need to set aside enough additional
address bits to allow for their non-overlapping address spaces (and, have
to ensure that each node KNOWS what portion of the SHARED, SINGLE address
space they can use for their processes).  I.e., treat the \"system\" as
having
a single, unified address space -- though physically distributed across
multiple devices.

[I already need 9 bits to identify a \"node number\" so we\'ll have set
aside 20 address bits, already (10 process, 1 user/kernel, 9 node).
A larger installation could easily use more bits to support a greater
number of nodes]

If you assume a page is 4K (12 bits), then you\'ve exhausted all 32 of
the bits
available -- and only support a single page per process!  (note very
effective
as a means of transferring memory contents between processes  :> )

Also, partitioning the address space leaks information.  If YOUR code,
executing at 0x03678xxx (process 678 on node 03), accesses a block of
memory
at 0xNNPPPxxx (copied into YOUR local address space), then you know it
originated in process PPP on node NN.  This is information you shouldn\'t
know or see (as, someday, an exploit may be discovered that plagues node
NN or some particular process PPP regardless of the node on which it
resides)

I think I see where you are going now. Well if you have enough
MMU hardware and not too many tasks to run this could save some
(much?) code which an OS needs to do all that. My approach is to
do it in code though.

E.g., I have a separate mechanism that prevents the system from migrating
certain processes to nodes that aren\'t PHYSICALLY secure (e.g., located
in places where they can be compromised).  But, this is handled at
configuration time -- because it is site dependent.  If a malevolent
process could identify when an exploitable process has been migrated
onto a vulnerable node, then security is silently compromised.

[If your system is closed and you -- or a trusted few -- are the sole
developers, then you don\'t have to worry about malevolent actors.  OTOH,
if foreign software can be installed with the *intent* of adding value,
then you have to assume some of that software can be deliberately or
accidentally malevolent in its actions]

I have only made provisions against hostile code, no need to implement
all the protections for now. The overhead would be tolerable, perhaps
not unnoticeable, I am not sure. The key is I have thought it through
so it can be done, perhaps never if dps dies with me (likely so).

If you assume a process should be able to manipulate multiple *pages*
(as interfaces to other processes), then a process needs additional
addressing bits beyond those required to isolate a byte within *a*
page.

I rely on the VMM system, heavily, at runtime.  So, its hardware
implementation is of particular interest to me...

Well of course, this is one of the fundamentals. It has to allow
you to protect pages, allow/disallow access to BAT translated
areas (e.g. you may want to allow system code to be read/executed
in user mode but not written to, same or different for supervisor
mode etc., the power architecture provides for all that, don\'t know
about the rest. I remember some ARM cores would do caching based
on logical address which makes them useless for many purposes,
perhaps they have better ones nowadays).

Protecting memory is only a small part of the problem.  That attempts
to prevent surreptitious/accidental \"communication\" between processes
(benevolent or malevolent).

But, you also need to be able to constrain *explicit* communication.
Should process X be able to talk to the network stack?  More generally,
should it be able to talk to process Y?
For memory protection I rely on the page/BAT bits controlling access;
the rest is down to software checks on a case by case basis.
At a higher level than tasks/processes as discussed so far dps
has its runtime object processing system, you can write interactions
between objects (which may be within the same task or not) which
you can limit at will for each object type etc. (dps objects may
have little in common with other \"objects\" people know about;
an object is a piece of memory with a standard header and object
unique part, once code tries to create such an object (by just
providing the standard header to the OS) the object descriptor
is either found or created by searching a defined sequence of
paths etc., this is done recursively if the newly found object
is of an unknown type). Then you can tell the object to do
this or that, if it can\'t do it its parent object will try
to do it and so on. It has evolved for 25+ years now, and I
have been getting better (still am) at using that system.

And, if it can talk to a particular process, what should it be allowed
to say/request?  There\'s likely no harm in letting an arbitrary process
check the health of the battery.  But, probably not as trusting to
be that lax with allowing processes to disconnect the battery (and
interfere with charging or killing power).

As the hardware mechanisms are too crude to give that sort of fine-grained
access control (permission bits per function invocation?), you have to
build
mechanisms to implement and enforce those features.  Mechanisms are
typically
active -- more *processes* (a \"wider\" PPP) to act as gatekeepers.
 
On 11/18/2022 5:59 PM, Dimiter_Popoff wrote:

Check your mail...
 

Welcome to EDABoard.com

Sponsor

Back
Top