Bit Swizzling...

R

Rick C. Hodgin

Guest
I\'m not sure where to ask this question, so I pushed it out to several
groups. You need not reply to all of them if you don\'t think it is a
topical subject.

I included comp.compilers in a previous message, but it apparently holds
the message until it passes moderation. So, I\'ve not included it here.
A duplicate message may post if/when the comp.compilers moderator John
Levine approves it.

-----
Are there any algorithms which take a known-at-compile-time sequence
of bitwise operations on an 8-bit to 64-bit quantity, and optimize
them down to their minimal set of operations?

For example, if I have an 8-bit byte and I want to swizzle the bits
thusly:

Input: 07 06 05 04 03 02 01 00
Output: 05 04 07 02 01 03 00 06

I can easily swizzle the data using a brute force method:

v = get_value();
o = 0;
swizzle1(o, v, 0, 6);
swizzle1(o, v, 1, 0);
swizzle1(o, v, 2, 3);
swizzle1(o, v, 3, 1);
swizzle1(o, v, 4, 2);
swizzle1(o, v, 5, 7);
swizzle1(o, v, 6, 4);
swizzle1(o, v, 7, 5);

// Untested, off the top of my head
void swizzle(unsigned char& o, unsigned char v, int s, int d)
{
o |= (((v & (1 << s)) >> s) << d);
}

And, of course, that algorithm can be optimized based on relative
values of s and d, and if s is 0, etc.

However, there also exists a minimal set of steps to complete that
swizzling because in the operation above, bits 5 and 4 are used in
sequence. It could be done using a swizzle2() operation, for example
and handle 2 bits at a time.

Are there any existing algorithms which examine the operations that
must be conducted and the optimized / minimal sequence of steps to
conduct it?

Thank you in advance.

--
Rick C. Hodgin
 
On 05/09/2020 01:33, Rick C. Hodgin wrote:

-----
Are there any algorithms which take a known-at-compile-time sequence
of bitwise operations on an 8-bit to 64-bit quantity, and optimize
them down to their minimal set of operations?

For example, if I have an 8-bit byte and I want to swizzle the bits
thusly:

    Input:   07 06 05 04 03 02 01 00
   Output:   05 04 07 02 01 03 00 06

If it\'s just 8-bit bytes, then it\'s easy: use any method to build a map
that translates one 8-bit value to its swizzled version. I assume it
will be used many millions of times.

I can easily swizzle the data using a brute force method:

    v = get_value();
    o = 0;
    swizzle1(o, v, 0, 6);
    swizzle1(o, v, 1, 0);
    swizzle1(o, v, 2, 3);
    swizzle1(o, v, 3, 1);
    swizzle1(o, v, 4, 2);
    swizzle1(o, v, 5, 7);
    swizzle1(o, v, 6, 4);
    swizzle1(o, v, 7, 5);

    // Untested, off the top of my head
    void swizzle(unsigned char& o, unsigned char v, int s, int d)
    {
        o |= (((v & (1 << s)) >> s) << d);
    }

It took me a while to figure out what this means, which appears to be
take bit s of v, and store it as bit d of o. (But o has to start at 0
since |= means it can\'t change a 1 in o to 0.)

In my language it would be simply this:

o.[d] := v.

although there, it can write both 1s and 0s into o.

And, of course, that algorithm can be optimized based on relative
values of s and d, and if s is 0, etc.

However, there also exists a minimal set of steps to complete that
swizzling because in the operation above, bits 5 and 4 are used in
sequence.  It could be done using a swizzle2() operation, for example
and handle 2 bits at a time.

Are there any existing algorithms which examine the operations that
must be conducted and the optimized / minimal sequence of steps to
conduct it?

The algorithm that works on what, bits of C++ code? What is its input?
What are the aims: to make it fast? Will the mapping always be known
constants, or variables? Will all bits be translated or just some?

And, what would be its output?

If it\'s just reordering the bits in a word (where you don\'t map, for
example, both bits 5 and 7 to bit 3), that doesn\'t sound too hard
(although it probably is!). But importantly, this isn\'t done in-place.

Up to 8 or 16 bits, probably you can use tables, but the tables need to
be set up.

Otherwise the simplest thing I can think of, for an N-bit word, is a
table of translation pairs. For your 8-bit example this will just be the
parameters to your swizzle function, but as data:

(0,6)
(1,0)
(2,3)
(3,1)
(4,2)
(5,7)
(6,4)
(7,5)

This be input to a routine that will do the job, or it can be the input
to an algorithem that generates, for example, one giant logical expression.

I can probably just about do that last part, if it doesn\'t need to be
optimised, example:

y = ((x&1)<<6)|((x&2)>>1)|((x&4)<<1)|((x&8)>>2)|
((x&16)>>2)|((x&32)<<2)|((x&64)>>2)|((x&128)>>2);

where x is the input word, and y is the output word. I did this with a
12-line script based on that data input. (Not tested; may need extra
parens.)

I\'m sure gcc-O3 will find some optimisations in there if there are any.
 
On 9/6/20 3:31 AM, davidlovemore@gmail.com wrote:
On Saturday, September 5, 2020 at 5:45:43 PM UTC+1, Rick C. Hodgin wrote:
Are there any algorithms which take a known-at-compile-time sequence
of bitwise operations on an 8-bit to 64-bit quantity, and optimize
them down to their minimal set of operations?

There\'s some good resources on bit permutations including this online calculator here:
http://programming.sirrida.de/calcperm.php

Source code for a more sophisticated permutation code generator is also available linked from the page.

Some notes here on the online generator:
http://programming.sirrida.de/bit_perm.html#calculator

Exactly what was I looking for. Thank you.

--
Rick C. Hodgin
 
On 9/4/2020 7:33 PM, Rick C. Hodgin wrote:
I\'m not sure where to ask this question, so I pushed it out to several
groups.  You need not reply to all of them if you don\'t think it is a
topical subject.

I included comp.compilers in a previous message, but it apparently holds
the message until it passes moderation.  So, I\'ve not included it here.
A duplicate message may post if/when the comp.compilers moderator John
Levine approves it.

-----
Are there any algorithms which take a known-at-compile-time sequence
of bitwise operations on an 8-bit to 64-bit quantity, and optimize
them down to their minimal set of operations?

For example, if I have an 8-bit byte and I want to swizzle the bits
thusly:

    Input:   07 06 05 04 03 02 01 00
   Output:   05 04 07 02 01 03 00 06

I can easily swizzle the data using a brute force method:

    v = get_value();
    o = 0;
    swizzle1(o, v, 0, 6);
    swizzle1(o, v, 1, 0);
    swizzle1(o, v, 2, 3);
    swizzle1(o, v, 3, 1);
    swizzle1(o, v, 4, 2);
    swizzle1(o, v, 5, 7);
    swizzle1(o, v, 6, 4);
    swizzle1(o, v, 7, 5);

    // Untested, off the top of my head
    void swizzle(unsigned char& o, unsigned char v, int s, int d)
    {
        o |= (((v & (1 << s)) >> s) << d);
    }

And, of course, that algorithm can be optimized based on relative
values of s and d, and if s is 0, etc.

However, there also exists a minimal set of steps to complete that
swizzling because in the operation above, bits 5 and 4 are used in
sequence.  It could be done using a swizzle2() operation, for example
and handle 2 bits at a time.

Are there any existing algorithms which examine the operations that
must be conducted and the optimized / minimal sequence of steps to
conduct it?

Thank you in advance.

https://www.techopedia.com/definition/22413/swizzling

--
Copyright 2020 Pete Olcott
 
On 9/8/2020 11:25 AM, olcott wrote:
On 9/4/2020 7:33 PM, Rick C. Hodgin wrote:

I\'m not sure where to ask this question, so I pushed it out to several
groups.  You need not reply to all of them if you don\'t think it is a
topical subject.

I included comp.compilers in a previous message, but it apparently holds
the message until it passes moderation.  So, I\'ve not included it here.
A duplicate message may post if/when the comp.compilers moderator John
Levine approves it.

-----
Are there any algorithms which take a known-at-compile-time sequence
of bitwise operations on an 8-bit to 64-bit quantity, and optimize
them down to their minimal set of operations?

For example, if I have an 8-bit byte and I want to swizzle the bits
thusly:

     Input:   07 06 05 04 03 02 01 00
    Output:   05 04 07 02 01 03 00 06

I can easily swizzle the data using a brute force method:

     v = get_value();
     o = 0;
     swizzle1(o, v, 0, 6);
     swizzle1(o, v, 1, 0);
     swizzle1(o, v, 2, 3);
     swizzle1(o, v, 3, 1);
     swizzle1(o, v, 4, 2);
     swizzle1(o, v, 5, 7);
     swizzle1(o, v, 6, 4);
     swizzle1(o, v, 7, 5);

     // Untested, off the top of my head
     void swizzle(unsigned char& o, unsigned char v, int s, int d)
     {
         o |= (((v & (1 << s)) >> s) << d);
     }

And, of course, that algorithm can be optimized based on relative
values of s and d, and if s is 0, etc.

However, there also exists a minimal set of steps to complete that
swizzling because in the operation above, bits 5 and 4 are used in
sequence.  It could be done using a swizzle2() operation, for example
and handle 2 bits at a time.

Are there any existing algorithms which examine the operations that
must be conducted and the optimized / minimal sequence of steps to
conduct it?

Thank you in advance.


https://www.techopedia.com/definition/22413/swizzling

glTexParameteri(Target, GL_TEXTURE_SWIZZLE_R, Format.Swizzle[0]);
glTexParameteri(Target, GL_TEXTURE_SWIZZLE_G, Format.Swizzle[1]);
glTexParameteri(Target, GL_TEXTURE_SWIZZLE_B, Format.Swizzle[2]);
glTexParameteri(Target, GL_TEXTURE_SWIZZLE_A, Format.Swizzle[3]);
https://gli.g-truc.net/0.7.0/index.html

--
Copyright 2020 Pete Olcott
 
(Too many xpost groups, had to remove one)

Rick,

> o |= (((v & (1 << s)) >> s) << d);

If you reverse the two tables (having the output bits in order from high to
low) you could left-shift the output by one and than OR the output with the
right-shifted input masked with 1. In this specific case (all eight bits
swizzeled) you do not even need to clear the output.

My C(++) isn\'t worth anything, but I imagine it could look something like
this :

o <<= 1
o |= (v >> s) & 1

That takes, at least on a x86, 4 machine instructions per bit.

The thing with writing C(++) that should work /everywhere/ is that you can\'t
use optimalisations for a specific processor.

If you would use the x86, which has instructions that use the Carry flag as
the ninth bit, you would only need 2 machine instructions per bit (rotate
desired source bit into carry, rotate carry bit into target).

And a remark : you\'ve made that \"swizzle\" a function. Which means you will
probably have a relative large overhead, possibly doubling if not tripling
the ammount of instructions executed for each bit extraction and insertion
(starting with the \"call\" and \"return\"). Rewiting it as a macro would
probably be a good idea.

In the case of an x86 and some smart-ass usage of its nine-bit rotate
instructions means that a full 8 bit swizzle uses only 16 instructions (or
17 if you want the source to be unchanged afterwards) - both code /and/
execution ...

Regards,
Rudy Wieser
 
On 08/09/2020 20:26, R.Wieser wrote:
(Too many xpost groups, had to remove one)

Rick,

o |= (((v & (1 << s)) >> s) << d);

If you reverse the two tables (having the output bits in order from high to
low) you could left-shift the output by one and than OR the output with the
right-shifted input masked with 1. In this specific case (all eight bits
swizzeled) you do not even need to clear the output.

My C(++) isn\'t worth anything, but I imagine it could look something like
this :

o <<= 1
o |= (v >> s) & 1

That takes, at least on a x86, 4 machine instructions per bit.

The thing with writing C(++) that should work /everywhere/ is that you can\'t
use optimalisations for a specific processor.

If you would use the x86, which has instructions that use the Carry flag as
the ninth bit, you would only need 2 machine instructions per bit (rotate
desired source bit into carry, rotate carry bit into target).

A good compiler can sometimes (not always) do that kind of thing in the
generated code, if it recognises the pattern. Often, however, these
kind of small instructions are effectively free on x86 processors as you
wait for memory (of course that depends very heavily on the rest of the
code and how you are using this function).

And a remark : you\'ve made that \"swizzle\" a function. Which means you will
probably have a relative large overhead, possibly doubling if not tripling
the ammount of instructions executed for each bit extraction and insertion
(starting with the \"call\" and \"return\"). Rewiting it as a macro would
probably be a good idea.

Such a function would be (or should be) inlined by the compiler - using
an inline function is almost always a better choice than a macro if you
don\'t /need/ it to be a macro.

In the case of an x86 and some smart-ass usage of its nine-bit rotate
instructions means that a full 8 bit swizzle uses only 16 instructions (or
17 if you want the source to be unchanged afterwards) - both code /and/
execution ...

Regards,
Rudy Wieser
 
On 9/8/20 2:26 PM, R.Wieser wrote:
(Too many xpost groups, had to remove one)

Rick,

o |= (((v & (1 << s)) >> s) << d);

If you reverse the two tables (having the output bits in order from high to
low) you could left-shift the output by one and than OR the output with the
right-shifted input masked with 1. In this specific case (all eight bits
swizzeled) you do not even need to clear the output.

My C(++) isn\'t worth anything, but I imagine it could look something like
this :

o <<= 1
o |= (v >> s) & 1

That takes, at least on a x86, 4 machine instructions per bit.

The thing with writing C(++) that should work /everywhere/ is that you can\'t
use optimalisations for a specific processor.

I appreciate your response, Rudy. I\'m writing my own CAlive language,
and my request for an algorithm is to have my internal intermediate
language able to then generate optimum code sequences for a target ISA
given the constraints of that ISA.

It\'s a fairly complex task which is why I reached out for it. I\'m
adding swizzle operator support so you can name an operator and apply
inlined swizzle operation to data (as an operator and not as a func-
tion call).

And a remark : you\'ve made that \"swizzle\" a function. Which means you will
probably have a relative large overhead, possibly doubling if not tripling
the ammount of instructions executed for each bit extraction and insertion
(starting with the \"call\" and \"return\"). Rewiting it as a macro would
probably be a good idea.

It was an example only of a worst-case scenario for swizzling data. I\'m
sorry that wasn\'t more clear. :)

--
Rick C. Hodgin
 
Rick,

> I appreciate your response, Rudy. I\'m writing my own CAlive language,

I already thought I recognised your name (and its weight around here), and
by it wondered if I could tell you anything at all.

and my request for an algorithm is to have my internal intermediate
language able to then generate optimum code sequences for a target ISA
given the constraints of that ISA.

It seems to me that I concentrated on the wrong thing. Thanks for not
biting my head off. :)

Regards,
Rudy Wieser
 

Welcome to EDABoard.com

Sponsor

Back
Top