Key words: Struct and union, bit manipulation using union, bit-fields in C structs
Topics at a glance:
- The wonders of union:
- Vagaries of behavior
- Same memory, different perspectives
- Elegant use of unions for mitigating bit manipulation
- Struct datatype and it’s mysterious bit-field
Struct and Union
After arrays and their inherent address abstraction mechanisms, I’ll now turn my focus to unions and structures. They are also a category of composite data types in C. Unions and structures can hold members belonging to varying data types, whereas for arrays, members should be belonging to the same data type. Let’s start with unions.
Unions are also known as super variables. I guess in ‘C’, union is the only data structure defined by the language which exhibits vagaries of behavior. A union, can at times behave as an integer data type, at a different time, the same union can behave as a float data type, or a char data type or at sometimes even as arrays or even structures. That is why I told unions exhibit behavioral changes, or, we can put it like this :- unions exhibits multiple-personality. I will demonstrate it for you with an example.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | typedef union { int i; char c; float f; }my_union; my_union u1; u1.i = 5; // from here onwards u1 behaves as an integer till line 14 ... u1.f = 2.57F; // from here onwards u1 behaves as a float till line 17 ... u1.c = 'a'; // from here onwards u1 behaves as a character and so on... ... |
In the above example, once you assign 2.567F to member ‘f’, the previously assigned integer value to ‘i’ will get changed. It will be replaced by the IEEE 754 floating point equivalent of 2.567F. i.e. once you assign a valid data to any member of the union, from there on, that union will behave as a variable belonging to that member’s data type, until the next data is assigned. You will run into programming horrors when you try to access any other member of union at this point, such as accessing ‘i’ after assigning a float value to ‘f’, accessing ‘f’ after assigning a char value to ‘c’ etc. The main reason for this issue is actually the most important feature of a union.
A union always occupy the same block of memory regardless of the members declared inside it.
The below code demonstrates this. I have declared a union type as follows:
1 2 3 4 5 6 | typedef union { int i; float f; char c; }my_union; |
I have written the following code to manipulate this union:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | my_union u1; char *byte; int b_count; // use as integer puts("As integer"); u1.i = 5; byte = (char*)&(u1.i); for(b_count = 0; b_count < sizeof(my_union); ++b_count) { printf("Byte %d : 0x%X\n", b_count, byte[b_count]); } // use as float puts("As float"); u1.f = 5.125F; byte = (char*)&(u1.f); for(b_count = 0; b_count < sizeof(my_union); ++b_count) { printf("Byte %d : 0x%X\n", b_count, (byte[b_count]&0xFF)); } // use as char puts("As char"); u1.c = 'A'; byte = (char*)&(u1.c); for(b_count = 0; b_count < sizeof(my_union); ++b_count) { printf("Byte %d : 0x%X\n", b_count, (byte[b_count]&0xFF)); } puts("\nWhere is the union 'u1' stored ?\n"); printf("Address of u1.i is 0x%X\n", &u1.i); printf("Address of u1.f is 0x%X\n", &u1.f); printf("Address of u1.c is 0x%X\n", &u1.c); |
The above code produced the following result when run:
As integer
Byte 0 : 0x5
Byte 1 : 0x0
Byte 2 : 0x0
Byte 3 : 0x0
As float
Byte 0 : 0x0
Byte 1 : 0x0
Byte 2 : 0xA4
Byte 3 : 0x40
As char
Byte 0 : 0x41
Byte 1 : 0x0
Byte 2 : 0xA4
Byte 3 : 0x40
Where is the union 'u1' stored ?
Address of u1.i is 0x461CC44
Address of u1.f is 0x461CC44
Address of u1.c is 0x461CC44
- When used as integer, printing the contents of union shows how 0x5 is stored in 4 bytes of memory in a typical little endian system (i.e. LSB first) (Refer to chapter 1 for more details on how integer is stored)
- When used as float, it shows how 5.125F is stored in IEEE-754 single precision floating format in little endian system. (Refer to chapter 1 for more details on how float (IEEE 754) is stored)
- When used as character, it shows how ‘A’ (ascii value of character ‘A’ is 0x41) is stored in memory
- Results also show that when a union is used as char, accessing union as a 4-byte data causes undesirable effects. When you observe the results for char you can see that bytes 2 and 3 of union is still retaining remnants of older float value. This happens as all the members of union is getting stored in the exact same memory (here, 0x461CC44).
- Unions are not self-managing. Programmer should be very careful on the current state of union and should treat union as a data type appropriate in that state. In the above example, after copying a character value of ‘A’ to union, code should not use the union as a 4-byte data type. If you want to change the type, then re-assign some value to a member of suitable data type.
- The language just gives you a facility to use the same memory for storing different type of data, but of course not at the same time. In course of execution union’s behavior change, i. e. it takes the shape of the last assigned data type.
It is programmer’s responsibility to use the union wisely!
Bit manipulation using union
I will explain a typical use case where unions are most appropriate choice for a programmer. Any guess?
If you are an embedded programmer it would have struck your mind. I am talking about manipulating register values of a typical processor.
Bit-fields in C structs: A use-case
The use of union in conjunction with structure’s bit fields is a very powerful programming idiom in embedded world!
Please see the below declaration of a union.
Note: The bit ordering and alignment is implementation (underlying platform) dependent. Understand the word attributes such as byte alignment/ordering etc of your target platform and compiler support for bit-fields. Here, I have considered target platform as Intel x86 32 bit CPU and is compiled using gcc 7.4.0.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | typedef char byte; #define REG8_MASK 0x000000FFU typedef union register_8_bits__ { byte value; struct bits__ { byte b0 : 1; byte b1 : 1; byte b2 : 1; byte b3 : 1; byte b4 : 1; byte b5 : 1; byte b6 : 1; byte b7 : 1; }bits; struct nibbles__ { byte low : 4; byte high : 4; }nibbles; }register_8_bits; |
See how the code easily and naturally manipulates register contents w.r.t bytes, bits and nibbles!!!
1 2 3 4 5 6 7 8 9 10 11 | register_8_bits my_reg; my_reg.value = 0x00; // clear register printf("Register value : %X\n", my_reg.value & REG8_MASK); my_reg.bits.b2 = 1; // sets b2 my_reg.bits.b7 = 1; // sets b7 printf("Register value : %X\n", my_reg.value & REG8_MASK); printf("Register value low : %X\n", my_reg.nibbles.low &0xF); printf("Register value high : %X\n", my_reg.nibbles.high &0xF); |
The result of the above code is as below:
Register value : 0
Register value : 84
Register value low : 4
Register value high : 8
Awesome, isn’t it ?
Let’s understand how this is happening? But before that, let me tell you the significance of this idiom in embedded world.
Usually programmers perform bit set/clear, nibble manipulation etc. using language’s bit manipulation operators such as bit left shit <<, bit right shift >>, Hexadecimal/Binary MASKS such as 0xFF and various other combinations. In embedded programming, updating register contents is just a routine task. Let us see how a C program is written to manipulate bits using typical C bit manipulation convention.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | char my_reg; // 8 bit value. // Assume that bits are packed with msb b7 in left to lsb b0 in right // for setting bit 2 and not changing other bit positions my_reg = my_reg | 0x04; //0x4 is '0100' in binary // for clearing bit 5 // 0x20U which is 0010 0000 in binary; // note that ~ operator is for getting // the number's 1's complement. // ~0x20 => 1101 1111 in binary = 0xDF my_reg = my_reg & ~(0x20); // a different way using bit shift << operation // for setting b1 my_reg = my_reg | (0x01 << 1); // or in a little cryptic way using C's shorthand notation my_reg |= (0x01 << 1); |
Now using our union approach the same operations can be done in a more natural or simple way.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | //The same operations using the above union register_8_bits my_reg; // clearing register to all 0 my_reg.value = 0x0; // for setting bit 2 my_reg.bits.b2 = 1; printf("b2 set, Register value : 0x%X\n", my_reg.value); // for clearing bit 2 my_reg.bits.b2 = 0; printf("b2 cleared, Register value : 0x%X\n", my_reg.value); // for setting b1 my_reg.bits.b1 = 1; printf("b1 set, Register value : 0x%X\n", my_reg.value); // for setting high nibble bits to '1110' my_reg.nibbles.high = 0xE; // 0xE is 1110 in binary printf("nibble high is set to 0xE, Register value : 0x%X\n", my_reg.value & 0xFF); // NOTE: Mask with 0xFF is required as we use %X (4bytes) to print the result inside printf() // for setting the entire 8 bits to 0xFF my_reg.value = 0xFF; printf("Value set to 0xFF, Register value : 0x%X\n", my_reg.value & 0xFF); // NOTE: Mask with 0xFF is required as we use %X (4bytes) to print the result inside printf() |
Let us see the result of the above code using union approach:
b2 set, Register value : 0x4
b2 cleared, Register value : 0x0
b1 set, Register value : 0x2
nibble high is set to 0xE, Register value : 0xE2
Value set to 0xFF, Register value : 0xFF
Now, what you say? Which approach is better? I will definitely select union approach over the direct bit manipulation approach, as union approach is more intuitive.
Just few points on C’s bit-fields:
- Make sure that your C compiler supports bit fields properly. Almost all standard C compilers such as gcc, clang and MSVC supports bit fields. Bit fields are defined in the C standard and is a very handy tool at times.
- Byte ordering (endianness), bit ordering, alignment etc. are platform/implementation dependent, as I’ve mentioned above.3.
- Never try to get address of members declared with bit-field. It results in undesirable behavior. Most of the compilers will not allow you to use ‘&’ i.e address of operation on bit-fields. At least, they’ll warn us what so ever.
Let’s see what is happening under the hood with the above union – register_8_bits? The answer is simple; Unions in ‘C’ guarantees that the compiler will allocate the exact same memory region for all the members declared inside. Here, the 8 bit value, member struct bits and nibbles are all allocated to the same memory region. With C’s struct datatype’s bit field, C compiler guarantees that only the amount of ‘N’ bits specified after ‘:’ will be referred when that specific struct member is used in the code. i.e. bit-fields allow you to refer to a variable’s bit position specifically.
So if you are very prudent about bit positions in your code, there is a direct language support.
By placing the members aptly inside a union along with struct’s bit-fields everything else will fit into the puzzle!
Enjoyed this chapter? Let me know in the comments below. Thanks! 🙂
When talking about type punning with unions one should really mention that this leads to undefined behaviour in general. In fact the advocated mechanism for bit manipulation violates strict aliasing rules. While this technique is common in practice, it only works because compilers make special exceptions here.
A similarly common technique (type punning via reinterpret_cast) has bitten many when compilers started to be more strict on aliasing rules in order to do more aggressive optimization.
Dear reader, you are right. I have mentioned the same in this chapter. One should be very careful while using unions for type punning. Also bit manipulation/ordering/alignment etc is dependent on underlying platform/compiler. Thanks for feedback. Do read other charters also.
Well articulated! Didn’t know unions are so useful this way when it comes to bit manipulations. Thank you.
Glad to know this dear reader. Hope you will read other chapters also.
So good!