Optimizing Software for Lower Power Consumption
By Vit Stepanek of ST Microelectronics
Low Power MCU Design Series
Featured Products
Introduction
How do you reach the lowest power consumption with a microcontroller? The first step is to understand the difference between current consumption and energy usage.
Let’s compare two hypothetical MCUs. MCU A is 16-bit and MCU B is 32-bit.
MCU A
MCU B
Active mode current
2 mA
2.1 mA
Low Power mode current
1 uA
0.6 uA
If we only look at active mode current consumption, MCU A is the leader by a small margin. However, if we compare Low Power mode, MCU B turns out to be the clear winner. Obviously, this is a very basic comparison of power consumption between two MCU.
We can obtain a more accurate measurement by examining the energy consumed by each MCU in 1 second of operation. Once again, we must specify certain conditions before making any calculations. In the first scenario, let’s assume that the MCU will stay in active mode for 10% of the time, with the rest in low power mode. In a second scenario, the MCU will only spend 0.1% of the time in active mode. As before, MCU A is the winner in scenario 1, but loses out to MCU B when switched to scenario 2.
MCU A
MCU B
Energy (10% active mode)
200.9 uJ
210.5 uJ
Energy (0.1% active mode)
3 uJ
2.7 uJ
The amount of time an MCU spends in the various power down modes is known as the Power Profile of the application and is critical for calculating overall energy usage and for benchmarking purposes. Unfortunately, there is no rule for what a power profile should be. It will be dependent upon the actual application in which the MCU is being used.
Taking the calculations a step further, let’s assume that MCU B has a more powerful core (higher CoreMark score) and it accomplishes the same 1 second task with less time in active mode, which translates into less energy consumed.
MCU A
MCU B
Active mode time
10%
7%
Energy used
200.9 uJ
147.6 uJ
As the table shows, MCU B is now the clear winner. The purpose of this hypothetical illustration is to show the importance of MCU active time in total energy consumption over a system’s lifetime. If we reduce the time spent in an MCU’s active mode, we can significantly reduce the total energy consumption. There are several ways to accomplish this goal through the use of compiler, software, and hardware optimizations.
Compiler Optimization
The easiest way for a design engineer to reduce current consumption is through the code optimization offered by the compiler. Since the goal is to minimize the time the MCU spends in active mode, selecting complier optimization for speed is the right course of action. If we look at the output of a well-established compiler, the settings for maximum speed and minimum size create vastly differing output, which results in a significant difference in energy consumption by the MCU.
Optimization
Size of test code
Energy consumed
Max Speed
5456 B
40.6 uJ
Min Size
3944 B
77.6 uJ
The MCU is spending up to 10% of its time in active mode executing code for the calculations for a PID regulator loop using floating point numbers. Compilers vary in optimization features and implementation. Though most are good at what they do, compilers can never be trusted in a critical low-power application. Know which compiler features are good for low power and which are not -- and always check the compiler output.
Software Optimization
Consider an application in which the MCU, running at 16 MHz, is woken up every second to perform a task that is 800 instructions in length. One of the design criteria is that the device be able to operate up to 5 years from a single battery. If you were able to eliminate just a single instruction (assuming one clock cycle) from this piece of code, the resulting energy saving would allow the system to be powered an additional 2.3 days over the 5-year lifecycle of the product. Although one instruction and 2.3 days over 5 years might not seem significant, the focus should always be on the parts of the code that force the MCU into active mode and on optimizing that code for speed. This is also, in effect, optimizing for high performance.
Even efficient compilers often miss a variety of techniques and tricks for optimizing code for low power. For example, if there is plenty of memory, avoid using “small / short” variables. This will not save any power because the entire memory array will need to be powered regardless. Although it will save one byte of data memory, it will also require two additional instructions. On some 32-bit MCU architectures such as ARM, it is simply not efficient to work with single-byte variables. There are 32-bit registers and enough memory to be used.
In another example, the modulo operator checks the remainder of an integer divide operation. In most MCU instruction sets, the division instruction usually takes more than one clock cycle for its execution if the results need to be made immediately available. Changing to more, but simple (i.e. single set of instructions) actually saves time and therefore power.
Another well-known technique for speed optimization is in-lining. To in-line functions means to take the body of each function and insert it in every place the function is called rather than generating code to call the function itself. In an example below with three functions, Function A, Function B, and Function C, Function A is executed and from within its body, Function B is called, and in turn Function C is called from within Function B. Naturally, there are returns within each function. By in-lining, at least 2 calls and 2 returns will be removed with a corresponding increase in code size, but a decrease in execution time. Most compilers will automatically perform in-lining, but only if they are properly enabled.
The next method that is widely used is loop-unrolling. . Loop-unrolling is mainly used to decrease the count of conditional branches. In the example below, we just want to copy values of 64-bytes long array. If you copy the body eight times, you will increase the code size, but you will remove fifty-six conditional jumps, because the loop will be executed only eight times. Check the final code generated by the compiler, because it usually applies only a certain level of unrolling. The most efficient way regarding the execution speed (not the code size, of course) would be to unroll the whole loop in this case.
Besides the variable types, instruction usage and code flow, there are also optimizations of the arithmetic and calculations used inside your code. The next example is typical and maybe the simplest for subexpression elimination. First, sum up variables a, b, and c, and store the result in variable x (this would take two ADD instructions). Then you want to sum up variables a, b, and d and store the result in variable y (this would take other two ADD instructions). As you can see, you could just reuse the sum of a and b and just add to this intermediate result the variable c (in the first case) and variable d (in the second case). This will save at least one ADD instruction. Compilers usually do this automatically. The advanced compilers can optimize much more complex equations and mathematical expressions. But certain situations might require you to call mathematical functions from the standard library.
An example of a call for mathematical functions from the standard library could be a calculation of geometric mean, especially in single precision floating point. For instance, if you have a need to calculate the geometric mean of six values, the sixth root could be parsed as square root of cube root of the argument. This uses the optimized math. The geometric mean of six values could as demonstrated below:
What is this good for?
Let’s check the available functions in the standard library. To calculate any power of the argument, you could use the standard function “pow” (power), but what will happen if you use the cube root and square root sequence? Cube and square root functions are more convenient, because the algorithms are well optimized in comparison with standard ”pow” function, which is implementing a universal raw algorithm to calculate any power of the given number. What’s the final result?
What is the final result?
Execution time of the “powf“ function is three times longer than if you used the optimized cube root and square root functions. The optimizations for speed were enabled, and a commercial “state of the art” compiler was used. This optimization is maybe more related to the standard library implementation, but also the compiler could maybe identify such calculation and use more efficient functions.
Optimization of Peripherals
In advanced microcontrollers, besides the core, there can be additional bus masters. The most powerful co-workers are DMA controllers, which can be used to automatically handle some simple tasks that would otherwise need to be handled by polling or interrupts. In this basic example, ADC is used to measure the signal on an analog input pin, then the sampled values are processed and the results are sent out using USART/SPI communication peripheral.
STEP 1 – SAMPLING:
ADC and DMA is ON
ADC is sampling the preconfigured analog input channel
DMA is storing the converted values into the embedded SRAM memory
(CORE is kept SLEEPING during this period)
STEP 2 – PROCESSING:
ADC and DMA is put OFF
CORE is woken up and the sampled data are processed (average value, calculation, geometric mean, or any other filtering…)
(CORE is RUNNING during this period)
STEP 3 – SENDING:
USART/SPI and DMA is put ON
Processed data are transmitted over USART/SPI
DMA is providing the data stored in SRAM to the communication peripheral
(CORE is kept SLEEPING during this period)
During the operation of the DMA, the core can do other tasks, or it can be halted, so the average power consumption and the total amount of energy are significantly decreased. Sampling and sending may take the biggest portion of the application execution profile in RUN.
Conclusion
Take all the offered benefits and be aware of the drawbacks of the architecture and the compiler. You can squeeze the maximum energy savings from the architecture, peripherals, from code in every clock cycle, and from the compiler to reach the lowest power consumption with the highest performance.