A 6–140-nW 11 Hz–8.2-kHz DVFS RISC-V Microprocessor Using Scalable Dynamic Leakage-Suppression Logic

Daniel S. Truesdell, Student Member, IEEE, Jacob Breiholz, Student Member, IEEE, Sumanth Kamineni, Student Member, IEEE, Ningxi Liu, Student Member, IEEE, Albert Magyar, Student Member, IEEE, and Benton H. Calhoun, Senior Member, IEEE

Abstract—This letter presents an RISC-V microprocessor implemented using a proposed scalable dynamic leakage suppression (SDLS) logic style. Together with a custom adaptive clock generator and voltage scaling controller, the SDLS RISC-V microprocessor realizes a fully integrated modified dynamic voltage and frequency scaling (DVFS) scheme that enables nW-level performance flexibility for battery-less IoT sensing nodes in energy-scarce environments. At the nominal core VDD of 0.6 V, the core can scale its performance from 6 nW at 11-Hz operating frequency to 140 nW at 8.2-kHz operating frequency. Across the supply voltage range, the core is capable of delivering minimum power of 840 pW, maximum frequency of 41.5 kHz, and a minimum energy of 13.4 pJ/cycle.

Index Terms—Digital logic, dynamic voltage and frequency scaling (DVFS), IoT, ultralow power.

I. INTRODUCTION

Modern IoT sensor nodes enable continuous background monitoring of physiological and environmental signals, such as temperature, humidity, blood oxygen, ExG, air-quality, and motion. These signals are typically low bandwidth and only need to be sampled and preprocessed at Hz–kHz rates [1] in order to sufficiently detect and react to real-time events. Due to the low throughput requirements, sensing nodes in these applications can be powered directly from harvested energy in order to provide compact form-factor and low-maintenance operation. However, ambient energy sources can fluctuate unpredictably and cause harvested power to drop down to the nW level. Therefore, to maintain continuous battery-less sensing, a node must be able to dynamically scale its performance in response to changing energy harvesting conditions, even at ultralow-power levels. For the above applications, this simply means sensing and processing as much data as possible with the available harvested power. Conventional static CMOS logic can be scaled via dynamic voltage and frequency scaling (DVFS), but the minimum power consumption of CMOS processors is limited to 100 s of nW due to leakage currents, with further reduction requiring sleep modes or total shutdown [2], [3]. Dynamic leakage suppression (DLS) logic [6], [7] was proposed to facilitate continuous operation at sub-nW power levels by reducing leakage. However, its performance is weakly dependent on supply voltage, rendering traditional DVFS ineffectively, and limiting the maximum operating frequency to the Hz-range, severely limiting the node’s computational ability even if plenty of harvested power is available. Speed can be boosted in DLS gates by adding bypass switches that allow the gate to be dynamically reconfigured as CMOS logic [8], but this provides poor scaling granularity (2-point) since gates are forced to stay in the slow DLS-mode unless harvested power is high enough to support full CMOS operation. Although dithering can improve performance flexibility in this case, the high-peak power consumption during the CMOS on-state can rapidly deplete the energy reserve of a small storage capacitor, leaving the node without enough energy to transmit data, and vulnerable to total shutdown if ambient energy availability suddenly decreases. As a result, efficient nW-range performance scaling required by battery-less sensor nodes is currently unreachable (Fig. 1). To span the performance gap between CMOS and DLS logic, this letter proposes a performance-scalable DLS logic implementation co-designed with an adaptive clock generator (ACG) and custom voltage scaling controller (VSC) that enable high-granularity DVFS at ultralow-power levels.

II. ARCHITECTURE AND DESIGN

A. System Architecture

Fig. 2 shows the system architecture. The SDLS processor implements a BottleRocket RISC-V core (RV32IMC, derived from Rocket processor [9]) which interfaces to a custom 8-kb 6T SRAM macro and an uncore domain that includes system interconnect, SPI master and GPIO peripherals, and a memory-mapped DVFS control register that allows the core to tune the DVFS mode and clock settings. The core and uncore logic is synthesized from an SDLS standard cell library with a fully automated place and route, and use static CMOS-style clock buffers in order to minimize insertion delay and improve slew rate relative to SDLS inverters. The core, uncore, and SRAM all use a 0.6-V nominal supply voltage. The power and timing management subsystem operates at a 1.2-V supply and consists of a tunable reference current generator, VSC, and an ACG.
B. Scalable Dynamic Leakage Suppression Logic

The proposed SDLS logic design, shown in Fig. 3, implements a new voltage-scaling approach in which two complementary bias voltages, \( V_{CN} \) and \( V_{CP} \), are used to transition the gate across a continuous range between DLS and CMOS operating regimes, trading off between leakage power and gate delay. Compared to traditional DVFS, this constant-\( V_{DD} \) approach eliminates the need for level shifters and avoids regulator efficiency losses caused by varying supply levels. In the SDLS logic gate, transistors \( M_{CN} \) and \( M_{CP} \) are added in parallel to the traditional DLS transistors \( M_{HN} \) and \( M_{FP} \), similar to the approach in [8]. In addition to the proposed bias-voltage approach for modulating gate performance, we introduce new design modifications to improve performance scalability and reduce gate area. First, the body of internal pMOS transistors \( M_{PX} \) are tied to \( V_{DD} \), rather than node \( n1 \), in order to save area by sharing internal n-wells with negligible performance degradation. Second, LVT devices are used for the external transistors \( M_{HN} \), \( M_{CP} \), \( M_{FP} \), and \( M_{CP} \), allowing increased sensitivity to \( V_{CN} \) and \( V_{CP} \) for larger performance tuning range while simultaneously decreasing the large transistor sizes required in [7] and [8]. Fig. 4(a) demonstrates the transient operation of an SDLS inverter during a falling input transition. In the steady state when the input \( A \) is high and \( V_C \) is 0 V (\( V_{CN} = 0 \) V and \( V_{CP} = 1 \) V), the internal node \( n1 \) settles to \( V_{DD}/2 \), pushing \( M_{CN} \), \( M_{HN} \), and \( M_{PX} \) into super-cutoff mode (negative \( V_{GS} \)), reducing leakage through the pull-up network. When \( A \) transitions low, \( M_{PX} \) turns on, allowing \( Y \) and \( n1 \) to converge. This creates positive feedback by increasing \( V_{GS} \) of \( M_{HN} \), allowing more current to leak through the pull-up network, further charging \( Y \) until it has fully transitioned. If \( V_C \) is increased (\( V_{CN} > 0 \) V and \( V_{CP} < 1 \) V), transistor \( M_{CN} \) pulls the steady state voltage of \( n1 \) higher than \( V_{DD}/2 \). While this weakens the super-cutoff effect in the pullup network causing increased leakage, it also accelerates the convergence of \( Y \) and \( n1 \), allowing a quicker feedback response from \( M_{HN} \) that results in a shorter gate delay. In addition to increasing \( M_{HN} \)’s feedback response, \( M_{CN} \) provides increased on-current throughout the transition that further improves speed. The same functionality applies during a rising-edge input transition, with \( V_{CP} \) tuning the steady-state voltage of \( n2 \).

C. Adaptive Clock Generator and Voltage Scaling Controller

The VSC, shown in Fig. 5(a), generates \( V_{CN} \) and \( V_{CP} \) (each up to a maximum voltage \( V_{DIV} \)) by selecting two complementary reference voltages from a gate leakage-based resistive divider and then driving the \( V_{CN} \) and \( V_{CP} \) nodes to the selected voltages with tunable bang–bang \( pA \)-level switched current sources. At the nominal \( V_{DD} \) of 0.6 V, a \( V_{DIV} \) of 1.0 V is used to overdrive \( V_{CN} \) and \( V_{CP} \) for a stronger cutoff effect. Fig. 5(b) shows the measured \( V_{CN} \) and \( V_{CP} \) waveforms while sweeping \( V_{SEL} \). The ACG, shown in Fig. 6, replicates the critical path delay of the SDLS core and tracks the voltage ripple on \( V_{CN} \) and \( V_{CP} \) to maintain maximum operating frequency of the core. The replica path incorporates two main tuning methods that allow for variable tracking accuracy depending on the desired amount of tuning and calibration. First, the period and duty cycle are tunable by counting a programmable number of cycles of a three-stage SDLS ring oscillator that runs from the same \( V_{DD} \), \( V_{CN} \), and \( V_{CP} \) as the processor core. This number is separately tunable for both the high and low levels of the clock output. Second, a bank of 12 different SDLS ring oscillator cells provides selectivity over the delay sensitivity to \( V_C \) to help ensure a close match.
between the replica path delay and the actual critical path delay across \(V_C\) mode and \(V_C\) ripple. Each oscillator cell is designed to achieve a different \(V_C\) sensitivity by using HVT devices with different widths for \(M_{CN}\) and \(M_{CP}\). The path can be calibrated at any number of \(V_C\) modes, with near-optimum tracking being achieved by tuning the replicator at each \(V_C\) mode for a given program. Fig. 6 shows the measurements for each oscillator cell (sweeping sel_osc), the minimum frequency tuning resolution, and the power of the ACG. Besides the three-stage SDLS ring oscillator cells, the remainder of the clock generator logic is implemented with standard CMOS gates to reduce delay overhead. Fig. 7 shows a transient measurement of the VSC and ACG, demonstrating a 100-ms transition time.

III. MEASUREMENT RESULTS

The SDLS microprocessor test chip was fabricated in a 65-nm low-power process. Fig. 8 shows an annotated chip micrograph. Fig. 9 shows oscilloscope measurements of \(V_{CN}\), \(V_{CP}\), the adaptive clock output, and GPIO bits during runtime DVFS while executing a self-checking Fibonacci sequence program. The entire program including DVFS instructions, was written in C and compiled with a standard open-source toolchain. To fully evaluate the SDLS core across supply voltage and DVFS mode, a simpler GPIO bit toggling between the replica path delay and the actual critical path delay across \(V_C\) mode and \(V_C\) ripple. Each oscillator cell is designed to achieve a different \(V_C\) sensitivity by using HVT devices with different widths for \(M_{CN}\) and \(M_{CP}\). The path can be calibrated at any number of \(V_C\) modes, with near-optimum tracking being achieved by tuning the replicator at each \(V_C\) mode for a given program. Fig. 6 shows the measurements for each oscillator cell (sweeping sel_osc), the minimum frequency tuning resolution, and the power of the ACG. Besides the three-stage SDLS ring oscillator cells, the remainder of the clock generator logic is implemented with standard CMOS gates to reduce delay overhead. Fig. 7 shows a transient measurement of the VSC and ACG, demonstrating a 100-ms transition time.

III. MEASUREMENT RESULTS

The SDLS microprocessor test chip was fabricated in a 65-nm low-power process. Fig. 8 shows an annotated chip micrograph. Fig. 9 shows oscilloscope measurements of \(V_{CN}\), \(V_{CP}\), the adaptive clock output, and GPIO bits during runtime DVFS while executing a self-checking Fibonacci sequence program. The entire program including DVFS instructions, was written in C and compiled with a standard open-source toolchain. To fully evaluate the SDLS core across supply voltage and DVFS mode, a simpler GPIO bit toggling between the replica path delay and the actual critical path delay across \(V_C\) mode and \(V_C\) ripple. Each oscillator cell is designed to achieve a different \(V_C\) sensitivity by using HVT devices with different widths for \(M_{CN}\) and \(M_{CP}\). The path can be calibrated at any number of \(V_C\) modes, with near-optimum tracking being achieved by tuning the replicator at each \(V_C\) mode for a given program. Fig. 6 shows the measurements for each oscillator cell (sweeping sel_osc), the minimum frequency tuning resolution, and the power of the ACG. Besides the three-stage SDLS ring oscillator cells, the remainder of the clock generator logic is implemented with standard CMOS gates to reduce delay overhead. Fig. 7 shows a transient measurement of the VSC and ACG, demonstrating a 100-ms transition time.
TABLE I

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Technology</td>
<td>Core</td>
<td>System**</td>
<td>Core</td>
<td>System**</td>
<td>Core</td>
<td>System**</td>
</tr>
<tr>
<td></td>
<td>68-nm</td>
<td>90-nm</td>
<td>65-nm</td>
<td>90-nm</td>
<td>130-nm</td>
<td>45-nm</td>
</tr>
<tr>
<td>Architecture</td>
<td>RL5V</td>
<td>RL5V</td>
<td>ARM Cortex M+</td>
<td>ARM Cortex M3</td>
<td>ARM Cortex M+</td>
<td>ARM Cortex M3</td>
</tr>
<tr>
<td></td>
<td>0.45mm²</td>
<td>0.87mm²</td>
<td>0.33mm²</td>
<td>0.24mm²</td>
<td>0.16mm²</td>
<td>0.83mm²</td>
</tr>
<tr>
<td>Operating Voltage</td>
<td>0.3V – 0.9V</td>
<td>0.2V – 1.1V</td>
<td>0.16V –1.5V</td>
<td>0.2V – 0.5V</td>
<td>0.5V – 1.0V</td>
<td>0.45V – 0.9V</td>
</tr>
<tr>
<td>Performance Scaling</td>
<td>Modified DVFS</td>
<td>Dual-Mode</td>
<td>None</td>
<td>None</td>
<td>DVFS + AVS</td>
<td>None</td>
</tr>
<tr>
<td>Scaling Granularity</td>
<td>7-point + dithering</td>
<td>2-point</td>
<td>None</td>
<td>None</td>
<td>None</td>
<td>None</td>
</tr>
<tr>
<td></td>
<td>(100% step)</td>
<td>(100% step)</td>
<td>(100% step)</td>
<td>(100% step)</td>
<td>(100% step)</td>
<td>(100% step)</td>
</tr>
<tr>
<td>Minimum Active Power</td>
<td>840pW @ 0.3V</td>
<td>4.66nW @ 0.3V</td>
<td>595pW @ 0.3V</td>
<td>127.1pW @ 0.3V</td>
<td>16.8µW @ 0.3V</td>
<td>346µW @ 0.3V</td>
</tr>
<tr>
<td></td>
<td>0.3V / 10kHz</td>
<td>0.45V / 2kHz</td>
<td>0.33V / 10kHz</td>
<td>0.2V / 800kHz</td>
<td>0.5V / 1kHz</td>
<td>0.45V / 40kHz</td>
</tr>
<tr>
<td>Minimum Energy</td>
<td>13.4pJ/cycle @ 0.5V, 2.0kHz</td>
<td>38.84pJ/cycle @ 0.5V, 2.0kHz</td>
<td>14.0pJ/cycle @ 0.45V, 19kHz</td>
<td>44.7pJ/cycle @ 0.45V, 19kHz</td>
<td>8.9µJ/cycle @ 0.37V, 13.7kHz</td>
<td>23.0µJ/cycle @ 0.3V, 3.5kHz</td>
</tr>
</tbody>
</table>

Fig. 12. Measured core DVFS tuning range versus $V_{DD}$.

Fig. 13. Measured performance with comparison with the state-of-the-art low-power processors.

program is used. Fig. 10 shows the measured power, energy, and frequency of the core and uncore for each of the DVFS modes while running at the nominal $V_{DD}$ of 0.6 V. At this $V_{DD}$, the SDLS core consumes 6-nW total power to run at 11 Hz in the minimum DVFS mode and 140-nW total power at 8.2 kHz in the maximum DVFS mode, while the uncore domain increases from 4.86 to 93.9-nW total power. Fig. 11 shows the full power breakdown at these minimum and maximum DVFS modes, including measured power of the ACG and simulated power of the VSC, which consumes 3-nW total power from its 1.2-V supply across all $VC$ modes. Fig. 12 shows the achievable core power and frequency range from DVFS at each $V_{DD}$. The minimum achievable core power is 840 pW, which occurs in the minimum DVFS mode at 0.3-V $V_{DD}$ while running at 6 Hz. At this point, the uncore domain consumes 690 pW and the clock generator consumes 130 pW, bringing the total minimum system power to 4.66 nW after the addition of the 3-nW VSC. While most processors require high operating frequency (and therefore, high power) to reduce energy consumption, the SDLS core achieves a competitive minimum energy of 13.4 pJ/cycle at 0.5-V $V_{DD}$ while running at 2.07 kHz in the maximum DVFS mode, consuming just 27.9 nW. Increasing $V_{DD}$ allows the core to reach higher frequencies (up to 41.5 kHz when $V_{DD} = 0.9$ V and $V_{C} = 1.1$ V) during DVFS, but results in high dynamic power due to the SDLS gates having a higher intrinsic gate capacitance than static CMOS gates. Table I shows a detailed summary of this letter with comparison to the state-of-the-art low-power and low-energy processors.

IV. CONCLUSION

This letter presented an RISC-V microprocessor in 65-nm CMOS, implemented with a custom SDLS standard cell logic family designed to enable constant-$V_{DD}$ performance scaling at ultralow-power levels. A co-designed VSC and ACG are also presented that enable fully integrated DVFS at 0.6 V from 6 nW/11 Hz to 140 nW/8.2 kHz. Overall, the core design achieves performance limits of 840-pW minimum power and 13.4-pJ minimum energy. This performance helps enable a new area of the design space for continuous and reliable battery-less IoT sensing nodes (Fig. 13).

REFERENCES