Instruction Timing

1Learning Outcomes¶

Practice interpreting waveforms in timing diagrams.
Given an instruction, identify the critical path through the single-cycle datapath.
Approximate instruction timing based on the five phases of instruction execution.

🎥 Lecture Video

How should we time our single-cycle datapath? How should we set the clock frequency? In this section, we develop an approximation of instruction timing using the five steps to a RISC-V instruction.

2Timing Diagram for `add`¶

First, let’s consider the delays in our beloved add instruction. Review the add datapath in Figure 1.

Figure 1:The add datapath, updated from an earlier section’s simple add-only datapath. Use the menu bar to trace through the animation or access the original Google slides.

Figure 2 shows the waveforms for executing an add x1 x2 x3 instruction at address 0x100, followed by add x6 x7 x9 at address 0x104.

"Timing waveforms for two consecutive add instructions, showing PC, instruction, register read, ALU, and writeback stabilization within a clock period." — Figure 2:Timing diagram for `add`. Only relevant signal waveforms are shown.

Explanation of Figure 2

Instruction Fetch (IF).
- On the rising edge of the clock, update the program counter register with its input signal. After some delay $t_{clk-to-q}$ , the value of the program counter 0x100 appears at the output signal pc.
- Concurrently perform the following:^[1]
  - Increment PC to the next instruction with the simple adder. After some delay, the signal pc + 4 is ready with 0x104 at the input to the PCSel mux.^[2].
  - Fetch an instruction from IMEM. After some delay, inst[31:0] is updated with the machine code for add x1 x2 x3.
Instruction Decode (ID). Concurrently perform the following:^[3]
- Retrieve the values of the source registers rs1 and rs2 from RegFile. After some delay, the output signals R[rs1] and R[rs2] are ready with the values of registers x2 and x3.
- Decode the instruction to determine the control logic signals.
Execute (EX).
- Use the two muxes to select the appropriate input signals to the ALU. After some delay, these input signals carry the two source register values.^[4]
- After some more delay, the ALU’s output signal alu is set to the sum of the two source registers x2 and x3.
Memory (MEM). (We don’t access memory, so do not incorporate memory delay in our analysis.)
Write Back (WB). Use the WBSel mux to select the alu output signal as the wb signal to the wdata input of the RegFile. After some delay, the signal is set to the sum of the two source registers.
Additionally, account for setup time needed to hold the wb signal stable before the rising clock edge.

3Critical path delay by instruction¶

Different instructions use different components of the datapath. We now update our definition of critical path to consider the path between clocked element inputs and outputs that matter for the given instruction. For example, accessing DMEM does not matter for an add, whereas setting up the RegFile data to write back does not matter for sw.

Table 1:Timing descriptions of components.

Delay	Description
$t_{\texttt{clk-to-q}}$	clk-to-q delay to transfer register input value to the output.
$t_{\texttt{setup}}$	Setup time to hold the register input stable before the rising clock edge.
$t_{\texttt{mux}}$	Propagation delay through a mux; assume the same delay for all muxes.
$t_{\texttt{add}}$	Propagation delay through the simple adder that increments PC to the next instruction.
$t_{\texttt{RegFile}}$	Delay to read a register value from RegFile.
$t_{\texttt{IMEM}}$	Delay to read the instruction from IMEM.
$t_{\texttt{DMEM}}$	Delay to read a word from DMEM.
$t_{\texttt{ALU}}$	Propagation delay through the ALU.
$t_{\texttt{Imm}}$	Propagation delay through the immediate generator.
$t_{\texttt{BrComp}}$	Propagation delay through the branch comparator.

Compute the delay along the critical path for each instruction.

add
lw
beq

Options:

A. $t_{\texttt{clk-to-q}} + t_{\texttt{Add}} + t_{\texttt{IMEM}} + t_{\texttt{RegFile}} + t_{\texttt{BrComp}} + t_{\texttt{ALU}} + t_{\texttt{DMEM}} + t_{\texttt{mux}} + t_{\texttt{Setup}}$
B. $t_{\texttt{clk-to-q}} + t_{\texttt{IMEM}} + t_{\texttt{RegFile}} + 2 \cdot t_{\texttt{mux}} + t_{\texttt{ALU}} + t_{\texttt{Setup}}$
C. $t_{\texttt{clk-to-q}} + t_{\texttt{IMEM}} + \max\{t_{\texttt{RegFile}}, t_{\texttt{Imm}}\} + t_{\texttt{ALU}} + 2 \cdot t_{\texttt{mux}} + t_{\texttt{DMEM}} + t_{\texttt{Setup}}$
D. $t_{\texttt{clk-to-q}} + t_{\texttt{IMEM}} + \max\{t_{\texttt{RegFile}}, t_{\texttt{Imm}}\} + t_{\texttt{ALU}} + 3 \cdot t_{\texttt{mux}} + t_{\texttt{Setup}}$
E. Something else

Show Answer for add

B. There are two “loops” that we consider:^[3]

The PC update loop, measured from the PC output to the PC input: $t_{\texttt{clk-to-q}} + t_{\texttt{Add}} + t_{\texttt{mux}} + t_{\texttt{setup}}$
The loop through the ALU, measured from the PC output to the RegFile input: $t_{\texttt{clk-to-q}} + t_{\texttt{IMEM}} + t_{\texttt{Reg}} + t_{\texttt{mux}} + t_{\texttt{ALU}} + t_{\texttt{mux}} + t_{\texttt{setup}}$

\begin{aligned} \text{Critical path delay} =& t_{\texttt{clk-to-q}} \\ & + \max \{ t_{\texttt{Add}} + t_{\texttt{mux}}, t_{\texttt{IMEM}} + t_{\texttt{Reg}} + t_{\texttt{mux}} + t_{\texttt{ALU}} + t_{\texttt{mux}} \} \\ & + t_{\texttt{setup}} \\ =& t_{\texttt{clk-to-q}} + t_{\texttt{IMEM}} + t_{\texttt{Reg}} + t_{\texttt{mux}} + t_{\texttt{ALU}} + t_{\texttt{mux}} + t_{\texttt{setup}} \\ =&t_{\texttt{clk-to-q}} + t_{\texttt{IMEM}} + t_{\texttt{RegFile}} + 2 \cdot t_{\texttt{mux}} + t_{\texttt{ALU}} + t_{\texttt{Setup}} \end{aligned}

The critical path uses the longer loop through the ALU.

Figure 3:The beq datapath, updated from an earlier section’s simpler datapath. Use the menu bar to trace through the animation or access the original Google slides.

Show Answer for beq

E. Something else.

We leave this derivation to you. Note you may need to make new placeholder delays for control logic...!

Figure 4:The lw datapath, updated from an earlier section’s simpler datapath. Use the menu bar to trace through the animation or access the original Google slides.

Show Answer for lw

C. Load uses hardware in all five phases of the datapath. We still consider the two “loops” through the datapath^[3]:

The PC update loop, still measured from the PC output to the PC input.
The much longer loop, measured from the PC output through the ALU and DMEM, to the RegFile input. We now consider additional hardware for loads:
- Instruction Decode: The immediate generation block sets imm concurrently with the RegFile retrieving the source register value R[rs1]. We denote this delay as the larger of the two, $\max\{t_{\texttt{RegFile}}, t_{\texttt{Imm}}\}$ .
- Execute: The ALU output computes the memory address, so we incur $t_{\texttt{ALU}}$ .
- Memory: The DMEM read now matters, so we incur DMEM read time, $t_{\texttt{DMEM}}$ .

\begin{aligned} \text{Critical path delay} =& t_{\texttt{clk-to-q}} \\ & + \max \{ t_{\texttt{add}} + t_{\texttt{mux}}, \\ & t_{\texttt{IMEM}} + t_{\texttt{Imm}} + t_{\texttt{mux}} + t_{\texttt{ALU}} + t_{\texttt{DMEM}} + t_{\texttt{mux}}, \\ & t_{\texttt{IMEM}} + t_{\texttt{RegFile}} + t_{\texttt{mux}} + t_{\texttt{ALU}} + t_{\texttt{DMEM}} + t_{\texttt{mux}} \} \\ & + t_{\texttt{setup}} \\ =& t_{\texttt{clk-to-q}} + t_{\texttt{IMEM}} + \max\{t_{\texttt{RegFile}}, t_{\texttt{Imm}}\} + t_{\texttt{ALU}} + 2 \cdot t_{\texttt{mux}} + t_{\texttt{DMEM}} + t_{\texttt{setup}} \end{aligned}

4The single-cycle datapath clock is slow¶

To determine the clock frequency for the single-cycle datapath, we compute delays of each instruction’s critical path, then set the clock period as the worst-case delay incurred over all instructions.

To put some numbers to our earlier analysis, we will simplify our time estimates with Table 2, which assumes that the timing of each of the five steps to a RISC-V instruction are dominated by the major functional hardware units.

Table 2:Assume each of the five steps is dominated by a major hardware unit. Multiplexors, control unit, PC accesses, immediate generation, and branch prediction incur minimal delay.

Step	Operation time	Major hardware unit
Instruction Fetch (`IF`)	200 ps	Read an instruction word from IMEM.
Instruction Decode (`ID`)	100 ps	Read register values from the RegFile.
Execute (`EX`)	200 ps	Perform arithmetic/logical operations in the ALU.
Memory Access (`MEM`)	200 ps	Read or write data from DMEM.
Write Back (`WB`)	100 ps	Write back to the RegFile. For single-cycle, we assume this is the delay of the WBSel mux and setup time.

We can then produce the simplified timing diagram in Figure 5 for an instruction that uses all phases—like our lw instruction from earlier. We can additionally construct Table 3, which shows the time required for various instruction formats.

"Phase-based timing diagram labeling IF, ID, EX, MEM, and WB intervals used to approximate single-cycle instruction delay." — Figure 5:Approximate timing diagram for the five steps to a RISC-V instruction in the single-cycle-datapath.

Table 3:(P&H Figure 4.28). Total time for each instruction calculated from the simplified time for each phase.

Instruction	IF (200ps)	ID (100ps)	EX (200ps)	MEM (200ps)	WB (100ps)	Total
`add`	X	X	X		X	600ps
`beq`	X	X	X			500ps
`jal`	X	X	X			500ps
`lw`	X	X	X	X	X	800ps
`sw`	X	X	X	X		700ps

While Table 3 above shows the shortest time to complete each instruction, we note that the single-cycle datapath, like all synchronous digital systems, shares a single clock.

We further note that each instruction’s critical path often involves accessing major hardware units in sequence. In other words, for most of each clock period, much of our hardware is idle and not computing additional data!

We address these performance issues and more in our pipelined datapath design up next. Stay tuned!

Footnotes¶

These processes take a comparable amount of time, though which is longer depends on the specific technology. In Figure 2, the adder happens to complete faster than the IMEM memory fetch.
↩
Note that the waveform represent bundles of wires with a hexadecimal value (contrast this with the clock’s binary high-low signal). The PC output pc bundle of wires update at the same time, because flip-flops are wired in parallel. By contrast, the pc+4 output does not stabilize simultaneously. Because the adder cascades single-bit adders in series, the least significant bits stabilize sooner than the more significant bits. In timing diagrams, we always show the transition to the correct value. For pc+4, this occurs after the propagation delay of the most significant bit.
↩
In Figure 2, the control logic decoding of the instruction happens to complete faster than the RegFile register read. We will assume this precedence in later analysis.
↩↩↩
There are two multiplexers controlled with ASel and BSel, respectively. Both propagation delays occur concurrently, so we only count for one mux’s propagation delay.
↩