Tuesday, February 16, 2016

PSoC 5LP DFB assembler

The PSoC 5LP DFB and assembler

The digital filter block DFB on the PSoC 5LP is a very powerful element on the PSoC chip. It is available through the DFB component or the Filter component in PSoC creator. The filter block can be configured with cascaded FIR or BiQuad filters. This is a very powerful configuration but the filter parameters cannot be changed at run-time using the component API. The DFB component, on the other hand, is programmed using the supplied assembler and all parameters stored in DFB RAM can be dynamically changed at run-time. The assembler syntax is documented in the component and in the chip data sheets and the component contains a simulator to test the user supplied code. The problem using these tools is that the data-flow architecture and vliw instruction syntax have somewhat steep learning curve and there is not many commented examples available to guide the new user through the learning process.

It is not really very complicated, but it is a pipelined architecture and its important to know at every cycle/instruction what data element is available in what section of the pipeline.

The main elements are
  • a dual input stage where only one can be connected to the data path at a time, and an input register can only be read once before written again from the main bus, I might be wrong here but this is my experience from my early learning process.
  • multiplexers controlling what data is routed to the input of the RAM, MAC and the ALU
  • a pair of RAM blocks, they can be read and written independently and connected to 
  • a 24x24 multiply accumulator block MAC using q23 data format followed by
  • a 24 bit arithmetic logic unit ALU and a shifter
  • an output stage 
There is a control store and state handling but that can mostly be left to the assembler in basic applications.

Number format

The numerical format is signed 24 bit, and multiply and accumulate operations returns the signed bits 23:46, that is the 24 top minus one bits. A natural interpretation of this data is as signed decimal numbers in q23 format, that is a sign bit followed by a decimal point and 23 decimal bits. For the simulator these values are written as hexadecimal values.

First lessons learned

Some of the things that is mentioned in the documentation but, at least to me, are not obvious:
  • Input and Output channel A is numbered 1, and  B is numbered 0.
  • Input is the same as Staging register, Output is Holding register.
  • The DFB component code uses 3x8 bit byte access to the Staging and Holding registers in the  LoadInputValue and GetOutputValue functions, but it works well using 32 bit access.
  • In test examples for the simulator, use values where the product is not zero in the upper 24 bits, otherwise the MAC product stays zero and it seems nothing is happening.
  • The two input buffers cannot be read at the same time, and they can each be read only once before reloaded from the exterior bus. This means that if an input value must be used several times it has to be stored/held somewhere. This can be in one of the RAM blocks or in the ALU, of course its not possible to hold more than one value in the ALU and only for a few cycles, until some other data must flow through the ALU.
  •  All output from the MAC must go through the ALU
  • Writing a value to a RAM location puts this same value on the output of the RAM during the same cycle.
  • Addressing a specific RAM location in one of the RAM buffers is a bit involved
    • acu(clear, ...)  for location 0
    • acu(incr, ...)    for next location
    • acu(decr, ...) for previous location
    • acu(read, ...) addr(xx)  to read ACU RAM row xx as RAM A register address
    • acu(write, ...) addr(xx) to write current RAM A register address to ACU RAM row xx
  •  The saturation logic is for the ALU, the MAC will overflow in the accumulation even with saturation detection enabled.

Example: Squaring the input

The steps in this code are:
  • Wait for input
  • Send input buffer A to ALU
  • Route ALU to both MAC ports and clear, this places the product in the MAC with no accumulation.
  • Route the result through ALU to the Output register.

The asm code for the DFB block
initial:
acu(clear, clear) dmux(sa,sa) alu(set0) mac(hold)
acu(setmod, setmod) dmux(sa,sa) alu(hold) mac(clra) jmp(eob, waitForNew)

// Wait for data to be written to Staging Register Input 1
waitForNew:
acu(clear,clear) dmux(sa,sa) alu(hold) mac(hold) jmpl(in1,dataRead)

dataRead:
// Read staging register A into ALU
acu(hold, hold) addr(1) dmux(sa,ba) alu(setb) mac(hold)

// Multiply ALU out with ALU out and place in cleared MAC ACC
acu(hold, hold) dmux(sa,sa) alu(setb) mac(clra)            

//Move MAC o/p to ALU
acu(hold, hold) dmux(sm,sm) alu(seta) mac(hold)             

//Wait for ALU output
acu(hold, hold) dmux(sm,sm) alu(hold) mac(hold)

// Write the MAC content to holding register A
acu(hold, hold) addr(1) dmux(sa,sa) alu(hold) mac(hold) write(bus) jmp(eob,waitForNew)

Note: It is possible to send the staging register directly to both  ports of the MAC, saving one instruction, but the assembler generates a warning so I guess more testing is needed.
Use the assembler and simulator in the DFB block to test your code carefully before trying in on the live chip. The code reads from staging register A so some data should be placed in the simulator bus data area 'Bus1' (far out to the right).


To use this, the following code is put in the program:
float finput = 0.3, foutput;
uint32 input, output;

DFB_1_Start();
DFB_1_SetInterruptMode(DFB_1_HOLDA);
input = finput*(1<<23); /* float to q23 */
Loop code:
input = finput*(1<<23);            /* float to q23 */
DFB_1_LoadInputValue(1, input);

while (!(DFB_1_GetInterruptSource() & DFB_HOLDA) ) ;

output = DFB_1_GetOutputValue(1);
foutput = ((float)output)/(1<<23); /* q23 to float */

To use 32 bit access we can use the following code to write and read the staging and holding registers:
*( (reg32 *) DFB_1_DFB__STAGEA) = input;
output = *(  (reg32 *) DFB_1_DFB__HOLDA);

Example: First order LP filter

A basic first order filter calculates the recurrence relation: 
\[ y_{n} = a_1 y_{n-1} + b_0 x_{n} \]
The filter coefficients are stored in the first two locations of RAM-A and the previous output  \( y_{n-1} \) is remembered in the ALU hold register between loops.

When new input is available \( y_{n-1} \) is routed from ALU output (shift) to MAC input B and multiplied with \( a_1 \) (RAM-A[0] ) without accumulation, mac(clra), next the new input is routed to MAC input B and multiplied with \( b_0 \) (RAM-A[1]) and added to the previous product. The result is then transferred to the ALU and stored in the holding register.

For the example we use \( a_1 = 0.9 \) and \( b_0 = 0.1 \) for a DC unity gain filter.

0x733333 // a1 = 0.9
0x0CCCCC // b0 = 0.1

The asm code for the DFB block:
area data_a
org 0
dw 0x733333 // a1 = 0.9
dw 0x0CCCCC // b0 = 0.1

initial:// Clear ALU and MAC
acu(clear, clear) dmux(sa,sa) alu(set0) mac(hold)
acu(setmod, setmod) dmux(sa,sa) alu(hold) mac(clra) jmp(eob, waitForNew)

// Wait for data to be written to Staging Register Input 1
waitForNew:
acu(clear,clear) dmux(sa,sa) alu(hold) mac(hold) jmpl(in1,dataRead)

dataRead:
// Multiply ALU out with RAMA[0] and place in cleared MAC ACC
acu(hold, hold) dmux(sra,sa) alu(setb) mac(clra)

// Read staging register A to MAC port B and multiply with RAMA[1]
acu(incr, hold) addr(1) dmux(sra,ba) alu(hold) mac(macc)

//Move MAC o/p to ALU
acu(hold, hold) dmux(sm,sm) alu(seta) mac(hold)

//Wait for ALU output
acu(hold, hold) dmux(sm,sm) alu(hold) mac(hold)

// Write the MAC content to holding register A
acu(hold, hold) addr(1) dmux(sa,sa) alu(hold) mac(hold) write(bus) jmp(eob,waitForNew)


To monitor performance a semaphore can be output during the calculation and cleared in the wait loop. This is then connected to an output pin and monitored with a logic analyzer or a oscilloscope.


1 comment:

Carlos 47 said...

Thanks for writing about this component, it's great to start learning about it. Are you planning to post more about it?

Carlos