Sunday, December 11, 2016

Dynamic Systems and Mechanics with Free and Open tools

Note: This is at the moment only a work in progress blog post, I didn't plan to post this until it was a bit more complete and polished,  but since its here I keep it public but I will keep modifying a bit when I got the time.

Mixed control and mechanics problems like controlling an inverted pendulum, a robotic arm or a flying drones gives rise to mechanical problems that can be solved by standard methods but the calculations can get quite involved and translating the formulas into C code without errors needs very careful work.The free CAS system Maxima can help us in the calculations and also generate almost runnable code expressions. The following example shows how the equations of motion for an inverted pendulum can be derived with the aid of Maxima.

Deriving the equations of motion in Maxima

The first command tells Maxima that p1,p2, p3 and p4 are functions of t, this important for taking derivatives later. The next line is the Lagrangian. 


The equations of motion are given by
\( \frac{d}{dt} \left [ \frac{\partial L}{\partial p_i} \right ] -  \frac{\partial L}{\partial q_i} = F_i \)


Rewriting the equations of motions as a dynamical system


The resulting dynamical system where the definitions of  p1 and p2 as derivatives of q1 and q2 has been inserted:

\( \begin{align*}
 \frac{d}{d\,t}\,q1 &=p1  &\\
\frac{d}{d\,t}\,p1 &=\frac{\left( F+l\,m\,{p2}^{2}\,\mathrm{sin}\left( q2\right) \right) \,I+{l}^{2}\,F+\left( {l}^{3}\,m\,{p2}^{2}−g\,{l}^{2}\,m\,\mathrm{cos}\left( q2\right) \right) \,\mathrm{sin}\left( q2\right) }{\left( I+{l}^{2}\right) \,M+m\,I+{l}^{2}\,m\,{\mathrm{sin}\left( q2\right) }^{2}} \\
 \frac{d}{d\,t}\,q2 &=p2 \\
\frac{d}{d\,t}\,p2 &=\frac{g\,l\,\mathrm{sin}\left( q2\right) \,M−l\,\mathrm{cos}\left( q2\right) \,F+\left( g\,l\,m−{l}^{2}\,m\,{p2}^{2}\,\mathrm{cos}\left( q2\right) \right) \,\mathrm{sin}\left( q2\right) }{\left( I+{l}^{2}\right) \,M+m\,I+{l}^{2}\,m\,{\mathrm{sin}\left( q2\right) }^{2}}
\end{align*} \)

Converting our system into C code

The Maxima 'fortran' command gives us an output that quite easily can be changed into a usable C formula.


      ds2(1) = ['diff(p1,t,1) = ((F+l*m*p2**2*sin(q2))*I+l**2*F+(l**3*m*
     1   p2**2-g*l**2*m*cos(q2))*sin(q2))/((I+l**2)*M+m*I+l**2*m*sin(q2)
     2   **2),'diff(p2,t,1) = (g*l*sin(q2)*M-l*cos(q2)*F+(g*l*m-l**2*m*p
     3   2**2*cos(q2))*sin(q2))/((I+l**2)*M+m*I+l**2*m*sin(q2)**2)] 

There are some Fortran constructs to remove, like line numbers and ** as the power operator, but all the elements of the formulas are in place.


A good discussion of using Maxima together with Octave/Matlab can be found here:

Saturday, November 19, 2016

PSoC 5LP - DFB assembler and some more ACU techniques

Saving ACU state and using per loop block addressing

Working with the DFB assembler for a while reveals how clever and powerful the architecture is, the ACU registers and the ACU RAM can be used to generate address patterns to pointing into the RAMA and RAMB memory blocks. 

We can use the ACU and ACU RAM to create several blocks of RAM memory that are used by consecutive loops. A possible configuration using three blocks with 3 and 4 elements respectively. is seen in the figure below.
  • ACU RAM[0] saves the start positions for the current block, this can be read into ACU REG with acu(read,read) addr(0) instead of using acu(clear,clear) to address the base of the block
  • ACU RAM[1] saves the size of the blocks, read into FREG and used to increase memory pointers at the end of a loop 
  • ACU RAM[1] saves the last position in the last block, this is used but the ACU modulo arithmetic to loop back to the beginning of the first block. 

At the end of a loop, before waiting for new input, the current base position is loaded and incremented with the values in FREG, this new memory base is then saved into ACU RAM[0] and used in the next loop.
When new input data has arrived the addresses are read from ACU RAM[0] at the start of the next loop, this gives correct RAM addressing even if the DFB has been paused and restarted to change memory parameters between the loops. The code also tests if the updated base value is 0, indicating the end of a major loop.

The following assembler code shows the control flow, the value of the current RAMB base address is written to holding register A but no other useful work is done. The process input section only reads, and discards the input from staging register A. The acu magic is marked in orange.

// ACU and block addressing
// Every loop uses a separate memory block in RAMA and RAMB
// The base address of this block is updated and saved at the end of the loop, before waiting for input data
// This base address is read at the start of the loop, in case there was a Pause/Resume event while waiting for input
// At any point in the loop the base address can be reloaded 
// with "acu(read, read) addr(0)"
area acu
org 0
dw 0x0000    // Memory location to save block base address
dw 0x0403    // Size of block, equals the increase of block base address for each loop
dw 0x130E    // REGM values, maximal RAMA and RAMB addresses before wraparound

area data_a
org 0

area data_b
org 0
dw 0x0000
dw 0x0001
dw 0x0002
dw 0x0003
dw 0x0004
dw 0x0005
dw 0x0006
dw 0x0007
dw 0x0008
dw 0x0009
dw 0x000A
dw 0x000B
dw 0x000C
dw 0x000D
dw 0x000E
dw 0x000F

acu(clear, clear) dmux(sa,sa) alu(set0) mac(hold)
acu(loadf, loadf) addr(1) dmux(sa,sa) alu(set0) mac(hold)
acu(loadm, loadm) addr(2) dmux(sa,sa) alu(set0) mac(clra) jmp(eob,wait_input)

acu(read, read) addr(0) dmux(sa,sa) alu(hold) mac(hold) jmp(eob,process_input)

// Only outputs current RAMB base for testing
acu(hold, hold) dmux(sa, sa) alu(clearsem, 001) mac(hold)
acu(hold, hold) addr(1) dmux(sa ,ba) alu(setb) mac(hold)
// Set ALU to RAMB[ACUB]
acu(hold, hold) dmux(sa, sra) alu(setb) mac(hold)
acu(hold, hold) dmux(sa, sa) alu(hold) mac(hold)
acu(hold, hold) addr(1) dmux(sa, sa) alu(hold) mac(hold) write(bus) jmp(eob,loop_end)

// Move acu registers to point to next memory block, and save in ACU RAM[0]
acu(read, read) addr(0) dmux(sa,sa) alu(hold) mac(hold) write(da)
acu(addf, addf) dmux(sa,sa) alu(hold) mac(hold) jmp(acubeq, major_loop_end)

acu(write, write) addr(0) dmux(sm,sm) alu(hold) mac(hold) jmp(eob,wait_input)

// Set alu 1 to show that we have detected end of the major loop,
// used in component simulator for testing
acu(hold, hold) dmux(sa, sa) alu(set1) mac(hold)
acu(write, write) addr(0) dmux(sa,sa) alu(hold) mac(hold) jmp(eob,wait_input)

acu(hold, hold) dmux(sa,sa) alu(setsem, 001) mac(hold) jmpl(in1,loop_start)

Monday, November 7, 2016

PSoC 5LP DFB assembler - ACU as a loop counter

This time we will use the address calculation unit ACU as a loop counter. We create a PSoC creator project to test our ideas. The DFB assembler code is checked out in the component simulator until everything seems to work as planned and then we place the code onto a real chip under control of the debugger.

Test project

The test project is setup to transfer ADC readings to the DFB using DMA. A new output value from the DFB is signaled by raising an interrupt. The interrupt flag can be read from the main loop or handled by an interrupt handler. In the test code the interrupt status is checked in the main look and DFB output data is saved in a SRAM array.

Example - Mean value of N samples

This example calculates the mean value of successive blocks of 10 samples.
Samples are read from staging register A, multiplied with 0.1 (1/N) and accumulated in the MAC.
The address calculation register ACUB is used as loop counter. Modulo arithmetic for ACUB is enabled and the final loop counter value N (10) is loaded to the MREG register from ACU RAM[0] during setup . When the loop counter reaches 10 the accumulated value is written to output register A and the accumulator is cleared before waiting for the next input value.

// Calculates average of 10 inputs
area acu
org 0
dw 0x000A

area data_a
org 0
dw 0x0CCCCC // RAMA[0] = 0.1

acu(clear, clear) dmux(sa,sa) alu(set0) mac(hold)
acu(hold, loadm) addr(0) dmux(sa,sa) alu(hold) mac(clra)
acu(setmod, setmod) dmux(sa,sa) alu(hold) mac(hold) jmp(eob,wait_input)
// Read staging register A to MAC port B and multiply-accumulate with RAMA[0]
acu(hold, incr) addr(1) dmux(sra,ba) alu(clearsem, 001) mac(macc)
// Move MAC o/p to ALU
// When ACUB is 10 then block is complete and output is written to holding register
acu(hold,hold) dmux(sm,sm) alu(seta) mac(hold) jmp(acubeq,write_output)
// Use semaphore0 to signal that DFB is waiting for input
acu(hold,hold) dmux(sa,sa) alu(setsem, 001) mac(hold) jmpl(in1,process_input)
//Wait for ALU output
acu(hold, hold) dmux(sm,sm) alu(hold) mac(hold)
// Write the MAC content to holding register A and clear the MAC
acu(clear, clear) addr(1) dmux(sa,sa) alu(set0) mac(hold) write(bus)
acu(clear, clear) dmux(sa,sa) alu(hold) mac(clra) jmp(eob,wait_input)

There are a few notes:
  1. The acubeq condition is true BOTH when the ACUBREG is equal to MREG and when its 0, so we catch the end of the loop when ACUBREG equals MREG and reset it in code
  2. Placing the wait_input state after the process input eliminates one jump 
  3. The jump when waiting for input is done with a loop jump jmpl(in1,process_input) that transfers control to the beginning of the block if the condition is not true, the block is this single instruction. 
  4. The jump when checking the loop count jmp(acubeq,write_output) is a simple jmp that falls through to the next block ( wait_input) if the condition is not set.
  5. Each sample is processed in less than 4 cycles, including the write of the output. If the DFB runs at 48MHz this equals 12MSamp/s.
  6. Mux settings are only important in two instructions in this example: 
      1. When reading from RAMA to channel A and the input register A to channel B, passing these values to the MAC input. dmux(sr,ba) , mux3 settings are not used.
      2. When outputting accumulated value from MAC to ALU input A. dmux(sm,sm) , only mux3a setting is used.

The example project can be found at:

Tuesday, February 16, 2016

PSoC 5LP DFB assembler

The PSoC 5LP DFB and assembler

The digital filter block DFB on the PSoC 5LP is a very powerful element on the PSoC chip. It is available through the DFB component or the Filter component in PSoC creator. The filter block can be configured with cascaded FIR or BiQuad filters. This is a very powerful configuration but the filter parameters cannot be changed at run-time using the component API. The DFB component, on the other hand, is programmed using the supplied assembler and all parameters stored in DFB RAM can be dynamically changed at run-time. The assembler syntax is documented in the component and in the chip data sheets and the component contains a simulator to test the user supplied code. The problem using these tools is that the data-flow architecture and vliw instruction syntax have somewhat steep learning curve and there is not many commented examples available to guide the new user through the learning process.

It is not really very complicated, but it is a pipelined architecture and its important to know at every cycle/instruction what data element is available in what section of the pipeline.

The main elements are
  • a dual input stage where only one can be connected to the data path at a time, and an input register can only be read once before written again from the main bus, I might be wrong here but this is my experience from my early learning process.
  • multiplexers controlling what data is routed to the input of the RAM, MAC and the ALU
  • a pair of RAM blocks, they can be read and written independently and connected to 
  • a 24x24 multiply accumulator block MAC using q23 data format followed by
  • a 24 bit arithmetic logic unit ALU and a shifter
  • an output stage 
There is a control store and state handling but that can mostly be left to the assembler in basic applications.

Number format

The numerical format is signed 24 bit, and multiply and accumulate operations returns the signed bits 23:46, that is the 24 top minus one bits. A natural interpretation of this data is as signed decimal numbers in q23 format, that is a sign bit followed by a decimal point and 23 decimal bits. For the simulator these values are written as hexadecimal values.

First lessons learned

Some of the things that is mentioned in the documentation but, at least to me, are not obvious:
  • Input and Output channel A is numbered 1, and  B is numbered 0.
  • Input is the same as Staging register, Output is Holding register.
  • The DFB component code uses 3x8 bit byte access to the Staging and Holding registers in the  LoadInputValue and GetOutputValue functions, but it works well using 32 bit access.
  • In test examples for the simulator, use values where the product is not zero in the upper 24 bits, otherwise the MAC product stays zero and it seems nothing is happening.
  • The two input buffers cannot be read at the same time, and they can each be read only once before reloaded from the exterior bus. This means that if an input value must be used several times it has to be stored/held somewhere. This can be in one of the RAM blocks or in the ALU, of course its not possible to hold more than one value in the ALU and only for a few cycles, until some other data must flow through the ALU.
  •  All output from the MAC must go through the ALU
  • Writing a value to a RAM location puts this same value on the output of the RAM during the same cycle.
  • Addressing a specific RAM location in one of the RAM buffers is a bit involved
    • acu(clear, ...)  for location 0
    • acu(incr, ...)    for next location
    • acu(decr, ...) for previous location
    • acu(read, ...) addr(xx)  to read ACU RAM row xx as RAM A register address
    • acu(write, ...) addr(xx) to write current RAM A register address to ACU RAM row xx
  •  The saturation logic is for the ALU, the MAC will overflow in the accumulation even with saturation detection enabled.

Example: Squaring the input

The steps in this code are:
  • Wait for input
  • Send input buffer A to ALU
  • Route ALU to both MAC ports and clear, this places the product in the MAC with no accumulation.
  • Route the result through ALU to the Output register.

The asm code for the DFB block
acu(clear, clear) dmux(sa,sa) alu(set0) mac(hold)
acu(setmod, setmod) dmux(sa,sa) alu(hold) mac(clra) jmp(eob, waitForNew)

// Wait for data to be written to Staging Register Input 1
acu(clear,clear) dmux(sa,sa) alu(hold) mac(hold) jmpl(in1,dataRead)

// Read staging register A into ALU
acu(hold, hold) addr(1) dmux(sa,ba) alu(setb) mac(hold)

// Multiply ALU out with ALU out and place in cleared MAC ACC
acu(hold, hold) dmux(sa,sa) alu(setb) mac(clra)            

//Move MAC o/p to ALU
acu(hold, hold) dmux(sm,sm) alu(seta) mac(hold)             

//Wait for ALU output
acu(hold, hold) dmux(sm,sm) alu(hold) mac(hold)

// Write the MAC content to holding register A
acu(hold, hold) addr(1) dmux(sa,sa) alu(hold) mac(hold) write(bus) jmp(eob,waitForNew)

Note: It is possible to send the staging register directly to both  ports of the MAC, saving one instruction, but the assembler generates a warning so I guess more testing is needed.
Use the assembler and simulator in the DFB block to test your code carefully before trying in on the live chip. The code reads from staging register A so some data should be placed in the simulator bus data area 'Bus1' (far out to the right).

To use this, the following code is put in the program:
float finput = 0.3, foutput;
uint32 input, output;

input = finput*(1<<23); /* float to q23 */
Loop code:
input = finput*(1<<23);            /* float to q23 */
DFB_1_LoadInputValue(1, input);

while (!(DFB_1_GetInterruptSource() & DFB_HOLDA) ) ;

output = DFB_1_GetOutputValue(1);
foutput = ((float)output)/(1<<23); /* q23 to float */

To use 32 bit access we can use the following code to write and read the staging and holding registers:
*( (reg32 *) DFB_1_DFB__STAGEA) = input;
output = *(  (reg32 *) DFB_1_DFB__HOLDA);

Example: First order LP filter

A basic first order filter calculates the recurrence relation: 
\[ y_{n} = a_1 y_{n-1} + b_0 x_{n} \]
The filter coefficients are stored in the first two locations of RAM-A and the previous output  \( y_{n-1} \) is remembered in the ALU hold register between loops.

When new input is available \( y_{n-1} \) is routed from ALU output (shift) to MAC input B and multiplied with \( a_1 \) (RAM-A[0] ) without accumulation, mac(clra), next the new input is routed to MAC input B and multiplied with \( b_0 \) (RAM-A[1]) and added to the previous product. The result is then transferred to the ALU and stored in the holding register.

For the example we use \( a_1 = 0.9 \) and \( b_0 = 0.1 \) for a DC unity gain filter.

0x733333 // a1 = 0.9
0x0CCCCC // b0 = 0.1

The asm code for the DFB block:
area data_a
org 0
dw 0x733333 // a1 = 0.9
dw 0x0CCCCC // b0 = 0.1

initial:// Clear ALU and MAC
acu(clear, clear) dmux(sa,sa) alu(set0) mac(hold)
acu(setmod, setmod) dmux(sa,sa) alu(hold) mac(clra) jmp(eob, waitForNew)

// Wait for data to be written to Staging Register Input 1
acu(clear,clear) dmux(sa,sa) alu(hold) mac(hold) jmpl(in1,dataRead)

// Multiply ALU out with RAMA[0] and place in cleared MAC ACC
acu(hold, hold) dmux(sra,sa) alu(setb) mac(clra)

// Read staging register A to MAC port B and multiply with RAMA[1]
acu(incr, hold) addr(1) dmux(sra,ba) alu(hold) mac(macc)

//Move MAC o/p to ALU
acu(hold, hold) dmux(sm,sm) alu(seta) mac(hold)

//Wait for ALU output
acu(hold, hold) dmux(sm,sm) alu(hold) mac(hold)

// Write the MAC content to holding register A
acu(hold, hold) addr(1) dmux(sa,sa) alu(hold) mac(hold) write(bus) jmp(eob,waitForNew)

To monitor performance a semaphore can be output during the calculation and cleared in the wait loop. This is then connected to an output pin and monitored with a logic analyzer or a oscilloscope.

Saturday, February 6, 2016

The Cypress PSoC 5LP

I recently started to work on a project that needs several analog input and output channels connected to some sensors and a PID control loop and it will probably work better with 5V than the currently popular 3.3 for Cortex M systems. So I decided to dig out a  CY8CKIT-059 PSoC 5LP Prototyping Kit  that has been gathering dust waiting for the right project to come along.

This is actually an amazing chip, even if the processor is a standard Cortex-M3 that runs up to 80MHz (there are signs that Cypress is working on a PSoC7 with a M7 core and probably a price to match). The 64K RAM and 256K FLASH sizes are modest but what makes this chip special are the configurable analog and digital blocks on the chip. There are three ADC's, one with high impedance buffers, two DAC's and four opamps, all connected to an analog switching network. The digital side has a number of universal digital blocks to implement your own digital logic or preconfigured communication interfaces and a digital filter block, DFB. This is a 24 bit datastream co-processor with a multiply-accumulate unit and a ALU.  The analog and digital I/O can be run from 1.71 to 5.5 volts.

The vendor supplied tool chain PSoC Creator only runs under Windows but it is not a bad experience even though I am always skeptic of systems that generates code that is hard to know where one can change and how. I find it often less trouble to implement stuff directly from the data sheet than to learn the ins and outs of the library calls. In order to use the extra analog and digital blocks in the chip I think it is really necessary to use the vendor supplied toolchain, and without these extras the chip is not very special.

Bildresultat för cy8ckit-059

Add the fact that the CY8CKIT-059 PSoC 5LP Prototyping Kit can be bought for $10 this is definitely a system worth trying out, even if USB connectors made from four strips on top of the circuit board is not the most professional and stable connection method. That can be fixed with an old USB cable and a soldering iron, and the price will still be attractive. The on board programmer and debugger is also a programmable PSoC 5LP and could be used for a project requiring very few I/O pins.

Getting started is quite easy, install some USB drivers, the PSoC Creator and PSoC Programmer software with example projects. Open an example project and hit the debug button. I started with the "CE95277 ADC and UART" project and soon had the board sending ADC samples over USB serial to my PC. Getting used to all the tools and panels takes a few days but the help functions are easy to access and components have datasheets and code examples the opens with a right click.  

Friday, August 28, 2015

Getting started with the STM32F7 Discovery board

The other week I got a brand new STM32F7 Discovery board, the fabled Cortex M7 had arrived and performance bliss was at hand ...

... except there were a few minor stumbling stones on the road to paradise ...

this is not an ad for ST, I did pay the full Mouser price.

  • ST programming tools for the ST-Link interface are not available for Mac OSX, and the board does not expose the JTAG/SWD of the Cortex M7, it is only available through the onboard ST-Link adapter chip. The GGC ARM Embedded toolchain was happy to generate code after importing some system defines and startup code from STM32Cube_FW_F7_V1.1.0 library and a bit of tweaking to Makefiles and linker scripts. It was possible to flash the binaries to the board using the USB file system interface, but then there is no debug connection, and it refused load code that was linked to use fast ITCM flash access.
  • With some some work it was possible to add the chip and flash configuration details to the free stlink utility and also tweak the stlink RAM based flash writing code to handle the 64 bit flash access so flash is correctly written, all running on Mac OSX.
  • The example code in the STM32Cube_FW_F7_V1.1.0 library does not by default use the external 25MHz crystal clock, only the onchip internal 16MHz oscillator. The PLL is also not configured by default, meaning the chip runs at 16MHz, and not the 196 or 216 Mhz that should be possible.
  • For optimal performance the linker settings should be configured to make the code use the ART accelerated ITCM flash access at 0x0020000 instead of the normal, slow, flash access at 0x0800000. When the linker was configured this way the ST onboard USB flash loader refused to load code into flash, but stlink utility worked.
  • The external SDRAM for the Discovery board has a different configuration than the one in the "system_stm32f7xx.c" supplied from ST. It seems ST has used the settings for another STM32F7 board and not been careful to make the changes for the F7 Discovery board.
Virtual USB com port was quite easy, it turns out that the USB subsystem is almsot identical to the the one on STMF4 chips.

After these tweaks I was able to run some simple tests where the chip executed between 200 and 300M instructions/sec. I think this can be a great chip for audio synthesis.

Next will be the big LCD display, but that must wait for when I have lots of free time again.

Monday, August 10, 2015

More USB and MIDI

Time goes fast, new and more powerful microprocessors are introduced, and the favourites of a year ago are fading and starts to collect dust in a draver filled with useful but not quite exciting boards. These days the Cortex M3 of the Maple board feels a bit old, the Teensy 3.1 is a great board and the Arduino type libraries and support is superb, but it has a bit limited memory and no hardware floating point.The boards I use mostly for audio generation at the moment are the Netduino plus 2, without the Netduino bootloader and .Net libraries, the STM32F4 Discovery and the XMC4500 Relax Kit Lite. All of these have Cortex-M4F processors, more than 128KB RAM and  1024KB of flash, they are all cheap and good value for money.

I need stable USB MIDI and after wrangling with the device libraries and development platforms, endless hierachies of folders and files that are usually targeted at Windows and often lacking USB MIDI, I decided to write yet another lightweight USB implementation. I know this might sound stupid and a waste of effort but having some time free due to vacation times and a bit of rainy summer I went ahead.

The code is written from scratch and uses only some MCU specific header files borrowed unchanged  from the manufaturer supplied code libraries. When time allows and I have done some more testing a few minimal example programs will be added.

I have uploaded the code to my Cortal Dendrites repository on Github:

Happy Coding :)