I had a short look at the Forth programming language for a while. Is it possible to do multithreading with synchronization primitives in Forth?

For example, is it possible to do n-by-n matrix multiplication with multiple threads in Forth? If so, what is the basic mechanism, or programming patterns?

For the stated goal the multi-threading has to be pre-emptive. Simple Forths have a PAUSE-ing task-loop that runs tasks one after the other, never overlapping. Surprisingly useful but not in this case.

Modern, professional, Forth can do multi-threading but I know of only one with special primitives to make it easier.

The example matrix multiplication given earlier is not an demonstration of multi-threading.

To my knowledge (*), only the iForth compiler has special multi-threading primitives (OCCAM based), and comes with examples that really run x-times faster on n-core processors (where x < n). For the matrix code I would use its PAR .. ENDPAR where the threads access rows and colums that stay far apart in memory, to prevent cache pollution. There is another primitive that automatically splits up DO-LOOPs for you, in the way needed for this task. An example of this syntax for 8 threads is:

```
0 VALUE jj
: mmul2 ( F: -- r )
a3 /size DFLOATS ERASE
/rsz 0 DO
I TO jj
PAR
STARTP /rsz 0 DO a1 jj /rsz * I + DFLOAT[] [email protected] a2 I /rsz * DFLOAT[] a3 jj /rsz * DFLOAT[] /rsz DAXPY_sse2 LOOP ENDP
STARTP /rsz 0 DO a1 jj 1+ /rsz * I + DFLOAT[] [email protected] a2 I /rsz * DFLOAT[] a3 jj 1+ /rsz * DFLOAT[] /rsz DAXPY_sse2 LOOP ENDP
STARTP /rsz 0 DO a1 jj 2+ /rsz * I + DFLOAT[] [email protected] a2 I /rsz * DFLOAT[] a3 jj 2+ /rsz * DFLOAT[] /rsz DAXPY_sse2 LOOP ENDP
STARTP /rsz 0 DO a1 jj 3 + /rsz * I + DFLOAT[] [email protected] a2 I /rsz * DFLOAT[] a3 jj 3 + /rsz * DFLOAT[] /rsz DAXPY_sse2 LOOP ENDP
STARTP /rsz 0 DO a1 jj 4 + /rsz * I + DFLOAT[] [email protected] a2 I /rsz * DFLOAT[] a3 jj 4 + /rsz * DFLOAT[] /rsz DAXPY_sse2 LOOP ENDP
STARTP /rsz 0 DO a1 jj 5 + /rsz * I + DFLOAT[] [email protected] a2 I /rsz * DFLOAT[] a3 jj 5 + /rsz * DFLOAT[] /rsz DAXPY_sse2 LOOP ENDP
STARTP /rsz 0 DO a1 jj 6 + /rsz * I + DFLOAT[] [email protected] a2 I /rsz * DFLOAT[] a3 jj 6 + /rsz * DFLOAT[] /rsz DAXPY_sse2 LOOP ENDP
STARTP /rsz 0 DO a1 jj 7 + /rsz * I + DFLOAT[] [email protected] a2 I /rsz * DFLOAT[] a3 jj 7 + /rsz * DFLOAT[] /rsz DAXPY_sse2 LOOP ENDP
ENDPAR
8 +LOOP
0e a3 /size 0 ?DO [email protected]+ F+ LOOP DROP ;
```

For 1024 x 1024 matrices this (mmul2) is about twice faster than the single-thread version (mmul1).

```
FORTH> TESTS
DOT/AXPY using 64 bits floats.
Vector size = 1048576
mul0 (dot) : 6.8719411200000000000e+0013 0.133 seconds elapsed.
mul1 (dot_sse2) : 6.8719411200000000000e+0013 0.106 seconds elapsed.
mmul0 (axpy) : 5.6294941655040000004e+0014 0.981 seconds elapsed.
mmul1 (axpy_sse2) : 5.6294941655040000004e+0014 0.400 seconds elapsed.
mmul2 (Paxpy_sse2) : 5.6294941655040000004e+0014 0.114 seconds elapsed. ok
```

(*) Rumor has it that MPE and Forth Inc recently added similar functionality.