Home記事一覧フォーラム

メニーコア・プロセッサーの設計

【更新履歴】
2024/08/14 プロセッサーコアをmini16scベースに変更
2024/06/02 KV260 145コア版を追加
2023/08/31 Kria KV260 97コア版を追加
2019/05/10 501コア版を追加
2019/01/24 新規公開

Mini16SC-CPUを使ってメニーコア構成のSoCを実装してみました。

1コアあたりのリソース消費が少ないため、Kria KV260で145コアのプロセッサーを実装できました。
レジスター、データ幅は32bitの構成です。

16bit版Processor Elementの最小構成のコアを使った場合、BeMicro-CVA9で501コアを実装できました。

機種プロセッサーコア数動作周波数
Kria KV260145200 MHz(実機動作)
Terasic DE0-CV33140 MHz(実機動作)
Terasic DE0-CV65 (16bit版PE)140 MHz(実機動作)
BeMicro-CVA9171100 MHz(実機動作)
BeMicro-CVA9501 (16bit版PE)100 MHz(実機動作)
Kintex UltraScale+129500 MHz(VGAなし, Vivado上の評価)

SoC構成


ターゲットボードについて

このプロジェクトは以下のFPGAボードに対応しています。
Terasic DE0-CV
BeMicro-CVA9
Kria KV260

I/O電圧のジャンパ設定について

●BeMicro CV A9の場合
BeMicro CV A9ではボードのI/O電圧を3.3Vに設定することを前提にしています。
BeMicro CV A9 Hardware Reference Guide
のp.23を参照してVCCIO選択ジャンパ (J11)のpin 1とpin 2が接続されていることを確認してください。

論理合成・実行方法

ソースコードのダウンロード:mini16_manycore.tar.gz

BeMicro CV A9, DE0-CV向け(Mini16CPUコア版)ソースコードのダウンロード(github)

ターミナルで、

tar xf mini16_manycore.tar.gz

各ボードのディレクトリに移動してmakeします。

cd mini16_manycore/ボードのディレクトリ名

make

その後、各社のツールでプロジェクトファイルを開いて合成、転送します。

プロジェクトファイル:
Terasic DE0-CV: mini16_manycore/de0-cv/DE0_CV_start.qpf
BeMicro-CVA9: mini16_manycore/bemicro_cva9/bemicro_cva9_start.qpf
Kria KV260: mini16_manycore/kv260/project_1/project_1.xpr

DE0-CV、BeMicro-CVA9の場合: クロックを高めに設定しているので合成ツールのランダムシードによってはTiming metにならない場合があります。この場合はQuartusのAssignments:Settings:Compiler Settings:Advanced Settings:Fitter Initial Placement Seedを1ずつ増やして何度か試してみてください。だいたい10回以内に「当たり」の配置配線が出るはずです。

Verilogシミュレータ「Icarus Verilog」でのシミュレーション

「Icarus Verilog」を使えばFPGAボードがなくても開発・シミュレーションを行うことができます。
Icarus Verilogコンパイラを使う」の方法で iverilog と gtkwave をインストールし、

cd mini16_manycore/testbench

make run
(16bit版の場合は make run16)
でシミュレーションできます。出力された wave.vcd を gtkwave で開いて画面左側の信号リストから見たい信号を右側の波形画面へドラッグ&ドロップすれば信号波形を観察できます。

Raspberry Pi、PCとの接続

Raspberry Pi、もしくはUSBシリアルケーブルを接続したPCからFPGAにUARTで接続して、プログラムの転送、実行を行えるようにしました。

その他のI/Oの接続

UART経由でのプログラムの転送、実行

上記のように設定したRaspberry PiまたはPCで、

cd mini16_manycore/ボードのディレクトリ名

make run

これでツールのコンパイル、プログラムのコンパイル、転送、実行が行われます。

このCPUでプログラミングする方法

mini16_manycore/asm 以下にJava上で動作する簡易アセンブラが入っています。
実行にはOpenJDK 8.0以上のインストールが必要です。
AsmLibクラスを継承したクラスを作り、init()で初期化設定、program()にプログラム、data()にデータを記述します。AsmTop.javaも修正します。
mini16_manycoreディレクトリに移動して make を実行するとプログラム・バイナリ(default_code_mem.v, default_data_mem.v)が出力されます。
UART使用時は make run を実行するとビルド後に転送されます。

並列化プログラムの例:マンデルブロ集合の描画

mini16_manycore/asm 以下にマンデルブロ集合を描画するデモプログラムが入っています。
MasterProgram.java がマスターコア用プログラムで、PEの制御を行います。
PEProgram.java がPE用プログラムで、マンデルブロ集合の計算とフレームバッファへの描画を行います。
BootProgram.java はPE用プログラムをPEに転送するマスターコア用プログラムです。UARTインターフェース使用時はまずこれが走り、次に MasterProgram.java のプログラムが走るようになっています。(mini16_manycore/tools/Makefile 参照)
PCとUARTで接続している場合は、mini16_manycore ディレクトリ以下で make run すると全てのプログラムがコンパイルされて転送、実行されます。

16bit版の場合はマンデルブロ集合の代わりに画面を色で塗りつぶすプログラムが動きます。
デフォルトでは垂直同期を待たずに描画するようになっているので、ちらつく縞模様が見えます。
asm/MasterProgram.java の
private int DEBUG = 0; を1にするとウェイトがかかり、フレーム数がUARTで出力されます。
private int WAIT_VSYNC = 0; を1にすると垂直同期を待ってから次のフレームを描画するので縞模様がなくなります。

ソースコード

これらのソースコードはBSD 2-Clauseライセンスで公開します。 全てのソースコードはmini16_manycore.tar.gzをダウンロードするか、
https://github.com/miya4649/mini16_manycoreを参照してください。

mini16_pe.v : Processor Element
// SPDX-License-Identifier: BSD-2-Clause
// Copyright (c) 2019 miya All rights reserved.

module mini16_pe
  #(
    parameter WIDTH_D = 16,
    parameter DEPTH_I = 8,
    parameter DEPTH_D = 8,
    parameter DEPTH_M2S = 8,
    parameter DEPTH_FIFO = 7,
    parameter CORE_ID = 0,
    parameter MASTER_W_BANK_BC = 63,
    parameter DEPTH_V_F = 16,
    parameter DEPTH_B_F = 15,
    parameter DEPTH_V_M = 17,
    parameter DEPTH_B_M = 11,
    parameter DEPTH_V_S_R = 10,
    parameter DEPTH_B_S_R = 8,
    parameter DEPTH_V_S_W = 9,
    parameter DEPTH_B_S_W = 8,
    parameter DEPTH_V_M2S = 9,
    parameter DEPTH_B_M2S = 8,
    parameter FIFO_RAM_TYPE = "auto",
    parameter M2S_RAM_TYPE = "auto",
    parameter DEPTH_REG = 5
    )
  (
   input                          clk,
   input                          reset,
   input                          soft_reset,
   input                          fifo_req_r,
   output                         fifo_valid,
   output [WIDTH_D+DEPTH_V_F-1:0] fifo_r_data,
   input [DEPTH_V_M-1:0]          addr_i,
   input [WIDTH_D-1:0]            data_i,
   input                          we_i
   );

  localparam WIDTH_I = 16;
  localparam TRUE = 1'b1;
  localparam FALSE = 1'b0;
  localparam ONE = 1'd1;
  localparam ZERO = 1'd0;
  localparam FFFF = {WIDTH_D{1'b1}};

  wire [DEPTH_I-1:0]     cpu_i_r_addr;
  wire [WIDTH_I-1:0]     cpu_i_r_data;
  wire [DEPTH_V_S_W-1:0] cpu_d_r_addr;
  reg [WIDTH_D-1:0]      cpu_d_r_data;
  wire [DEPTH_V_S_W-1:0] cpu_d_w_addr;
  wire [WIDTH_D-1:0]     cpu_d_w_data;
  wire                   cpu_d_we;
  wire [DEPTH_V_S_W-DEPTH_B_S_W-1:0] cpu_d_w_bank;
  wire [DEPTH_V_S_R-DEPTH_B_S_R-1:0] cpu_d_r_bank;

  // cpu data write
  reg [DEPTH_D-1:0]  mem_d_w_addr;
  reg [WIDTH_D-1:0]  mem_d_w_data;
  reg                mem_d_we;
  assign cpu_d_w_bank = cpu_d_w_addr[DEPTH_V_S_W-1:DEPTH_B_S_W];
  always @(posedge clk)
    begin
      mem_d_w_addr <= cpu_d_w_addr[DEPTH_D-1:0];
      mem_d_w_data <= cpu_d_w_data;
      s2mfifo_data_w <= {cpu_d_w_addr[DEPTH_V_F-1:0], cpu_d_w_data};
      if (cpu_d_we == TRUE)
        begin
          case (cpu_d_w_bank)
            0:
              begin
                // mem_d
                mem_d_we <= TRUE;
                s2mfifo_we <= FALSE;
              end
            default:
              begin
                // fifo
                mem_d_we <= FALSE;
                s2mfifo_we <= TRUE;
              end
          endcase
        end
      else
        begin
          mem_d_we <= FALSE;
          s2mfifo_we <= FALSE;
        end
    end

  // cpu data read
  wire [DEPTH_D-1:0] mem_d_r_addr;
  wire [WIDTH_D-1:0] mem_d_r_data;
  wire [WIDTH_D-1:0] shared_m2s_r_data;
  assign mem_d_r_addr = cpu_d_r_addr[DEPTH_D-1:0];
  assign cpu_d_r_bank = cpu_d_r_addr[DEPTH_V_S_R-1:DEPTH_B_S_R];
  always @(posedge clk)
    begin
      case (cpu_d_r_bank)
        // mem_d
        0: cpu_d_r_data <= mem_d_r_data;
        // shared_m2s
        1: cpu_d_r_data <= shared_m2s_r_data;
        // register
        default: cpu_d_r_data <= s2mfifo_item_count;
      endcase
    end

  // data from master
  reg shared_m2s_we;
  reg mem_i_we;
  reg [DEPTH_V_M-1:0] addr_i_d1;
  reg [WIDTH_D-1:0]   data_i_d1;
  reg [DEPTH_V_M-1:0] addr_i_d2;
  reg [WIDTH_D-1:0]   data_i_d2;
  reg                 we_i_d1;
  wire [DEPTH_V_M-DEPTH_B_M-1:0] core_bank;
  wire [DEPTH_V_M2S-DEPTH_B_M2S-1:0] m2s_bank;
  assign core_bank = addr_i_d1[DEPTH_V_M-1:DEPTH_B_M];
  assign m2s_bank = addr_i_d1[DEPTH_V_M2S-1:DEPTH_B_M2S];

  always @(posedge clk)
    begin
      addr_i_d1 <= addr_i;
      data_i_d1 <= data_i;
      addr_i_d2 <= addr_i_d1;
      data_i_d2 <= data_i_d1;
      we_i_d1 <= we_i;
    end

  always @(posedge clk)
    begin
      if ((we_i_d1 == TRUE) && ((core_bank == CORE_ID) || (core_bank == MASTER_W_BANK_BC)))
        begin
          case (m2s_bank)
            0:
              begin
                shared_m2s_we <= TRUE;
                mem_i_we <= FALSE;
              end
            default:
              begin
                shared_m2s_we <= FALSE;
                mem_i_we <= TRUE;
              end
          endcase
        end
      else
        begin
          shared_m2s_we <= FALSE;
          mem_i_we <= FALSE;
        end
    end

  mini16sc_cpu
    #(
      .WIDTH_I (WIDTH_I),
      .WIDTH_D (WIDTH_D),
      .DEPTH_I (DEPTH_I),
      .DEPTH_D (DEPTH_V_S_W),
      .DEPTH_REG (DEPTH_REG)
      )
  mini16sc_cpu_0
    (
     .clk (clk),
     .reset (reset),
     .soft_reset (soft_reset),
     .mem_i_r_addr (cpu_i_r_addr),
     .mem_i_r_data (cpu_i_r_data),
     .mem_d_r_addr (cpu_d_r_addr),
     .mem_d_r_data (cpu_d_r_data),
     .mem_d_w_addr (cpu_d_w_addr),
     .mem_d_w_data (cpu_d_w_data),
     .mem_d_we (cpu_d_we)
     );

  default_pe_code_mem
    #(
      .DATA_WIDTH (WIDTH_I),
      .ADDR_WIDTH (DEPTH_I)
      )
  mem_i
    (
     .clk (clk),
     .addr_r (cpu_i_r_addr),
     .addr_w (addr_i_d2[DEPTH_I-1:0]),
     .data_in (data_i_d2[WIDTH_I-1:0]),
     .we (mem_i_we),
     .data_out (cpu_i_r_data)
     );

  default_pe_data_mem
    #(
      .DATA_WIDTH (WIDTH_D),
      .ADDR_WIDTH (DEPTH_D)
      )
  mem_d
    (
     .clk (clk),
     .addr_r (mem_d_r_addr),
     .addr_w (mem_d_w_addr),
     .data_in (mem_d_w_data),
     .we (mem_d_we),
     .data_out (mem_d_r_data)
     );

  rw_port_ram
    #(
      .DATA_WIDTH (WIDTH_D),
      .ADDR_WIDTH (DEPTH_M2S),
      .RAM_TYPE (M2S_RAM_TYPE)
      )
  shared_m2s
    (
     .clk (clk),
     .addr_r (mem_d_r_addr[DEPTH_M2S-1:0]),
     .addr_w (addr_i_d2[DEPTH_M2S-1:0]),
     .data_in (data_i_d2),
     .we (shared_m2s_we),
     .data_out (shared_m2s_r_data)
     );

  reg s2mfifo_we;
  reg [WIDTH_D+DEPTH_V_F-1:0] s2mfifo_data_w;
  wire [DEPTH_FIFO-1:0] s2mfifo_item_count;
  fifo
    #(
      .WIDTH (WIDTH_D+DEPTH_V_F),
      .DEPTH_IN_BITS (DEPTH_FIFO),
      .MAX_ITEMS (((1 << DEPTH_FIFO) - 7)),
      .RAM_TYPE (FIFO_RAM_TYPE)
      )
  s2mfifo
    (
     .clk (clk),
     .reset (reset),
     .req_r (fifo_req_r),
     .we (s2mfifo_we),
     .data_w (s2mfifo_data_w),
     .data_r (fifo_r_data),
     .valid_r (fifo_valid),
     .full (),
     .item_count (s2mfifo_item_count),
     .empty ()
     );

endmodule
mini16_soc.v : SoC
// SPDX-License-Identifier: BSD-2-Clause
// Copyright (c) 2019 miya All rights reserved.

module mini16_soc
  #(
    parameter CORES = 32,
    parameter UART_CLK_HZ = 50000000,
    parameter UART_SCLK_HZ = 115200,
    parameter WIDTH_M_D = 32,
    parameter WIDTH_P_D = 32,
    parameter DEPTH_M_I = 11,
    parameter DEPTH_M_D = 11,
    parameter DEPTH_P_I = 10,
    parameter DEPTH_P_D = 8,
    parameter DEPTH_M2S = 8,
    parameter DEPTH_FIFO = 4,
    parameter DEPTH_S2M = 9,
    parameter DEPTH_U2M = 11,
    parameter VRAM_BPP = 3,
    parameter VRAM_WIDTH_BITS = 8,
    parameter VRAM_HEIGHT_BITS = 9,
    parameter PE_FIFO_RAM_TYPE = "auto",
    parameter PE_M2S_RAM_TYPE = "auto",
    parameter VRAM_RAM_TYPE = "auto",
    parameter PE_DEPTH_REG = 5
    )
  (
   input                 clk,
   input                 reset,
`ifdef USE_UART
   input                 uart_rxd,
   output                uart_txd,
`endif
`ifdef USE_VGA
   input                 clkv,
   input                 resetv,
   output                vga_hs,
   output                vga_vs,
   output                vga_de,
   output [VRAM_BPP-1:0] vga_color,
`endif
   output [15:0]         led
   );

  // instruction width
  localparam WIDTH_I = 16;
  // register file depth
  localparam DEPTH_REG = 5;
  // I/O register depth
  localparam DEPTH_IO_REG = 5;
  localparam DEPTH_VRAM = (VRAM_WIDTH_BITS + VRAM_HEIGHT_BITS);
  // UART I/O addr depth
  localparam DEPTH_B_U = max(DEPTH_M_I, DEPTH_U2M);
  // UART I/O Virtual memory depth
  localparam DEPTH_V_U = (DEPTH_B_U + 2);
  localparam CORE_BITS = $clog2(CORES + 6);
  localparam DEPTH_B_F = max(DEPTH_VRAM, DEPTH_S2M);
  localparam DEPTH_B_M2S = max(DEPTH_P_I, DEPTH_M2S);
  localparam DEPTH_V_M2S = (DEPTH_B_M2S + 1);
  // Master write addr depth
  localparam DEPTH_B_M_W = max(DEPTH_V_M2S, max(DEPTH_M_D, DEPTH_IO_REG));
  // Master read addr depth
  localparam DEPTH_B_M_R = max(DEPTH_M_D, max(DEPTH_IO_REG, max(DEPTH_U2M, DEPTH_S2M)));
  // Master virtual memory write depth
  localparam DEPTH_V_M_W = (DEPTH_B_M_W + CORE_BITS);
  // Master virtual memory read depth
  localparam DEPTH_V_M_R = (DEPTH_B_M_R + 2);
  localparam DEPTH_V_F = (DEPTH_B_F + 1);
  localparam DEPTH_V_M = max(DEPTH_V_M_W, DEPTH_V_M_R);
  localparam DEPTH_B_S_R = max(DEPTH_P_D, DEPTH_M2S);
  localparam DEPTH_V_S_R = (DEPTH_B_S_R + 2);
  localparam DEPTH_B_S_W = max(DEPTH_V_F, DEPTH_P_D);
  localparam DEPTH_V_S_W = (DEPTH_B_S_W + 1);
  localparam PE_ID_START = 4;

  localparam MASTER_W_BANK_BC = ((1 << CORE_BITS) - 1);
  localparam MASTER_W_BANK_MEM_D = 0;
  localparam MASTER_W_BANK_IO_REG = 1;
  localparam MASTER_R_BANK_MEM_D = 0;
  localparam MASTER_R_BANK_IO_REG = 1;
  localparam MASTER_R_BANK_U2M = 2;
  localparam MASTER_R_BANK_S2M = 3;
  localparam UART_IO_ADDR_RESET = ((1 << DEPTH_B_U) + 0);
  localparam UART_BANK_MEM_I = 0;
  localparam UART_BANK_U2M = 2;
  localparam FIFO_BANK_S2M = 0;
  localparam FIFO_BANK_VRAM = 1;
  localparam IO_REG_R_UART_BUSY = 0;
  localparam IO_REG_R_VGA_VSYNC = 1;
  localparam IO_REG_R_VGA_VCOUNT = 2;
  localparam IO_REG_W_RESET_PE = 0;
  localparam IO_REG_W_LED = 1;
  localparam IO_REG_W_UART = 2;
  localparam IO_REG_W_SPRITE_X = 3;
  localparam IO_REG_W_SPRITE_Y = 4;
  localparam IO_REG_W_SPRITE_SCALE = 5;

  localparam TRUE = 1'b1;
  localparam FALSE = 1'b0;
  localparam ONE = 1'd1;
  localparam ZERO = 1'd0;

  function integer max (input integer a1, input integer a2);
    begin
      if (a1 > a2)
        begin
          max = a1;
        end
      else
        begin
          max = a2;
        end
    end
  endfunction

  // LED
  assign led = io_reg_w[IO_REG_W_LED];

  // Master IO reg
  reg [WIDTH_M_D-1:0] io_reg_r[0:((1 << DEPTH_IO_REG) - 1)];
  reg [WIDTH_M_D-1:0] io_reg_w[0:((1 << DEPTH_IO_REG) - 1)];

  // Master read
  wire [DEPTH_V_M_R-DEPTH_B_M_R-1:0] master_d_r_bank;
  assign master_d_r_bank = master_d_r_addr[DEPTH_V_M_R-1:DEPTH_B_M_R];
  always @(posedge clk)
    begin
      case (master_d_r_bank)
        MASTER_R_BANK_MEM_D:
          begin
            master_d_r_data <= master_mem_d_r_data;
          end
        MASTER_R_BANK_IO_REG:
          begin
            master_d_r_data <= io_reg_r[master_d_r_addr[DEPTH_IO_REG-1:0]];
          end
`ifdef USE_UART
        MASTER_R_BANK_U2M:
          begin
            master_d_r_data <= u2m_r_data;
          end
`endif
        default:
          begin
            master_d_r_data <= {{(WIDTH_M_D-WIDTH_P_D){1'b0}}, s2m_r_data};
          end
      endcase
    end

  // Master mem_d write
  reg [DEPTH_V_M_W-1:0] master_d_w_addr_d1;
  reg [WIDTH_M_D-1:0] master_d_w_data_d1;
  reg                 master_d_we_d1;
  always @(posedge clk)
    begin
      master_d_w_addr_d1 <= master_d_w_addr;
      master_d_w_data_d1 <= master_d_w_data;
      master_d_we_d1 <= master_d_we;
    end

  always @(posedge clk)
    begin
      if (reset == TRUE)
        begin
          master_mem_d_we <= FALSE;
        end
      else
        begin
          if ((master_d_we == TRUE) && (master_d_w_bank == MASTER_W_BANK_MEM_D))
            begin
              master_mem_d_we <= TRUE;
            end
          else
            begin
              master_mem_d_we <= FALSE;
            end
        end
    end

  // Master IO reg read
  always @(posedge clk)
    begin
`ifdef USE_UART
      io_reg_r[IO_REG_R_UART_BUSY] <= uart_io_busy;
`endif
`ifdef USE_VGA
      io_reg_r[IO_REG_R_VGA_VSYNC] <= vga_vsync;
      io_reg_r[IO_REG_R_VGA_VCOUNT] <= vga_vcount;
`endif
    end

  // Master IO reg write
  wire [WIDTH_M_D-1:0] io_reg_w_data;
  wire [DEPTH_IO_REG-1:0] io_reg_w_addr;
  reg io_reg_we;
  assign io_reg_w_data = master_d_w_data_d1;
  assign io_reg_w_addr = master_d_w_addr_d1[DEPTH_IO_REG-1:0];
  always @(posedge clk)
    begin
      if ((master_d_we == TRUE) && (master_d_w_bank == MASTER_W_BANK_IO_REG))
        begin
          io_reg_we <= TRUE;
        end
      else
        begin
          io_reg_we <= FALSE;
        end
      if (io_reg_we == TRUE)
        begin
          io_reg_w[io_reg_w_addr] <= io_reg_w_data;
        end
    end

`ifdef USE_UART
  // Master IO reg write: UART TX we
  always @(posedge clk)
    begin
      if (reset == TRUE)
        begin
          uart_io_tx_we <= FALSE;
        end
      else
        begin
          if ((io_reg_we == TRUE) && (io_reg_w_addr == IO_REG_W_UART))
            begin
              uart_io_tx_we <= TRUE;
            end
          else
            begin
              uart_io_tx_we <= FALSE;
            end
        end
    end
`endif

  // harvester
  reg [DEPTH_V_F-1:0] s2m_w_addr;
  reg [WIDTH_P_D-1:0] s2m_w_data;
  reg s2m_we;
  reg vram_we;
  wire [DEPTH_V_F-DEPTH_B_F-1:0] harvester_w_bank;
  assign harvester_w_bank = harvester_w_addr[DEPTH_V_F-1:DEPTH_B_F];
  always @(posedge clk)
    begin
      s2m_w_addr <= harvester_w_addr;
      s2m_w_data <= harvester_w_data;
      if (harvester_we == TRUE)
        begin
          if (harvester_w_bank == FIFO_BANK_S2M)
            begin
              s2m_we <= TRUE;
              vram_we <= FALSE;
            end
          else
            begin
              s2m_we <= FALSE;
              vram_we <= TRUE;
            end
        end
      else
        begin
          s2m_we <= FALSE;
          vram_we <= FALSE;
        end
    end

  wire harvester_r_valid [0:CORES-1];
  wire [WIDTH_P_D+DEPTH_V_F-1:0] harvester_r_data [0:CORES-1];
  wire [CORES-1:0] harvester_r_req;
  wire [DEPTH_V_F-1:0] harvester_w_addr;
  wire [WIDTH_P_D-1:0] harvester_w_data;
  wire harvester_we;
  wire [CORE_BITS-1:0] harvester_cs;

  harvester
    #(
      .CORE_BITS (CORE_BITS),
      .CORES (CORES),
      .WIDTH (WIDTH_P_D),
      .DEPTH (DEPTH_V_F)
      )
  harvester_0
    (
     .clk (clk),
     .reset (reset),
     .cs (harvester_cs),
     .r_data (harvester_r_data[harvester_cs]),
     .r_valid (harvester_r_valid[harvester_cs]),
     .r_req (harvester_r_req),
     .w_addr (harvester_w_addr),
     .w_data (harvester_w_data),
     .we (harvester_we)
     );

  wire [WIDTH_P_D-1:0] s2m_r_data;
  rw_port_ram
    #(
      .DATA_WIDTH (WIDTH_P_D),
      .ADDR_WIDTH (DEPTH_S2M)
      )
  shared_s2m
    (
     .clk (clk),
     .addr_r (master_d_r_addr[DEPTH_S2M-1:0]),
     .addr_w (s2m_w_addr[DEPTH_S2M-1:0]),
     .data_in (s2m_w_data),
     .we (s2m_we),
     .data_out (s2m_r_data)
     );

`ifdef USE_UART
  // UART IO: write to mem_i
  reg uart_io_tx_we;
  wire uart_io_busy;
  wire [31:0] uart_io_rx_addr;
  wire [31:0] uart_io_rx_data;
  reg [31:0] uart_io_rx_addr_d1;
  reg [31:0] uart_io_rx_data_d1;
  wire uart_io_rx_we;
  reg master_mem_i_we;
  wire [DEPTH_V_U-DEPTH_B_U-1:0] uart_io_rx_bank;
  assign uart_io_rx_bank = uart_io_rx_addr[DEPTH_V_U-1:DEPTH_B_U];

  always @(posedge clk)
    begin
      uart_io_rx_addr_d1 <= uart_io_rx_addr;
      uart_io_rx_data_d1 <= uart_io_rx_data;
    end

  always @(posedge clk)
    begin
      if (reset == TRUE)
        begin
          master_mem_i_we <= FALSE;
        end
      else
        begin
          if ((uart_io_rx_we == TRUE) && (uart_io_rx_bank == UART_BANK_MEM_I))
            begin
              master_mem_i_we <= TRUE;
            end
          else
            begin
              master_mem_i_we <= FALSE;
            end
        end
    end

  // u2m write
  always @(posedge clk)
    begin
      if (reset == TRUE)
        begin
          u2m_we <= FALSE;
        end
      else
        begin
          if ((uart_io_rx_we == TRUE) && (uart_io_rx_bank == UART_BANK_U2M))
            begin
              u2m_we <= TRUE;
            end
          else
            begin
              u2m_we <= FALSE;
            end
        end
    end

  // UART IO: reset master
  reg reset_master;
  always @(posedge clk)
    begin
      if (reset == TRUE)
        begin
          reset_master <= FALSE;
        end
      else
        begin
          if ((uart_io_rx_we == TRUE) && (uart_io_rx_addr == UART_IO_ADDR_RESET))
            begin
              reset_master <= uart_io_rx_data[0];
            end
        end
    end

  uart_io
    #(
      .CLK_HZ (UART_CLK_HZ),
      .SCLK_HZ (UART_SCLK_HZ)
      )
  uart_io_0
    (
     .clk (clk),
     .reset (reset),
     .uart_rxd (uart_rxd),
     .tx_data (io_reg_w[IO_REG_W_UART][7:0]),
     .tx_we (uart_io_tx_we),
     .uart_txd (uart_txd),
     .uart_busy (uart_io_busy),
     .rx_addr (uart_io_rx_addr),
     .rx_data (uart_io_rx_data),
     .rx_we (uart_io_rx_we)
     );
`endif

`ifdef USE_VGA
  // sprite
  localparam SPRITE_BPP = 3;
  wire [SPRITE_BPP-1:0] color_all;
  // vga
  wire                  vga_vsync;
  wire [WIDTH_M_D-1:0]  vga_vcount;
  wire [32-1:0]         ext_vga_count_h;
  wire [32-1:0]         ext_vga_count_v;

  sprite
    #(
      .SPRITE_WIDTH_BITS (VRAM_WIDTH_BITS),
      .SPRITE_HEIGHT_BITS (VRAM_HEIGHT_BITS),
      .BPP (SPRITE_BPP),
      .RAM_TYPE (VRAM_RAM_TYPE)
      )
  sprite_0
    (
     .clk (clk),
     .reset (reset),
     .bitmap_length (),
     .bitmap_address (s2m_w_addr[DEPTH_VRAM-1:0]),
     .bitmap_din (s2m_w_data[VRAM_BPP-1:0]),
     .bitmap_dout (),
     .bitmap_we (vram_we),
     .bitmap_oe (FALSE),
     .x (io_reg_w[IO_REG_W_SPRITE_X]),
     .y (io_reg_w[IO_REG_W_SPRITE_Y]),
     .scale (io_reg_w[IO_REG_W_SPRITE_SCALE]),
     .ext_clkv (clkv),
     .ext_resetv (resetv),
     .ext_color (color_all),
     .ext_count_h (ext_vga_count_h),
     .ext_count_v (ext_vga_count_v)
     );

  vga_iface
    #(
  `ifdef VGA_720P
      .VGA_MAX_H (1650-1),
      .VGA_MAX_V (750-1),
      .VGA_WIDTH (1280),
      .VGA_HEIGHT (720),
      .VGA_SYNC_H_START (1390),
      .VGA_SYNC_V_START (725),
      .VGA_SYNC_H_END (1430),
      .VGA_SYNC_V_END (730),
      .PIXEL_DELAY (2),
  `else
      .VGA_MAX_H (800-1),
      .VGA_MAX_V (525-1),
      .VGA_WIDTH (640),
      .VGA_HEIGHT (480),
      .VGA_SYNC_H_START (656),
      .VGA_SYNC_V_START (490),
      .VGA_SYNC_H_END (752),
      .VGA_SYNC_V_END (492),
      .PIXEL_DELAY (2),
  `endif
  `ifdef CLIPV512
      .CLIP_ENABLE (1),
      .CLIP_V_E (512),
  `endif
      .BPP (VRAM_BPP)
      )
  vga_iface_0
    (
     .clk (clk),
     .reset (reset),
     .vsync (vga_vsync),
     .vcount (vga_vcount),
     .ext_clkv (clkv),
     .ext_resetv (resetv),
     .ext_color_in (color_all),
     .ext_vga_hs (vga_hs),
     .ext_vga_vs (vga_vs),
     .ext_vga_de (vga_de),
     .ext_vga_color_out (vga_color),
     .ext_count_h (ext_vga_count_h),
     .ext_count_v (ext_vga_count_v)
     );
`endif

  // Master core
  wire [DEPTH_V_M_W-1:0] master_d_w_addr;
  wire [WIDTH_M_D-1:0] master_d_w_data;
  wire master_d_we;
  wire [DEPTH_M_I-1:0] master_i_r_addr;
  wire [WIDTH_I-1:0] master_i_r_data;
  wire [DEPTH_V_M_R-1:0] master_d_r_addr;
  reg [WIDTH_M_D-1:0] master_d_r_data;
  wire [DEPTH_V_M_W-DEPTH_B_M_W-1:0] master_d_w_bank;
  assign master_d_w_bank = master_d_w_addr[DEPTH_V_M_W-1:DEPTH_B_M_W];
  mini16sc_cpu
    #(
      .WIDTH_I (WIDTH_I),
      .WIDTH_D (WIDTH_M_D),
      .DEPTH_I (DEPTH_M_I),
      .DEPTH_D (DEPTH_V_M),
      .DEPTH_REG (DEPTH_REG)
      )
  mini16sc_cpu_master
    (
     .clk (clk),
`ifdef USE_UART
     .soft_reset (reset_master),
`else
     .soft_reset (FALSE),
`endif
     .reset (reset),
     .mem_i_r_addr (master_i_r_addr),
     .mem_i_r_data (master_i_r_data),
     .mem_d_r_addr (master_d_r_addr),
     .mem_d_r_data (master_d_r_data),
     .mem_d_w_addr (master_d_w_addr),
     .mem_d_w_data (master_d_w_data),
     .mem_d_we (master_d_we)
     );

  default_master_code_mem
    #(
      .DATA_WIDTH (WIDTH_I),
      .ADDR_WIDTH (DEPTH_M_I)
      )
  master_mem_i
    (
     .clk (clk),
     .addr_r (master_i_r_addr),
`ifdef USE_UART
     .addr_w (uart_io_rx_addr_d1[DEPTH_M_I-1:0]),
     .data_in (uart_io_rx_data_d1[WIDTH_I-1:0]),
     .we (master_mem_i_we),
`else
     .addr_w ({DEPTH_M_I{1'b0}}),
     .data_in ({WIDTH_I{1'b0}}),
     .we (FALSE),
`endif
     .data_out (master_i_r_data)
     );

  wire [WIDTH_M_D-1:0] master_mem_d_r_data;
  reg master_mem_d_we;
  default_master_data_mem
    #(
      .DATA_WIDTH (WIDTH_M_D),
      .ADDR_WIDTH (DEPTH_M_D)
      )
  master_mem_d
    (
     .clk (clk),
     .addr_r (master_d_r_addr[DEPTH_M_D-1:0]),
     .addr_w (master_d_w_addr_d1[DEPTH_M_D-1:0]),
     .data_in (master_d_w_data_d1),
     .we (master_mem_d_we),
     .data_out (master_mem_d_r_data)
     );

`ifdef USE_UART
  reg u2m_we;
  wire [WIDTH_M_D-1:0] u2m_r_data;
  rw_port_ram
    #(
      .DATA_WIDTH (WIDTH_M_D),
      .ADDR_WIDTH (DEPTH_U2M)
      )
  shared_u2m
    (
     .clk (clk),
     .addr_r (master_d_r_addr[DEPTH_U2M-1:0]),
     .addr_w (uart_io_rx_addr_d1[DEPTH_U2M-1:0]),
     .data_in (uart_io_rx_data_d1[WIDTH_M_D-1:0]),
     .we (u2m_we),
     .data_out (u2m_r_data)
     );
`endif

  generate
    genvar i;
    for (i = 0; i < CORES; i = i + 1)
      begin: mini16_pe_gen
        mini16_pe
             #(
               .WIDTH_D (WIDTH_P_D),
               .DEPTH_I (DEPTH_P_I),
               .DEPTH_D (DEPTH_P_D),
               .DEPTH_M2S (DEPTH_M2S),
               .DEPTH_FIFO (DEPTH_FIFO),
               .CORE_ID (i + PE_ID_START),
               .MASTER_W_BANK_BC (MASTER_W_BANK_BC),
               .DEPTH_V_F (DEPTH_V_F),
               .DEPTH_B_F (DEPTH_B_F),
               .DEPTH_V_M (DEPTH_V_M),
               .DEPTH_B_M (DEPTH_B_M_W),
               .DEPTH_V_S_R (DEPTH_V_S_R),
               .DEPTH_B_S_R (DEPTH_B_S_R),
               .DEPTH_V_S_W (DEPTH_V_S_W),
               .DEPTH_B_S_W (DEPTH_B_S_W),
               .DEPTH_V_M2S (DEPTH_V_M2S),
               .DEPTH_B_M2S (DEPTH_B_M2S),
               .FIFO_RAM_TYPE (PE_FIFO_RAM_TYPE),
               .M2S_RAM_TYPE (PE_M2S_RAM_TYPE),
               .DEPTH_REG (PE_DEPTH_REG)
               )
        mini16_pe_0
             (
              .clk (clk),
              .reset (reset),
              .soft_reset (io_reg_w[IO_REG_W_RESET_PE][0]),
              .fifo_req_r (harvester_r_req[i]),
              .fifo_valid (harvester_r_valid[i]),
              .fifo_r_data (harvester_r_data[i]),
              .addr_i (master_d_w_addr_d1),
              .data_i (master_d_w_data_d1),
              .we_i (master_d_we_d1)
              );
      end
  endgenerate

endmodule
harvester.v : PEからのデータ転送処理
// SPDX-License-Identifier: BSD-2-Clause
// Copyright (c) 2019 miya All rights reserved.

module harvester
  #(
    parameter CORE_BITS = 8,
    parameter CORES = 32,
    parameter WIDTH = 32,
    parameter DEPTH = 8
    )
  (
   input                   clk,
   input                   reset,
   output [CORE_BITS-1:0]  cs,
   input [WIDTH+DEPTH-1:0] r_data,
   input                   r_valid,
   output reg [CORES-1:0]  r_req,
   output [DEPTH-1:0]      w_addr,
   output [WIDTH-1:0]      w_data,
   output reg              we
   );

  localparam TRUE = 1'b1;
  localparam FALSE = 1'b0;
  localparam ONE = 1'd1;
  localparam ZERO = 1'd0;

  // fifo to s2m core select
  reg [CORE_BITS-1:0] core;
  reg [CORE_BITS-1:0] core_d1;
  reg [CORE_BITS-1:0] core_d2;
  reg [CORE_BITS-1:0] core_d3;
  always @(posedge clk)
    begin
      core_d1 <= core;
      core_d2 <= core_d1;
      core_d3 <= core_d2;
      if (reset == TRUE)
        begin
          core <= ZERO;
        end
      else
        begin
          if (core == CORES - 1)
            begin
              core <= ZERO;
            end
          else
            begin
              core <= core + ONE;
            end
        end
    end

  assign cs = core_d3;
  assign w_addr = harvester_r_data_fetch_d1[WIDTH+DEPTH-1:WIDTH];
  assign w_data = harvester_r_data_fetch_d1[WIDTH-1:0];

  reg [WIDTH+DEPTH-1:0] harvester_r_data_fetch;
  reg [WIDTH+DEPTH-1:0] harvester_r_data_fetch_d1;
  reg r_valid_d1;

  always @(posedge clk)
    begin
      r_req[core] <= TRUE;
      r_req[core_d1] <= FALSE;
      r_valid_d1 <= r_valid;
      we <= r_valid_d1;
      harvester_r_data_fetch <= r_data;
      harvester_r_data_fetch_d1 <= harvester_r_data_fetch;
    end

endmodule
asm/MasterProgram.java : マンデルブロ集合デモ:マスターコア用プログラム
// SPDX-License-Identifier: BSD-2-Clause
// Copyright (c) 2019 miya All rights reserved.

import java.lang.Math;

public class MasterProgram extends AsmLib
{
  private int DEBUG = 0;
  private int WAIT_VSYNC = 0;
  private int VGA_HEIGHT_BITS = 9;

  private int M2S_BC_ADDR_H;
  private int M2S_BC_ADDR_SHIFT;
  private int S2M_ADDR_H;
  private int S2M_ADDR_SHIFT;
  private int U2M_ADDR_H;
  private int U2M_ADDR_SHIFT;
  private int IO_REG_W_ADDR_H;
  private int IO_REG_W_ADDR_SHIFT;
  private int IO_REG_R_ADDR_H;
  private int IO_REG_R_ADDR_SHIFT;

  private void f_get_m2s_bc_addr()
  {
    // output: R3:m2s_bc_addr
    int m2s_bc_addr = 3;
    // m2s_bc_addr = M2S_BC_ADDR_H << M2S_BC_ADDR_SHIFT;
    label("f_get_m2s_bc_addr");
    lib_set_im(m2s_bc_addr, M2S_BC_ADDR_H);
    as_sli(m2s_bc_addr, M2S_BC_ADDR_SHIFT);
    lib_nop(2);
    as_mvsi(m2s_bc_addr, MVS_SL);
    lib_return();
  }

  private void f_get_m2s_core_addr()
  {
    // input: R3:core id(0-(N-1))
    // output: R3:m2s_core_addr
    int core_id = 3;
    int m2s_core_addr = 3;
    int tmp0 = SP_REG_MVIL;
    // m2s_core_addr = ((core_id + PE_ID_START) << DEPTH_B_M_W) + (M2S_BANK_M2S << DEPTH_B_M2S);
    label("f_get_m2s_core_addr");
    as_addi(core_id, PE_ID_START);
    as_mvi(tmp0, M2S_BANK_M2S);
    as_sli(m2s_core_addr, DEPTH_B_M_W);
    as_sli(tmp0, DEPTH_B_M2S);
    as_nop();
    as_mvsi(m2s_core_addr, MVS_SL);
    as_mvsi(tmp0, MVS_SL);
    as_add(m2s_core_addr, tmp0);
    lib_return();
  }

  private void f_get_s2m_addr()
  {
    // output: R3:s2m_addr
    int s2m_addr = 3;
    // s2m_addr = S2M_ADDR_H << S2M_ADDR_SHIFT;
    label("f_get_s2m_addr");
    lib_set_im(s2m_addr, S2M_ADDR_H);
    as_sli(s2m_addr, S2M_ADDR_SHIFT);
    lib_nop(2);
    as_mvsi(s2m_addr, MVS_SL);
    lib_return();
  }

  private void f_get_io_reg_w_addr()
  {
    // input: R3: device reg num
    // output: R3:io_reg_w_addr
    int io_reg_w_addr = 3;
    int tmp0 = LREG0;
    // io_reg_w_addr = (IO_REG_W_ADDR_H << IO_REG_W_ADDR_SHIFT) + R3;
    label("f_get_io_reg_w_addr");
    lib_set_im(tmp0, IO_REG_W_ADDR_H);
    as_sli(tmp0, IO_REG_W_ADDR_SHIFT);
    lib_nop(2);
    as_mvsi(tmp0, MVS_SL);
    as_add(io_reg_w_addr, tmp0);
    lib_return();
  }

  private void f_get_io_reg_r_addr()
  {
    // input: R3: device reg num
    // output: R3:io_reg_r_addr
    int io_reg_r_addr = 3;
    int tmp0 = LREG0;
    // io_reg_r_addr = (IO_REG_R_ADDR_H << IO_REG_R_ADDR_SHIFT) + R3;
    label("f_get_io_reg_r_addr");
    lib_set_im(tmp0, IO_REG_R_ADDR_H);
    as_sli(tmp0, IO_REG_R_ADDR_SHIFT);
    lib_nop(2);
    as_mvsi(tmp0, MVS_SL);
    as_add(io_reg_r_addr, tmp0);
    lib_return();
  }

  private void f_get_u2m_addr()
  {
    // output: R3:u2m_addr
    int u2m_addr = 3;
    // u2m_addr = U2M_ADDR_H << U2M_ADDR_SHIFT;
    label("f_get_u2m_addr");
    lib_set_im(u2m_addr, U2M_ADDR_H);
    as_sli(u2m_addr, U2M_ADDR_SHIFT);
    lib_nop(2);
    as_mvsi(u2m_addr, MVS_SL);
    lib_return();
  }

  private void example_led()
  {
    /*
    led_addr = (MASTER_W_BANK_IO_REG << DEPTH_B_M_W) + IO_REG_W_LED;
    counter = 0;
    shift = 18;
    do
    {
      led = counter >> shift;
      mem[led_addr] = led;
      counter++;
    } while (1);
    */

    int led_addr = 3;
    int counter = 4;
    int shift = 5;
    int led = 6;
    as_nop();
    lib_init_stack();
    as_mvil(IO_REG_W_LED);
    as_mv(led_addr, SP_REG_MVIL);
    lib_call("f_get_io_reg_w_addr");
    as_mvi(counter, 0);
    lib_set_im(shift, 18);
    label("example_led_L_0");
    as_mv(led, counter);
    as_sr(led, shift);
    as_addi(counter, 1);
    as_nop();
    as_mvsi(led, MVS_SR);
    as_st(led_addr, led);
    lib_ba("example_led_L_0");
    // link library
    f_get_io_reg_w_addr();
  }

  private void example_helloworld()
  {
    as_nop();
    int from_u2m = 1;
    if (from_u2m == 1)
    {
      lib_call("f_get_u2m_data");
    }
    lib_init_stack();
    as_mvi(R4, MASTER_R_BANK_MEM_D);
    as_sli(R4, DEPTH_B_M_R);
    lib_set_im(R3, addr_abs("d_helloworld"));
    as_mvsi(R4, MVS_SL);
    as_add(R3, R4);
    lib_call("f_uart_print_32");
    lib_call("f_halt");
    // link library
    f_uart_char();
    f_uart_print_32();
    f_halt();
    f_get_u2m_data();
  }

  private void example_helloworld_data()
  {
    label("d_helloworld");
    if (WIDTH_M_D == 32)
    {
      string_data32("Hello, world!\r\n");
    }
    else
    {
      string_data16("Hello, world!\r\n");
    }
  }

  private void f_reset_pe()
  {
    /*
    addr_reset = MASTER_W_BANK_IO_REG;
    addr_reset <<= DEPTH_B_M_W;
    addr_reset += IO_REG_W_RESET_PE;
    mem[addr_reset] = reset_value;
    */

    int reset_value = LREG0;
    int addr_reset = LREG1;
    label("f_reset_pe");
    as_mvi(addr_reset, MASTER_W_BANK_IO_REG);
    as_sli(addr_reset, DEPTH_B_M_W);
    lib_nop(2);
    as_mvsi(addr_reset, MVS_SL);
    as_addi(addr_reset, IO_REG_W_RESET_PE);
    as_st(addr_reset, reset_value);
    lib_return();
  }

  // copy data from U2M to MEM_D
  // call before lib_init_stack()
  public void f_get_u2m_data()
  {
    int addr_dst = LREG0;
    int addr_src = LREG1;
    int size = LREG2;
    int data = LREG3;
    int compare = LREG4;
    label("f_get_u2m_data");
    as_mvi(size, 1);
    as_mvi(addr_src, U2M_ADDR_H);
    as_sli(addr_src, U2M_ADDR_SHIFT);
    as_mvi(addr_dst, 0);
    as_sli(size, DEPTH_M_D);
    as_mvsi(addr_src, MVS_SL);
    as_nop();
    as_mvsi(size, MVS_SL);
    label("f_get_u2m_data_L_0");
    as_ld(data, addr_src);
    as_subi(size, 1);
    as_addi(addr_src, 1);
    as_ld(data, addr_src);
    as_st(addr_dst, data);
    as_cnz(compare, size);
    as_addi(addr_dst, 1);
    lib_bc(compare, "f_get_u2m_data_L_0");
    lib_return();
  }

  public void f_reset_vga()
  {
    /*
    addr_ioreg = MASTER_W_BANK_IO_REG;
    addr_ioreg <<= DEPTH_B_M_W;
    addr_sp_x = addr_ioreg;
    addr_sp_y = addr_ioreg;
    addr_sp_s = addr_ioreg;
    addr_sp_x += 3;
    addr_sp_y += 4;
    addr_sp_s += 5;
    mem[addr_sp_x] = 0;
    mem[addr_sp_y] = 0;
    mem[addr_sp_s] = 12;
    */

    int addr_ioreg = LREG0;
    int addr_sp_x = LREG1;
    int addr_sp_y = LREG2;
    int addr_sp_s = LREG3;
    int x = LREG5;
    label("f_reset_vga");
    as_mvi(addr_ioreg, MASTER_W_BANK_IO_REG);
    as_sli(addr_ioreg, DEPTH_B_M_W);
    lib_set_im(x, 64);
    as_mvsi(addr_ioreg, MVS_SL);
    as_mv(addr_sp_x, addr_ioreg);
    as_mv(addr_sp_y, addr_ioreg);
    as_mv(addr_sp_s, addr_ioreg);
    as_addi(addr_sp_x, IO_REG_W_SPRITE_X);
    as_addi(addr_sp_y, IO_REG_W_SPRITE_Y);
    as_addi(addr_sp_s, IO_REG_W_SPRITE_SCALE);
    as_st(addr_sp_x, x);
    as_sti(addr_sp_y, 0);

    if (WIDTH_P_D == 32)
    {
      as_sti(addr_sp_s, 7);
    }
    else
    {
      as_sti(addr_sp_s, 5);
    }
    lib_return();
  }

  private void f_init_core_id()
  {
    /*
      depends: f_get_m2s_core_addr()
    */

    int addr_core_id = LREG0;
    int next_core_offset = LREG1;
    int i = LREG2;
    int cores = LREG3;
    int addr_cores = LREG4;
    int para = LREG5;
    int compare = LREG6;
    /*
    R3 = cores - 1;
    lib_call("f_get_m2s_core_addr");
    addr_core_id = R3;
    addr_cores = R3 + 1;
    next_core_offset = 1 << DEPTH_B_M_W;
    i = CORES;
    para = PARALLEL;
    do
    {
      i--;
      mem[addr_core_id] = i;
      mem[addr_cores] = para;
      addr_core_id -= next_core_offset;
      addr_cores -= next_core_offset;
    } while (i != 0);
    */

    label("f_init_core_id");
    lib_push(SP_REG_LINK);
    lib_push(R3);
    lib_set_im(cores, CORES);
    lib_set_im(para, PARALLEL);
    lib_set_im(R3, CORES - 1); // cores - 1
    lib_call("f_get_m2s_core_addr");
    as_mv(addr_core_id, R3);
    as_mv(addr_cores, R3);
    as_mvi(next_core_offset, 1);
    as_sli(next_core_offset, DEPTH_B_M_W);
    as_mv(i, cores);
    as_addi(addr_cores, 1);
    as_mvsi(next_core_offset, MVS_SL);
    label("f_init_core_id_L_0");
    as_subi(i, 1);
    as_st(addr_core_id, i);
    as_st(addr_cores, para);
    as_sub(addr_core_id, next_core_offset);
    as_sub(addr_cores, next_core_offset);
    as_cnz(compare, i);
    lib_bc(compare, "f_init_core_id_L_0");
    lib_pop(R3);
    lib_pop(SP_REG_LINK);
    lib_return();
  }

  private void m_vga_flip(int reg_task_id)
  {
    int task_id = reg_task_id;
    int addr_sp_y = LREG0;
    int tmp0 = LREG1;
    int page = LREG2;

    /*
    addr_sp_y = (MASTER_W_BANK_IO_REG << DEPTH_B_M_W) + IO_REG_W_SPRITE_Y;
    page = -(((task_id & 1) ^ 1) << (IMAGE_HEIGHT_BITS + 1));
    *addr_sp_y = page;
    */


    as_mvi(addr_sp_y, MASTER_W_BANK_IO_REG);
    as_sli(addr_sp_y, DEPTH_B_M_W);
    as_mv(tmp0, task_id);
    as_andi(tmp0, 1);
    as_mvsi(addr_sp_y, MVS_SL);
    as_xori(tmp0, 1);
    // vga_height = 1 << VGA_HEIGHT_BITS
    as_sli(tmp0, VGA_HEIGHT_BITS);
    as_mvi(page, 0);
    as_addi(addr_sp_y, IO_REG_W_SPRITE_Y);
    as_mvsi(tmp0, MVS_SL);
    // sp_y = 0(page0), -vga_height(page1)
    as_sub(page, tmp0);
    as_st(addr_sp_y, page);
  }

  private void m_wait_vsync()
  {
    /*
    addr_vsync = (MASTER_R_BANK_IO_REG << DEPTH_B_M_R) + IO_REG_R_VGA_VSYNC;
    vsync_pre = 0;
    do
    {
      vsync = mem[addr_vsync];
      vsync_start = ((vsync == 0) && (vsync_pre == 1));
      vsync_pre = vsync;
    } while (!vsync_start);
    (!vsync_start = ((vsync == 1) || (vsync_pre == 0)))
    */

    int addr_vsync = LREG0;
    int vsync = LREG1;
    int vsync_start = LREG2;
    int vsync_pre = LREG3;
    int compare = LREG4;
    as_mvi(addr_vsync, MASTER_R_BANK_IO_REG);
    as_sli(addr_vsync, DEPTH_B_M_R);
    as_mvi(vsync_pre, 0);
    as_nop();
    as_mvsi(addr_vsync, MVS_SL);
    as_addi(addr_vsync, IO_REG_R_VGA_VSYNC);
    as_ld(vsync, addr_vsync);
    lib_nop(2);
    label("m_wait_vsync_L_0");
    as_ld(vsync, addr_vsync);
    as_cnz(vsync_start, vsync);
    as_cnz(compare, vsync_pre);
    as_mv(vsync_pre, vsync);
    as_xori(compare, -1);
    as_or(compare, vsync_start);
    lib_bc(compare, "m_wait_vsync_L_0");
  }

  private void m_init_mandel_param()
  {
    /*
      PE m2s memory map:
      3: scale
      4: cx
      5: cy
     */

    int addr_m2s_root = 3;
    int addr_scale = LREG0;
    int addr_cx = LREG1;
    int addr_cy = LREG2;
    int scale = LREG3;
    int cx = LREG4;
    int cy = LREG5;

    as_mv(addr_scale, addr_m2s_root);
    as_mv(addr_cx, addr_m2s_root);
    as_mv(addr_cy, addr_m2s_root);
    lib_ld(scale, "d_mandel_scale");
    as_addi(addr_scale, 3);
    as_addi(addr_cx, 4);
    as_addi(addr_cy, 5);
    lib_ld(cx, "d_mandel_cx");
    lib_ld(cy, "d_mandel_cy");
    as_st(addr_scale, scale);
    as_st(addr_cx, cx);
    as_st(addr_cy, cy);
  }

  private void m_update_mandel_param()
  {
    int addr_m2s_root = 3;
    int addr_scale = LREG0;
    int scale = LREG1;
    int scale_mask = LREG2;
    int compare = LREG3;

    /*
    addr_scale = addr_m2s_root + 3;
    scale -= 1;
    if (scale == 0)
    {
      scale = 256;
    }
    m2s[addr_scale] = scale;
    mem["d_mandel_scale"] = scale;
    */


    as_mv(addr_scale, addr_m2s_root);
    lib_ld(scale, "d_mandel_scale");
    as_addi(addr_scale, 3);
    as_subi(scale, 1);
    as_cnz(compare, scale);
    as_mvil(256);
    as_mv(SP_REG_MVC, scale);
    as_xori(compare, -1);
    as_mvc(SP_REG_MVIL, compare);
    as_mv(scale, SP_REG_MVC);
    as_st(addr_scale, scale);
    lib_st("d_mandel_scale", scale);
  }

  private void master_thread_manager()
  {
    /*
      PE m2s memory map:
      0: core_id
      1: parallel
      2: task_id
      user parameters
      3: scale
      4: cx
      5: cy

      s2m memory map:
      0 - PARALLEL-1: Incremented task_id from PE
     */

    int addr_m2s_root = 3;
    int addr_s2m_root = 4;
    int addr_task_id = 5;
    int addr_s2m = 6;
    int task_id = 7;
    int pe_ack = 8;
    int i = 9;
    int compare = 10;

    /*
    reset_pe(1);
    if (ENABLE_UART == 1)
    {
      lib_call("f_get_u2m_data");
    }
    f_reset_vga();
    init_core_id();
    addr_m2s_root = M2S_BC_ADDR_H << M2S_BC_ADDR_SHIFT;
    addr_task_id = addr_m2s_root + 2;
    addr_s2m_root = MASTER_R_BANK_S2M << DEPTH_B_M_R;
    m_init_mandel_param();
    task_id = 0;
    mem[addr_task_id] = task_id;
    reset_pe(0);
    do
    {
      i = PARALLEL;
      task_id++;
      addr_s2m = addr_s2m_root;
      do
      {
        i--;
        do
        {
          pe_ack = mem[addr_s2m] - task_id;
        } while (pe_ack != 0)
        addr_s2m++;
      } while (i != 0)
      m_wait_vsync();
      m_vga_flip(task_id);
      m_update_mandel_param();
      mem[addr_task_id] = task_id;
    } while (1);
    */

    as_nop();
    lib_init_stack();
    as_mvi(LREG0, 1);
    lib_call("f_reset_pe");

    if (ENABLE_UART == 1)
    {
      lib_call("f_get_u2m_data");
    }

    lib_call("f_reset_vga");
    lib_call("f_init_core_id");
    lib_set_im(addr_m2s_root, M2S_BC_ADDR_H);
    as_mvi(addr_s2m_root, MASTER_R_BANK_S2M);
    as_sli(addr_s2m_root, DEPTH_B_M_R);
    as_sli(addr_m2s_root, M2S_BC_ADDR_SHIFT);
    as_mvi(task_id, 0);
    as_mvsi(addr_s2m_root, MVS_SL);
    as_mvsi(addr_m2s_root, MVS_SL);
    m_init_mandel_param();
    as_mv(addr_task_id, addr_m2s_root);
    as_addi(addr_task_id, 2);
    as_st(addr_task_id, task_id);
    as_mvi(LREG0, 0);
    lib_call("f_reset_pe");
    label("master_thread_manager_L_0");
    lib_set_im(i, PARALLEL);
    as_addi(task_id, 1);
    as_mv(addr_s2m, addr_s2m_root);
    label("master_thread_manager_L_1");
    as_subi(i, 1);
    label("master_thread_manager_L_2");
    as_ld(pe_ack, addr_s2m);
    lib_nop(2);
    as_ld(pe_ack, addr_s2m);
    as_sub(pe_ack, task_id);

    if (WIDTH_P_D < 32)
    {
      as_andi(pe_ack, 1);
    }

    as_cnz(compare, pe_ack);
    lib_bc(compare, "master_thread_manager_L_2");
    as_addi(addr_s2m, 1);
    as_cnz(compare, i);
    lib_bc(compare, "master_thread_manager_L_1");

    m_update_mandel_param();

    if (DEBUG == 1)
    {
      lib_push(R3);
      as_mv(R3, task_id);
      lib_call("f_uart_hex_word_ln");
      as_mvi(R3, 1);
      as_mvil(21);
      as_sl(R3, SP_REG_MVIL);
      lib_nop(2);
      as_mvsi(R3, MVS_SL);
      lib_call("f_wait");
      lib_pop(R3);
    }

    if (WAIT_VSYNC == 1)
    {
      m_wait_vsync();
    }

    m_vga_flip(task_id);

    as_st(addr_task_id, task_id);

    lib_ba("master_thread_manager_L_0");

    lib_call("f_halt");

    // link library
    f_halt();
    f_init_core_id();
    f_reset_pe();
    f_get_m2s_core_addr();
    f_reset_vga();
    if (ENABLE_UART == 1)
    {
      f_get_u2m_data();
    }
    if (DEBUG == 1)
    {
      f_uart_char();
      f_uart_hex();
      f_uart_hex_word();
      f_uart_hex_word_ln();
      f_wait();
    }
  }

  @Override
  public void init(String[] args)
  {
    super.init(args);
    M2S_BC_ADDR_SHIFT = DEPTH_B_M2S;
    M2S_BC_ADDR_H = ((MASTER_W_BANK_BC << DEPTH_B_M_W) + (M2S_BANK_M2S << DEPTH_B_M2S)) >>> M2S_BC_ADDR_SHIFT;
    S2M_ADDR_H = MASTER_R_BANK_S2M;
    S2M_ADDR_SHIFT = DEPTH_B_M_R;
    U2M_ADDR_H = MASTER_R_BANK_U2M;
    U2M_ADDR_SHIFT = DEPTH_B_M_R;
    IO_REG_W_ADDR_H = MASTER_W_BANK_IO_REG;
    IO_REG_W_ADDR_SHIFT = DEPTH_B_M_W;
    IO_REG_R_ADDR_H = MASTER_R_BANK_IO_REG;
    IO_REG_R_ADDR_SHIFT = DEPTH_B_M_R;
  }

  @Override
  public void program()
  {
    set_filename("default_master_code");
    set_rom_width(WIDTH_I);
    set_rom_depth(DEPTH_M_I);
    //example_led();
    //example_helloworld();
    master_thread_manager();
  }

  @Override
  public void data()
  {
    set_filename("default_master_data");
    set_rom_width(WIDTH_M_D);
    set_rom_depth(DEPTH_M_D);
    label("d_mandel_scale");
    dat(256);
    label("d_mandel_cx");
    dat(161 << 6);
    label("d_mandel_cy");
    dat(49 << 6);
    example_helloworld_data();
  }
}
asm/PEProgram.java : マンデルブロ集合デモ:PE用プログラム
// SPDX-License-Identifier: BSD-2-Clause
// Copyright (c) 2019 miya All rights reserved.

import java.lang.Math;

public class PEProgram extends AsmLib
{
  private int FIFO_ADDR;
  private int VRAM_ADDR_H;
  private int VRAM_ADDR_SHIFT;
  private int M2S_ADDR_H;
  private int M2S_ADDR_SHIFT;
  private int ITEM_COUNT_ADDR_H;
  private int ITEM_COUNT_ADDR_SHIFT;
  private int S2M_ADDR_H;
  private int S2M_ADDR_SHIFT;
  private int IMAGE_WIDTH_BITS;
  private int IMAGE_HEIGHT_BITS;
  private int IMAGE_WIDTH_HALF_BITS;
  private int IMAGE_HEIGHT_HALF_BITS;
  private int IMAGE_WIDTH;
  private int IMAGE_HEIGHT;
  private int IMAGE_WIDTH_HALF;
  private int IMAGE_HEIGHT_HALF;

  private void m_mandel_core()
  {
    int x = 11;
    int y = 12;
    int scale = 13;
    int count = 14;
    int cx = 15;
    int cy = 16;
    int a = 17;
    int b = 18;
    int aa = 19;
    int bb = 20;
    int c = 21;
    int x1 = 22;
    int y1 = 23;
    int cmask = 24;
    int max_c = 25;
    int pc = 26;
    int tmp1 = 27;
    int tmp2 = 28;
    int tmp3 = 29;
    int compare = 30;

    // const
    int FIXED_BITS = 13;
    int FIXED_BITS_M1 = 12;
    int MAX_C = 4;

    /*
    a = 0;
    b = 0;
    aa = 0;
    bb = 0;
    scale = 256;
    count = 100;
    cmask = 252;
    max_c = MAX_C << FIXED_BITS;
    x1 = ((x - IMAGE_WIDTH_HALF) * scale) + cx;
    y1 = ((y - IMAGE_HEIGHT_HALF) * scale) + cy;
    do
    {
      pc = c;
      b = ((a * b) >> FIXED_BITS_M1) - y1;
      a = aa - bb - x1;
      aa = (a * a) >> FIXED_BITS;
      bb = (b * b) >> FIXED_BITS;
      c = aa + bb;
      count--;
      x1 += scale;
      pc -= c;
      pc >>= 5;
      limit = (c < MAX_C) && (count > 0) && (pc != 0);
    } while (limit);

    as_mvi(a, 0);
    as_mvi(b, 0);
    as_mvi(aa, 0);
    as_mvi(bb, 0);
    as_mv(x1, x);
    as_mv(y1, y);
    lib_set_im(count, 100);
    lib_set_im(tmp1, IMAGE_WIDTH_HALF);
    lib_set_im(tmp2, IMAGE_HEIGHT_HALF);
    as_mvi(max_c, MAX_C);
    as_sli(max_c, FIXED_BITS);
    as_sub(x1, tmp1);
    as_sub(y1, tmp2);
    as_mul(x1, scale);
    as_mul(y1, scale);
    as_add(x1, cx);
    as_add(y1, cy);
    label("m_mandel_L_0");
    as_mv(pc, c);
    as_mul(b, a);
    as_srai(b, FIXED_BITS_M1);
    as_sub(b, y1);
    as_mv(a, aa);
    as_sub(a, bb);
    as_sub(a, x1);
    as_mv(aa, a);
    as_mul(aa, a);
    as_sri(aa, FIXED_BITS);
    as_mv(bb, b);
    as_mul(bb, b);
    as_sri(bb, FIXED_BITS);
    as_mv(c, aa);
    as_add(c, bb);
    as_subi(count, 1);
    as_add(x1, scale);
    as_mv(tmp1, max_c);
    as_sub(pc, c);
    as_sub(tmp1, c);
    as_sri(pc, 5);
    as_cnm(compare, tmp1);
    as_cnm(tmp2, count);
    as_cnz(tmp3, pc);
    as_and(compare, tmp2);
    as_and(compare, tmp3);
    lib_bc(compare, "m_mandel_L_0");
    */


    as_mv(x1, x);
    as_mv(y1, y);
    lib_set_im(tmp1, IMAGE_WIDTH_HALF);
    lib_set_im(tmp2, IMAGE_HEIGHT_HALF);
    as_mvi(max_c, MAX_C);
    as_sli(max_c, FIXED_BITS);
    as_sub(x1, tmp1);
    as_sub(y1, tmp2);
    as_mvsi(max_c, MVS_SL);
    as_mul(x1, scale);
    as_mul(y1, scale);
    as_mvi(a, 0);
    as_mvi(b, 0);
    as_mvi(aa, 0);
    as_mvi(bb, 0);
    as_mvsi(x1, MVS_MUL);
    as_mvsi(y1, MVS_MUL);
    as_add(x1, cx);
    as_add(y1, cy);
    lib_set_im(count, 100);
    label("m_mandel_core_L_0");
    as_mul(b, a);
    as_subi(count, 1);
    as_mv(pc, c);
    as_mv(a, aa);
    as_sub(a, bb);
    as_sub(a, x1);
    as_mvsi(b, MVS_MUL);
    as_srai(b, FIXED_BITS_M1);
    as_add(x1, scale);
    as_mv(aa, a);
    as_mvsi(b, MVS_SRA);
    as_sub(b, y1);
    as_mv(bb, b);
    as_mul(bb, b);
    as_mul(aa, a);
    as_mv(tmp1, max_c);
    lib_nop(3);
    as_mvsi(bb, MVS_MUL);
    as_sri(bb, FIXED_BITS);
    as_mvsi(aa, MVS_MUL);
    as_sri(aa, FIXED_BITS);
    as_mvsi(bb, MVS_SR);
    as_nop();
    as_mvsi(aa, MVS_SR);
    as_mv(c, aa);
    as_add(c, bb);
    as_sub(pc, c);
    as_sri(pc, 5);
    as_sub(tmp1, c);
    as_cnm(compare, tmp1);
    as_mvsi(pc, MVS_SR);
    as_cnm(tmp2, count);
    as_cnz(tmp3, pc);
    as_and(compare, tmp2);
    as_and(compare, tmp3);
    lib_bc(compare, "m_mandel_core_L_0");
  }

  private void m_fill_vram()
  {
    int MAX_ITEM = 3;
    int item_count_addr = 3;
    int task_id = 4;
    int my_core_id = 7;
    int parallel = 8;
    int vram_addr = 9;
    int i = 10;
    int page = 11;
    int item_count = 11;
    int tmp0 = 12;
    int compare = 13;
    /*
    lib_push(vram_addr);

    page = task_id & 1;
    i = (1 << (IMAGE_WIDTH_BITS + IMAGE_HEIGHT_BITS)) - 1 - my_core_id;
    vram_addr += (page << (IMAGE_WIDTH_BITS + IMAGE_HEIGHT_BITS)) + i;
    item_count_addr = ITEM_COUNT_ADDR_H << ITEM_COUNT_ADDR_SHIFT;
    do
    {
      // fifo full check
      do
      {
        item_count = mem[item_count_addr];
        item_count -= MAX_ITEM;
      } while (item_count >= 0);

      mem[vram_addr] = task_id;
      vram_addr -= parallel;
      i -= parallel;
    } while (i >=0);
    lib_pop(vram_addr);
    */


    lib_push(vram_addr);

    as_mvi(i, 1);
    lib_set_im(tmp0, IMAGE_WIDTH_BITS + IMAGE_HEIGHT_BITS);
    as_sl(i, tmp0);
    as_mv(page, task_id);
    as_andi(page, 1);
    as_mvsi(i, MVS_SL);
    as_sl(page, tmp0);
    as_subi(i, 1);
    as_sub(i, my_core_id);
    as_mvsi(page, MVS_SL);
    as_add(page, i);
    as_add(vram_addr, page);

    as_mvi(item_count_addr, ITEM_COUNT_ADDR_H);
    as_sli(item_count_addr, ITEM_COUNT_ADDR_SHIFT);
    lib_nop(2);
    as_mvsi(item_count_addr, MVS_SL);

    label("m_fill_vram_L_0");

    as_ld(item_count, item_count_addr);
    lib_nop(2);
    as_ld(item_count, item_count_addr);
    as_subi(item_count, MAX_ITEM);
    as_cnm(compare, item_count);
    lib_bc(compare, "m_fill_vram_L_0");

    as_st(vram_addr, task_id);
    as_sub(vram_addr, parallel);
    as_sub(i, parallel);
    as_cnm(compare, i);
    lib_bc(compare, "m_fill_vram_L_0");
    lib_pop(vram_addr);
  }

  private void m_mandel()
  {
    int task_id = 4;
    int m2s_addr = 5;
    int my_core_id = 7;
    int parallel = 8;
    int vram_addr = 9;
    int i = 10;
    int page = 11;
    int x = 11;
    int y = 12;
    int scale = 13;
    int count = 14;
    int cx = 15;
    int cy = 16;
    // temp
    int tmp0 = 17;
    int param_addr = 17;
    int compare = 17;
    int wait_counter = 18;
    // const
    int WAIT = 8;
    /*
    lib_push_regs(4, 6); // push R4-R9
    // get param
    scale = mem[m2s_addr + 1];
    cx = mem[m2s_addr + 2];
    cy = mem[m2s_addr + 3];

    page = task_id & 1;
    vram_addr += (page << (IMAGE_WIDTH_BITS + IMAGE_HEIGHT_BITS)) + (1 << (IMAGE_WIDTH_BITS + IMAGE_HEIGHT_BITS)) - 1 - my_core_id;
    i = (1 << (IMAGE_WIDTH_BITS + IMAGE_HEIGHT_BITS)) - 1 - my_core_id;
    do
    {
      x = i & ((1 << IMAGE_WIDTH_BITS) - 1);
      y = i >> IMAGE_WIDTH_BITS;
      m_mandel_core();
      mem[vram_addr] = count;
      vram_addr -= parallel;
      i -= parallel;
    } while (i >=0);
    lib_pop_regs(4, 6);
    */


    lib_push_regs(4, 6);

    // get param
    as_mv(param_addr, m2s_addr);
    as_addi(param_addr, 1);
    as_ld(scale, param_addr); // ld param_addr
    as_addi(param_addr, 1);
    as_ld(scale, param_addr); // ld param_addr+1
    as_addi(param_addr, 1);
    as_ld(scale, param_addr); // write scale, ld param_addr+2
    as_ld(cx, param_addr); // write cx, ld param_addr+2
    as_nop();
    as_ld(cy, param_addr); // write cy

    as_mvi(i, 1);
    as_mv(page, task_id);
    as_mvi(tmp0, 1);
    as_mvil(IMAGE_WIDTH_BITS + IMAGE_HEIGHT_BITS);
    as_sl(i, SP_REG_MVIL);
    as_sl(tmp0, SP_REG_MVIL);
    as_andi(page, 1);
    as_mvsi(i, MVS_SL);
    as_mvsi(tmp0, MVS_SL);
    as_subi(i, 1);
    as_sl(page, SP_REG_MVIL);
    as_sub(i, my_core_id);
    as_nop();
    as_mvsi(page, MVS_SL);
    as_add(page, tmp0);
    as_subi(page, 1);
    as_sub(page, my_core_id);
    as_add(vram_addr, page);
    label("m_mandel_L_0");

    // It has become too fast for the Harvester...
    lib_set_im(wait_counter, WAIT);
    label("m_mandel_L_1");
    as_subi(wait_counter, 1);
    as_cnm(compare, wait_counter);
    lib_bc(compare, "m_mandel_L_1");

    as_mv(x, i);
    as_mv(y, i);
    lib_set_im(tmp0, (1 << IMAGE_WIDTH_BITS) - 1);
    as_sri(y, IMAGE_WIDTH_BITS);
    as_and(x, tmp0);
    as_nop();
    as_mvsi(y, MVS_SR);

    m_mandel_core();

    as_st(vram_addr, count);
    as_sub(vram_addr, parallel);
    as_sub(i, parallel);
    as_cnm(compare, i);
    lib_bc(compare, "m_mandel_L_0");
    lib_pop_regs(4, 6);
  }

  private void pe_thread_manager()
  {
    int task_id = 4;
    int m2s_addr = 5;
    int s2m_addr = 6;
    int my_core_id = 7;
    int parallel = 8;
    int vram_addr = 9;
    // temp
    int master_task_id = 10;
    int diff = 11;
    int compare = 12;
    /*
    as_nop();
    lib_init_stack();
    m2s_addr = m_get_m2s_addr();
    vram_addr = m_get_vram_addr();
    s2m_addr = m_get_s2m_addr();
    my_core_id = mem[m2s_addr];
    s2m_addr += my_core_id;
    m2s_addr++;
    parallel = mem[m2s_addr];
    m2s_addr++;
    task_id = mem[m2s_addr];
    if (my_core_id >= parallel) goto "pe_thread_manager_L_end"
    do
    {
      task_id++;
      mem[s2m_addr] = task_id;
      do
      {
        master_task_id = mem[m2s_addr];
        diff = master_task_id - task_id;
      } while (diff != 0);
      m_mandel();
    } (1);
    */

    as_nop();
    lib_init_stack();
    // get m2s,vram,s2m addr
    as_mvi(m2s_addr, M2S_ADDR_H);
    as_sli(m2s_addr, M2S_ADDR_SHIFT);
    as_mvi(s2m_addr, S2M_ADDR_H);
    as_sli(s2m_addr, S2M_ADDR_SHIFT);
    as_mvsi(m2s_addr, MVS_SL);
    as_ld(my_core_id, m2s_addr); // 1st
    as_mvsi(s2m_addr, MVS_SL);
    as_mvi(vram_addr, VRAM_ADDR_H);
    as_sli(vram_addr, VRAM_ADDR_SHIFT);

    as_ld(my_core_id, m2s_addr); // 2nd
    as_addi(m2s_addr, 1);
    as_ld(parallel, m2s_addr); // 1st
    as_mvsi(vram_addr, MVS_SL);
    as_mv(diff, my_core_id);
    as_add(s2m_addr, my_core_id);
    as_ld(parallel, m2s_addr); // 2nd
    as_addi(m2s_addr, 1);
    as_ld(task_id, m2s_addr); // 1st
    as_sub(diff, parallel);
    as_cnm(compare, diff);
    as_ld(task_id, m2s_addr); // 2nd
    lib_bc(compare, "pe_thread_manager_L_end");
    label("pe_thread_manager_L_0");
    as_addi(task_id, 1);
    as_st(s2m_addr, task_id);
    label("pe_thread_manager_L_1");
    as_ld(master_task_id, m2s_addr); // 3rd
    as_mv(diff, master_task_id);
    as_sub(diff, task_id);
    as_cnz(compare, diff);
    lib_bc(compare, "pe_thread_manager_L_1");

    if (WIDTH_P_D == 32)
    {
      m_mandel();
    }
    else
    {
      m_fill_vram();
    }

    lib_ba("pe_thread_manager_L_0");
    label("pe_thread_manager_L_end");
    lib_call("f_halt");
    // link
    f_halt();
  }

  @Override
  public void init(String[] args)
  {
    super.init(args);
    DEPTH_REG = opts.getIntValue("pe_depth_reg");
    REGS = (1 << DEPTH_REG);
    SP_REG_STACK_POINTER = (REGS - 1);
    STACK_ADDRESS = ((1 << DEPTH_P_D) - 1);
    LREG0 = opts.getIntValue("lreg_start") + 0;
    LREG1 = opts.getIntValue("lreg_start") + 1;
    LREG2 = opts.getIntValue("lreg_start") + 2;
    LREG3 = opts.getIntValue("lreg_start") + 3;
    LREG4 = opts.getIntValue("lreg_start") + 4;
    LREG5 = opts.getIntValue("lreg_start") + 5;
    LREG6 = opts.getIntValue("lreg_start") + 6;

    FIFO_ADDR = (PE_W_BANK_FIFO << DEPTH_B_S_W);
    VRAM_ADDR_SHIFT = DEPTH_B_S_W - 3;
    VRAM_ADDR_H = ((FIFO_ADDR + (FIFO_BANK_VRAM << DEPTH_B_F)) >>> VRAM_ADDR_SHIFT);
    M2S_ADDR_H = PE_R_BANK_M2S;
    M2S_ADDR_SHIFT = DEPTH_B_S_R;
    ITEM_COUNT_ADDR_H = PE_R_BANK_ITEM_COUNT;
    ITEM_COUNT_ADDR_SHIFT = DEPTH_B_S_R;
    S2M_ADDR_SHIFT = DEPTH_B_S_W - 3;
    S2M_ADDR_H = ((FIFO_ADDR + (FIFO_BANK_S2M << DEPTH_B_F)) >>> S2M_ADDR_SHIFT);
    IMAGE_WIDTH_BITS = opts.getIntValue("image_width_bits");
    IMAGE_HEIGHT_BITS = opts.getIntValue("image_height_bits");
    IMAGE_WIDTH_HALF_BITS = (IMAGE_WIDTH_BITS - 1);
    IMAGE_HEIGHT_HALF_BITS = (IMAGE_HEIGHT_BITS - 1);
    IMAGE_WIDTH = (1 << IMAGE_WIDTH_BITS);
    IMAGE_HEIGHT = (1 << IMAGE_HEIGHT_BITS);
    IMAGE_WIDTH_HALF = (1 << IMAGE_WIDTH_HALF_BITS);
    IMAGE_HEIGHT_HALF = (1 << IMAGE_HEIGHT_HALF_BITS);
  }

  @Override
  public void program()
  {
    set_filename("default_pe_code");
    set_rom_width(WIDTH_I);
    set_rom_depth(DEPTH_P_I);
    pe_thread_manager();
  }

  @Override
  public void data()
  {
    set_filename("default_pe_data");
    set_rom_width(WIDTH_P_D);
    set_rom_depth(DEPTH_P_D);
  }
}