Home記事一覧フォーラム

メニーコア・プロセッサーの設計


【更新履歴】
2019/01/24 新規公開

前回設計したMini16-CPUを使ってメニーコア構成のSoCを実装してみました。

1コアあたりのリソース消費が少ないため、Terasic DE0-CVで33コア、BeMicro-CVA9では171コアのプロセッサーを実装できました。
これらはFPGAの乗算器ブロックの搭載量で決まる上限数で、ロジックリソース自体はまだ余裕があります。
レジスター、データ幅は32bitの構成で、Mini16-CPUのオプションもほぼ全て有効にしています。

機種プロセッサーコア数動作周波数
Terasic DE0-CV33140 MHz(実機動作)
BeMicro-CVA9171100 MHz(実機動作)
Kintex UltraScale+129500 MHz(VGAなし, Vivado上の評価)

SoC構成


ターゲットボードについて

このプロジェクトは以下のFPGAボードに対応しています。
Terasic DE0-CV
BeMicro-CVA9

I/O電圧のジャンパ設定について

●BeMicro CV A9の場合
BeMicro CV A9ではボードのI/O電圧を3.3Vに設定することを前提にしています。
BeMicro CV A9 Hardware Reference Guide
のp.23を参照してVCCIO選択ジャンパ (J11)のpin 1とpin 2が接続されていることを確認してください。

論理合成・実行方法

ソースコードのダウンロード:mini16_manycore.tar.gz

●Ubuntuでのビルドに対応しています。gcc, make, OpenJDK8.0のパッケージをインストールしていることとします。
●Terasic DE0-CV、BeMicro-CVA9の場合
Quartus Primeは「AlteraのFPGA開発ツール「Quartus Prime」をUbuntuにインストールする」の方法でインストールしているものとします。

ターミナルで、

tar xf mini16_manycore.tar.gz

mini16_manycore/asm/AsmLib.java の「public static final int CORES」の値を32(DE0-CV) or 170(CVA9)に、 MasterProgram.java の「private static final int PARALLEL」の値を32(DE0-CV) or 170(CVA9)に変更してから mini16_manycore ディレクトリ以下でmakeします。
各社のツールでプロジェクトファイルを開いて合成、転送します。

プロジェクトファイル:
Terasic DE0-CV: mini16_manycore/de0-cv/DE0_CV_start.qpf
BeMicro-CVA9: mini16_manycore/bemicro_cva9/bemicro_cva9_start.qpf

クロックを高めに設定しているので合成ツールのランダムシードによってはTiming metにならない場合があります。この場合はQuartusのAssignments:Settings:Compiler Settings:Advanced Settings:Fitter Initial Placement Seedを1ずつ増やして何度か試してみてください。だいたい10回以内に「当たり」の配置配線が出るはずです。

Verilogシミュレータ「Icarus Verilog」でのシミュレーション

「Icarus Verilog」を使えばFPGAボードがなくても開発・シミュレーションを行うことができます。
Icarus Verilogコンパイラを使う」の方法で iverilog と gtkwave をインストールし、

cd mini16_manycore/testbench

make run

でシミュレーションできます。出力された wave.vcd を gtkwave で開いて画面左側の信号リストから見たい信号を右側の波形画面へドラッグ&ドロップすれば信号波形を観察できます。

Raspberry Pi、PCとの接続

Raspberry Pi、もしくはUSBシリアルケーブルを接続したPCからFPGAにUARTで接続して、プログラムの転送、実行を行えるようにしました。

その他のI/Oの接続

UART経由でのプログラムの転送、実行

上記のように設定したRaspberry PiまたはPCで、

cd mini16_manycore

make run

これでツールのコンパイル、プログラムのコンパイル、転送、実行が行われます。

このCPUでプログラミングする方法

mini16_manycore/asm 以下にJava上で動作する簡易アセンブラが入っています。
実行にはOpenJDK 8.0以上のインストールが必要です。
AsmLibクラスを継承したクラスを作り、init()で初期化設定、program()にプログラム、data()にデータを記述します。AsmTop.javaも修正します。
mini16_manycoreディレクトリに移動して make を実行するとプログラム・バイナリ(default_code_mem.v, default_data_mem.v)が出力されます。
UART使用時は make run を実行するとビルド後に転送されます。

並列化プログラムの例:マンデルブロ集合の描画

mini16_manycore/asm 以下にマンデルブロ集合を描画するデモプログラムが入っています。
MasterProgram.java がマスターコア用プログラムで、PEの制御を行います。
PEProgram.java がPE用プログラムで、マンデルブロ集合の計算とフレームバッファへの描画を行います。
BootProgram.java はPE用プログラムをPEに転送するマスターコア用プログラムです。UARTインターフェース使用時はまずこれが走り、次に MasterProgram.java のプログラムが走るようになっています。(mini16_manycore/tools/Makefile 参照)
PCとUARTで接続している場合は、mini16_manycore ディレクトリ以下で make run すると全てのプログラムがコンパイルされて転送、実行されます。

ソースコード

これらのソースコードはBSD 2-Clauseライセンスで公開します。 全てのソースコードはmini16_manycore.tar.gzをダウンロードするか、
https://github.com/miya4649/mini16_manycoreを参照してください。

mini16_cpu.v : CPU本体
/*
  Copyright (c) 2018-2019, miya
  All rights reserved.

  Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/


module mini16_cpu
  #(
    parameter WIDTH_I = 16,
    parameter WIDTH_D = 16,
    parameter DEPTH_I = 8,
    parameter DEPTH_D = 8,
    parameter DEPTH_REG = 5,
    parameter REGFILE_RAM_TYPE = "auto",
    parameter ENABLE_MVIL = 1'b0,
    parameter ENABLE_MUL = 1'b0,
    parameter ENABLE_MULTI_BIT_SHIFT = 1'b0,
    parameter ENABLE_MVC = 1'b0,
    parameter ENABLE_WA = 1'b0,
    parameter ENABLE_INT = 1'b0,
    parameter FULL_PIPELINED_ALU = 1'b0
    )
  (
   input                    clk,
   input                    reset,
   input                    soft_reset,
   output reg [DEPTH_I-1:0] mem_i_r_addr,
   input [WIDTH_I-1:0]      mem_i_r_data,
   output reg [DEPTH_D-1:0] mem_d_r_addr,
   input [WIDTH_D-1:0]      mem_d_r_data,
   output reg [DEPTH_D-1:0] mem_d_w_addr,
   output reg [WIDTH_D-1:0] mem_d_w_data,
   output reg               mem_d_we
   );

  localparam TRUE = 1'b1;
  localparam FALSE = 1'b0;
  localparam ONE = 1'd1;
  localparam ZERO = 1'd0;
  localparam FFFF = {WIDTH_D{1'b1}};
  localparam SHIFT_BITS = $clog2(WIDTH_D);
  localparam BL_OFFSET = 1'd1;

  // opcode
  localparam I_NOP  = 5'h00; // 5'b00000;
  localparam I_ST   = 5'h01; // 5'b00001;
  localparam I_MVC  = 5'h02; // 5'b00010;
  localparam I_BA   = 5'h04; // 5'b00100;
  localparam I_BC   = 5'h05; // 5'b00101;
  localparam I_WA   = 5'h06; // 5'b00110;
  localparam I_BL   = 5'h07; // 5'b00111;
  localparam I_ADD  = 5'h08; // 5'b01000;
  localparam I_SUB  = 5'h09; // 5'b01001;
  localparam I_AND  = 5'h0a; // 5'b01010;
  localparam I_OR   = 5'h0b; // 5'b01011;
  localparam I_XOR  = 5'h0c; // 5'b01100;
  localparam I_MUL  = 5'h0d; // 5'b01101;
  localparam I_MV   = 5'h10; // 5'b10000;
  localparam I_MVIL = 5'h11; // 5'b10001;
  localparam I_LD   = 5'h17; // 5'b10111;
  localparam I_SR   = 5'h18; // 5'b11000;
  localparam I_SL   = 5'h19; // 5'b11001;
  localparam I_SRA  = 5'h1a; // 5'b11010;
  localparam I_CNZ  = 5'h1c; // 5'b11100;
  localparam I_CNM  = 5'h1d; // 5'b11101;

  // special register
  localparam SP_REG_CP   = 0;
  localparam SP_REG_MVIL = 1;

  // debug
`ifdef DEBUG
  reg [DEPTH_I-1:0] mem_i_r_addr_d1;
  reg [DEPTH_I-1:0] mem_i_r_addr_s1;
  always @(posedge clk)
    begin
      mem_i_r_addr_d1 <= mem_i_r_addr;
      mem_i_r_addr_s1 <= mem_i_r_addr_d1;
    end
`endif

  // stage 1 fetch
  reg  [WIDTH_I-1:0]   inst_s1;
  wire [DEPTH_REG-1:0] reg_d_s1;
  wire [DEPTH_REG-1:0] reg_a_s1;
  wire [4:0]           op_s1;
  wire                 is_im_s1;
  assign reg_d_s1 = inst_s1[15:11];
  assign reg_a_s1 = inst_s1[10:6];
  assign is_im_s1 = inst_s1[5];
  assign op_s1 = inst_s1[4:0];
  generate
    if (ENABLE_WA == TRUE)
      begin
        always @(posedge clk)
          begin
            if (reset == TRUE)
              begin
                inst_s1 <= ZERO;
              end
            else
              begin
                if (wait_en_s2 == TRUE)
                  begin
                    inst_s1 <= ZERO;
                  end
                else
                  begin
                    inst_s1 <= mem_i_r_data;
                  end
              end
          end
      end
    else
      begin
        always @(posedge clk)
          begin
            if (reset == TRUE)
              begin
                inst_s1 <= ZERO;
              end
            else
              begin
                inst_s1 <= mem_i_r_data;
              end
          end
      end
  endgenerate

  // stage 2 wait counter
  wire wait_en_s2;
  reg [4:0] wait_count_m1;
  reg [9:0] wait_counter_s2;
  generate
    if (ENABLE_WA == TRUE)
      begin
        assign wait_en_s2 = (wait_counter_s2 == ZERO) ? FALSE : TRUE;
        always @(posedge clk)
          begin
            if (reset == TRUE)
              begin
                wait_counter_s2 <= ZERO;
                wait_count_m1 <= ZERO;
              end
            else
              begin
                if (op_s1 == I_WA)
                  begin
                    wait_counter_s2 <= reg_a_s1;
                    wait_count_m1 <= reg_a_s1 - ONE;
                  end
                else
                  begin
                    if (wait_en_s2 == TRUE)
                      begin
                        wait_counter_s2 <= wait_counter_s2 - ONE;
                      end
                  end
              end
          end
      end
  endgenerate

  // stage 2 set reg read addr
  reg [DEPTH_REG-1:0] reg_addr_a_s2;
  reg [DEPTH_REG-1:0] reg_addr_b_s2;
  generate
    if (ENABLE_MVC == TRUE)
      begin
        always @(posedge clk)
          begin
            reg_addr_b_s2 <= reg_a_s1;
            if (op_s1 == I_MVC)
              begin
                reg_addr_a_s2 <= SP_REG_CP;
              end
            else
              begin
                reg_addr_a_s2 <= reg_d_s1;
              end
          end
      end
  endgenerate

  // stage 2 delay
  reg [4:0]           op_s2;
  reg                 is_im_s2;
  reg [DEPTH_REG-1:0] reg_d_s2;
  reg [DEPTH_REG-1:0] reg_a_s2;
  always @(posedge clk)
    begin
      op_s2 <= op_s1;
      is_im_s2 <= is_im_s1;
      reg_d_s2 <= reg_d_s1;
      reg_a_s2 <= reg_a_s1;
    end

  // stage 3 set dest reg addr
  reg [DEPTH_REG-1:0] reg_addr_d_s3;
  always @(posedge clk)
    begin
      if (reset == TRUE)
        begin
          reg_addr_d_s3 <= ZERO;
        end
      else
        begin
          if ((ENABLE_MVIL == TRUE) && (op_s2 == I_MVIL))
            begin
              reg_addr_d_s3 <= SP_REG_MVIL;
            end
          else
            begin
              reg_addr_d_s3 <= reg_d_s2;
            end
        end
    end

  // stage 3 delay
  reg [4:0]           op_s3;
  reg                 is_im_s3;
  reg [DEPTH_REG-1:0] reg_a_s3;
  always @(posedge clk)
    begin
      op_s3 <= op_s2;
      is_im_s3 <= is_im_s2;
      reg_a_s3 <= reg_a_s2;
    end
  reg [DEPTH_REG-1:0] reg_d_s3;
  generate
    if (ENABLE_MVIL == TRUE)
      begin
        always @(posedge clk)
          begin
            reg_d_s3 <= reg_d_s2;
          end
      end
  endgenerate

  // stage 4 fetch reg_data
  wire [WIDTH_D-1:0] reg_data_a_s_s3;
  wire [WIDTH_D-1:0] reg_data_b_s_s3;
  reg [WIDTH_D-1:0]  reg_data_a_s4;
  reg [WIDTH_D-1:0]  reg_data_b_s4;
  always @(posedge clk)
    begin
      reg_data_a_s4 <= reg_data_a_s_s3;
      if (reset == TRUE)
        begin
          reg_data_b_s4 <= ZERO;
        end
      else
        begin
          if ((ENABLE_MVIL == TRUE) && (op_s3 == I_MVIL))
            begin
              reg_data_b_s4 <= {reg_d_s3, reg_a_s3, is_im_s3};
            end
          else if (is_im_s3 == TRUE)
            begin
              reg_data_b_s4 <= $signed(reg_a_s3);
            end
          else
            begin
              reg_data_b_s4 <= reg_data_b_s_s3;
            end
        end
    end

  // stage 4 load address
  always @(posedge clk)
    begin
      if (reset == TRUE)
        begin
          mem_d_r_addr <= ZERO;
        end
      else
        begin
          if (op_s3 == I_LD)
            begin
              mem_d_r_addr <= reg_data_b_s_s3;
            end
        end
    end

  // stage 4 delay
  reg [4:0]           op_s4;
  reg [DEPTH_REG-1:0] reg_addr_d_s4;
  always @(posedge clk)
    begin
      op_s4 <= op_s3;
      reg_addr_d_s4 <= reg_addr_d_s3;
    end

  // stage 5 execute store
  always @(posedge clk)
    begin
      if (reset == TRUE)
        begin
          mem_d_w_addr <= ZERO;
          mem_d_w_data <= ZERO;
          mem_d_we <= FALSE;
        end
      else
        begin
          case (op_s4)
            I_ST:
              begin
                mem_d_w_addr <= reg_data_a_s4;
                mem_d_w_data <= reg_data_b_s4;
                mem_d_we <= TRUE;
              end
            default:
              begin
                mem_d_w_addr <= ZERO;
                mem_d_w_data <= ZERO;
                mem_d_we <= FALSE;
              end
          endcase
        end
    end

  // stage 5 calc BL address
  reg [DEPTH_I-1:0] bl_addr_s5;
  always @(posedge clk)
    begin
      bl_addr_s5 <= mem_i_r_addr + BL_OFFSET;
    end

  // stage 5 execute branch
  wire cond_true_s4;
  assign cond_true_s4 = (reg_data_a_s4 != ZERO) ? TRUE : FALSE;
  always @(posedge clk)
    begin
      if (reset == TRUE)
        begin
          mem_i_r_addr <= ZERO;
        end
      else
        begin
          // branch
          if ((ENABLE_INT == TRUE) && (soft_reset == TRUE))
            begin
              mem_i_r_addr <= ZERO;
            end
          else if ((op_s4 == I_BA) || (op_s4 == I_BL) || ((op_s4 == I_BC) && (cond_true_s4)))
            begin
              mem_i_r_addr <= reg_data_b_s4;
            end
          else if ((ENABLE_WA == TRUE) && (op_s4 == I_WA))
            begin
              mem_i_r_addr <= mem_i_r_addr - wait_count_m1;
            end
          else
            begin
              mem_i_r_addr <= mem_i_r_addr + ONE;
            end
        end
    end

  // stage 5 delay
  reg [4:0]           op_s5;
  reg [DEPTH_REG-1:0] reg_addr_d_s5;
  reg [WIDTH_D-1:0]   reg_data_a_s5;
  reg [WIDTH_D-1:0]   reg_data_b_s5;
  always @(posedge clk)
    begin
      op_s5 <= op_s4;
      reg_addr_d_s5 <= reg_addr_d_s4;
      reg_data_a_s5 <= reg_data_a_s4;
      reg_data_b_s5 <= reg_data_b_s4;
    end
  reg cond_true_s5;
  generate
    if (ENABLE_MVC == TRUE)
      begin
        always @(posedge clk)
          begin
            cond_true_s5 <= cond_true_s4;
          end
      end
  endgenerate

  // stage 6 compare
  reg flag_cnz_s6;
  reg flag_cnm_s6;
  always @(posedge clk)
    begin
      if (reg_data_b_s5 == ZERO)
        begin
          flag_cnz_s6 <= FALSE;
        end
      else
        begin
          flag_cnz_s6 <= TRUE;
        end

      if (reg_data_b_s5[WIDTH_D-1] == 1'b0)
        begin
          flag_cnm_s6 <= TRUE;
        end
      else
        begin
          flag_cnm_s6 <= FALSE;
        end
    end

  // stage 6 reg we
  reg reg_we_s6;
  wire stage6_reg_we_cond;
  generate
    if (ENABLE_MVC == TRUE)
      begin
        assign stage6_reg_we_cond = ((op_s5[4:3] != 2'b00) || (op_s5 == I_BL) || ((op_s5 == I_MVC) && (cond_true_s5 == TRUE)));
      end
    else
      begin
        assign stage6_reg_we_cond = ((op_s5[4:3] != 2'b00) || (op_s5 == I_BL));
      end
  endgenerate
  always @(posedge clk)
    begin
      if (stage6_reg_we_cond)
        begin
          reg_we_s6 <= TRUE;
        end
      else
        begin
          reg_we_s6 <= FALSE;
        end
    end

  // stage 6 delay
  reg [4:0]           op_s6;
  reg [DEPTH_REG-1:0] reg_addr_d_s6;
  reg [WIDTH_D-1:0]   reg_data_a_s6;
  reg [WIDTH_D-1:0]   reg_data_b_s6;
  reg [DEPTH_I-1:0]   bl_addr_s6;
  always @(posedge clk)
    begin
      op_s6 <= op_s5;
      reg_addr_d_s6 <= reg_addr_d_s5;
      reg_data_a_s6 <= reg_data_a_s5;
      reg_data_b_s6 <= reg_data_b_s5;
      bl_addr_s6 <= bl_addr_s5;
    end

  // stage 6 pre-execute
  reg [WIDTH_D-1:0] reg_data_add_s6;
  reg [WIDTH_D-1:0] reg_data_sub_s6;
  reg [WIDTH_D-1:0] reg_data_and_s6;
  reg [WIDTH_D-1:0] reg_data_or_s6;
  reg [WIDTH_D-1:0] reg_data_xor_s6;
  generate
    if (FULL_PIPELINED_ALU == TRUE)
      begin
        always @(posedge clk)
          begin
            reg_data_add_s6 <= reg_data_a_s5 + reg_data_b_s5;
            reg_data_sub_s6 <= reg_data_a_s5 - reg_data_b_s5;
            reg_data_and_s6 <= reg_data_a_s5 & reg_data_b_s5;
            reg_data_or_s6  <= reg_data_a_s5 | reg_data_b_s5;
            reg_data_xor_s6 <= reg_data_a_s5 ^ reg_data_b_s5;
          end
      end
  endgenerate

  // stage 7 execute
  reg [WIDTH_D-1:0] reg_data_w_s7;
  always @(posedge clk)
    begin
      case (op_s6)
        I_ADD:
          begin
            if (FULL_PIPELINED_ALU == TRUE)
              begin
                reg_data_w_s7 <= reg_data_add_s6;
              end
            else
              begin
                reg_data_w_s7 <= reg_data_a_s6 + reg_data_b_s6;
              end
          end
        I_SUB:
          begin
            if (FULL_PIPELINED_ALU == TRUE)
              begin
                reg_data_w_s7 <= reg_data_sub_s6;
              end
            else
              begin
                reg_data_w_s7 <= reg_data_a_s6 - reg_data_b_s6;
              end
          end
        I_AND:
          begin
            if (FULL_PIPELINED_ALU == TRUE)
              begin
                reg_data_w_s7 <= reg_data_and_s6;
              end
            else
              begin
                reg_data_w_s7 <= reg_data_a_s6 & reg_data_b_s6;
              end
          end
        I_OR:
          begin
            if (FULL_PIPELINED_ALU == TRUE)
              begin
                reg_data_w_s7 <= reg_data_or_s6;
              end
            else
              begin
                reg_data_w_s7 <= reg_data_a_s6 | reg_data_b_s6;
              end
          end
        I_XOR:
          begin
            if (FULL_PIPELINED_ALU == TRUE)
              begin
                reg_data_w_s7 <= reg_data_xor_s6;
              end
            else
              begin
                reg_data_w_s7 <= reg_data_a_s6 ^ reg_data_b_s6;
              end
          end
        I_SR:
          begin
            reg_data_w_s7 <= sr_result_s6;
          end
        I_SL:
          begin
            reg_data_w_s7 <= sl_result_s6;
          end
        I_SRA:
          begin
            reg_data_w_s7 <= sra_result_s6;
          end
        I_CNZ:
          begin
            reg_data_w_s7 <= {WIDTH_D{flag_cnz_s6}};
          end
        I_CNM:
          begin
            reg_data_w_s7 <= {WIDTH_D{flag_cnm_s6}};
          end
        I_BL:
          begin
            reg_data_w_s7 <= bl_addr_s6;
          end
        I_MUL:
          begin
            if (ENABLE_MUL == TRUE)
              begin
                reg_data_w_s7 <= mul_result_s6;
              end
            else
              begin
                reg_data_w_s7 <= reg_data_b_s6;
              end
          end
        I_LD:
          begin
            reg_data_w_s7 <= mem_d_r_data;
          end
        // I_MV, I_MVIL
        default:
          begin
            reg_data_w_s7 <= reg_data_b_s6;
          end
      endcase
    end

  // stage 7 delay
  reg [DEPTH_REG-1:0] reg_addr_d_s7;
  reg reg_we_s7;
  always @(posedge clk)
    begin
      reg_addr_d_s7 <= reg_addr_d_s6;
      reg_we_s7 <= reg_we_s6;
    end

  wire [DEPTH_REG-1:0] reg_file_addr_r_a;
  wire [DEPTH_REG-1:0] reg_file_addr_r_b;
  generate
    if (ENABLE_MVC == TRUE)
      begin
        assign reg_file_addr_r_a = reg_addr_a_s2;
        assign reg_file_addr_r_b = reg_addr_b_s2;
      end
    else
      begin
        assign reg_file_addr_r_a = reg_d_s2;
        assign reg_file_addr_r_b = reg_a_s2;
      end
  endgenerate
  r2w1_port_ram
    #(
      .DATA_WIDTH (WIDTH_D),
      .ADDR_WIDTH (DEPTH_REG),
      .RAM_TYPE (REGFILE_RAM_TYPE)
      )
  reg_file
    (
     .clk (clk),
     .addr_r_a (reg_file_addr_r_a),
     .addr_r_b (reg_file_addr_r_b),
     .addr_w (reg_addr_d_s7),
     .data_in (reg_data_w_s7),
     .we (reg_we_s7),
     .data_out_a (reg_data_a_s_s3),
     .data_out_b (reg_data_b_s_s3)
     );

  wire [WIDTH_D-1:0] mul_result_s6;
  generate
    if (ENABLE_MUL == TRUE)
      begin
        delayed_mul
          #(
            .WIDTH_D (WIDTH_D)
            )
        delayed_mul_0
          (
           .clk (clk),
           .a (reg_data_a_s4),
           .b (reg_data_b_s4),
           .out (mul_result_s6)
           );
      end
  endgenerate

  wire [WIDTH_D-1:0] sr_result_s6;
  wire [WIDTH_D-1:0] sl_result_s6;
  wire [WIDTH_D-1:0] sra_result_s6;
  reg [WIDTH_D-1:0]  sr_result_s6_reg;
  reg [WIDTH_D-1:0]  sl_result_s6_reg;
  reg [WIDTH_D-1:0]  sra_result_s6_reg;
  generate
    if (ENABLE_MULTI_BIT_SHIFT == TRUE)
      begin

        delayed_sr
          #(
            .WIDTH_D (WIDTH_D),
            .SHIFT_BITS (SHIFT_BITS)
            )
        delayed_sr_0
          (
           .clk (clk),
           .a (reg_data_a_s4),
           .b (reg_data_b_s4[SHIFT_BITS-1:0]),
           .out (sr_result_s6)
           );

        delayed_sl
          #(
            .WIDTH_D (WIDTH_D),
            .SHIFT_BITS (SHIFT_BITS)
            )
        delayed_sl_0
          (
           .clk (clk),
           .a (reg_data_a_s4),
           .b (reg_data_b_s4[SHIFT_BITS-1:0]),
           .out (sl_result_s6)
           );

        delayed_sra
          #(
            .WIDTH_D (WIDTH_D),
            .SHIFT_BITS (SHIFT_BITS)
            )
        delayed_sra_0
          (
           .clk (clk),
           .a (reg_data_a_s4),
           .b (reg_data_b_s4[SHIFT_BITS-1:0]),
           .out (sra_result_s6)
           );
      end
    else
      begin
        always @(posedge clk)
          begin
            sr_result_s6_reg <= {1'b0, reg_data_a_s5[WIDTH_D-1:1]};
            sl_result_s6_reg <= {reg_data_a_s5[WIDTH_D-2:0], 1'b0};
            sra_result_s6_reg <= {reg_data_a_s5[WIDTH_D-1], reg_data_a_s5[WIDTH_D-1:1]};
          end
        assign sr_result_s6 = sr_result_s6_reg;
        assign sl_result_s6 = sl_result_s6_reg;
        assign sra_result_s6 = sra_result_s6_reg;
      end
  endgenerate

endmodule

module delayed_mul
  #(
    parameter WIDTH_D = 16
    )
  (
   input                           clk,
   input signed [WIDTH_D-1:0]      a,
   input signed [WIDTH_D-1:0]      b,
   output reg signed [WIDTH_D-1:0] out
   );

  reg signed [WIDTH_D-1:0]         sa;
  reg signed [WIDTH_D-1:0]         sb;

  always @(posedge clk)
    begin
      sa <= a;
      sb <= b;
      out <= sa * sb;
    end
endmodule

module delayed_sr
  #(
    parameter WIDTH_D = 16,
    parameter SHIFT_BITS = 4
    )
  (
   input                    clk,
   input [WIDTH_D-1:0]      a,
   input [SHIFT_BITS-1:0]   b,
   output reg [WIDTH_D-1:0] out
   );

  reg [WIDTH_D-1:0]         sa;
  reg [SHIFT_BITS-1:0]      sb;

  always @(posedge clk)
    begin
      sa <= a;
      sb <= b;
      out <= sa >> sb;
    end
endmodule

module delayed_sl
  #(
    parameter WIDTH_D = 16,
    parameter SHIFT_BITS = 4
    )
  (
   input                    clk,
   input [WIDTH_D-1:0]      a,
   input [SHIFT_BITS-1:0]   b,
   output reg [WIDTH_D-1:0] out
   );

  reg [WIDTH_D-1:0]         sa;
  reg [SHIFT_BITS-1:0]      sb;

  always @(posedge clk)
    begin
      sa <= a;
      sb <= b;
      out <= sa << sb;
    end
endmodule

module delayed_sra
  #(
    parameter WIDTH_D = 16,
    parameter SHIFT_BITS = 4
    )
  (
   input                    clk,
   input [WIDTH_D-1:0]      a,
   input [SHIFT_BITS-1:0]   b,
   output reg [WIDTH_D-1:0] out
   );

  reg signed [WIDTH_D-1:0]  sa;
  reg [SHIFT_BITS-1:0]      sb;

  always @(posedge clk)
    begin
      sa <= a;
      sb <= b;
      out <= sa >>> sb;
    end
endmodule
mini16_pe.v : Processor Element
/*
  Copyright (c) 2019, miya
  All rights reserved.

  Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/


module mini16_pe
  #(
    parameter WIDTH_D = 16,
    parameter DEPTH_I = 8,
    parameter DEPTH_D = 8,
    parameter DEPTH_M2S = 8,
    parameter DEPTH_FIFO = 7,
    parameter CORE_ID = 0,
    parameter MASTER_W_BANK_BC = 63,
    parameter DEPTH_V_F = 16,
    parameter DEPTH_B_F = 15,
    parameter DEPTH_V_M_W = 17,
    parameter DEPTH_B_M_W = 11,
    parameter DEPTH_V_S_R = 10,
    parameter DEPTH_B_S_R = 8,
    parameter DEPTH_V_S_W = 9,
    parameter DEPTH_B_S_W = 8,
    parameter DEPTH_V_M2S = 9,
    parameter DEPTH_B_M2S = 8,
    parameter FIFO_RAM_TYPE = "auto",
    parameter REGFILE_RAM_TYPE = "auto"
    )
  (
   input                          clk,
   input                          reset,
   input                          soft_reset,
   input                          fifo_req_r,
   output                         fifo_valid,
   output [WIDTH_D+DEPTH_V_F-1:0] fifo_r_data,
   input [DEPTH_V_M_W-1:0]        addr_i,
   input [WIDTH_D-1:0]            data_i,
   input                          we_i
   );

  localparam WIDTH_I = 16;
  localparam DEPTH_REG = 5;
  localparam TRUE = 1'b1;
  localparam FALSE = 1'b0;
  localparam ONE = 1'd1;
  localparam ZERO = 1'd0;
  localparam FFFF = {WIDTH_D{1'b1}};

  wire [DEPTH_I-1:0]     cpu_i_r_addr;
  wire [WIDTH_I-1:0]     cpu_i_r_data;
  wire [DEPTH_V_S_W-1:0] cpu_d_r_addr;
  reg [WIDTH_D-1:0]      cpu_d_r_data;
  wire [DEPTH_V_S_W-1:0] cpu_d_w_addr;
  wire [WIDTH_D-1:0]     cpu_d_w_data;
  wire                   cpu_d_we;
  wire [DEPTH_V_S_W-DEPTH_B_S_W-1:0] cpu_d_w_bank;
  wire [DEPTH_V_S_R-DEPTH_B_S_R-1:0] cpu_d_r_bank;

  // cpu data write
  reg [DEPTH_D-1:0]  mem_d_w_addr;
  reg [WIDTH_D-1:0]  mem_d_w_data;
  reg                mem_d_we;
  assign cpu_d_w_bank = cpu_d_w_addr[DEPTH_V_S_W-1:DEPTH_B_S_W];
  always @(posedge clk)
    begin
      mem_d_w_addr <= cpu_d_w_addr[DEPTH_D-1:0];
      mem_d_w_data <= cpu_d_w_data;
      s2mfifo_data_w <= {cpu_d_w_addr[DEPTH_V_F-1:0], cpu_d_w_data};
      if (cpu_d_we == TRUE)
        begin
          case (cpu_d_w_bank)
            0:
              begin
                // mem_d
                mem_d_we <= TRUE;
                s2mfifo_we <= FALSE;
              end
            default:
              begin
                // fifo
                mem_d_we <= FALSE;
                s2mfifo_we <= TRUE;
              end
          endcase
        end
      else
        begin
          mem_d_we <= FALSE;
          s2mfifo_we <= FALSE;
        end
    end

  // cpu data read
  wire [DEPTH_D-1:0] mem_d_r_addr;
  wire [WIDTH_D-1:0] mem_d_r_data;
  wire [WIDTH_D-1:0] shared_m2s_r_data;
  assign mem_d_r_addr = cpu_d_r_addr[DEPTH_D-1:0];
  assign cpu_d_r_bank = cpu_d_r_addr[DEPTH_V_S_R-1:DEPTH_B_S_R];
  always @(posedge clk)
    begin
      case (cpu_d_r_bank)
        // mem_d
        0: cpu_d_r_data <= mem_d_r_data;
        // shared_m2s
        1: cpu_d_r_data <= shared_m2s_r_data;
        // register
        default: cpu_d_r_data <= s2mfifo_item_count;
      endcase
    end

  // data from master
  reg shared_m2s_we;
  reg mem_i_we;
  reg [DEPTH_V_M_W-1:0] addr_i_d1;
  reg [WIDTH_D-1:0]     data_i_d1;
  reg [DEPTH_V_M_W-1:0] addr_i_d2;
  reg [WIDTH_D-1:0]     data_i_d2;
  reg                   we_i_d1;
  wire [DEPTH_V_M_W-DEPTH_B_M_W-1:0] core_bank;
  wire [DEPTH_V_M2S-DEPTH_B_M2S-1:0] m2s_bank;
  assign core_bank = addr_i_d1[DEPTH_V_M_W-1:DEPTH_B_M_W];
  assign m2s_bank = addr_i_d1[DEPTH_V_M2S-1:DEPTH_B_M2S];

  always @(posedge clk)
    begin
      addr_i_d1 <= addr_i;
      data_i_d1 <= data_i;
      addr_i_d2 <= addr_i_d1;
      data_i_d2 <= data_i_d1;
      we_i_d1 <= we_i;
    end

  always @(posedge clk)
    begin
      if ((we_i_d1 == TRUE) && ((core_bank == CORE_ID) || (core_bank == MASTER_W_BANK_BC)))
        begin
          case (m2s_bank)
            0:
              begin
                shared_m2s_we <= TRUE;
                mem_i_we <= FALSE;
              end
            default:
              begin
                shared_m2s_we <= FALSE;
                mem_i_we <= TRUE;
              end
          endcase
        end
      else
        begin
          shared_m2s_we <= FALSE;
          mem_i_we <= FALSE;
        end
    end

  mini16_cpu
    #(
      .WIDTH_I (WIDTH_I),
      .WIDTH_D (WIDTH_D),
      .DEPTH_I (DEPTH_I),
      .DEPTH_D (DEPTH_V_S_W),
      .DEPTH_REG (DEPTH_REG),
      .ENABLE_MVIL (TRUE),
      .ENABLE_MUL (TRUE),
      .ENABLE_MULTI_BIT_SHIFT (TRUE),
      .ENABLE_MVC (TRUE),
      .ENABLE_WA (TRUE),
      .ENABLE_INT (TRUE),
      .FULL_PIPELINED_ALU (FALSE),
      .REGFILE_RAM_TYPE (REGFILE_RAM_TYPE)
      )
  mini16_cpu_0
    (
     .clk (clk),
     .reset (reset),
     .soft_reset (soft_reset),
     .mem_i_r_addr (cpu_i_r_addr),
     .mem_i_r_data (cpu_i_r_data),
     .mem_d_r_addr (cpu_d_r_addr),
     .mem_d_r_data (cpu_d_r_data),
     .mem_d_w_addr (cpu_d_w_addr),
     .mem_d_w_data (cpu_d_w_data),
     .mem_d_we (cpu_d_we)
     );

  default_pe_code_mem
    #(
      .DATA_WIDTH (WIDTH_I),
      .ADDR_WIDTH (DEPTH_I)
      )
  mem_i
    (
     .clk (clk),
     .addr_r (cpu_i_r_addr),
     .addr_w (addr_i_d2[DEPTH_I-1:0]),
     .data_in (data_i_d2[WIDTH_I-1:0]),
     .we (mem_i_we),
     .data_out (cpu_i_r_data)
     );

  default_pe_data_mem
    #(
      .DATA_WIDTH (WIDTH_D),
      .ADDR_WIDTH (DEPTH_D)
      )
  mem_d
    (
     .clk (clk),
     .addr_r (mem_d_r_addr),
     .addr_w (mem_d_w_addr),
     .data_in (mem_d_w_data),
     .we (mem_d_we),
     .data_out (mem_d_r_data)
     );

  rw_port_ram
    #(
      .DATA_WIDTH (WIDTH_D),
      .ADDR_WIDTH (DEPTH_M2S)
      )
  shared_m2s
    (
     .clk (clk),
     .addr_r (mem_d_r_addr[DEPTH_M2S-1:0]),
     .addr_w (addr_i_d2[DEPTH_M2S-1:0]),
     .data_in (data_i_d2),
     .we (shared_m2s_we),
     .data_out (shared_m2s_r_data)
     );

  reg s2mfifo_we;
  reg [WIDTH_D+DEPTH_V_F-1:0] s2mfifo_data_w;
  wire [DEPTH_FIFO-1:0] s2mfifo_item_count;
  fifo
    #(
      .WIDTH (WIDTH_D+DEPTH_V_F),
      .DEPTH_IN_BITS (DEPTH_FIFO),
      .MAX_ITEMS (((1 << DEPTH_FIFO) - 7)),
      .RAM_TYPE (FIFO_RAM_TYPE)
      )
  s2mfifo
    (
     .clk (clk),
     .reset (reset),
     .req_r (fifo_req_r),
     .we (s2mfifo_we),
     .data_w (s2mfifo_data_w),
     .data_r (fifo_r_data),
     .valid_r (fifo_valid),
     .full (),
     .item_count (s2mfifo_item_count),
     .empty ()
     );

endmodule
mini16_soc.v : SoC
/*
  Copyright (c) 2019, miya
  All rights reserved.

  Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/


module mini16_soc
  #(
    parameter CORES = 32,
    parameter UART_CLK_HZ = 50000000,
    parameter UART_SCLK_HZ = 115200,
    parameter WIDTH_M_D = 32,
    parameter WIDTH_P_D = 32,
    parameter DEPTH_M_I = 11,
    parameter DEPTH_M_D = 11,
    parameter DEPTH_P_I = 10,
    parameter DEPTH_P_D = 8,
    parameter DEPTH_M2S = 8,
    parameter DEPTH_FIFO = 4,
    parameter DEPTH_S2M = 8,
    parameter DEPTH_U2M = 11,
    parameter WIDTH_VRAM = 3,
    parameter DEPTH_VRAM = 17,
    parameter MASTER_REGFILE_RAM_TYPE = "auto",
    parameter PE_REGFILE_RAM_TYPE = "auto",
    parameter PE_FIFO_RAM_TYPE = "auto"
    )
  (
   input  clk,
   input  reset,
`ifdef USE_UART
   input  uart_rxd,
   output uart_txd,
`endif
`ifdef USE_VGA
   input  clkv,
   input  resetv,
   output vga_hs,
   output vga_vs,
   output vga_r,
   output vga_g,
   output vga_b,
`endif
   output [15:0] led
   );

  localparam WIDTH_I = 16;
  localparam DEPTH_REG = 5;
  localparam DEPTH_IO_REG = 5;
  localparam DEPTH_B_U = max(DEPTH_M_I, DEPTH_U2M);
  localparam DEPTH_V_U = (DEPTH_B_U + 2);
  localparam CORE_BITS = $clog2(CORES + 6);
  localparam DEPTH_B_F = max(DEPTH_VRAM, DEPTH_S2M);
  localparam DEPTH_V_F = (DEPTH_B_F + 1);
  localparam DEPTH_B_M2S = max(DEPTH_P_I, DEPTH_M2S);
  localparam DEPTH_V_M2S = (DEPTH_B_M2S + 1);
  localparam DEPTH_B_M_W = max(DEPTH_V_M2S, max(DEPTH_M_D, DEPTH_IO_REG));
  localparam DEPTH_V_M_W = (DEPTH_B_M_W + CORE_BITS);
  localparam DEPTH_B_M_R = max(DEPTH_M_D, max(DEPTH_IO_REG, max(DEPTH_U2M, DEPTH_S2M)));
  localparam DEPTH_V_M_R = (DEPTH_B_M_R + 2);
  localparam DEPTH_B_S_R = max(DEPTH_P_D, DEPTH_M2S);
  localparam DEPTH_V_S_R = (DEPTH_B_S_R + 2);
  localparam DEPTH_B_S_W = max(DEPTH_V_F, DEPTH_P_D);
  localparam DEPTH_V_S_W = (DEPTH_B_S_W + 1);
  localparam PE_ID_START = 4;

  localparam MASTER_W_BANK_BC = ((1 << CORE_BITS) - 1);
  localparam MASTER_W_BANK_MEM_D = 0;
  localparam MASTER_W_BANK_IO_REG = 1;
  localparam MASTER_R_BANK_MEM_D = 0;
  localparam MASTER_R_BANK_IO_REG = 1;
  localparam MASTER_R_BANK_U2M = 2;
  localparam MASTER_R_BANK_S2M = 3;
  localparam UART_IO_ADDR_RESET = ((1 << DEPTH_B_U) + 0);
  localparam UART_BANK_MEM_I = 0;
  localparam UART_BANK_U2M = 2;
  localparam FIFO_BANK_S2M = 0;
  localparam FIFO_BANK_VRAM = 1;
  localparam IO_REG_R_UART_BUSY = 0;
  localparam IO_REG_R_VGA_VSYNC = 1;
  localparam IO_REG_R_VGA_VCOUNT = 2;
  localparam IO_REG_W_RESET_PE = 0;
  localparam IO_REG_W_LED = 1;
  localparam IO_REG_W_UART = 2;
  localparam IO_REG_W_SPRITE_X = 3;
  localparam IO_REG_W_SPRITE_Y = 4;
  localparam IO_REG_W_SPRITE_SCALE = 5;

  localparam TRUE = 1'b1;
  localparam FALSE = 1'b0;
  localparam ONE = 1'd1;
  localparam ZERO = 1'd0;

  function integer max (input integer a1, input integer a2);
    begin
      if (a1 > a2)
        begin
          max = a1;
        end
      else
        begin
          max = a2;
        end
    end
  endfunction

  // LED
  assign led = io_reg_w[IO_REG_W_LED];

  // Master IO reg
  reg [WIDTH_M_D-1:0] io_reg_r[0:((1 << DEPTH_IO_REG) - 1)];
  reg [WIDTH_M_D-1:0] io_reg_w[0:((1 << DEPTH_IO_REG) - 1)];

  // Master read
  wire [DEPTH_V_M_R-DEPTH_B_M_R-1:0] master_d_r_bank;
  assign master_d_r_bank = master_d_r_addr[DEPTH_V_M_R-1:DEPTH_B_M_R];
  always @(posedge clk)
    begin
      case (master_d_r_bank)
        MASTER_R_BANK_MEM_D:
          begin
            master_d_r_data <= master_mem_d_r_data;
          end
        MASTER_R_BANK_IO_REG:
          begin
            master_d_r_data <= io_reg_r[master_d_r_addr[DEPTH_IO_REG-1:0]];
          end
`ifdef USE_UART
        MASTER_R_BANK_U2M:
          begin
            master_d_r_data <= u2m_r_data;
          end
`endif
        default:
          begin
            master_d_r_data <= {{(WIDTH_M_D-WIDTH_P_D){1'b0}}, s2m_r_data};
          end
      endcase
    end

  // Master mem_d write
  reg [DEPTH_V_M_W-1:0] master_d_w_addr_d1;
  reg [WIDTH_M_D-1:0] master_d_w_data_d1;
  reg                 master_d_we_d1;
  always @(posedge clk)
    begin
      master_d_w_addr_d1 <= master_d_w_addr;
      master_d_w_data_d1 <= master_d_w_data;
      master_d_we_d1 <= master_d_we;
    end

  always @(posedge clk)
    begin
      if (reset == TRUE)
        begin
          master_mem_d_we <= FALSE;
        end
      else
        begin
          if ((master_d_we == TRUE) && (master_d_w_bank == MASTER_W_BANK_MEM_D))
            begin
              master_mem_d_we <= TRUE;
            end
          else
            begin
              master_mem_d_we <= FALSE;
            end
        end
    end

  // Master IO reg read
  always @(posedge clk)
    begin
`ifdef USE_UART
      io_reg_r[IO_REG_R_UART_BUSY] <= uart_io_busy;
`endif
`ifdef USE_VGA
      io_reg_r[IO_REG_R_VGA_VSYNC] <= vga_vsync;
      io_reg_r[IO_REG_R_VGA_VCOUNT] <= vga_vcount;
`endif
    end

  // Master IO reg write
  wire [WIDTH_M_D-1:0] io_reg_w_data;
  wire [DEPTH_IO_REG-1:0] io_reg_w_addr;
  reg io_reg_we;
  assign io_reg_w_data = master_d_w_data_d1;
  assign io_reg_w_addr = master_d_w_addr_d1[DEPTH_IO_REG-1:0];
  always @(posedge clk)
    begin
      if ((master_d_we == TRUE) && (master_d_w_bank == MASTER_W_BANK_IO_REG))
        begin
          io_reg_we <= TRUE;
        end
      else
        begin
          io_reg_we <= FALSE;
        end
      if (io_reg_we == TRUE)
        begin
          io_reg_w[io_reg_w_addr] <= io_reg_w_data;
        end
    end

`ifdef USE_UART
  // Master IO reg write: UART TX we
  always @(posedge clk)
    begin
      if (reset == TRUE)
        begin
          uart_io_tx_we <= FALSE;
        end
      else
        begin
          if ((master_d_we == TRUE) && (master_d_w_addr == ((MASTER_W_BANK_IO_REG << DEPTH_B_M_W) + IO_REG_W_UART)))
            begin
              uart_io_tx_we <= TRUE;
            end
          else
            begin
              uart_io_tx_we <= FALSE;
            end
        end
    end
`endif

  // harvester
  reg [DEPTH_V_F-1:0] s2m_w_addr;
  reg [WIDTH_P_D-1:0] s2m_w_data;
  reg s2m_we;
  reg vram_we;
  wire [DEPTH_V_F-DEPTH_B_F-1:0] harvester_w_bank;
  assign harvester_w_bank = harvester_w_addr[DEPTH_V_F-1:DEPTH_B_F];
  always @(posedge clk)
    begin
      s2m_w_addr <= harvester_w_addr;
      s2m_w_data <= harvester_w_data;
      if (harvester_we == TRUE)
        begin
          if (harvester_w_bank == FIFO_BANK_S2M)
            begin
              s2m_we <= TRUE;
              vram_we <= FALSE;
            end
          else
            begin
              s2m_we <= FALSE;
              vram_we <= TRUE;
            end
        end
      else
        begin
          s2m_we <= FALSE;
          vram_we <= FALSE;
        end
    end

  wire harvester_r_valid [0:CORES-1];
  wire [WIDTH_P_D+DEPTH_V_F-1:0] harvester_r_data [0:CORES-1];
  wire [CORES-1:0] harvester_r_req;
  wire [DEPTH_V_F-1:0] harvester_w_addr;
  wire [WIDTH_P_D-1:0] harvester_w_data;
  wire harvester_we;
  wire [CORE_BITS-1:0] harvester_cs;

  harvester
    #(
      .CORE_BITS (CORE_BITS),
      .CORES (CORES),
      .WIDTH (WIDTH_P_D),
      .DEPTH (DEPTH_V_F)
      )
  harvester_0
    (
     .clk (clk),
     .reset (reset),
     .cs (harvester_cs),
     .r_data (harvester_r_data[harvester_cs]),
     .r_valid (harvester_r_valid[harvester_cs]),
     .r_req (harvester_r_req),
     .w_addr (harvester_w_addr),
     .w_data (harvester_w_data),
     .we (harvester_we)
     );

  wire [WIDTH_P_D-1:0] s2m_r_data;
  rw_port_ram
    #(
      .DATA_WIDTH (WIDTH_P_D),
      .ADDR_WIDTH (DEPTH_S2M)
      )
  shared_s2m
    (
     .clk (clk),
     .addr_r (master_d_r_addr[DEPTH_S2M-1:0]),
     .addr_w (s2m_w_addr[DEPTH_S2M-1:0]),
     .data_in (s2m_w_data),
     .we (s2m_we),
     .data_out (s2m_r_data)
     );

`ifdef USE_UART
  // UART IO: write to mem_i
  reg uart_io_tx_we;
  wire uart_io_busy;
  wire [31:0] uart_io_rx_addr;
  wire [31:0] uart_io_rx_data;
  reg [31:0] uart_io_rx_addr_d1;
  reg [31:0] uart_io_rx_data_d1;
  wire uart_io_rx_we;
  reg master_mem_i_we;
  wire [DEPTH_V_U-DEPTH_B_U-1:0] uart_io_rx_bank;
  assign uart_io_rx_bank = uart_io_rx_addr[DEPTH_V_U-1:DEPTH_B_U];

  always @(posedge clk)
    begin
      uart_io_rx_addr_d1 <= uart_io_rx_addr;
      uart_io_rx_data_d1 <= uart_io_rx_data;
    end

  always @(posedge clk)
    begin
      if (reset == TRUE)
        begin
          master_mem_i_we <= FALSE;
        end
      else
        begin
          if ((uart_io_rx_we == TRUE) && (uart_io_rx_bank == UART_BANK_MEM_I))
            begin
              master_mem_i_we <= TRUE;
            end
          else
            begin
              master_mem_i_we <= FALSE;
            end
        end
    end

  // u2m write
  always @(posedge clk)
    begin
      if (reset == TRUE)
        begin
          u2m_we <= FALSE;
        end
      else
        begin
          if ((uart_io_rx_we == TRUE) && (uart_io_rx_bank == UART_BANK_U2M))
            begin
              u2m_we <= TRUE;
            end
          else
            begin
              u2m_we <= FALSE;
            end
        end
    end

  // UART IO: reset master
  reg reset_master;
  always @(posedge clk)
    begin
      if (reset == TRUE)
        begin
          reset_master <= FALSE;
        end
      else
        begin
          if ((uart_io_rx_we == TRUE) && (uart_io_rx_addr == UART_IO_ADDR_RESET))
            begin
              reset_master <= uart_io_rx_data[0];
            end
        end
    end

  uart_io
    #(
      .CLK_HZ (UART_CLK_HZ),
      .SCLK_HZ (UART_SCLK_HZ)
      )
  uart_io_0
    (
     .clk (clk),
     .reset (reset),
     .uart_rxd (uart_rxd),
     .tx_data (io_reg_w[IO_REG_W_UART][7:0]),
     .tx_we (uart_io_tx_we),
     .uart_txd (uart_txd),
     .uart_busy (uart_io_busy),
     .rx_addr (uart_io_rx_addr),
     .rx_data (uart_io_rx_data),
     .rx_we (uart_io_rx_we)
     );
`endif

`ifdef USE_VGA
  // sprite
  localparam SPRITE_BPP = 3;
  wire [SPRITE_BPP-1:0] color_all;
  // vga
  wire                  vga_vsync;
  wire [WIDTH_M_D-1:0]  vga_vcount;
  wire [32-1:0]         ext_vga_count_h;
  wire [32-1:0]         ext_vga_count_v;

  sprite
   #(
    .SPRITE_WIDTH_BITS (8),
    .SPRITE_HEIGHT_BITS (9),
    .BPP (SPRITE_BPP)
    )
  sprite_0
    (
     .clk (clk),
     .reset (reset),
     .bitmap_length (),
     .bitmap_address (s2m_w_addr[DEPTH_VRAM-1:0]),
     .bitmap_din (s2m_w_data[WIDTH_VRAM-1:0]),
     .bitmap_dout (),
     .bitmap_we (vram_we),
     .bitmap_oe (FALSE),
     .x (io_reg_w[IO_REG_W_SPRITE_X]),
     .y (io_reg_w[IO_REG_W_SPRITE_Y]),
     .scale (io_reg_w[IO_REG_W_SPRITE_SCALE]),
     .ext_clkv (clkv),
     .ext_resetv (resetv),
     .ext_color (color_all),
     .ext_count_h (ext_vga_count_h),
     .ext_count_v (ext_vga_count_v)
     );

  vga_iface
   #(
    .BPP (3),
    .BPC (1)
    )
  vga_iface_0
    (
     .clk (clk),
     .reset (reset),
     .vsync (vga_vsync),
     .vcount (vga_vcount),
     .ext_clkv (clkv),
     .ext_resetv (resetv),
     .ext_color (color_all),
     .ext_vga_hs (vga_hs),
     .ext_vga_vs (vga_vs),
     .ext_vga_de (),
     .ext_vga_r (vga_r),
     .ext_vga_g (vga_g),
     .ext_vga_b (vga_b),
     .ext_count_h (ext_vga_count_h),
     .ext_count_v (ext_vga_count_v)
     );
`endif

  // Master core
  wire [DEPTH_V_M_W-1:0] master_d_w_addr;
  wire [WIDTH_M_D-1:0] master_d_w_data;
  wire master_d_we;
  wire [DEPTH_M_I-1:0] master_i_r_addr;
  wire [WIDTH_I-1:0] master_i_r_data;
  wire [DEPTH_V_M_W-1:0] master_d_r_addr;
  reg [WIDTH_M_D-1:0] master_d_r_data;
  wire [DEPTH_V_M_W-DEPTH_B_M_W-1:0] master_d_w_bank;
  assign master_d_w_bank = master_d_w_addr[DEPTH_V_M_W-1:DEPTH_B_M_W];
  mini16_cpu
    #(
      .WIDTH_I (WIDTH_I),
      .WIDTH_D (WIDTH_M_D),
      .DEPTH_I (DEPTH_M_I),
      .DEPTH_D (DEPTH_V_M_W),
      .DEPTH_REG (DEPTH_REG),
      .ENABLE_MVIL (TRUE),
      .ENABLE_MUL (TRUE),
      .ENABLE_MULTI_BIT_SHIFT (TRUE),
      .ENABLE_MVC (TRUE),
      .ENABLE_WA (TRUE),
      .ENABLE_INT (TRUE),
      .FULL_PIPELINED_ALU (FALSE),
      .REGFILE_RAM_TYPE (MASTER_REGFILE_RAM_TYPE)
      )
  mini16_cpu_master
    (
     .clk (clk),
`ifdef USE_UART
     .soft_reset (reset_master),
`else
     .soft_reset (FALSE),
`endif
     .reset (reset),
     .mem_i_r_addr (master_i_r_addr),
     .mem_i_r_data (master_i_r_data),
     .mem_d_r_addr (master_d_r_addr),
     .mem_d_r_data (master_d_r_data),
     .mem_d_w_addr (master_d_w_addr),
     .mem_d_w_data (master_d_w_data),
     .mem_d_we (master_d_we)
     );

  default_master_code_mem
    #(
      .DATA_WIDTH (WIDTH_I),
      .ADDR_WIDTH (DEPTH_M_I)
      )
  master_mem_i
    (
     .clk (clk),
     .addr_r (master_i_r_addr),
`ifdef USE_UART
     .addr_w (uart_io_rx_addr_d1[DEPTH_M_I-1:0]),
     .data_in (uart_io_rx_data_d1[WIDTH_I-1:0]),
     .we (master_mem_i_we),
`else
     .addr_w ({DEPTH_M_I{1'b0}}),
     .data_in ({WIDTH_I{1'b0}}),
     .we (FALSE),
`endif
     .data_out (master_i_r_data)
     );

  wire [WIDTH_M_D-1:0] master_mem_d_r_data;
  reg master_mem_d_we;
  default_master_data_mem
    #(
      .DATA_WIDTH (WIDTH_M_D),
      .ADDR_WIDTH (DEPTH_M_D)
      )
  master_mem_d
    (
     .clk (clk),
     .addr_r (master_d_r_addr[DEPTH_M_D-1:0]),
     .addr_w (master_d_w_addr_d1[DEPTH_M_D-1:0]),
     .data_in (master_d_w_data_d1),
     .we (master_mem_d_we),
     .data_out (master_mem_d_r_data)
     );

`ifdef USE_UART
  reg u2m_we;
  wire [WIDTH_M_D-1:0] u2m_r_data;
  rw_port_ram
    #(
      .DATA_WIDTH (WIDTH_M_D),
      .ADDR_WIDTH (DEPTH_U2M)
      )
  shared_u2m
    (
     .clk (clk),
     .addr_r (master_d_r_addr[DEPTH_U2M-1:0]),
     .addr_w (uart_io_rx_addr_d1[DEPTH_U2M-1:0]),
     .data_in (uart_io_rx_data_d1[WIDTH_M_D-1:0]),
     .we (u2m_we),
     .data_out (u2m_r_data)
     );
`endif

  generate
    genvar i;
    for (i = 0; i < CORES; i = i + 1)
      begin: mini16_pe_gen
        mini16_pe
             #(
               .WIDTH_D (WIDTH_P_D),
               .DEPTH_I (DEPTH_P_I),
               .DEPTH_D (DEPTH_P_D),
               .DEPTH_M2S (DEPTH_M2S),
               .DEPTH_FIFO (DEPTH_FIFO),
               .CORE_ID (i + PE_ID_START),
               .MASTER_W_BANK_BC (MASTER_W_BANK_BC),
               .DEPTH_V_F (DEPTH_V_F),
               .DEPTH_B_F (DEPTH_B_F),
               .DEPTH_V_M_W (DEPTH_V_M_W),
               .DEPTH_B_M_W (DEPTH_B_M_W),
               .DEPTH_V_S_R (DEPTH_V_S_R),
               .DEPTH_B_S_R (DEPTH_B_S_R),
               .DEPTH_V_S_W (DEPTH_V_S_W),
               .DEPTH_B_S_W (DEPTH_B_S_W),
               .DEPTH_V_M2S (DEPTH_V_M2S),
               .DEPTH_B_M2S (DEPTH_B_M2S),
               .FIFO_RAM_TYPE (PE_FIFO_RAM_TYPE),
               .REGFILE_RAM_TYPE (PE_REGFILE_RAM_TYPE)
               )
        mini16_pe_0
             (
              .clk (clk),
              .reset (reset),
              .soft_reset (io_reg_w[IO_REG_W_RESET_PE][0]),
              .fifo_req_r (harvester_r_req[i]),
              .fifo_valid (harvester_r_valid[i]),
              .fifo_r_data (harvester_r_data[i]),
              .addr_i (master_d_w_addr_d1),
              .data_i (master_d_w_data_d1),
              .we_i (master_d_we_d1)
              );
      end
  endgenerate

endmodule
harvester.v : PEからのデータ転送処理
/*
  Copyright (c) 2019, miya
  All rights reserved.

  Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/


module harvester
  #(
    parameter CORE_BITS = 8,
    parameter CORES = 32,
    parameter WIDTH = 32,
    parameter DEPTH = 8
    )
  (
   input                   clk,
   input                   reset,
   output [CORE_BITS-1:0]  cs,
   input [WIDTH+DEPTH-1:0] r_data,
   input                   r_valid,
   output reg [CORES-1:0]  r_req,
   output [DEPTH-1:0]      w_addr,
   output [WIDTH-1:0]      w_data,
   output reg              we
   );

  localparam TRUE = 1'b1;
  localparam FALSE = 1'b0;
  localparam ONE = 1'd1;
  localparam ZERO = 1'd0;

  // fifo to s2m core select
  reg [CORE_BITS-1:0] core;
  reg [CORE_BITS-1:0] core_d1;
  reg [CORE_BITS-1:0] core_d2;
  reg [CORE_BITS-1:0] core_d3;
  always @(posedge clk)
    begin
      core_d1 <= core;
      core_d2 <= core_d1;
      core_d3 <= core_d2;
      if (reset == TRUE)
        begin
          core <= ZERO;
        end
      else
        begin
          if (core == CORES - 1)
            begin
              core <= ZERO;
            end
          else
            begin
              core <= core + ONE;
            end
        end
    end

  assign cs = core_d3;
  assign w_addr = harvester_r_data_fetch_d1[WIDTH+DEPTH-1:WIDTH];
  assign w_data = harvester_r_data_fetch_d1[WIDTH-1:0];

  reg [WIDTH+DEPTH-1:0] harvester_r_data_fetch;
  reg [WIDTH+DEPTH-1:0] harvester_r_data_fetch_d1;
  reg r_valid_d1;

  always @(posedge clk)
    begin
      r_req[core] <= TRUE;
      r_req[core_d1] <= FALSE;
      r_valid_d1 <= r_valid;
      we <= r_valid_d1;
      harvester_r_data_fetch <= r_data;
      harvester_r_data_fetch_d1 <= harvester_r_data_fetch;
    end

endmodule
asm/MasterProgram.java : マンデルブロ集合デモ:マスターコア用プログラム
/*
  Copyright (c) 2019, miya
  All rights reserved.

  Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/


import java.lang.Math;

public class MasterProgram extends AsmLib
{
  private static final int PARALLEL = 32;

  private static final int CODE_ROM_WIDTH = 16;
  private static final int DATA_ROM_WIDTH = 32;
  private static final int CODE_ROM_DEPTH = 11;
  private static final int DATA_ROM_DEPTH = 11;

  private static final int M2S_BC_ADDR_H = ((MASTER_W_BANK_BC << (DEPTH_B_M_W - DEPTH_B_M2S)) + M2S_BANK_M2S);
  private static final int M2S_BC_ADDR_SHIFT = DEPTH_B_M2S;
  private static final int S2M_ADDR_H = MASTER_R_BANK_S2M;
  private static final int S2M_ADDR_SHIFT = DEPTH_B_M_R;
  private static final int U2M_ADDR_H = MASTER_R_BANK_U2M;
  private static final int U2M_ADDR_SHIFT = DEPTH_B_M_R;
  private static final int IO_REG_W_ADDR_H = MASTER_W_BANK_IO_REG;
  private static final int IO_REG_W_ADDR_SHIFT = DEPTH_B_M_W;
  private static final int IO_REG_R_ADDR_H = MASTER_R_BANK_IO_REG;
  private static final int IO_REG_R_ADDR_SHIFT = DEPTH_B_M_R;

  private static final int VGA_HEIGHT_BITS = 9;


  private void f_get_m2s_bc_addr()
  {
    // output: R3:m2s_bc_addr
    int m2s_bc_addr = 3;
    // m2s_bc_addr = M2S_BC_ADDR_H << M2S_BC_ADDR_SHIFT;
    label("f_get_m2s_bc_addr");
    lib_wait_dep_pre();
    as_mvi(m2s_bc_addr, M2S_BC_ADDR_H);
    lib_wait_dep_post();
    as_sli(m2s_bc_addr, M2S_BC_ADDR_SHIFT);
    lib_return();
  }

  private void f_get_m2s_core_addr()
  {
    // input: R3:core id(0-(N-1))
    // output: R3:m2s_core_addr
    int core_id = 3;
    int m2s_core_addr = 3;
    int tmp0 = LREG0;
    // m2s_core_addr = ((core_id + PE_ID_START) << DEPTH_B_M_W) + (M2S_BANK_M2S << DEPTH_B_M2S);
    label("f_get_m2s_core_addr");
    as_addi(core_id, PE_ID_START);
    lib_wait_dep_pre();
    as_mvi(tmp0, M2S_BANK_M2S);
    lib_wait_dep_post();
    as_sli(m2s_core_addr, DEPTH_B_M_W);
    lib_wait_dep_pre();
    as_sli(tmp0, DEPTH_B_M2S);
    lib_wait_dep_post();
    as_add(m2s_core_addr, tmp0);
    lib_return();
  }

  private void f_get_s2m_addr()
  {
    // output: R3:s2m_addr
    int s2m_addr = 3;
    // s2m_addr = S2M_ADDR_H << S2M_ADDR_SHIFT;
    label("f_get_s2m_addr");
    lib_wait_dep_pre();
    as_mvi(s2m_addr, S2M_ADDR_H);
    lib_wait_dep_post();
    as_sli(s2m_addr, S2M_ADDR_SHIFT);
    lib_return();
  }

  private void f_get_io_reg_w_addr()
  {
    // input: R3: device reg num
    // output: R3:io_reg_w_addr
    int io_reg_w_addr = 3;
    int tmp0 = LREG0;
    // io_reg_w_addr = (IO_REG_W_ADDR_H << IO_REG_W_ADDR_SHIFT) + R3;
    label("f_get_io_reg_w_addr");
    lib_wait_dep_pre();
    as_mvi(tmp0, IO_REG_W_ADDR_H);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_sli(tmp0, IO_REG_W_ADDR_SHIFT);
    lib_wait_dep_post();
    as_add(io_reg_w_addr, tmp0);
    lib_return();
  }

  private void f_get_io_reg_r_addr()
  {
    // input: R3: device reg num
    // output: R3:io_reg_r_addr
    int io_reg_r_addr = 3;
    int tmp0 = LREG0;
    // io_reg_r_addr = (IO_REG_R_ADDR_H << IO_REG_R_ADDR_SHIFT) + R3;
    label("f_get_io_reg_r_addr");
    lib_wait_dep_pre();
    as_mvi(tmp0, IO_REG_R_ADDR_H);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_sli(tmp0, IO_REG_R_ADDR_SHIFT);
    lib_wait_dep_post();
    as_add(io_reg_r_addr, tmp0);
    lib_return();
  }

  private void f_get_u2m_addr()
  {
    // output: R3:u2m_addr
    int u2m_addr = 3;
    // u2m_addr = U2M_ADDR_H << U2M_ADDR_SHIFT;
    label("f_get_u2m_addr");
    lib_wait_dep_pre();
    as_mvi(u2m_addr, U2M_ADDR_H);
    lib_wait_dep_post();
    as_sli(u2m_addr, U2M_ADDR_SHIFT);
    lib_return();
  }

  private void example_led()
  {
    /*
    led_addr = (MASTER_W_BANK_IO_REG << DEPTH_B_M_W) + IO_REG_W_LED;
    counter = 0;
    shift = 18;
    do
    {
      led = counter >> shift;
      mem[led_addr] = led;
      counter++;
    } while (1);
    */

    int led_addr = 3;
    int counter = 4;
    int shift = 5;
    int led = 6;
    as_nop();
    lib_init_stack();
    lib_set_im(R3, IO_REG_W_LED);
    lib_call("f_get_io_reg_w_addr");
    as_mvi(counter, 0);
    lib_set_im(shift, 18);
    lib_wait_dep_pre();
    as_sli(led_addr, DEPTH_B_M_W);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_addi(led_addr, IO_REG_W_LED);
    lib_wait_dep_post();
    label("example_led_L_0");
    as_mv(led, counter);
    lib_wait_dep_pre();
    as_addi(counter, 1);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_sr(led, shift);
    lib_wait_dep_post();
    as_st(led_addr, led);
    lib_ba("example_led_L_0");
    // link library
    f_get_io_reg_w_addr();
  }

  private void example_helloworld()
  {
    as_nop();
    lib_call("f_get_u2m_data");
    lib_init_stack();
    lib_wait_dep_pre();
    as_mvi(R4, MASTER_R_BANK_U2M);
    lib_wait_dep_post();
    as_sli(R4, DEPTH_B_M_R);
    lib_set_im(R3, addr_abs("d_helloworld"));
    lib_wait_dep_pre();
    as_nop();
    lib_wait_dep_post();
    as_add(R3, R4);
    lib_call("f_uart_print");
    lib_call("f_halt");
    // link library
    f_uart_char();
    f_uart_print();
    f_halt();
    f_get_u2m_data();
  }

  private void example_helloworld_data()
  {
    label("d_helloworld");
    string_data32("Hello, world!\n");
  }

  private void f_reset_pe()
  {
    /*
    addr_reset = MASTER_W_BANK_IO_REG;
    addr_reset <<= DEPTH_B_M_W;
    addr_reset += IO_REG_W_RESET_PE;
    mem[addr_reset] = 1;
    mem[addr_reset] = 0;
    */

    int addr_reset = LREG0;
    label("f_reset_pe");
    lib_wait_dep_pre();
    as_mvi(addr_reset, MASTER_W_BANK_IO_REG);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_sli(addr_reset, DEPTH_B_M_W);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_addi(addr_reset, IO_REG_W_RESET_PE);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_sti(addr_reset, 1);
    lib_wait_dep_post();
    as_sti(addr_reset, 0);
    lib_return();
  }

  // copy data from U2M to MEM_D
  // call before lib_init_stack()
  public void f_get_u2m_data()
  {
    int addr_dst = LREG0;
    int addr_src = LREG1;
    int size = LREG2;
    int data = LREG3;
    label("f_get_u2m_data");
    as_mvi(size, 1);
    lib_wait_dep_pre();
    as_mvi(addr_src, U2M_ADDR_H);
    lib_wait_dep_post();
    as_sli(addr_src, U2M_ADDR_SHIFT);
    as_mvi(addr_dst, 0);
    lib_wait_dep_pre();
    as_sli(size, DEPTH_M_D);
    lib_wait_dep_post();
    label("f_get_u2m_data_L_0");
    as_ld(data, addr_src);
    as_subi(size, 1);
    as_addi(addr_src, 1);
    lib_nop(3);
    as_st(addr_dst, data);
    as_cnz(SP_REG_CP, size);
    as_addi(addr_dst, 1);
    lib_bc("f_get_u2m_data_L_0");
    lib_return();
  }

  public void f_reset_vga()
  {
    /*
    addr_ioreg = MASTER_W_BANK_IO_REG;
    addr_ioreg <<= DEPTH_B_M_W;
    addr_sp_x = addr_ioreg;
    addr_sp_y = addr_ioreg;
    addr_sp_s = addr_ioreg;
    addr_sp_x += 3;
    addr_sp_y += 4;
    addr_sp_s += 5;
    mem[addr_sp_x] = 0;
    mem[addr_sp_y] = 0;
    mem[addr_sp_s] = 12;
    */

    int addr_ioreg = LREG0;
    int addr_sp_x = LREG1;
    int addr_sp_y = LREG2;
    int addr_sp_s = LREG3;
    int scale = LREG4;
    int x = LREG5;
    label("f_reset_vga");
    lib_wait_dep_pre();
    as_mvi(addr_ioreg, MASTER_W_BANK_IO_REG);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_sli(addr_ioreg, DEPTH_B_M_W);
    lib_wait_dep_post();
    as_mv(addr_sp_x, addr_ioreg);
    as_mv(addr_sp_y, addr_ioreg);
    as_mv(addr_sp_s, addr_ioreg);
    lib_nop(3);
    as_addi(addr_sp_x, IO_REG_W_SPRITE_X);
    as_addi(addr_sp_y, IO_REG_W_SPRITE_Y);
    as_addi(addr_sp_s, IO_REG_W_SPRITE_SCALE);
    lib_set_im(x, 64);
    lib_wait_dep_pre();
    as_nop();
    lib_wait_dep_post();
    as_st(addr_sp_x, x);
    as_sti(addr_sp_y, 0);
    as_sti(addr_sp_s, 7);
    lib_return();
  }

  private void f_init_core_id()
  {
    /*
      depends: f_get_m2s_core_addr()
    */

    int addr_core_id = LREG0;
    int next_core_offset = LREG1;
    int i = LREG2;
    int cores = LREG3;
    int addr_cores = LREG4;
    int parallel = LREG5;
    /*
    R3 = cores - 1;
    lib_call("f_get_m2s_core_addr");
    addr_core_id = R3;
    addr_cores = R3 + 1;
    next_core_offset = 1 << DEPTH_B_M_W;
    i = CORES;
    parallel = PARALLEL;
    do
    {
      i--;
      mem[addr_core_id] = i;
      mem[addr_cores] = parallel;
      addr_core_id -= next_core_offset;
      addr_cores -= next_core_offset;
    } while (i != 0);
    */

    label("f_init_core_id");
    lib_push(SP_REG_LINK);
    lib_push(R3);
    lib_set_im(cores, CORES);
    lib_set_im(parallel, PARALLEL);
    lib_set_im(R3, CORES - 1); // cores - 1
    lib_call("f_get_m2s_core_addr");
    as_mv(addr_core_id, R3);
    as_mv(addr_cores, R3);
    as_mvi(next_core_offset, 1);
    lib_wait_dep_pre();
    as_mv(i, cores);
    lib_wait_dep_post();
    as_sli(next_core_offset, DEPTH_B_M_W);
    as_addi(addr_cores, 1);
    label("f_init_core_id_L_0");
    lib_wait_dep_pre();
    as_subi(i, 1);
    lib_wait_dep_post();
    as_st(addr_core_id, i);
    as_st(addr_cores, parallel);
    as_sub(addr_core_id, next_core_offset);
    as_sub(addr_cores, next_core_offset);
    as_cnz(SP_REG_CP, i);
    lib_bc("f_init_core_id_L_0");
    lib_pop(R3);
    lib_pop(SP_REG_LINK);
    lib_return();
  }

  private void m_vga_flip(int reg_task_id)
  {
    int task_id = reg_task_id;
    int addr_sp_y = LREG0;
    int tmp0 = LREG1;
    int page = LREG2;

    lib_wait_dep_pre();
    as_mvi(addr_sp_y, MASTER_W_BANK_IO_REG);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_sli(addr_sp_y, DEPTH_B_M_W);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_addi(addr_sp_y, IO_REG_W_SPRITE_Y);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_mv(tmp0, task_id);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_andi(tmp0, 1);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_xori(tmp0, 1);
    lib_wait_dep_post();
    as_mvi(page, 0);
    lib_wait_dep_pre();
    as_sli(tmp0, VGA_HEIGHT_BITS);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_sub(page, tmp0);
    lib_wait_dep_post();
    as_st(addr_sp_y, page);
  }

  private void m_wait_vsync()
  {
    /*
    addr_vsync = (MASTER_R_BANK_IO_REG << DEPTH_B_M_R) + IO_REG_R_VGA_VSYNC;
    vsync_pre = 0;
    do
    {
      vsync = mem[addr_vsync];
      vsync_start = ((vsync == 0) && (vsync_pre == 1));
      vsync_pre = vsync;
    } while (!vsync_start);
    (!vsync_start = ((vsync == 1) || (vsync_pre == 0)))
    */

    int addr_vsync = LREG0;
    int vsync = LREG1;
    int vsync_start = LREG2;
    int vsync_pre = LREG3;
    lib_wait_dep_pre();
    as_mvi(addr_vsync, MASTER_R_BANK_IO_REG);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_sli(addr_vsync, DEPTH_B_M_R);
    lib_wait_dep_post();
    as_mvi(vsync_pre, 0);
    lib_wait_dep_pre();
    as_addi(addr_vsync, IO_REG_R_VGA_VSYNC);
    lib_wait_dep_post();
    label("m_wait_vsync_L_0");
    lib_wait_dep_pre();
    as_ld(vsync, addr_vsync);
    lib_wait_dep_post();
    as_cnz(vsync_start, vsync);
    as_cnz(SP_REG_CP, vsync_pre);
    lib_wait_dep_pre();
    as_mv(vsync_pre, vsync);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_xori(SP_REG_CP, -1);
    lib_wait_dep_post();
    as_or(SP_REG_CP, vsync_start);
    lib_bc("m_wait_vsync_L_0");
  }

  private void m_init_mandel_param()
  {
    /*
      PE m2s memory map:
      3: scale
      4: cx
      5: cy
     */

    int addr_m2s_root = 3;
    int addr_scale = LREG0;
    int addr_cx = LREG1;
    int addr_cy = LREG2;
    int scale = LREG3;
    int cx = LREG4;
    int cy = LREG5;

    as_mv(addr_scale, addr_m2s_root);
    as_mv(addr_cx, addr_m2s_root);
    as_mv(addr_cy, addr_m2s_root);
    lib_ld(scale, "d_mandel_scale");
    as_addi(addr_scale, 3);
    as_addi(addr_cx, 4);
    as_addi(addr_cy, 5);
    lib_ld(cx, "d_mandel_cx");
    lib_ld(cy, "d_mandel_cy");
    lib_wait_dep_pre();
    as_st(addr_scale, scale);
    lib_wait_dep_post();
    as_st(addr_cx, cx);
    as_st(addr_cy, cy);
  }

  private void m_update_mandel_param()
  {
    int addr_m2s_root = 3;
    int addr_scale = LREG0;
    int scale = LREG1;
    int scale_mask = LREG2;

    as_mv(addr_scale, addr_m2s_root);
    lib_ld(scale, "d_mandel_scale");
    lib_wait_dep_pre();
    as_addi(addr_scale, 3);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_subi(scale, 1);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_cnz(SP_REG_CP, scale);
    lib_wait_dep_post();
    as_mvil(256);
    lib_wait_dep_pre();
    as_xori(SP_REG_CP, -1);
    lib_wait_dep_post();
    as_mvc(scale, SP_REG_MVIL);
    as_st(addr_scale, scale);
    lib_st("d_mandel_scale", scale);
  }

  private void master_thread_manager()
  {
    /*
      PE m2s memory map:
      0: core_id
      1: parallel
      2: task_id
      user parameters
      3: scale
      4: cx
      5: cy

      s2m memory map:
      0 - PARALLEL-1: Incremented task_id from PE
     */

    int addr_m2s_root = 3;
    int addr_s2m_root = 4;
    int addr_task_id = 5;
    int addr_s2m = 6;
    int task_id = 7;
    int pe_ack = 8;
    int i = 9;

    /*
    f_get_u2m_addr();
    f_reset_vga();
    init_core_id();
    addr_m2s_root = M2S_BC_ADDR_H << M2S_BC_ADDR_SHIFT;
    addr_task_id = addr_m2s_root + 2;
    addr_s2m_root = MASTER_R_BANK_S2M << DEPTH_B_M_R;
    m_init_mandel_param();
    task_id = 0;
    mem[addr_task_id] = task_id;
    reset_pe();
    do
    {
      i = PARALLEL;
      task_id++;
      addr_s2m = addr_s2m_root;
      do
      {
        i--;
        do
        {
          pe_ack = mem[addr_s2m] - task_id;
        } while (pe_ack != 0)
        addr_s2m++;
      } while (i != 0)
      m_wait_vsync();
      m_vga_flip(task_id);
      m_update_mandel_param();
      mem[addr_task_id] = task_id;
    } while (1);
    */

    as_nop();
    lib_call("f_get_u2m_data");
    lib_init_stack();
    lib_call("f_reset_vga");
    lib_call("f_init_core_id");
    as_mvi(addr_m2s_root, M2S_BC_ADDR_H);
    as_mvi(task_id, 0);
    lib_wait_dep_pre();
    as_mvi(addr_s2m_root, MASTER_R_BANK_S2M);
    lib_wait_dep_post();
    as_sli(addr_s2m_root, DEPTH_B_M_R);
    lib_wait_dep_pre();
    as_sli(addr_m2s_root, M2S_BC_ADDR_SHIFT);
    lib_wait_dep_post();
    m_init_mandel_param();
    lib_wait_dep_pre();
    as_mv(addr_task_id, addr_m2s_root);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_addi(addr_task_id, 2);
    lib_wait_dep_post();
    as_st(addr_task_id, task_id);
    lib_call("f_reset_pe");
    label("master_thread_manager_L_0");
    lib_set_im(i, PARALLEL);
    as_addi(task_id, 1);
    lib_wait_dep_pre();
    as_mv(addr_s2m, addr_s2m_root);
    lib_wait_dep_post();
    label("master_thread_manager_L_1");
    as_subi(i, 1);
    label("master_thread_manager_L_2");
    lib_wait_dep_pre();
    as_ld(pe_ack, addr_s2m);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_sub(pe_ack, task_id);
    lib_wait_dep_post();
    as_cnz(SP_REG_CP, pe_ack);
    lib_bc("master_thread_manager_L_2");
    as_addi(addr_s2m, 1);
    as_cnz(SP_REG_CP, i);
    lib_bc("master_thread_manager_L_1");

    m_update_mandel_param();
    //m_wait_vsync();
    m_vga_flip(task_id);

    as_st(addr_task_id, task_id);
    lib_ba("master_thread_manager_L_0");

    lib_call("f_halt");

    // link library
    f_halt();
    f_init_core_id();
    f_reset_pe();
    f_get_m2s_core_addr();
    f_reset_vga();
    f_get_u2m_data();
    f_get_u2m_addr();
    f_memcpy();
  }

  @Override
  public void init()
  {
    set_rom_width(CODE_ROM_WIDTH, DATA_ROM_WIDTH);
    set_rom_depth(CODE_ROM_DEPTH, DATA_ROM_DEPTH);
    set_stack_address((1 << DATA_ROM_DEPTH) - 1);
    set_filename("default_master");
  }

  @Override
  public void program()
  {
    //example_led();
    //example_helloworld();
    master_thread_manager();
  }

  @Override
  public void data()
  {
    label("d_rand");
    dat(0xfc720c27);
    label("d_mandel_scale");
    dat(256);
    label("d_mandel_cx");
    dat(161 << 6);
    label("d_mandel_cy");
    dat(49 << 6);
    example_helloworld_data();
  }
}
asm/PEProgram.java : マンデルブロ集合デモ:PE用プログラム
/*
  Copyright (c) 2019, miya
  All rights reserved.

  Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/


import java.lang.Math;

public class PEProgram extends AsmLib
{
  private static final int CODE_ROM_WIDTH = 16;
  private static final int DATA_ROM_WIDTH = 32;
  private static final int CODE_ROM_DEPTH = 10;
  private static final int DATA_ROM_DEPTH = 8;
  private static final int FIFO_ADDR = (PE_W_BANK_FIFO << DEPTH_B_S_W);
  private static final int VRAM_ADDR_H = ((FIFO_ADDR + (FIFO_BANK_VRAM << DEPTH_B_F)) >>> 15);
  private static final int VRAM_ADDR_SHIFT = 15;
  private static final int M2S_ADDR_H = PE_R_BANK_M2S;
  private static final int M2S_ADDR_SHIFT = DEPTH_B_S_R;
  private static final int S2M_ADDR_H = ((FIFO_ADDR + (FIFO_BANK_S2M << DEPTH_B_F)) >>> 15);
  private static final int S2M_ADDR_SHIFT = 15;
  private static final int IMAGE_WIDTH_BITS = 8;
  private static final int IMAGE_HEIGHT_BITS = 8;
  private static final int IMAGE_WIDTH_HALF_BITS = (IMAGE_WIDTH_BITS - 1);
  private static final int IMAGE_HEIGHT_HALF_BITS = (IMAGE_HEIGHT_BITS - 1);
  private static final int IMAGE_WIDTH = (1 << IMAGE_WIDTH_BITS);
  private static final int IMAGE_HEIGHT = (1 << IMAGE_HEIGHT_BITS);
  private static final int IMAGE_WIDTH_HALF = (1 << IMAGE_WIDTH_HALF_BITS);
  private static final int IMAGE_HEIGHT_HALF = (1 << IMAGE_HEIGHT_HALF_BITS);

  private void m_mandel_core()
  {
    int x = 9;
    int y = 10;
    int scale = 11;
    int count = 12;
    int cx = 13;
    int cy = 14;
    int a = 16;
    int b = 17;
    int aa = 18;
    int bb = 19;
    int c = 20;
    int x1 = 21;
    int y1 = 22;
    int cmask = 23;
    int max_c = 24;
    int pc = 25;
    int tmp1 = 26;
    int tmp2 = 27;
    int tmp3 = 28;

    // const
    int FIXED_BITS = 13;
    int FIXED_BITS_M1 = 12;
    int MAX_C = 4;

    /*
    a = 0;
    b = 0;
    aa = 0;
    bb = 0;
    scale = 256;
    count = 256;
    cmask = 252;
    max_c = MAX_C << FIXED_BITS;
    x1 = ((x - IMAGE_WIDTH_HALF) * scale) + cx;
    y1 = ((y - IMAGE_HEIGHT_HALF) * scale) + cy;
    do
    {
      pc = c;
      b = ((a * b) >> FIXED_BITS_M1) - y1;
      a = aa - bb - x1;
      aa = (a * a) >> FIXED_BITS;
      bb = (b * b) >> FIXED_BITS;
      c = aa + bb;
      count--;
      x1 += scale;
      pc -= c;
      pc >>= 5;
      limit = (c < MAX_C) && (count > 0) && (pc != 0);
    } while (limit);

    as_mvi(a, 0);
    as_mvi(b, 0);
    as_mvi(aa, 0);
    as_mvi(bb, 0);
    as_mv(x1, x);
    as_mv(y1, y);
    lib_set_im(count, 1024);
    lib_set_im(tmp1, IMAGE_WIDTH_HALF);
    lib_set_im(tmp2, IMAGE_HEIGHT_HALF);
    as_mvi(max_c, 4);
    as_sli(max_c, FIXED_BITS);
    as_sub(x1, tmp1);
    as_sub(y1, tmp2);
    as_mul(x1, scale);
    as_mul(y1, scale);
    as_add(x1, cx);
    as_add(y1, cy);
    label("m_mandel_L_0");
    as_mv(pc, c);
    as_mul(b, a);
    as_srai(b, FIXED_BITS_M1);
    as_sub(b, y1);
    as_mv(a, aa);
    as_sub(a, bb);
    as_sub(a, x1);
    as_mv(aa, a);
    as_mul(aa, a);
    as_sri(aa, FIXED_BITS);
    as_mv(bb, b);
    as_mul(bb, b);
    as_sri(bb, FIXED_BITS);
    as_mv(c, aa);
    as_add(c, bb);
    as_subi(count, 1);
    as_add(x1, scale);
    as_mv(tmp1, max_c);
    as_sub(pc, c);
    as_sub(tmp1, c);
    as_sri(pc, 5);
    as_cnm(SP_REG_CP, tmp1);
    as_cnm(tmp2, count);
    as_cnz(tmp3, pc);
    as_and(SP_REG_CP, tmp2);
    as_and(SP_REG_CP, tmp3);
    lib_bc("m_mandel_L_0");
    */


    as_mvi(a, 0);
    as_mvi(b, 0);
    as_mvi(aa, 0);
    as_mvi(bb, 0);
    as_mv(x1, x);
    as_mv(y1, y);
    lib_set_im(count, 100);
    lib_set_im(tmp1, IMAGE_WIDTH_HALF);
    lib_set_im(tmp2, IMAGE_HEIGHT_HALF);
    lib_wait_dep_pre();
    as_mvi(max_c, 4);
    lib_wait_dep_post();
    as_sli(max_c, FIXED_BITS);
    as_sub(x1, tmp1);
    lib_wait_dep_pre();
    as_sub(y1, tmp2);
    lib_wait_dep_post();
    as_mul(x1, scale);
    lib_wait_dep_pre();
    as_mul(y1, scale);
    lib_wait_dep_post();
    as_add(x1, cx);
    as_add(y1, cy);
    label("m_mandel_core_L_0");
    as_mv(pc, c);
    lib_wait_dep_pre();
    as_mul(b, a);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_srai(b, FIXED_BITS_M1);
    lib_wait_dep_post();
    as_sub(b, y1);
    lib_wait_dep_pre();
    as_mv(a, aa);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_sub(a, bb);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_sub(a, x1);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_mv(aa, a);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_mul(aa, a);
    lib_wait_dep_post();
    as_sri(aa, FIXED_BITS);
    lib_wait_dep_pre();
    as_mv(bb, b);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_mul(bb, b);
    lib_wait_dep_post();
    as_sri(bb, FIXED_BITS);
    lib_wait_dep_pre();
    as_mv(c, aa);
    lib_wait_dep_post();
    as_add(c, bb);
    as_subi(count, 1);
    as_add(x1, scale);
    lib_wait_dep_pre();
    as_mv(tmp1, max_c);
    lib_wait_dep_post();
    as_sub(pc, c);
    lib_wait_dep_pre();
    as_sub(tmp1, c);
    lib_wait_dep_post();
    as_sri(pc, 5);
    as_cnm(SP_REG_CP, tmp1);
    lib_wait_dep_pre();
    as_cnm(tmp2, count);
    lib_wait_dep_post();
    as_cnz(tmp3, pc);
    lib_wait_dep_pre();
    as_and(SP_REG_CP, tmp2);
    lib_wait_dep_post();
    as_and(SP_REG_CP, tmp3);
    lib_bc("m_mandel_core_L_0");
  }

  private void m_mandel()
  {
    int my_core_id = 4;
    int parallel = 5;
    int task_id = 6;
    int vram_addr = 7;
    int m2s_addr = 8;
    int page = 8;
    int x = 9;
    int y = 10;
    int scale = 11;
    int count = 12;
    int cx = 13;
    int cy = 14;
    int i = 15;
    // temp
    int tmp0 = 16;
    int param_addr = 16;
    /*
    lib_push_regs(4, 6); // push R4-R9
    get_param;
    page = task_id & 1;
    vram_addr += (page << (IMAGE_WIDTH_BITS + IMAGE_HEIGHT_BITS)) + (1 << (IMAGE_WIDTH_BITS + IMAGE_HEIGHT_BITS)) - 1 - my_core_id;
    i = (1 << (IMAGE_WIDTH_BITS + IMAGE_HEIGHT_BITS)) - 1 - my_core_id;
    do
    {
      x = i & ((1 << IMAGE_WIDTH_BITS) - 1);
      y = i >> IMAGE_WIDTH_BITS;
      m_mandel();
      mem[vram_addr] = count;
      vram_addr -= parallel;
      i -= parallel;
    } while (i >=0);
    lib_pop_regs(4, 6);
    */


    lib_push_regs(4, 6);

    // get param
    lib_wait_dep_pre();
    as_mv(param_addr, m2s_addr);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_addi(param_addr, 1);
    lib_wait_dep_post();
    as_ld(scale, param_addr);
    lib_wait_dep_pre();
    as_addi(param_addr, 1);
    lib_wait_dep_post();
    as_ld(cx, param_addr);
    lib_wait_dep_pre();
    as_addi(param_addr, 1);
    lib_wait_dep_post();
    as_ld(cy, param_addr);

    as_mvi(i, 1);
    as_mv(page, task_id);
    as_mvi(tmp0, 1);
    lib_wait_dep_pre();
    as_mvil(IMAGE_WIDTH_BITS + IMAGE_HEIGHT_BITS);
    lib_wait_dep_post();
    as_sl(i, SP_REG_MVIL);
    as_sl(tmp0, SP_REG_MVIL);
    lib_wait_dep_pre();
    as_andi(page, 1);
    lib_wait_dep_post();
    as_subi(i, 1);
    lib_wait_dep_pre();
    as_sl(page, SP_REG_MVIL);
    lib_wait_dep_post();
    as_sub(i, my_core_id);
    lib_wait_dep_pre();
    as_add(page, tmp0);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_subi(page, 1);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_sub(page, my_core_id);
    lib_wait_dep_post();
    as_add(vram_addr, page);
    label("m_mandel_L_0");
    as_mv(x, i);
    as_mv(y, i);
    lib_set_im(tmp0, (1 << IMAGE_WIDTH_BITS) - 1);
    lib_wait_dep_pre();
    as_sri(y, IMAGE_WIDTH_BITS);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_and(x, tmp0);
    lib_wait_dep_post();

    m_mandel_core();

    as_st(vram_addr, count);
    as_sub(vram_addr, parallel);
    lib_wait_dep_pre();
    as_sub(i, parallel);
    lib_wait_dep_post();
    as_cnm(SP_REG_CP, i);
    lib_bc("m_mandel_L_0");
    lib_pop_regs(4, 6);
  }

  private void f_get_m2s_addr()
  {
    // output: R3:m2s_addr
    int m2s_addr = 3;
    // m2s_addr = M2S_ADDR_H << M2S_ADDR_SHIFT;
    label("f_get_m2s_addr");
    lib_wait_dep_pre();
    as_mvi(m2s_addr, M2S_ADDR_H);
    lib_wait_dep_post();
    as_sli(m2s_addr, M2S_ADDR_SHIFT);
    lib_return();
  }

  private void f_get_s2m_addr()
  {
    // output: R3:s2m_addr
    int s2m_addr = 3;
    // s2m_addr = S2M_ADDR_H << S2M_ADDR_SHIFT;
    label("f_get_s2m_addr");
    lib_wait_dep_pre();
    as_mvi(s2m_addr, S2M_ADDR_H);
    lib_wait_dep_post();
    as_sli(s2m_addr, S2M_ADDR_SHIFT);
    lib_return();
  }

  private void f_get_vram_addr()
  {
    // output: R3:vram_addr
    int vram_addr = 3;
    // vram_addr = VRAM_ADDR_H << VRAM_ADDR_SHIFT;
    label("f_get_vram_addr");
    lib_wait_dep_pre();
    as_mvi(vram_addr, VRAM_ADDR_H);
    lib_wait_dep_post();
    as_sli(vram_addr, VRAM_ADDR_SHIFT);
    lib_return();
  }

  private void pe_thread_manager()
  {
    int my_core_id = 4;
    int parallel = 5;
    int task_id = 6;
    int vram_addr = 7;
    int m2s_addr = 8;
    int s2m_addr = 9;
    // temp
    int master_task_id = 10;
    int diff = 11;
    /*
    m2s_addr = lib_call("f_get_m2s_addr");
    vram_addr = lib_call("f_get_vram_addr");
    my_core_id = mem[m2s_addr];
    s2m_addr = lib_call("f_get_s2m_addr");
    s2m_addr += my_core_id;
    m2s_addr++;
    parallel = mem[m2s_addr];
    m2s_addr++;
    task_id = mem[m2s_addr];
    if (my_core_id >= parallel) goto "pe_thread_manager_L_end"
    do
    {
      task_id++;
      mem[s2m_addr] = task_id;
      do
      {
        master_task_id = mem[m2s_addr];
        diff = master_task_id - task_id;
      } while (diff != 0);
      lib_call("f_mandel");
    } (1);
    */

    as_nop();
    lib_init_stack();
    lib_call("f_get_m2s_addr");
    as_mv(m2s_addr, R3);
    lib_call("f_get_vram_addr");
    as_mv(vram_addr, R3);
    lib_call("f_get_s2m_addr");
    as_mv(s2m_addr, R3);
    as_ld(my_core_id, m2s_addr);
    lib_wait_dep_pre();
    as_addi(m2s_addr, 1);
    lib_wait_dep_post();
    as_mv(diff, my_core_id);
    as_add(s2m_addr, my_core_id);
    as_ld(parallel, m2s_addr);
    lib_wait_dep_pre();
    as_addi(m2s_addr, 1);
    lib_wait_dep_post();
    as_ld(task_id, m2s_addr);
    lib_wait_dep_pre();
    as_sub(diff, parallel);
    lib_wait_dep_post();
    as_cnm(SP_REG_CP, diff);
    lib_bc("pe_thread_manager_L_end");
    label("pe_thread_manager_L_0");
    lib_wait_dep_pre();
    as_addi(task_id, 1);
    lib_wait_dep_post();
    as_st(s2m_addr, task_id);
    label("pe_thread_manager_L_1");
    lib_wait_dep_pre();
    as_ld(master_task_id, m2s_addr);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_mv(diff, master_task_id);
    lib_wait_dep_post();
    lib_wait_dep_pre();
    as_sub(diff, task_id);
    lib_wait_dep_post();
    as_cnz(SP_REG_CP, diff);
    lib_bc("pe_thread_manager_L_1");

    m_mandel();

    lib_ba("pe_thread_manager_L_0");
    label("pe_thread_manager_L_end");
    lib_call("f_halt");
    // link
    f_get_m2s_addr();
    f_get_s2m_addr();
    f_get_vram_addr();
    f_wait();
    f_halt();
  }

  @Override
  public void init()
  {
    set_rom_width(CODE_ROM_WIDTH, DATA_ROM_WIDTH);
    set_rom_depth(CODE_ROM_DEPTH, DATA_ROM_DEPTH);
    set_stack_address((1 << DATA_ROM_DEPTH) - 1);
    set_filename("default_pe");
  }

  @Override
  public void program()
  {
    pe_thread_manager();
  }

  @Override
  public void data()
  {
    label("d_rand");
    dat(0xfc720c27);
  }
}