要么改变世界,要么适应世界

CS:APP-Architecture Lab

2023-10-02 15:03:55
197
目录

前言

本实验要求我们学会涉及和实现一个流水线化的Y86-64处理器,至于为什么叫Y86-64,可能是想致敬x86-64吧。

同时我们还要优化该处理器。

本次实验主要分为三大部分,在A部分中,主要是编写一些Y86-64汇编程序,在B部分中,我们要扩展SEQ模拟器,A和B将作为C部分的基础,C部分是本实验的重点,在里边,我们要优化 Y86-64 基准程序和处理器设计。

话不多说,打开电脑,带上键盘,开启实验!

环境准备

安装依赖

sudo apt install tcl tcl-dev tk tk-dev

解压:

tar xvf archlab-handout.tar
cd archlab-handout
tar xvf sim.tar

编译环境

cd sim
make clean; make

如果遇到下面的错误:

multiple definition of `lineno'; yas-grammar.o:(.bss+0x0): first defined here
collect2: error: ld returned 1 exit status

则将misc/Makefile中两个变量改为如下内容:

CFLAGS=-Wall -O1 -g -fcommon
LCFLAGS=-O1 -fcommon

pipe/Makefile中变量改为如下内容:

CFLAGS=-Wall -O2 -fcommon

重新编译环境

cd sim
make clean; make

Part-A

这一部分的主要在目录sim/misc

1.

第一个小实验是我们要编写一个sum.ys汇编文件,用于迭代链表求和,实现逻辑参照example.csum_list函数,值得注意的是,我们要编写一个完整的程序,意味着我们还要做初始化堆栈,调用函数后进行停机操作。

我们可以参考原书第三版的251页和252页的图4-6和图4-7,完成下面的程序

# sum.ys
.pos 0
        irmovq stack, %rsp
        call main
        halt
.align 8
ele1:
        .quad 0x00a
        .quad ele2
ele2:
        .quad 0x0b0
        .quad ele3
ele3:
        .quad 0xc00
        .quad 0
main:
		irmovq ele1,%rdi
		call sum_list
		ret

sum_list:
		# rbx会在下面使用,因此要借助堆栈保护起来
		pushq %rbx
		# long val = 0
        xorq %rax,%rax
loop:
		# while (ls) 
        andq %rdi,%rdi
        je exit_
        # val += ls->val
        mrmovq (%rdi),%rbx
        addq %rbx,%rax
        # ls = ls->next
        mrmovq 8(%rdi),%rdi
        jmp loop
exit_:
		# 返回前从堆栈中恢复rbx
		popq %rbx
        ret

.pos 0x200
stack:

测试:

 ./yas sum.ys && ./yis sum.yo
Stopped in 30 steps at PC = 0x13.  Status 'HLT', CC Z=1 S=0 O=0
Changes to registers:
%rax:   0x0000000000000000      0x0000000000000cba
%rsp:   0x0000000000000000      0x0000000000000200

Changes to memory:
0x01f0: 0x0000000000000000      0x000000000000005b
0x01f8: 0x0000000000000000      0x0000000000000013

2.

第二个小实验还是迭代求和,只不过求和方式使用了递归方式,在汇编中使用递归的时候要记得递归调用前要保存可能会覆盖的寄存器,例如在函数中rsum_list,由于rax用于保存本轮调用的返回值,但是它也是作为下一轮递归调用的返回值,因此递归调用前,我们要通过其他寄存器先保存当前的rax(当然你也可以通过堆栈保存)

# rsum.ys
.pos 0
        irmovq stack, %rsp
        call main
        halt
.align 8
ele1:
        .quad 0x00a
        .quad ele2
ele2:
        .quad 0x0b0
        .quad ele3
ele3:
        .quad 0xc00
        .quad 0
main:
		irmovq ele1,%rdi
		call rsum_list
		ret

rsum_list:
		pushq %rbx
		pushq %rcx
        xorq %rax,%rax
		# if (!ls)
        andq %rdi,%rdi
        je exit_
        # long val = ls->val
        mrmovq (%rdi),%rbx
        # 先保存rax
        rrmovq %rax,%rcx
        # ls->next
        mrmovq 8(%rdi),%rdi
        # 递归
        call rsum_list
        # 返回值加上递归函数的返回值
        addq %rcx,%rax
        # 返回值加上当前链表节点的值
        addq %rbx,%rax
        
exit_:
		popq %rcx
		popq %rbx
        ret

.pos 0x200
stack:

测试:

./yas rsum.ys && ./yis rsum.yo
Stopped in 56 steps at PC = 0x13.  Status 'HLT', CC Z=0 S=0 O=0
Changes to registers:
%rax:   0x0000000000000000      0x0000000000000cba
%rsp:   0x0000000000000000      0x0000000000000200

Changes to memory:
0x01a0: 0x0000000000000000      0x0000000000000c00
0x01a8: 0x0000000000000000      0x000000000000008c
0x01b8: 0x0000000000000000      0x00000000000000b0
0x01c0: 0x0000000000000000      0x000000000000008c
0x01d0: 0x0000000000000000      0x000000000000000a
0x01d8: 0x0000000000000000      0x000000000000008c
0x01f0: 0x0000000000000000      0x000000000000005b
0x01f8: 0x0000000000000000      0x0000000000000013

3.

第三个小实验我们要实现复制内存的操作,并计算校验和,应该没啥难度

# copy.ys
.pos 0
        irmovq stack, %rsp
        call main
        halt
.align 8
# Source block
src:
        .quad 0x00a
        .quad 0x0b0
        .quad 0xc00
# Destination block
dest:
        .quad 0x111
        .quad 0x222
        .quad 0x333
main:
		irmovq src,%rdi
		irmovq dest,%rsi
		irmovq $3,%rdx
		call copy_block
		ret

copy_block:
		pushq %rbx
		pushq %r9
		pushq %r8
		irmovq $8, %r8
		irmovq $1, %r9
		xorq %rax,%rax
loop:
		subq %r9,%rdx
		jl exit_
		mrmovq (%rdi),%rbx
		rmmovq %rbx,(%rsi)
		addq %r8,%rdi
		addq %r8,%rsi
		xorq %rbx,%rax
		jmp loop
exit_:
		popq %r8
		popq %r9
		popq %rbx
		ret
.pos 0x200
stack:

测试:

 ./yas copy.ys && ./yis copy.yo
Stopped in 44 steps at PC = 0x13.  Status 'HLT', CC Z=0 S=1 O=0
Changes to registers:
%rax:   0x0000000000000000      0x0000000000000cba
%rdx:   0x0000000000000000      0xffffffffffffffff
%rsp:   0x0000000000000000      0x0000000000000200
%rsi:   0x0000000000000000      0x0000000000000048
%rdi:   0x0000000000000000      0x0000000000000030

Changes to memory:
0x0030: 0x0000000000000111      0x000000000000000a
0x0038: 0x0000000000000222      0x00000000000000b0
0x0040: 0x0000000000000333      0x0000000000000c00
0x01f0: 0x0000000000000000      0x000000000000006f
0x01f8: 0x0000000000000000      0x0000000000000013

Part-B

在这一部分中,我们需要使用硬件描述语言添加一条指令iaddq,具体是在工作目录sim/seq中,完成seq-full.hcl文件中其余的部分。

在开始编码之前,让我们先来完成一些理论部分。

根据所给的PDF提示,我们大概知道该指令格式:

字节:         |0   |1   |2   3   4   5   6   7   8   9   |
              --------------------------------------------
iaddq V,rB    |C0  |F rB|               V                |

参照《深入理解计算机系统》教材的图4-18,我们可以完成下表:

iaddq
取指 icode:ifun←M1[PC]
rA:rB←M1[PC+1]
valC←M8[Pc+2]
valP←PC+10
译码 valB←R[rB]
执行 valE←valB+valC
Set CC
访存
写回 R[rB]←valE
更新PC PC←valP

根据上表,我们要修改教材上第278页(第三版中文版)开始的涉及的HCL变量:

1.取指阶段:

iaddq需要寄存器:

bool need_regids =
	icode in { IRRMOVQ, IOPQ, IPUSHQ, IPOPQ, 
		     IIRMOVQ, IRMMOVQ, IMRMOVQ, IIADDQ };

iaddq需要变量C:

bool need_valC =
	icode in { IIRMOVQ, IRMMOVQ, IMRMOVQ, IJXX, ICALL, IIADDQ };

2.译码和写回阶段

iaddq不需要寄存器A,因此srcA不用改。

iaddq需要寄存器B:

word srcB = [
	icode in { IOPQ, IRMMOVQ, IMRMOVQ, IIADDQ  } : rB;
	icode in { IPUSHQ, IPOPQ, ICALL, IRET } : RRSP;
	1 : RNONE;  # Don't need register
];

在写回阶段,端口E的目的寄存器是rB:

word dstE = [
	icode in { IRRMOVQ } && Cnd : rB;
	icode in { IIRMOVQ, IOPQ, IIADDQ} : rB;
	icode in { IPUSHQ, IPOPQ, ICALL, IRET } : RRSP;
	1 : RNONE;  # Don't write any register
];

iaddq不涉及内存读写,因此dstM不需要更改。

3.执行阶段

执行阶段,aluA来自valC,aluB来自valB:

word aluA = [
	icode in { IRRMOVQ, IOPQ } : valA;
	icode in { IIRMOVQ, IRMMOVQ, IMRMOVQ, IIADDQ } : valC;
	icode in { ICALL, IPUSHQ } : -8;
	icode in { IRET, IPOPQ } : 8;
	# Other instructions don't need ALU
];
word aluB = [
	icode in { IRMMOVQ, IMRMOVQ, IOPQ, ICALL, 
		      IPUSHQ, IRET, IPOPQ, IIADDQ } : valB;
	icode in { IRRMOVQ, IIRMOVQ } : 0;
	# Other instructions don't need ALU
];

alufun不需要更改,但是需要更新条件码寄存器

bool set_cc = icode in { IOPQ, IIADDQ };

4.访存阶段

iaddq不涉及内存读写,因此mem_addr、mem_data、mem_write、mem_read都无需更改。

5.更新PC阶段

这一部分也不需要更改。

除了上述5个阶段涉及的一些变量意外,下面的也要进行相应修改:

bool instr_valid = icode in 
	{ INOP, IHALT, IRRMOVQ, IIRMOVQ, IRMMOVQ, IMRMOVQ,
	       IOPQ, IJXX, ICALL, IRET, IPUSHQ, IPOPQ, IIADDQ };

综上,seq-full.hcl文件为:

#/* $begin seq-all-hcl */
####################################################################
#  HCL Description of Control for Single Cycle Y86-64 Processor SEQ   #
#  Copyright (C) Randal E. Bryant, David R. O'Hallaron, 2010       #
####################################################################

## Your task is to implement the iaddq instruction
## The file contains a declaration of the icodes
## for iaddq (IIADDQ)
## Your job is to add the rest of the logic to make it work

####################################################################
#    C Include's.  Don't alter these                               #
####################################################################

quote '#include <stdio.h>'
quote '#include "isa.h"'
quote '#include "sim.h"'
quote 'int sim_main(int argc, char *argv[]);'
quote 'word_t gen_pc(){return 0;}'
quote 'int main(int argc, char *argv[])'
quote '  {plusmode=0;return sim_main(argc,argv);}'

####################################################################
#    Declarations.  Do not change/remove/delete any of these       #
####################################################################

##### Symbolic representation of Y86-64 Instruction Codes #############
wordsig INOP 	'I_NOP'
wordsig IHALT	'I_HALT'
wordsig IRRMOVQ	'I_RRMOVQ'
wordsig IIRMOVQ	'I_IRMOVQ'
wordsig IRMMOVQ	'I_RMMOVQ'
wordsig IMRMOVQ	'I_MRMOVQ'
wordsig IOPQ	'I_ALU'
wordsig IJXX	'I_JMP'
wordsig ICALL	'I_CALL'
wordsig IRET	'I_RET'
wordsig IPUSHQ	'I_PUSHQ'
wordsig IPOPQ	'I_POPQ'
# Instruction code for iaddq instruction
wordsig IIADDQ	'I_IADDQ'

##### Symbolic represenations of Y86-64 function codes                  #####
wordsig FNONE    'F_NONE'        # Default function code

##### Symbolic representation of Y86-64 Registers referenced explicitly #####
wordsig RRSP     'REG_RSP'    	# Stack Pointer
wordsig RNONE    'REG_NONE'   	# Special value indicating "no register"

##### ALU Functions referenced explicitly                            #####
wordsig ALUADD	'A_ADD'		# ALU should add its arguments

##### Possible instruction status values                             #####
wordsig SAOK	'STAT_AOK'	# Normal execution
wordsig SADR	'STAT_ADR'	# Invalid memory address
wordsig SINS	'STAT_INS'	# Invalid instruction
wordsig SHLT	'STAT_HLT'	# Halt instruction encountered

##### Signals that can be referenced by control logic ####################

##### Fetch stage inputs		#####
wordsig pc 'pc'				# Program counter
##### Fetch stage computations		#####
wordsig imem_icode 'imem_icode'		# icode field from instruction memory
wordsig imem_ifun  'imem_ifun' 		# ifun field from instruction memory
wordsig icode	  'icode'		# Instruction control code
wordsig ifun	  'ifun'		# Instruction function
wordsig rA	  'ra'			# rA field from instruction
wordsig rB	  'rb'			# rB field from instruction
wordsig valC	  'valc'		# Constant from instruction
wordsig valP	  'valp'		# Address of following instruction
boolsig imem_error 'imem_error'		# Error signal from instruction memory
boolsig instr_valid 'instr_valid'	# Is fetched instruction valid?

##### Decode stage computations		#####
wordsig valA	'vala'			# Value from register A port
wordsig valB	'valb'			# Value from register B port

##### Execute stage computations	#####
wordsig valE	'vale'			# Value computed by ALU
boolsig Cnd	'cond'			# Branch test

##### Memory stage computations		#####
wordsig valM	'valm'			# Value read from memory
boolsig dmem_error 'dmem_error'		# Error signal from data memory


####################################################################
#    Control Signal Definitions.                                   #
####################################################################

################ Fetch Stage     ###################################

# Determine instruction code
word icode = [
	imem_error: INOP;
	1: imem_icode;		# Default: get from instruction memory
];

# Determine instruction function
word ifun = [
	imem_error: FNONE;
	1: imem_ifun;		# Default: get from instruction memory
];

bool instr_valid = icode in 
	{ INOP, IHALT, IRRMOVQ, IIRMOVQ, IRMMOVQ, IMRMOVQ,
	       IOPQ, IJXX, ICALL, IRET, IPUSHQ, IPOPQ, IIADDQ };

# Does fetched instruction require a regid byte?
bool need_regids =
	icode in { IRRMOVQ, IOPQ, IPUSHQ, IPOPQ, 
		     IIRMOVQ, IRMMOVQ, IMRMOVQ, IIADDQ };

# Does fetched instruction require a constant word?
bool need_valC =
	icode in { IIRMOVQ, IRMMOVQ, IMRMOVQ, IJXX, ICALL, IIADDQ };

################ Decode Stage    ###################################

## What register should be used as the A source?
word srcA = [
	icode in { IRRMOVQ, IRMMOVQ, IOPQ, IPUSHQ  } : rA;
	icode in { IPOPQ, IRET } : RRSP;
	1 : RNONE; # Don't need register
];

## What register should be used as the B source?
word srcB = [
	icode in { IOPQ, IRMMOVQ, IMRMOVQ, IIADDQ  } : rB;
	icode in { IPUSHQ, IPOPQ, ICALL, IRET } : RRSP;
	1 : RNONE;  # Don't need register
];

## What register should be used as the E destination?
word dstE = [
	icode in { IRRMOVQ } && Cnd : rB;
	icode in { IIRMOVQ, IOPQ, IIADDQ} : rB;
	icode in { IPUSHQ, IPOPQ, ICALL, IRET } : RRSP;
	1 : RNONE;  # Don't write any register
];

## What register should be used as the M destination?
word dstM = [
	icode in { IMRMOVQ, IPOPQ } : rA;
	1 : RNONE;  # Don't write any register
];

################ Execute Stage   ###################################

## Select input A to ALU
word aluA = [
	icode in { IRRMOVQ, IOPQ } : valA;
	icode in { IIRMOVQ, IRMMOVQ, IMRMOVQ, IIADDQ } : valC;
	icode in { ICALL, IPUSHQ } : -8;
	icode in { IRET, IPOPQ } : 8;
	# Other instructions don't need ALU
];

## Select input B to ALU
word aluB = [
	icode in { IRMMOVQ, IMRMOVQ, IOPQ, ICALL, 
		      IPUSHQ, IRET, IPOPQ, IIADDQ } : valB;
	icode in { IRRMOVQ, IIRMOVQ } : 0;
	# Other instructions don't need ALU
];

## Set the ALU function
word alufun = [
	icode == IOPQ : ifun;
	1 : ALUADD;
];

## Should the condition codes be updated?
bool set_cc = icode in { IOPQ, IIADDQ };

################ Memory Stage    ###################################

## Set read control signal
bool mem_read = icode in { IMRMOVQ, IPOPQ, IRET };

## Set write control signal
bool mem_write = icode in { IRMMOVQ, IPUSHQ, ICALL };

## Select memory address
word mem_addr = [
	icode in { IRMMOVQ, IPUSHQ, ICALL, IMRMOVQ } : valE;
	icode in { IPOPQ, IRET } : valA;
	# Other instructions don't need address
];

## Select memory input data
word mem_data = [
	# Value from register
	icode in { IRMMOVQ, IPUSHQ } : valA;
	# Return PC
	icode == ICALL : valP;
	# Default: Don't write anything
];

## Determine instruction status
word Stat = [
	imem_error || dmem_error : SADR;
	!instr_valid: SINS;
	icode == IHALT : SHLT;
	1 : SAOK;
];

################ Program Counter Update ############################

## What address should instruction be fetched at

word new_pc = [
	# Call.  Use instruction constant
	icode == ICALL : valC;
	# Taken branch.  Use instruction constant
	icode == IJXX && Cnd : valC;
	# Completion of RET instruction.  Use value from stack
	icode == IRET : valM;
	# Default: Use incremented PC
	1 : valP;
];
#/* $end seq-all-hcl */

测试:

cd sim/seq
# 构建新的仿真器
make clean; make VERSION=full
# 如果遇到undefined reference to `matherr',则将ssim.c中关于matherr的两行代码都注释
# 再重新执行 make clean; make VERSION=full

# 通过单个的 Y86-64 程序测试我们修改的程序,测试的源码是../y86-code/asumi.ys
./ssim -t ../y86-code/asumi.yo
# 使用基准程序来测试我们修改的程序
(cd ../y86-code; make testssim)
# 测试除了iaddq和leave之外的所有指令
(cd ../ptest; make SIM=../seq/ssim)
# 测试iaddq指令
(cd ../ptest; make SIM=../seq/ssim TFLAGS=-i)

Part-C

经过前面的两道开胃菜,终于进入我们的正题——实现流水线处理器。

本部分工作目录在sim/pipe中。

在这个部分中,主要任务是修改ncopy.ys pipe-full.hcl,使得ncopy.ys运行得足够快。

我们可以先看一下默认能拿多少分:

make && (cd ../y86-code; make testpsim) && (cd ../ptest; make SIM=../pipe/psim TFLAGS=-i)

然后查看分数(注意, ./benchmark.pl不管对错,他只检查CPE,因此对于出现低得很离谱的CPE,大概率是程序出了问题):

 ./benchmark.pl
 ...
 ...
 Average CPE     15.18
Score   0.0/60.0

根据要求,我们要将CPE降到10.5以下才可以拿到分数。

我们可以先着手考虑增加iaddq指令,参照part-B,我们可以很快就可以修改pipe-full.hcl文件:

#/* $begin pipe-all-hcl */
####################################################################
#    HCL Description of Control for Pipelined Y86-64 Processor     #
#    Copyright (C) Randal E. Bryant, David R. O'Hallaron, 2014     #
####################################################################

## Your task is to implement the iaddq instruction
## The file contains a declaration of the icodes
## for iaddq (IIADDQ)
## Your job is to add the rest of the logic to make it work

####################################################################
#    C Include's.  Don't alter these                               #
####################################################################

quote '#include <stdio.h>'
quote '#include "isa.h"'
quote '#include "pipeline.h"'
quote '#include "stages.h"'
quote '#include "sim.h"'
quote 'int sim_main(int argc, char *argv[]);'
quote 'int main(int argc, char *argv[]){return sim_main(argc,argv);}'

####################################################################
#    Declarations.  Do not change/remove/delete any of these       #
####################################################################

##### Symbolic representation of Y86-64 Instruction Codes #############
wordsig INOP 	'I_NOP'
wordsig IHALT	'I_HALT'
wordsig IRRMOVQ	'I_RRMOVQ'
wordsig IIRMOVQ	'I_IRMOVQ'
wordsig IRMMOVQ	'I_RMMOVQ'
wordsig IMRMOVQ	'I_MRMOVQ'
wordsig IOPQ	'I_ALU'
wordsig IJXX	'I_JMP'
wordsig ICALL	'I_CALL'
wordsig IRET	'I_RET'
wordsig IPUSHQ	'I_PUSHQ'
wordsig IPOPQ	'I_POPQ'
# Instruction code for iaddq instruction
wordsig IIADDQ	'I_IADDQ'

##### Symbolic represenations of Y86-64 function codes            #####
wordsig FNONE    'F_NONE'        # Default function code

##### Symbolic representation of Y86-64 Registers referenced      #####
wordsig RRSP     'REG_RSP'    	     # Stack Pointer
wordsig RNONE    'REG_NONE'   	     # Special value indicating "no register"

##### ALU Functions referenced explicitly ##########################
wordsig ALUADD	'A_ADD'		     # ALU should add its arguments

##### Possible instruction status values                       #####
wordsig SBUB	'STAT_BUB'	# Bubble in stage
wordsig SAOK	'STAT_AOK'	# Normal execution
wordsig SADR	'STAT_ADR'	# Invalid memory address
wordsig SINS	'STAT_INS'	# Invalid instruction
wordsig SHLT	'STAT_HLT'	# Halt instruction encountered

##### Signals that can be referenced by control logic ##############

##### Pipeline Register F ##########################################

wordsig F_predPC 'pc_curr->pc'	     # Predicted value of PC

##### Intermediate Values in Fetch Stage ###########################

wordsig imem_icode  'imem_icode'      # icode field from instruction memory
wordsig imem_ifun   'imem_ifun'       # ifun  field from instruction memory
wordsig f_icode	'if_id_next->icode'  # (Possibly modified) instruction code
wordsig f_ifun	'if_id_next->ifun'   # Fetched instruction function
wordsig f_valC	'if_id_next->valc'   # Constant data of fetched instruction
wordsig f_valP	'if_id_next->valp'   # Address of following instruction
boolsig imem_error 'imem_error'	     # Error signal from instruction memory
boolsig instr_valid 'instr_valid'    # Is fetched instruction valid?

##### Pipeline Register D ##########################################
wordsig D_icode 'if_id_curr->icode'   # Instruction code
wordsig D_rA 'if_id_curr->ra'	     # rA field from instruction
wordsig D_rB 'if_id_curr->rb'	     # rB field from instruction
wordsig D_valP 'if_id_curr->valp'     # Incremented PC

##### Intermediate Values in Decode Stage  #########################

wordsig d_srcA	 'id_ex_next->srca'  # srcA from decoded instruction
wordsig d_srcB	 'id_ex_next->srcb'  # srcB from decoded instruction
wordsig d_rvalA 'd_regvala'	     # valA read from register file
wordsig d_rvalB 'd_regvalb'	     # valB read from register file

##### Pipeline Register E ##########################################
wordsig E_icode 'id_ex_curr->icode'   # Instruction code
wordsig E_ifun  'id_ex_curr->ifun'    # Instruction function
wordsig E_valC  'id_ex_curr->valc'    # Constant data
wordsig E_srcA  'id_ex_curr->srca'    # Source A register ID
wordsig E_valA  'id_ex_curr->vala'    # Source A value
wordsig E_srcB  'id_ex_curr->srcb'    # Source B register ID
wordsig E_valB  'id_ex_curr->valb'    # Source B value
wordsig E_dstE 'id_ex_curr->deste'    # Destination E register ID
wordsig E_dstM 'id_ex_curr->destm'    # Destination M register ID

##### Intermediate Values in Execute Stage #########################
wordsig e_valE 'ex_mem_next->vale'	# valE generated by ALU
boolsig e_Cnd 'ex_mem_next->takebranch' # Does condition hold?
wordsig e_dstE 'ex_mem_next->deste'      # dstE (possibly modified to be RNONE)

##### Pipeline Register M                  #########################
wordsig M_stat 'ex_mem_curr->status'     # Instruction status
wordsig M_icode 'ex_mem_curr->icode'	# Instruction code
wordsig M_ifun  'ex_mem_curr->ifun'	# Instruction function
wordsig M_valA  'ex_mem_curr->vala'      # Source A value
wordsig M_dstE 'ex_mem_curr->deste'	# Destination E register ID
wordsig M_valE  'ex_mem_curr->vale'      # ALU E value
wordsig M_dstM 'ex_mem_curr->destm'	# Destination M register ID
boolsig M_Cnd 'ex_mem_curr->takebranch'	# Condition flag
boolsig dmem_error 'dmem_error'	        # Error signal from instruction memory

##### Intermediate Values in Memory Stage ##########################
wordsig m_valM 'mem_wb_next->valm'	# valM generated by memory
wordsig m_stat 'mem_wb_next->status'	# stat (possibly modified to be SADR)

##### Pipeline Register W ##########################################
wordsig W_stat 'mem_wb_curr->status'     # Instruction status
wordsig W_icode 'mem_wb_curr->icode'	# Instruction code
wordsig W_dstE 'mem_wb_curr->deste'	# Destination E register ID
wordsig W_valE  'mem_wb_curr->vale'      # ALU E value
wordsig W_dstM 'mem_wb_curr->destm'	# Destination M register ID
wordsig W_valM  'mem_wb_curr->valm'	# Memory M value

####################################################################
#    Control Signal Definitions.                                   #
####################################################################

################ Fetch Stage     ###################################

## What address should instruction be fetched at
word f_pc = [
	# Mispredicted branch.  Fetch at incremented PC
	M_icode == IJXX && !M_Cnd : M_valA;
	# Completion of RET instruction
	W_icode == IRET : W_valM;
	# Default: Use predicted value of PC
	1 : F_predPC;
];

## Determine icode of fetched instruction
word f_icode = [
	imem_error : INOP;
	1: imem_icode;
];

# Determine ifun
word f_ifun = [
	imem_error : FNONE;
	1: imem_ifun;
];

# Is instruction valid?
bool instr_valid = f_icode in 
	{ INOP, IHALT, IRRMOVQ, IIRMOVQ, IRMMOVQ, IMRMOVQ,
	  IOPQ, IJXX, ICALL, IRET, IPUSHQ, IPOPQ, IIADDQ };

# Determine status code for fetched instruction
word f_stat = [
	imem_error: SADR;
	!instr_valid : SINS;
	f_icode == IHALT : SHLT;
	1 : SAOK;
];

# Does fetched instruction require a regid byte?
bool need_regids =
	f_icode in { IRRMOVQ, IOPQ, IPUSHQ, IPOPQ, 
		     IIRMOVQ, IRMMOVQ, IMRMOVQ, IIADDQ };

# Does fetched instruction require a constant word?
bool need_valC =
	f_icode in { IIRMOVQ, IRMMOVQ, IMRMOVQ, IJXX, ICALL, IIADDQ };

# Predict next value of PC
word f_predPC = [
	f_icode in { IJXX, ICALL } : f_valC;
	1 : f_valP;
];

################ Decode Stage ######################################


## What register should be used as the A source?
word d_srcA = [
	D_icode in { IRRMOVQ, IRMMOVQ, IOPQ, IPUSHQ  } : D_rA;
	D_icode in { IPOPQ, IRET } : RRSP;
	1 : RNONE; # Don't need register
];

## What register should be used as the B source?
word d_srcB = [
	D_icode in { IOPQ, IRMMOVQ, IMRMOVQ, IIADDQ } : D_rB;
	D_icode in { IPUSHQ, IPOPQ, ICALL, IRET } : RRSP;
	1 : RNONE;  # Don't need register
];

## What register should be used as the E destination?
word d_dstE = [
	D_icode in { IRRMOVQ, IIRMOVQ, IOPQ, IIADDQ} : D_rB;
	D_icode in { IPUSHQ, IPOPQ, ICALL, IRET } : RRSP;
	1 : RNONE;  # Don't write any register
];

## What register should be used as the M destination?
word d_dstM = [
	D_icode in { IMRMOVQ, IPOPQ } : D_rA;
	1 : RNONE;  # Don't write any register
];

## What should be the A value?
## Forward into decode stage for valA
word d_valA = [
	D_icode in { ICALL, IJXX } : D_valP; # Use incremented PC
	d_srcA == e_dstE : e_valE;    # Forward valE from execute
	d_srcA == M_dstM : m_valM;    # Forward valM from memory
	d_srcA == M_dstE : M_valE;    # Forward valE from memory
	d_srcA == W_dstM : W_valM;    # Forward valM from write back
	d_srcA == W_dstE : W_valE;    # Forward valE from write back
	1 : d_rvalA;  # Use value read from register file
];

word d_valB = [
	d_srcB == e_dstE : e_valE;    # Forward valE from execute
	d_srcB == M_dstM : m_valM;    # Forward valM from memory
	d_srcB == M_dstE : M_valE;    # Forward valE from memory
	d_srcB == W_dstM : W_valM;    # Forward valM from write back
	d_srcB == W_dstE : W_valE;    # Forward valE from write back
	1 : d_rvalB;  # Use value read from register file
];

################ Execute Stage #####################################

## Select input A to ALU
word aluA = [
	E_icode in { IRRMOVQ, IOPQ } : E_valA;
	E_icode in { IIRMOVQ, IRMMOVQ, IMRMOVQ , IIADDQ } : E_valC;
	E_icode in { ICALL, IPUSHQ } : -8;
	E_icode in { IRET, IPOPQ } : 8;
	# Other instructions don't need ALU
];

## Select input B to ALU
word aluB = [
	E_icode in { IRMMOVQ, IMRMOVQ, IOPQ, ICALL, 
		     IPUSHQ, IRET, IPOPQ , IIADDQ } : E_valB;
	E_icode in { IRRMOVQ, IIRMOVQ } : 0;
	# Other instructions don't need ALU
];

## Set the ALU function
word alufun = [
	E_icode == IOPQ : E_ifun;
	1 : ALUADD;
];

## Should the condition codes be updated?
bool set_cc = E_icode in { IOPQ, IIADDQ } &&
	# State changes only during normal operation
	!m_stat in { SADR, SINS, SHLT } && !W_stat in { SADR, SINS, SHLT };

## Generate valA in execute stage
word e_valA = E_valA;    # Pass valA through stage

## Set dstE to RNONE in event of not-taken conditional move
word e_dstE = [
	E_icode == IRRMOVQ && !e_Cnd : RNONE;
	1 : E_dstE;
];

################ Memory Stage ######################################

## Select memory address
word mem_addr = [
	M_icode in { IRMMOVQ, IPUSHQ, ICALL, IMRMOVQ } : M_valE;
	M_icode in { IPOPQ, IRET } : M_valA;
	# Other instructions don't need address
];

## Set read control signal
bool mem_read = M_icode in { IMRMOVQ, IPOPQ, IRET };

## Set write control signal
bool mem_write = M_icode in { IRMMOVQ, IPUSHQ, ICALL };

#/* $begin pipe-m_stat-hcl */
## Update the status
word m_stat = [
	dmem_error : SADR;
	1 : M_stat;
];
#/* $end pipe-m_stat-hcl */

## Set E port register ID
word w_dstE = W_dstE;

## Set E port value
word w_valE = W_valE;

## Set M port register ID
word w_dstM = W_dstM;

## Set M port value
word w_valM = W_valM;

## Update processor status
word Stat = [
	W_stat == SBUB : SAOK;
	1 : W_stat;
];

################ Pipeline Register Control #########################

# Should I stall or inject a bubble into Pipeline Register F?
# At most one of these can be true.
bool F_bubble = 0;
bool F_stall =
	# Conditions for a load/use hazard
	E_icode in { IMRMOVQ, IPOPQ } &&
	 E_dstM in { d_srcA, d_srcB } ||
	# Stalling at fetch while ret passes through pipeline
	IRET in { D_icode, E_icode, M_icode };

# Should I stall or inject a bubble into Pipeline Register D?
# At most one of these can be true.
bool D_stall = 
	# Conditions for a load/use hazard
	E_icode in { IMRMOVQ, IPOPQ } &&
	 E_dstM in { d_srcA, d_srcB };

bool D_bubble =
	# Mispredicted branch
	(E_icode == IJXX && !e_Cnd) ||
	# Stalling at fetch while ret passes through pipeline
	# but not condition for a load/use hazard
	!(E_icode in { IMRMOVQ, IPOPQ } && E_dstM in { d_srcA, d_srcB }) &&
	  IRET in { D_icode, E_icode, M_icode };

# Should I stall or inject a bubble into Pipeline Register E?
# At most one of these can be true.
bool E_stall = 0;
bool E_bubble =
	# Mispredicted branch
	(E_icode == IJXX && !e_Cnd) ||
	# Conditions for a load/use hazard
	E_icode in { IMRMOVQ, IPOPQ } &&
	 E_dstM in { d_srcA, d_srcB};

# Should I stall or inject a bubble into Pipeline Register M?
# At most one of these can be true.
bool M_stall = 0;
# Start injecting bubbles as soon as exception passes through memory stage
bool M_bubble = m_stat in { SADR, SINS, SHLT } || W_stat in { SADR, SINS, SHLT };

# Should I stall or inject a bubble into Pipeline Register W?
bool W_stall = W_stat in { SADR, SINS, SHLT };
bool W_bubble = 0;
#/* $end pipe-all-hcl */

先来测试一下我们修改的这个文件的正确性:

cd sim/pipe
make clean; make VERSION=full;

# 通过单个的 Y86-64 程序测试我们修改的程序,测试的源码是../y86-code/asumi.ys
./psim -t ../y86-code/asumi.yo
# 出现 ISA Check Succeeds 即可
# 使用基准程序来测试我们修改的程序
(cd ../y86-code; make testpsim)
# 测试除了iaddq和leave之外的所有指令
(cd ../ptest; make SIM=../pipe/psim)
# 测试iaddq指令
(cd ../ptest; make SIM=../pipe/psim TFLAGS=-i)

上面都通过以后,我们将ncopy.ys中适合的指令使用iaddq替换,改进后的ncopy.ys如下:

#/* $begin ncopy-ys */
##################################################################
# ncopy.ys - Copy a src block of len words to dst.
# Return the number of positive words (>0) contained in src.
#
# Include your name and ID here.
#
# Describe how and why you modified the baseline code.
#
##################################################################
# Do not modify this portion
# Function prologue.
# %rdi = src, %rsi = dst, %rdx = len
ncopy:

##################################################################
# You can modify this portion
	# Loop header
	xorq %rax,%rax		# count = 0;
	andq %rdx,%rdx		# len <= 0?
	jle Done		# if so, goto Done:

Loop:	
	mrmovq (%rdi), %r10	# read val from src...
	rmmovq %r10, (%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle Npos		# if so, goto Npos:
	iaddq $1, %rax		# count++
Npos:	
	iaddq $-1, %rdx		# len--
	iaddq $8, %rdi		# src++
	iaddq $8, %rsi		# dst++
	andq %rdx,%rdx		# len > 0?
	jg Loop			# if so, goto Loop:
##################################################################
# Do not modify the following section of code
# Function epilogue.
Done:
	ret
##################################################################
# Keep the following label at the end of your function
End:
#/* $end ncopy-ys */

对上述文件测试:

# 测试正确性
./correctness.pl
# 测试性能
./benchmark.pl
Average CPE     12.70
Score   0.0/60.0

有改进!但不多

这时候,我们就要考虑一下第五章5.8的循环展开了,循环展开有很多方式,例如4*1、5*1,6*1,7*1等等,试一下8*1最优,对应ncopy.ys如下:

#/* $begin ncopy-ys */
##################################################################
# ncopy.ys - Copy a src block of len words to dst.
# Return the number of positive words (>0) contained in src.
#
# Include your name and ID here.
#
# Describe how and why you modified the baseline code.
#
##################################################################
# Do not modify this portion
# Function prologue.
# %rdi = src, %rsi = dst, %rdx = len
ncopy:

##################################################################
# You can modify this portion
	# Loop header
	xorq %rax,%rax		# count = 0;
	andq %rdx,%rdx		# len <= 0?
	jle Done		    # if so, goto Done:
	
	rrmovq %rdx, %rbx
	iaddq $-7, %rbx		# (len - 7) <= 0?
	jle less_7_loop		# if so, goto less_7_loop:
						# else (len - 7) > 0, then goto test_1
	
	
test_1:	
	mrmovq (%rdi), %r10	# read val from src...
	rmmovq %r10, (%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle test_2		    # if so, goto test_2:
	iaddq $1, %rax		# count++
	
test_2:	
	mrmovq 8(%rdi), %r10	# read val from src...
	rmmovq %r10, 8(%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle test_3		    # if so, goto test_3:
	iaddq $1, %rax		# count++

test_3:	
	mrmovq 16(%rdi), %r10	# read val from src...
	rmmovq %r10, 16(%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle test_4		    # if so, goto test_4:
	iaddq $1, %rax		# count++

test_4:	
	mrmovq 24(%rdi), %r10	# read val from src...
	rmmovq %r10, 24(%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle test_5		    # if so, goto test_5:
	iaddq $1, %rax		# count++

test_5:	
	mrmovq 32(%rdi), %r10	# read val from src...
	rmmovq %r10, 32(%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle test_6		    # if so, goto test_6:
	iaddq $1, %rax		# count++
test_6:	
	mrmovq 40(%rdi), %r10	# read val from src...
	rmmovq %r10, 40(%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle test_7		    # if so, goto test_7:
	iaddq $1, %rax		# count++
test_7:	
	mrmovq 48(%rdi), %r10	# read val from src...
	rmmovq %r10, 48(%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle test_8		    # if so, goto test_8:
	iaddq $1, %rax		# count++
test_8:	
	mrmovq 56(%rdi), %r10	# read val from src...
	rmmovq %r10, 56(%rsi)	# ...and store it to dst
	iaddq $64, %rdi		# src += 8
	iaddq $64, %rsi		# dst += 8
	andq %r10, %r10		# val <= 0?
	jle test_loop		    # if so, goto test_loop:
	iaddq $1, %rax		# count++

test_loop:	
	iaddq $-8, %rdx		# len -= 8
	rrmovq %rdx, %rbx
	iaddq $-7, %rbx		# (len - 7) > 0?
	jg test_1			# if so, goto test_1:
	andq %rdx,%rdx		# else if len == 0?
	je Done		    	# if so, goto Done:
						# else len - 7 < 0, then goto less_7_loop
	
	
less_7_loop:	
	mrmovq (%rdi), %r10	# read val from src...
	rmmovq %r10, (%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle Npos		# if so, goto Npos:
	iaddq $1, %rax		# count++
Npos:	
	iaddq $-1, %rdx		# len--
	iaddq $8, %rdi		# src++
	iaddq $8, %rsi		# dst++
	andq %rdx,%rdx		# len > 0?
	jg less_7_loop		# if so, goto less_7_loop:
##################################################################
# Do not modify the following section of code
# Function epilogue.
Done:
	ret
##################################################################
# Keep the following label at the end of your function
End:
#/* $end ncopy-ys */

测试:

# 测试正确性
./correctness.pl
# 测试性能
./benchmark.pl
        ncopy
0       13
1       30      30.00
2       43      21.50
3       53      17.67
4       66      16.50
5       76      15.20
6       89      14.83
7       99      14.14
8       82      10.25
9       96      10.67
10      109     10.90
11      119     10.82
12      132     11.00
13      142     10.92
14      155     11.07
15      165     11.00
16      140     8.75
17      154     9.06
18      167     9.28
19      177     9.32
20      190     9.50
21      200     9.52
22      213     9.68
23      223     9.70
24      198     8.25
25      212     8.48
26      225     8.65
27      235     8.70
28      248     8.86
29      258     8.90
30      271     9.03
31      281     9.06
32      256     8.00
33      270     8.18
34      283     8.32
35      293     8.37
36      306     8.50
37      316     8.54
38      329     8.66
39      339     8.69
40      314     7.85
41      328     8.00
42      341     8.12
43      351     8.16
44      364     8.27
45      374     8.31
46      387     8.41
47      397     8.45
48      372     7.75
49      386     7.88
50      399     7.98
51      409     8.02
52      422     8.12
53      432     8.15
54      445     8.24
55      455     8.27
56      430     7.68
57      444     7.79
58      457     7.88
59      467     7.92
60      480     8.00
61      490     8.03
62      503     8.11
63      513     8.14
64      488     7.62
Average CPE     9.84
Score   13.2/60.0

还不错,有分数了,还能继续优化吗?能!有些代码存在加载/使用冒险:

	mrmovq (%rdi), %r10	# read val from src...
	rmmovq %r10, (%rsi)	# ...and store it to dst

我们可以看到下面的代码要读取%r10,但是该寄存器依赖于上个指令,因此我们要想办法将读取寄存器的指令延迟一下,例如插入气泡,为了不牺牲性能,我们可以连续读取src的内容:

test_1:	
	mrmovq (%rdi), %r10	# read val from src...
	mrmovq 8(%rdi), %r11
	mrmovq 16(%rdi), %r12
	mrmovq 24(%rdi), %r13
	mrmovq 32(%rdi), %r14

优化后的如下:

#/* $begin ncopy-ys */
##################################################################
# ncopy.ys - Copy a src block of len words to dst.
# Return the number of positive words (>0) contained in src.
#
# Include your name and ID here.
#
# Describe how and why you modified the baseline code.
#
##################################################################
# Do not modify this portion
# Function prologue.
# %rdi = src, %rsi = dst, %rdx = len
ncopy:

##################################################################
# You can modify this portion
	# Loop header
	xorq %rax,%rax		# count = 0;
	andq %rdx,%rdx		# len <= 0?
	jle Done		    # if so, goto Done:
	
	rrmovq %rdx, %rbx
	iaddq $-7, %rbx		# (len - 7) <= 0?
	jle less_7_loop		# if so, goto less_7_loop:
						# else (len - 7) > 0, then goto test_1
	
	
test_1:	
	mrmovq (%rdi), %r10	# read val from src...
	mrmovq 8(%rdi), %r11
	mrmovq 16(%rdi), %r12
	mrmovq 24(%rdi), %r13
	mrmovq 32(%rdi), %r14
	rmmovq %r10, (%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle test_2		    # if so, goto test_2:
	iaddq $1, %rax		# count++
	
test_2:	
	rrmovq %r11, %r10	# read val from src...
	rmmovq %r10, 8(%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle test_3		    # if so, goto test_3:
	iaddq $1, %rax		# count++

test_3:	
	rrmovq %r12, %r10	# read val from src...
	rmmovq %r10, 16(%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle test_4		    # if so, goto test_4:
	iaddq $1, %rax		# count++

test_4:	
	rrmovq %r13, %r10	# read val from src...
	rmmovq %r10, 24(%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle test_5		    # if so, goto test_5:
	iaddq $1, %rax		# count++

test_5:	
	rrmovq %r14, %r10	# read val from src...
	rmmovq %r10, 32(%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle test_6		    # if so, goto test_6:
	iaddq $1, %rax		# count++
test_6:	
	mrmovq 40(%rdi), %r10	# read val from src...
	mrmovq 48(%rdi), %r11
	mrmovq 56(%rdi), %r12
	rmmovq %r10, 40(%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle test_7		    # if so, goto test_7:
	iaddq $1, %rax		# count++
test_7:	
	rrmovq %r11, %r10	# read val from src...
	rmmovq %r10, 48(%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle test_8		    # if so, goto test_8:
	iaddq $1, %rax		# count++
test_8:	
	rrmovq %r12, %r10	# read val from src...
	rmmovq %r10, 56(%rsi)	# ...and store it to dst
	iaddq $64, %rdi		# src += 8
	iaddq $64, %rsi		# dst += 8
	andq %r10, %r10		# val <= 0?
	jle test_loop		    # if so, goto test_loop:
	iaddq $1, %rax		# count++

test_loop:	
	iaddq $-8, %rdx		# len -= 8
	rrmovq %rdx, %rbx
	iaddq $-7, %rbx		# (len - 7) > 0?
	jg test_1			# if so, goto test_1:
	andq %rdx,%rdx		# else if len == 0?
	je Done		    	# if so, goto Done:
						# else len - 7 < 0, then goto less_7_loop
	
	
less_7_loop:	
	mrmovq (%rdi), %r10	# read val from src...
	rmmovq %r10, (%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle Npos		# if so, goto Npos:
	iaddq $1, %rax		# count++
Npos:	
	iaddq $-1, %rdx		# len--
	iaddq $8, %rdi		# src++
	iaddq $8, %rsi		# dst++
	andq %rdx,%rdx		# len > 0?
	jg less_7_loop		# if so, goto less_7_loop:
##################################################################
# Do not modify the following section of code
# Function epilogue.
Done:
	ret
##################################################################
# Keep the following label at the end of your function
End:
#/* $end ncopy-ys */

测试:

# 测试正确性
./correctness.pl
# 测试性能
./benchmark.pl
        ncopy
0       13
1       30      30.00
2       43      21.50
3       53      17.67
4       66      16.50
5       76      15.20
6       89      14.83
7       99      14.14
8       80      10.00
9       94      10.44
10      107     10.70
11      117     10.64
12      130     10.83
13      140     10.77
14      153     10.93
15      163     10.87
16      136     8.50
17      150     8.82
18      163     9.06
19      173     9.11
20      186     9.30
21      196     9.33
22      209     9.50
23      219     9.52
24      192     8.00
25      206     8.24
26      219     8.42
27      229     8.48
28      242     8.64
29      252     8.69
30      265     8.83
31      275     8.87
32      248     7.75
33      262     7.94
34      275     8.09
35      285     8.14
36      298     8.28
37      308     8.32
38      321     8.45
39      331     8.49
40      304     7.60
41      318     7.76
42      331     7.88
43      341     7.93
44      354     8.05
45      364     8.09
46      377     8.20
47      387     8.23
48      360     7.50
49      374     7.63
50      387     7.74
51      397     7.78
52      410     7.88
53      420     7.92
54      433     8.02
55      443     8.05
56      416     7.43
57      430     7.54
58      443     7.64
59      453     7.68
60      466     7.77
61      476     7.80
62      489     7.89
63      499     7.92
64      472     7.38
Average CPE     9.64
Score   17.2/60.0

还能在哪里优化呢?当长度不足8的时候,我们能不能也展开呢?我们试一下4展开:

#/* $begin ncopy-ys */
##################################################################
# ncopy.ys - Copy a src block of len words to dst.
# Return the number of positive words (>0) contained in src.
#
# Include your name and ID here.
#
# Describe how and why you modified the baseline code.
#
##################################################################
# Do not modify this portion
# Function prologue.
# %rdi = src, %rsi = dst, %rdx = len
ncopy:

##################################################################
# You can modify this portion
	# Loop header
	xorq %rax,%rax		# count = 0;
	andq %rdx,%rdx		# len <= 0?
	jle Done		    # if so, goto Done:
	
	rrmovq %rdx, %rbx
	iaddq $-7, %rbx		# (len - 7) <= 0?
	jle less_7_loop		# if so, goto less_7_loop:
						# else (len - 7) > 0, then goto test_1
	
	
test_1:	
	mrmovq (%rdi), %r10	# read val from src...
	mrmovq 8(%rdi), %r11
	mrmovq 16(%rdi), %r12
	mrmovq 24(%rdi), %r13
	mrmovq 32(%rdi), %r14
	rmmovq %r10, (%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle test_2		    # if so, goto test_2:
	iaddq $1, %rax		# count++
	
test_2:	
	rrmovq %r11, %r10	# read val from src...
	rmmovq %r10, 8(%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle test_3		    # if so, goto test_3:
	iaddq $1, %rax		# count++

test_3:	
	rrmovq %r12, %r10	# read val from src...
	rmmovq %r10, 16(%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle test_4		    # if so, goto test_4:
	iaddq $1, %rax		# count++

test_4:	
	rrmovq %r13, %r10	# read val from src...
	rmmovq %r10, 24(%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle test_5		    # if so, goto test_5:
	iaddq $1, %rax		# count++

test_5:	
	rrmovq %r14, %r10	# read val from src...
	rmmovq %r10, 32(%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle test_6		    # if so, goto test_6:
	iaddq $1, %rax		# count++
test_6:	
	mrmovq 40(%rdi), %r10	# read val from src...
	mrmovq 48(%rdi), %r11
	mrmovq 56(%rdi), %r12
	rmmovq %r10, 40(%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle test_7		    # if so, goto test_7:
	iaddq $1, %rax		# count++
test_7:	
	rrmovq %r11, %r10	# read val from src...
	rmmovq %r10, 48(%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle test_8		    # if so, goto test_8:
	iaddq $1, %rax		# count++
test_8:	
	rrmovq %r12, %r10	# read val from src...
	rmmovq %r10, 56(%rsi)	# ...and store it to dst
	iaddq $64, %rdi		# src += 8
	iaddq $64, %rsi		# dst += 8
	andq %r10, %r10		# val <= 0?
	jle test_loop		    # if so, goto test_loop:
	iaddq $1, %rax		# count++

test_loop:	
	iaddq $-8, %rdx		# len -= 8
	rrmovq %rdx, %rbx
	iaddq $-7, %rbx		# (len - 7) > 0?
	jg test_1			# if so, goto test_1:
	andq %rdx,%rdx		# else if len == 0?
	je Done		    	# if so, goto Done:
						# else len - 7 < 0, then goto less_7_loop
	
	
less_7_loop:	
	rrmovq %rdx, %rbx
	iaddq $-3, %rbx		# (len - 3) > 0?
	jg test_11			# if so, goto test_11:
	andq %rdx,%rdx		# else if len == 0?
	je Done		    	# if so, goto Done:
						# else len - 3 < 0, then goto less_3_loop
	jmp less_3_loop
test_11:	
	mrmovq (%rdi), %r10	# read val from src...
	mrmovq 8(%rdi), %r11
	mrmovq 16(%rdi), %r12
	mrmovq 24(%rdi), %r13
	rmmovq %r10, (%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle test_22		    # if so, goto test_22:
	iaddq $1, %rax		# count++
	
test_22:	
	rrmovq %r11, %r10	# read val from src...
	rmmovq %r10, 8(%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle test_33		    # if so, goto test_33:
	iaddq $1, %rax		# count++

test_33:	
	rrmovq %r12, %r10	# read val from src...
	rmmovq %r10, 16(%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle test_44		    # if so, goto test_44:
	iaddq $1, %rax		# count++

test_44:	
	rrmovq %r13, %r10	# read val from src...
	rmmovq %r10, 24(%rsi)	# ...and store it to dst
	iaddq $32, %rdi		# src += 4
	iaddq $32, %rsi		# dst += 4
	andq %r10, %r10		# val <= 0?
	jle test_continue		    # if so, goto test_5:
	iaddq $1, %rax		# count++

test_continue:
	iaddq $-4, %rdx		# len -= 4
	andq %rdx,%rdx		# len <= 0?
	jle Done		    # if so, goto Done:
	
less_3_loop:
	mrmovq (%rdi), %r10	# read val from src...
	rmmovq %r10, (%rsi)	# ...and store it to dst
	andq %r10, %r10		# val <= 0?
	jle Npos		# if so, goto Npos:
	iaddq $1, %rax		# count++

Npos:	
	iaddq $-1, %rdx		# len--
	iaddq $8, %rdi		# src++
	iaddq $8, %rsi		# dst++
	andq %rdx,%rdx		# len > 0?
	jg less_3_loop		# if so, goto less_3_loop:
##################################################################
# Do not modify the following section of code
# Function epilogue.
Done:
	ret
##################################################################
# Keep the following label at the end of your function
End:
#/* $end ncopy-ys */

测试:

# 测试正确性
./correctness.pl
# 测试性能
./benchmark.pl
        ncopy
0       13
1       40      40.00
2       53      26.50
3       63      21.00
4       51      12.75
5       65      13.00
6       78      13.00
7       88      12.57
8       80      10.00
9       104     11.56
10      117     11.70
11      127     11.55
12      115     9.58
13      129     9.92
14      142     10.14
15      152     10.13
16      136     8.50
17      160     9.41
18      173     9.61
19      183     9.63
20      171     8.55
21      185     8.81
22      198     9.00
23      208     9.04
24      192     8.00
25      216     8.64
26      229     8.81
27      239     8.85
28      227     8.11
29      241     8.31
30      254     8.47
31      264     8.52
32      248     7.75
33      272     8.24
34      285     8.38
35      295     8.43
36      283     7.86
37      297     8.03
38      310     8.16
39      320     8.21
40      304     7.60
41      328     8.00
42      341     8.12
43      351     8.16
44      339     7.70
45      353     7.84
46      366     7.96
47      376     8.00
48      360     7.50
49      384     7.84
50      397     7.94
51      407     7.98
52      395     7.60
53      409     7.72
54      422     7.81
55      432     7.85
56      416     7.43
57      440     7.72
58      453     7.81
59      463     7.85
60      451     7.52
61      465     7.62
62      478     7.71
63      488     7.75
64      472     7.38
Average CPE     9.74
Score   15.3/60.0

竟然下降了!好吧,我宣告到此结束!

知识掌握了即可,不要死磕!

后记

要想让我们的程序更快,硬件方面,采取流水线是一种不错的选择,在软件方面,我们要尽可能地选择与立即数相关的指令,除此之外,还可以在算法层面采取优化,例如循环展开。

然而,或许我们绞尽脑汁,使出浑身解数,废了九牛二虎之力,把代码写得阅读性不是那么强后,我们可能会发现优化效果可能也不是很好(当然我说的不是我本次写的代码),但是不重要,也正是这一些很多人看不起的小优化,使得我们整个计算机体系大厦健康稳定,性能优越!致敬为了更快更强代码之学者!

历史评论
开始评论