Charles University in Prague

Faculty of Mathematics and Physics

 

 

 

 

 

 

 

MASTER THESIS

 

 

 

 

IA‑64 Architecture Simulation

 

 

Jindřich Houska

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Supervisor: Mgr. Jakub Yaghob

Department of Software Engineering

 


 

Acknowledgement

 

I would like to thank all the people who helped me when I was writing this work, especially Mgr. Yaghob whose useful suggestions led this work to its successful end.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I declare I elaborated this thesis myself using only the sources quoted.

I agree to lend this work for further studies.

 

Prague, April 18, 2001                                                                 Jindřich Houska

Contents

1      Introduction.. 5

1.1       Goal of Thesis. 5

1.2       Text Overview.. 5

2      IA-64 Architecture Overview... 6

2.1       General Overview.. 6

2.2       EPIC Architecture. 6

2.3       Program Execution. 7

2.4       Register Fields. 8

2.4.1        General Registers. 8

2.4.2        Floating-Point Registers. 8

2.4.3        Predicate Registers. 8

2.4.4        Current Frame Marker and Register Stack Engine. 9

2.4.5        Application Registers. 9

2.4.6        Other Registers. 9

2.5       Interruptions. 10

2.6       Memory and IA-64 Addressing. 10

2.6.1        Virtual Addressing. 11

2.6.2        Control Speculation. 12

2.6.3        Data Speculation. 12

2.7       Some Outstanding Instructions. 13

2.7.1        Compare Type Instructions. 13

2.7.2        Multimedia Instructions. 14

2.8       Branch Instructions. 14

2.8.1        Branch Types. 14

2.8.2        Branch Prediction. 15

2.9       Floating‑Point Architecture. 16

3      IA64Emu Program.. 18

3.1       Introduction. 18

3.2       Simulator Overview.. 18

3.3       Simulated Applications. 18

3.4       Inside the Simulator 19

3.4.1        Loader 20

3.4.2        Processor 21

3.4.2.1     Registers. 21

3.4.2.2     Program Simulation. 22

3.4.2.3     Dispersal Window.. 22

3.4.2.4     Instruction Execution. 23

3.4.2.5     Instruction Implementation. 24

3.4.2.6     Interruption Handling. 25

3.4.3        Register Stack Engine. 26

3.4.4        Physical and Virtual Memory. 28

3.4.4.1     Memory Cache. 30

3.4.4.2     Virtual Hash Page Table. 30

3.4.4.3     Translation Lookaside Buffer 31

3.4.5        Advanced Load Address Table. 31

3.4.6        Branch Prediction. 32

3.4.6.1     2‑level Predictor Algorithm.. 33

3.5       Code Testing. 34

3.5.1        Interruptions. 35

3.5.2        Access Dependency Tests. 36

3.5.3        Branch Prediction Test 38

3.5.4        Performance Monitoring. 38

3.5.5        Other Statistics. 39

3.6       Extending the Simulator 40

3.6.1        Architecture Configuration File. 40

3.6.2        Architecture Specific Algorithms. 43

3.6.3        Instruction Definition. 45

4      User Guide.. 47

4.1       Diskette Contents. 47

4.2       Installation and IA‑64 Application Compilation. 47

4.3       Simulator's Interface. 48

5      Related Work.. 51

5.1       Linux Developer's Kit 51

5.2       Jason Papadopoulos's Simulator 52

6      Conclusion.. 53

Bibliography.. 54

 

1        Introduction

1.1      Goal of Thesis

The complexity of applications and problems solved by computers rise continuously. Therefore, there are increasing demands on processor performance. One way to increase the performance is to develop a completely new architecture. Intel and Hewlett‑Packard chose this way and developed the IA‑64 architecture, which their future processors will be based on.

The specification of the architecture at an abstract level defines hardware components of a microprocessor such as the number and types of registers, cache levels etc. It defines processor instructions and the semantics of a processor behavior.

The processor can't be successful without a number of applications running on it. Intel supports enormously the porting of applications to be ready at the time of the first IA‑64 processor release. These applications can't be developed in the real processor based on this architecture, therefore, several simulators have been implemented.

The goal of this thesis is to create a simulator of the IA‑64 architecture. This simulator is neither required to be as fast as possible nor needed to simulate all features of the architecture. Concerning the simulation, the simulator should execute separate applications only and execute the application‑level instructions of the IA‑64. Unlike the other simulators, this one should provide a couple of machine code tests and performance monitoring. These features are useful for compiler development or debugging the parts of applications written in assembly language.

Another significant feature the simulator should have is extensibility. The simulator should be designed to simulate more than one implementation of the architecture. It should provide a set of parameters to define the architecture dependent features.

1.2      Text Overview

This text is divided into six chapters.

This is the chapter 1.

Chapter 2 contains a short introduction into the IA‑64 architecture. A general overview and significant features are described here.

Chapter 3 provides a detailed description of the IA64Emu simulator. Individual parts of the simulator, algorithms used, implemented tests and a simulator extensibility are illustrated in this chapter.

Chapter 4 is a user guide. It lists the contents of the enclosed diskette, describes the compilation of IA‑64 applications and defines the simulator's user interface.

Chapter 5 presents other IA‑64 architecture simulators. A comparison with this simulator can be found here.

Chapter 6 concludes this text and summarizes which goals have been achieved and which have not.

2        IA-64 Architecture Overview

This chapter gives a brief overview of the IA-64 architecture with some Itanium implementation specifics included. Because of the size of IA‑64 specification this description is very short. To learn more about the architecture, it is necessary to consult the specification [4]. The Itanium microarchitecture is described in [9].

The first two parts of the chapter contain a general description of the architecture. The other ones highlight specific features which should increase the processor performance at most.

2.1      General Overview

The IA-64 architecture was developed to increase the performance of Intel processors. It was developed mainly from HP processor architecture and it is not just an extension to the x86 architecture. Nevertheless, it is compatible with the x86 architecture. The compatibility is realized neither by similar assembly language nor instruction encoding. For a few years only this compatibility will be supported by a separated part of processor which can execute x86 code and it will be removed in the near future. Memory can contain both forms of the code, and some new instructions (also in the IA-32 architecture) are defined to transition the processor between these architectures.

To achieve higher performance, many new features have been added to the standard processor functionality.

2.2      EPIC Architecture

This newly developed architecture technology is called the EPIC architecture. It is an abbreviation for Explicitly Parallel Instruction Computing. This architecture technology is in some way an extension to the RISC architecture. It contains many RISC features such as large register fields or access to memory restricted to special instructions. In addition it provides mechanisms to achieve instruction level parallelism.

Instruction level parallelism is an ability to execute multiple instructions at the same time. The instructions are explicitly grouped into “instruction groups”. This is performed by a compiler supposing the compiler meets dependency requirements described further. Therefore, instructions within the group can be reordered or they can be executed at the same time (if there is a sufficient number of functional units). The processor allows the compiler to encode information about supposed behavior of a program. Many hints about the branch prediction, memory caches access or instruction prefetching can be added to pure instruction code. These hints do not affect the program behavior but they just tell the processor how to optimize an instruction execution and implementations in various processors can be different. Hints can be even omitted. For example, a lot of optimization hints and performance increasing features are not implemented in Itanium (the first processor based on the IA-64 architecture). However, they will be implemented in future processors.

Although the RISC architecture technology aims to be very simple to allow the processor to be small and to be able to run at high frequency rates, the IA-64 architecture makes the processor huge and, therefore, it runs at lower frequency rates than it is usual for other processors at the present.

2.3      Program Execution

To increase the level of instruction level parallelism, IA-64 defines several types of functional units. They are

·        I‑unit – this unit executes integer and multimedia operations.

·        M‑unit – this unit executes instructions providing memory access and other system instructions.

·        B‑unit executes branch instructions such as jumps, calls, loops, and returns.

·        F‑unit executes floating‑point instructions. More than one unit of each type can exist in the processor.

Itanium has two I‑units, two M‑units, three B‑units, and two F‑units (see Itanium microarchitecture specification [9]).

Instructions are encoded into bundles. Each bundle contains three slots and occupies 16 bytes of memory. Each slot is 41 bits long and contains one instruction. A few instructions using two slots are defined. There is one more field in the bundle called bundle type. A bundle type determines functional types of instructions in each slot (I, M, B or F) and instruction stops. These stops divide the instructions into instruction groups. Stops are placed by the programmer or compiler. They must ensure that there are no restrictions preventing these instructions from being executed simultaneously.

During the instruction execution, instructions are loaded into a dispersal window. A dispersal window is a processor resource for storing instruction bundles and preparing them for execution. The dispersal window size in Itanium is two bundles. Instructions loaded into dispersal window are dispersed into functional units according to a bundle type. There can be some additional rules for dispersing. For example these rules in Itanium are: M‑unit instruction is dispersed to the lowest free M‑unit, but F‑unit instructions in the first bundle of the dispersal window are always dispersed to F0 unit, and F‑unit instructions in the second bundle of the dispersal window are dispersed to F1 unit. When all instructions in one or two bundles of the dispersal window are processed, the dispersal window is rotated and new bundles are fetched from the memory. If no proper functional unit is free, the processor waits until an instruction in functional unit is processed.

To enable instruction level parallelism within an instruction group, instructions must keep following dependency requirements:

·        RAW (read after write) and WAW (write after write) register dependencies are not allowed. Each memory read or write will observe the state before executing current instruction group.

·        RAW, WAW and WAR (write after read) memory dependencies are allowed. Each memory access will observe the result of the most recent store to the same address. According to possible instruction reordering, the result of accesses to the same address within an instruction group is undefined.

·        There are some exceptions from these rules. For example the result of instruction allocating register stack can be observed by other instructions in an instruction group immediately.

A program can be executed at several privilege levels. There are four privilege levels and its current value is stored in the Processor Status Register. Applications are executed at the privilege level 3 (the least privileged), kernel is executed at the privilege level 0. A switch among privilege levels is promoted by a special instruction according to the current virtual page level and virtual page flags.

2.4      Register Fields

To sustain the instruction level parallelism, Intel incorporated a big amount of processor resources.

There are many registers used in the IA-64 architecture as defined in the specification [4]. They are divided into several register fields according to their functionality.

2.4.1      General Registers

There are 128 integer registers used for arithmetic or multimedia operations and memory access. Each register has 64 bits for data stores and one “NaT” bit. General registers are numbered GR0 through GR127. GR0 is a special register which contains zero value all the time.

NaT (Not a Thing) bit is used to determine whether the register contains defined value or not. It can be set by speculative memory loads when the value is not read because of the absence of a memory page.

General registers GR0 through GR31 are called the static general registers. These registers are visible to all procedures and there are no restrictions to access these registers (except of GR0 which can be read only, write to GR0 causes an Illegal Operation Fault). Registers through GR32 to GR127 are termed the stacked general registers. They are divided into register stack frames and can be automatically stored to memory. The register stack frame is local to the procedure and it is not visible for another one. Before accessing these registers the procedure must allocate a new stack frame. The procedure has to say how many registers it will be using. The register stack frame consists of two parts – local registers and output registers. Local registers are defined for local use only. Output registers should be used for data exchange among procedures. A special mechanism called register renaming described below is implemented for this purpose.

2.4.2      Floating-Point Registers

IA-64 defines 128 floating-point registers numbered FR0 through FR127. These registers are used for floating-point and some integer computations. There are no integer multiplication and division instructions. Therefore, floating-point numbers and instructions are used to execute such operations.

Each register has 82 bits of data storage. It is an extension to IEEE real types and can be converted to and from them. Besides standard IEEE values, a new value is defined. It is termed NaTVal (Not a Thing Value). This value has the same purpose as NaT in general registers.

Floating-pointer registers are divided into two parts. Registers FR0 through FR31 are called static floating-point registers. Registers FR32 through FR127 are called rotating floating-point registers. Unlike the general registers, registers FR32 through FR127 are visible to all procedures. They can't be renamed. They can be only rotated, which is used in software‑pipelined loops.

FR0 always reads as zero and FR1 always reads as one. Writing these registers causes an Illegal Operation Fault.

2.4.3      Predicate Registers

The IA-64 architecture has 64 predicate registers. They are numbered PR0 through PR63. Each register is 1 bit long and is used to store values true or false for instruction prediction.

Instruction prediction is a conditional execution of instructions. Each instruction has encoded the number of predicate register. According to the contents of this register, the processor schedules the instruction to be executed or not. This feature can rapidly speed up the program execution because it preserves the processor from many jumps to various branches, which often breaks a fluent instruction flow.

Predicate registers are divided in the same way as floating-point registers. PR0 through PR15 are called static predicate registers. PR16 through PR63 are called rotating predicate registers. The latter ones are also used in software-pipelined loops.

PR0 always reads as one (instructions predicated by this register are always executed) and writes are always discarded.

2.4.4      Current Frame Marker and Register Stack Engine

A special register called Current Frame Marker (CFM) is used to manage register stack frames. It consists of several fields – the number of local registers, the number of output registers and three bases (GR, FR and PR) for register rotation.

The procedure can access only local and output registers at the stack. The first local register is numbered GR32; output registers follow local registers straight afterwards. When a procedure call is executed, CFM value is updated. Registers are treated according to the following rules:

·        The output registers of the caller become the local registers of the callee.

·        The first local register of the callee (the first output register of the caller) is renamed to GR32.

·        When the callee wants to use more local registers than the output registers of the caller, it must extend register stack frame by a special instruction and it can also define the size of the output registers for further calls.

To flush the general registers field to memory, or to fetch them back during the lack of free registers, a special unit called Register Stack Engine is present in a processor. The unit can store or load a set of registers at the background. It works independently of the executional part of a processor and it uses several modes of behavior. These modes determine whether to store or load registers at the time of necessity, or whether to use smart storing or loading of registers not to cause program delay due to the lack of free register space or on the other hand the lack of loaded registers.

2.4.5      Application Registers

A set of 128 application registers is provided by the IA-64 architecture. They are numbered AR0 through AR127. Each register has a special meaning. There are registers used by Register Stack Engine mentioned above, registers containing previous procedure state, registers containing loop and epilog count for counted loops and software pipelined loops, registers used for transfering information between the kernel and application and registers used when IA-32 instruction set is executed. Many application registers are either reserved for future use or ignored.

2.4.6      Other Registers

A lot of other special purpose registers are present in the IA-64 architecture. They are briefly described below. All of them are 64 bit registers.

·        IP (Instruction Pointer) – this register contains the address of currently executed instruction bundle.

·        BR (Branch Registers) – there are eight branch registers (BR0 through BR7) used for storing address of the branch or return address.

·        PSR (Program Status Register) – this is a status register containing flags to determine the processor behavior. Some of them can be read or written at the application level. Some of them can be accessed just at the privileged level. Applications can access e.g. endian flag. This flag tells the processor to execute accesses to memory in a big or little endian. The other flags are flags to check whether the access to memory should be aligned, to allow external interruptions, instruction set flag etc.

·        CR (Control registers) – 128 registers are used to store memory access control flags, interruption addresses, and interruption states. A lot of these registers are reserved for future use.

·        Other sets of registers, such as registers measuring processor performance, eight region registers, debugging register, and registers used for processor identification.

2.5      Interruptions

IA-64 has two classes of interruptions: IVA-based interruptions and PAL-based interruptions. They are divided into these classes according to how they are serviced.

IVA-based interruptions are serviced via Interruption Vector Table (IVA). When an interruption occurs, the processor saves some architectural state such as the Instruction Pointer and the slot number of interrupted instruction and the Processor Status Register. Then an interruption vector is read from IVA and a call to specified address is executed. After return from an interruption, the previous processor state is restored. To decrease the overhead of processed interruptions, there are two banks of general registers GR16 through GR31. These banks have to be switched explicitly.

IVA-based interruptions are External Interrupts, Faults and Traps. External Interrupts are requests from I/O devices or requests from other processors. Non-Maskable Interrupts (NMI) are special external interruptions which are used to request critical operating system services. Faults are interruptions raised when any inconsistent state occurs and a system intervention is necessary to correct the execution state. After the return from Faults, a faulting instruction is restarted. Traps are raised when the system intervention is necessary after the execution of instruction. That is why the instruction causing a trap is not executed again.

PAL-based interruptions are serviced by an operating system or a system firmware. They are invoked through some hardware entry points, for example a specific pin at the processor. PAL means Processor Abstraction Level. This level was defined to provide abstraction between hardware and platform firmware or system software.

PAL-based interruptions are Aborts and Interrupts. Abort interruption occurs when an internal malfunction is detected and the system needs immediate correction or when the processor is powered up and needs to perform self-test and initialization. Interrupts (to be exact, they are Platform Management Interrupts which are PAL‑based) are called by platform management when it needs to perform a function such as memory scrubbing or platform error handling. When a PMI interrupt is called, a requested function must be specified. Functions can be processor self-tests, functions obtaining processor, RSE or register state, functions getting information about memory caches, TLB configuration, VHPT configuration or functions forcing the processor to enter the shutdown or halt state.

2.6      Memory and IA-64 Addressing

 Memory hierarchy consists of N levels of memory caches and the main memory. Itanium has just 3 levels of cache (L1 and L2 are on die, L3 is off chip). L1 is divided into instruction and data L1 parts. L2 and L3 are unified of both data and instructions. Application can affect contents of a cache by locality hints. Locality hints specify a cache hierarchy level which should be accessed. Besides locality hints, there are some other instructions for explicit prefetches or flushes between memory and cache.

Memory can be accessed in 1, 2, 4, 8, 10 and 16 byte units. A flag in the Processor Status Register determine whether to check address alignment or not. Memory data reads and writes can use both little-endian or big-endian byte ordering. Instructions must be accessed as little-endian units.

2.6.1      Virtual Addressing

As defined in the specification [4], IA-64 uses flat linear virtual address space. Each virtual address is 64 bits long. Virtual memory is divided into 224 virtual regions. Each region contains one address space. Within virtual address, a region is specified by the most significant 3 bits of the address. These 3 bits do not specify a region number itself, but just the number of the region register where the exact number of virtual region is stored. In this way, the processor can access eight regions concurrently. Regions can be coalesced to create 62-bit, 63-bit or 64-bit address space but less regions can be accessed concurrently. Usually, the system uses one or two regions for its purpose. Applications have a region for both data and code or a region for data and a separate region for code. Just the contents of the region register is necessary to be changed to access another application during the task switch.

Several structures are provided to support the virtual to physical pages mapping.

Translation Lookaside Buffer (TLB) – This buffer provides some information of page mappings. It can have more than one level and it consists of the Instruction TLB and the Data TLB. Itanium has two levels of the Data TLB and one level of the Instruction TLB.

TLBs are divided into two subsections: Translation Cache and Translation Register. Translation Register is an array defined to hold the most important translations. An access to this part is faster than an access to the rest of TC. System must explicitly insert a record into the Translation Register. Its contents is not affected by the processor. Itanium does not implement TR. Translation Cache is a larger and slower array. The processor manages entries in this structure but the system can insert or purge entries too.

Translation entries consist of several fields.

·        Dirty Bit specifies, whether some information was written into this page and, therefore, it should be written back to memory.

·        Memory Attribute specifies cachability or write coalescing of the physical page.

·        Privilege Level specifies the privilege level or the promotion level of this page.

·        Access Rights determine read, write or execute rights.

·        Physical page Number specifies the number of physical page (most significant bits of physical address). The size of the field depends on the page size.

·        Page Size – various page sizes are supported by the IA-64 architecture. Page can be 4KB through 256MB long.

·        Protection Key – besides the Access Rights a page has specific protection key. This key points to Protection Key Registers, where additional read, write or execute restrictions can be defined. The processor can divide pages into protection domains and can restrict them all at once by this mechanism.

·        Virtual Page Number – contains the page number within the virtual region

·        Virtual Region Identifier – specifies the address space of a page

Each virtual address consists of three fields:

·        Virtual Region Number (3-bits) – specifies index to region registers.

·        Virtual Page Number (number of bits depends on page size)

·        Page Offset (the rest of the address)

In the process of translating the virtual address to physical, a region identifier is fetched from the Region Registers at the index specified by the Virtual Region Number. Appropriate translation record is searched in TLB according to the Region Identifier and the Virtual Page Number. If the translation is not found, an exception is raised. Access rights are checked according to the access rights specified in the translation record and protection key registers. A Physical Address is composed of Page Offset and Physical Page Number.

Virtual Hash Page Table (VHPT) – this is an extension to TLBs. VHPT is placed to a virtual space in memory. The system can configure processor to search VHPT entry when a translation record in the TLB is not found. The processor does not perform any writes into VHPT. The system is responsible for the table contents and has to ensure the contents of VHPT to be coherent with TLB contents. VHPT record fields are similar to TLB fields. Instead of Virtual Page Number and Virtual Region Identifier it has a tag computed from these two fields. It is used for hashing during the search of a translation record.

A VHPT walker can be enabled to walk through the VHPT table and search appropriate translation. When the VHPT walker is disabled and a translation record is not found in the TLB, an exception is raised, otherwise the walker searches for a translation in the VHPT. When the translation is found, it is inserted into the Translation Cache, physical address is computed and memory is accessed.

Physical Addressing – The processor can use physical addressing instead of virtual addressing when appropriate flags is set. A physical address space is linear and there are no access rights used to restrict the memory access (switch to physical addressing can be made in privilege level only). There are less than all 64 bits of a physical address implemented in Itanium.

2.6.2      Control Speculation

Memory can be accessed in a standard way when the processor is asked to fetch or store some data from or into memory. When data are needed immediately after load, the processor must wait until these data are read. In some cases it is possible to move the load instruction to the sufficient distance from a consumer of loaded data. In the case of having target register consumer at the beginning of a branch, this is not possible, because the load instruction can cause a Page Fault Interruption and loaded data need not be used and then an execution overhead will increase.

IA-64 provides so called “speculative loads”. They can be either control speculative or data speculative. The data speculative loads and stores will be discussed later. Control speculation loads and stores performs a memory access just when the specified page is present in the memory. Otherwise they set NaT bit in the target register. When the processor consequently branches to the code where this read value is to be used, the value must be explicitly checked. A check can be performed in two ways. The first one is a check load. This instruction reads NaT bit of a specified register and when it is set, a non-speculative load is performed. The other one is a check branch instruction. When NaT bit of a target register is set, the processor branches to a specified recovery code. This recovery code must be explicitly created by the compiler.

When a speculative load fails and NaT bit is set, some instructions such as arithmetic instructions can treat these operands, and NaT bit is propagated consequently to the target registers. This state cannot be corrected just by check load but a recovery code has to be created to load the proper values and repeat all the computation.

Besides the load and store instruction, semaphore instructions and register and memory exchange instructions can also perform a speculative access to memory.

2.6.3      Data Speculation

Data speculation can be used to speed up the program execution when ambiguous memory dependencies occur. Ambiguous memory dependencies are those dependencies when we cannot say whether the memory stores or loads will overlap. Without the data speculation, when we would like to perform a memory store first and then a memory load, the store must always precede subsequent load in the instruction code. If the processing of data loaded by this load takes a long time and we does not know whether this store will be executed or whether this store will be executed to the same address, we can use a data speculative load which can be performed before the store.

Data speculative load fetches data from a specified virtual address to a register and it also inserts a record to the ALAT table (Advance Load Address Table). This table is a structure containing entries describing data speculative memory loads. Each entry consists of several fields: Physical address and the size of loaded data and the type and the number of target register. When the store overlaps the address of data speculative load, corresponding entry is removed form ALAT. When loaded data needs to be ready, a check load or a check branch instruction must be performed to ensure the data correctness. Just as the control speculative check instructions test NaT bit, data speculative check instructions search the appropriate ALAT entry. If this entry is not found, the memory load is performed or a branch to recovery code is performed.

ALAT entries can be explicitly purged by specific instructions. Data speculation and control speculation can be combined to reach the higher performance.

2.7      Some Outstanding Instructions

Some IA‑64 specific or enhanced instructions are depicted in this chapter. They are compare and multimedia instructions. The other improved IA‑64 instructions and technologies are described below in the following chapters.

2.7.1      Compare Type Instructions

Parallel compares were developed to take an advantage of instruction level parallelism. Compare instructions generally test source operands according to a specified relation and a compare type and write the result of the compare to a target predicate registers. Compare instructions are instructions comparing two integer values, two floating-point values, instructions testing specified bit of a register, NaT value or floating‑point number class (normal/denormal numbers, infinities, zeros or NaTVal value).

Compare instructions can have five types: Normal, Unconditional, AND, OR, and DeMorgan. The target of instructions are two predicate registers. The first one is used for compare result, the other one is used for the negation of the result.

Normal compare compares two source operands and writes the result and its negation into the target predicates.

Unconditional compares behave like normal compares when the compare instruction is predicated to be performed. When the instruction is not to be performed it writes zeros to both target predicates.

AND type compare enables parallel compares in the way that it allows multiple concurrent compares to access the same predicate target register. When the result of compare is true, it does not write anything. When the result is false, it writes zero to both target predicates. This feature can be used when somebody needs to perform compare of this type: a < b && c < d && e < f. First the target predicate register has to be set to 1. Then the compiler can schedule three AND type compares to the one instruction group. These compares are executed at the same time. If any of these compares fail, target predicate is reset, otherwise remains to be set.

OR type compare behaves almost like AND type compare. It sets both target predicates in the case of success, otherwise they are unchanged. To meet the proper functionality, predicates must be reset before.

DeMorgan type compare is a combination of previous two compares. OR type compare is used to one of the predicate targets and AND type compare to the other one.

Integer compare instruction can be performed by I‑units and M‑units. Itanium has two M‑units and two I‑units. Hence, this processor can evaluate expression consisting of as many as four AND type or OR type compares in one clock‑tick.

2.7.2      Multimedia Instructions

Multimedia instructions operate with general registers but treat them as eight 1‑byte registers, four 2‑byte registers or two 4‑byte registers. Arithmetic, shift and data arrangement instructions are defined on these elements.

Arithmetic instructions are parallel addition without carry of overflowing bit, parallel addition with signed and unsigned saturation (the result of an addition is bounded to maximal or minimal values of a register element), parallel normal and saturated subtraction, parallel average of difference, parallel multiplication, parallel compare (comparing corresponding elements of source registers), parallel sum of absolute difference and parallel minimum and maximum.

Parallel shifts are signed or unsigned shifts to the left or to the right shifting each element separately.

Data arrangement instructions assemble target register using parts of a source register. These instructions may interleave odd, even, most significant or least significant elements of both source registers. They allow to permute register elements or to convert from larger to smaller elements with saturation.

2.8      Branch Instructions

2.8.1      Branch Types

There are several types of branches: branches, procedure calls, returns, IA‑32 instruction call, counted loop branches, modulo‑scheduled counted loops and modulo‑scheduled while loops. All branch types can be conditional or unconditional depending on its predicate register. Branches are divided into three categories: indirect branches, long branches and IP‑relative branches. Indirect branches use a branch register to specify the target register. Long branches occupy two slots of a bundle and can specify 60‑bit displacement. These branches can target all the address space. IP‑relative branches use 21‑bit displacement, allowing to target ±16MB.

The call branch provide standard procedure calls. When a call is performed, the stack frame is moved and output registers at the stack frame become local registers and there are no output registers in the new frame. The previous Function State register is copied to the register specified as an operand and the Current Frame Marker register is copied to the Previous Function State. The return address is written into the specified branch register.

Return branch provides a standard return from procedure. There is no call stack and a branch register must be used to obtain the return address. Caller's register stack frame is restored because the Current Frame Marker is filled according to the Previous Function State value.

There is a branch instruction providing a call to IA‑32 instruction code. After this call the processor executes IA‑32 instructions until another IA‑32 instruction returning back to IA‑64 code is reached.

The counted loop branches are used to process loops of known number of cycles. First the Loop Count register must be filled. Reached loop branch instruction decrements this register and according to its value decides whether to jump to the beginning or not.

Modulo‑scheduled branch instructions along with register rotation and predicates are used to provide software pipelined loops. Software pipelining is similar to hardware instruction pipelining. Algorithm is divided into several stages with some instructions in each stage. Within a loop cycle all stages of algorithm are performed but each stage use different data. Each stage is predicated by a unique predicate register. Before entering the loop all the rotating predicates have to be reset, except for PR16, which has to be set. These loops have three phases: prolog, kernel and epilog. The prolog phase is used for filling instruction pipeline. When a new loop is started, predicate, general and floating-point registers are rotated and the first rotating predicate register (PR16) is set. This causes that the data in registers are processed by a new stage of algorithm. Because of the rotation of PR registers and setting PR16, one more stage is started in each loop until the pipeline is filled. During the kernel phase, all stages are performed. When there are no data for the first stage of algorithm (it happens when Loop Count register is zero), an epilog phase is started. During the epilog stage, all registers are also rotated but the first rotating predicate register (PR16) is reset. It ensures that one more stage does not execute in each loop. The number of the epilog loops is determined by the Epilog Count Register. It must be filled with the number of stages before the loop starts.

While‑loops are similar to modulo‑scheduled loops. The only difference is determining the end of loop. No Loop Count is used. When the register predicating while‑loop branch instruction is set, the prolog or the kernel phase is performed. When the predicate is reset, the epilog phase can be performed according to the Epilog Count value.

2.8.2      Branch Prediction

Branch prediction is the way how to reduce an execution overhead when a control flow is changed. Without a branch prediction, the instruction pipeline must be discarded after the control flow change. There are many algorithms determining whether the branch will be taken or not. It allows the processor to fill the instruction pipeline with instructions following the branch instruction or following the branch target.

IA‑64 enables the compiler to use branch prediction hints to provide information of the branch behavior. There are three types of hints:

Whether prediction hints describes the way of determining whether the branch will be taken. There are four Whether prediction hints.

·        Static Not‑Taken – Branch is predicted not taken and no dynamic prediction structures are allocated for this branch.

·        Static Taken – Branch is predicted taken and no dynamic prediction structures are allocated for this branch.

·        Dynamic Not‑Taken – Prediction depends on the state of dynamic prediction structures described below. When no information is found for this branch, the branch is predicted not taken.

·        Dynamic Taken – As for the previous case, if the information in dynamic prediction structures is found, the branch is predicted according to this information, otherwise it is predicted taken.

Sequential prefetch hints defines how much of the code the processor should prefetch at the branch target. There are two values of this hint:

·        Prefetch few lines – prefetching at the branch target is stopped after a few lines depending on implementation

·        Prefetch many lines – prefetching at the branch target is stopped after more lines.

Deallocation hint can be used to tell the processor to deallocate dynamic prediction information for the branch.

Information about branch prediction can be provided before the execution of the branch instruction by branch predict instructions. These instructions must be placed in a sufficient distance from a branch instruction and they can provide the following information:

·        Location of a branch – this information defines the bundle of a branch instruction this prediction instruction belongs to.

·        Target of a branch – The target address is computed from this information and remembered.

·        Importance hint – defines the importance of a branch. When a branch is important, its branch target address is stored in faster structures than in the other case.

·        Whether prediction hint – this hint has the same meaning as in the branch instruction described above.

Itanium use a couple of structures and supporting units to implement the dynamic branch prediction.

·        Branch Prediction Table (BPT) – this structure keeps history information about the executed branches. A new record into this structure is inserted when a dynamic prediction is hinted for executed branch. The predict algorithm used in BPT is a local 2­‑level predictor with 4 bits of history.

·        Multiway Branch Prediction Table (MBPT) – this structure is almost the same as BPT. The only difference is that it is used for bundles with multiple branch instructions and has prediction resources for each slot.

·        Target Address Cache (TAC) – branch prediction instructions store the branch target address after it is counted. When the branch address is found in this structure it is always predicted taken.

·        Target Address Register (TAR) – this is the faster and smaller version of the previous structure. In this structure branch predict instructions store the branch target addresses for important branches. Branch instructions having entries in this structure are also always predicted taken.

·        RSB is the address stack where return addresses are pushed during calls and popped when a return occurs. This structure speeds up returns from procedures.

·        BAC is the unit which computes a target address defined in branch predict instructions.

The process of branch prediction described above is generally defined in the IA‑64 architecture. Itanium does not implement a lot of these features and has many exceptions of general algorithms (see Itanium microarchitecture specification [9]). Detailed description of these algorithms and exceptions will be depicted together with the description of the IA‑64 simulator and its implementation.

2.9      Floating‑Point Architecture

The IA‑64 floating‑point architecture is compliant with IEEE floating‑point arithmetic. Internally, it does not use any IEEE floating‑point number format but supports conversion to all standard formats.

IA‑64 use 82‑bit floating‑point registers. The format of this register is similar to IEEE double‑extended number, but there are two more exponent bits added. To be exact, the number consists of 64 significand bits with explicit integer bit, 17 exponent bits and a sign bit. Like IEEE, IA‑64 floating‑point format support various NaNs (quiet and signaling), positive and negative infinities, normal and denormal numbers, positive and negative zeros and special NaTVal value). IA‑64 does not have integer multiplication and division. Floating‑point numbers with specific 64‑bit integer format of numbers are used for these operations.

Floating‑point instructions support a couple of memory formats of numbers. They are single real numbers, double real numbers, double‑extended real numbers and standard IA‑64 format. Conversion between IA‑64 format and another one performs during the reads and writes. 82‑bit floating‑point format is written to 16‑byte portion of memory. Like the double‑extended real numbers, it should be stored in 16‑byte aligned boundaries.

The Floating‑point Status Register is used to control the operations. It contains a set of control flags and three sets of status flags. One of the sets of the flags is the main one and the others are alternative. They can be specified as target status flags by several instructions. Control flags determine which faults and traps can be raised. They are for example invalid operation, zero division fault, overflow, underflow and inexact result traps. Status flags say the information about rounding control, precision control and possibly raised faults and traps when they are disabled.

The number of floating‑point instructions is much reduced. There are instructions providing memory loads and stores described above. Instead of separate multiply and add or subtract instructions, a compound instruction performing both multiplication and addition or subtraction is defined. For this purpose, two floating‑point registers have hardwired values. FR0 is 0 and FR1 is 1. Since there is no integer multiplication implemented, an instruction providing multiplication and addition for integer format of floating‑point numbers is provided. The division instruction is missing and it is replaced by reciprocal approximation. The last arithmetic instruction is a square root approximation. A reciprocal approximation and a square root approximation fill the defined predicate register depending on the correctness of an approximation. There are no goniometric instructions. A couple of instructions converting floating‑point numbers and general numbers or floating‑point numbers and the special format of floating‑points for integer numbers are defined.

Some instructions can work with floating‑point pair registers. Pair registers contain two single‑precision floating‑point numbers. These instructions behave like multimedia instructions for integer registers.

3        IA64Emu Program

3.1      Introduction

There was no processor based on the IA‑64 architecture released at the time of writing this work. Even its first implementation Itanium which have been announced many times to be released, was delayed again. A couple of functional samples of the Itanium processor were distributed to the world leading companies supporting Intel processors to test their products. The others have to be satisfied with any of the existing simulators such as NUE simulator by Hewlett‑Packard.

The main goal of this master thesis was to create a simulator of the IA‑64 architecture, too. But the task was not just to create a simulator as similar to the architecture as possible. The task was to create a simulator which would provide some tests of correctness and performance of the IA‑64 instruction code. This feature should help people debug compilers and optimize them, and debug low‑level programs written in IA‑64 assembly language. Although this simulator can recognize lots of undefined states and illegal operations, it still does not guarantee that the program is correct. There are no high‑level tests saying to the programmer: “This algorithm is not correct. You wanted to do something else.” Because of the performed tests, the simulator is not as fast as other simulators.

3.2      Simulator Overview

The simulator is written in C++ language. Currently it runs under Windows only. Even though it is programmed as platform independent as possible, there was no time to port it to Linux or to other operating systems. The partial independence is ensured by a user interface and input and output used in program. There are no platform dependent libraries used in the program except for the Microsoft CRT console library, which is separated and encapsulated into special functions. To port it to another operating system, the only thing necessary to do is to implement these functions using another user interface library.

Simulator is built to simulate more than just one IA‑64 implementation. Therefore, architecture configuration file can be specified for a lot of easy‑configurable features like the number of memory caches, the size of Translation Lookaside Buffer, restrictions to instructions dispersing in dispersal window, etc. More complicated features must be reprogramed inside the simulator. This concerns adding new instructions when an extension to the IA‑64 architecture is released, changing branch prediction algorithms and algorithms used for searching or inserting new entries into the memory cache, Translation Lookaside Buffer and Advanced Load Address Table. To make these extensions easier, all of the simulated architecture dependent stuff is separated. A complete description of extending the simulator will be described later.

3.3      Simulated Applications

IA64Emu can simulate Windows programs as well as Unix programs. To be exact, the simulator can load Windows PE executable format of files and Unix ELF executable format of files. A couple of testing programs are attached. These programs demonstrate some of the most important IA‑64 architecture innovations and some tests provided by the simulator. These programs are compiled by prerelease of Microsoft C++ for IA‑64 under Windows and gcc compiler running under Hewlett‑Packard NUE simulator in Linux.

Applications should be compiled without standard start‑up code and standard C or C++ library. These restrictions ensue from the implementation of a part of the operating system. Standard start‑up code as well as standard library call a lot of system calls which are not implemented by the simulator.

To overcome these restrictions and to enable the use of some basic C functions, a simulator's own start‑up code and standard library were created. The simulator's start‑up code simply ensures the call to the main function and then to the exit system call. Simulated program should be linked with this start‑up code and standard library to ensure the proper simulation.

Simulator's standard library contains these functions:

·        void memcpy(void *target, void const *source, qword size);

·        void strcpy(void *target, void const *source);

·        void strncpy(void *target, void const *source, qword size);

·        void strcat(void *target, void const *source);

·        void strncat(void *target, void const *source, qword size);

·        qword strlen(void const *source);

·        void print(char const *text);

·        int kbhit();

·        int getch();

·        void gets(char *str);

·        int64 atoi(char const *str);

·        char *itoa(int64 value, char *str, int radix);

·        long double atof(char const *str);

·        char *ftoa(char *str, long double value);

Each function has the same functionality as a standard C function. Some of these functions are implemented as an immediate system call, some of them are C functions also simulated.

These functions are not supposed to build up a useful program. They provide only a basic interaction between the user and the tested program.

3.4      Inside the Simulator

Simulation aims to be as exact as possible. In some cases the simulation is not exact because the internal processor microarchitecture is not described by Intel. It concerns for example measuring of program execution duration. Some cases of instructions are mentioned in the specification that have “variable” latency.

To achieve the most precise simulation, the simulator tries to follow the platform architecture as much as possible. It consists of many C++ objects corresponding to the objects of the computer. Figure 3‑1 shows the simulator structure. Each rectangle corresponds to a simulator's object or a significand part of the simulator. Arrows define the direction of function calls.

 


Figure 3‑1. The Simulator Structure

3.4.1      Loader

The program loader is not represented by any object. It consists of several functions which are able to load the program into the simulator's memory. Now the loader can load programs with PE header and ELF header and it is designed to enable possible extensions very easily. The contents of PE header is described in the specification [11]. The contents of ELF header is described in the specifications [7] and [12].

The loader decodes the program header and reads information about sections contained in executable file. If there is a section which should be placed into the memory (for example program code or data), a new page is allocated for this section and it is loaded. The size of this page is computed from the section size.

There is not any reallocation implemented. The simulator cannot load dynamic libraries, and it simulates just a single program. All of the functions used in the program have to be linked with the main code. That is why there cannot be more overlapping sections in programs. Because the virtual memory mappings are implemented, memory pages can be allocated at the addresses required by the program. Therefore, a relocation is not necessary to load the program correctly.

Besides the program sections, some other values are read. One of these values is a “program entry point”. The IP register is set to this value after the program loading or when the execution is being restarted. The other values read by the loader are used for loader internal purposes only.

After loading the data into the memory, some standard registers have to be set. There is the convention that GR2 register is called “global data pointer”. This register must be set to the address of the global data segment. In ELF programs, it is the segment called “.data”, in PE programs it is called “.sdata”.

Another register necessary to be set is GR12. This register is called “memory stack pointer”. It points to the free memory space used as a stack. This stack is used in the same way as in IA‑32 architecture except for return addresses. Return branches jump to an address specified by a branch register. This branch register is filled with the proper value by a call branch instruction. If the program needs to save the old value of this branch register (it is necessary in almost all cases) it should rather store this value into the local part of the registers stack frame than into the stack in memory. The registers stack frame is written to memory by RSE at the background and it is written just when it is necessary. The proper usage of a general stack frame is to store procedure local variables which cannot be kept in general registers because of their size or to store huge procedure parameters on call.

There are some rules for using the memory stack pointer. This pointer must point to a 0 mod 16 aligned area all the time. Called functions must preserve this pointer during their execution. Sometimes functions do not preserve this pointer because they dynamically allocate a part of the stack. A call to this function must notify the compiler to cause the generation of code that behaves properly.

The simulator does not limit the size of the memory stack. The only limit is the size of the whole simulated memory. The size of memory can be set in the architecture configuration file. After the program startup, the simulator does not allocate any page for the stack. When the memory stack pointer is decreased bellow the mapped memory, or it is accessed the first time, an exception occurs and one more page is allocated. These pages are not deallocated.

At the top of the stack, a system environment has to be placed by the program loader. The system environment is a part of the memory where system variables and command line parameters are placed. Now the simulator doesn't use any system variable and no parameter can be specified in the command line. The number of variables in system environment is set to zero. The number of command line parameters is set to one and the simulated program path is written as the first parameter. The system environment address can be obtained from GR12 immediately after the program had started.

There is one more general register which should be set in the real IA‑64 processor. This register is called “thread pointer” and it is numbered GR13. The thread pointer is used by multithreaded applications. This simulator does not support multithreading and does not treat GR13 as a special register.

Register and memory conventions are fully described in the run‑time architecture guide [5].

3.4.2      Processor

The main part of the simulator is the processor which is represented by an object. The processor class is the biggest one in the program. The processor is responsible for the work of the whole system. It simulates the instructions and contains a lot of other objects such as registers, memory caches, Translation Lookaside Buffers, branch prediction tables, and Virtual Hash Page Table. VHPT should normally reside in the main memory but to make the program more simple it was implemented as a part of the processor. This deviation does not decrease the functionality of the simulator.

3.4.2.1  Registers

The IA‑64 architecture registers are divided into application registers and system registers. The simulator implements all of the application registers and a small portion of system registers.

The simulator registers sets:

·        General Registers – Hardwired GR0, register stack frames and register rotation are implemented. Banked GR16 through GR31 registers are not implemented, since the simulator's operating system does not use them.

·        Floating‑point Registers – All features defined by the IA‑64 architecture are supported except for integer and unsigned integer format in floating‑points.

·        Predicate Registers

·        Branch Registers

·        Application Registers – The whole set is implemented but some of the special registers are treated as standard. They are IA‑32 control registers AR21 (FCR) and AR24 through AR30 (EFLAG, CSD, SSD, CFLG, FSR, FID, FDR). The Interval Time Counter (AR44) is used to keep the total number of cycles of a simulated program execution.

·        Instruction Pointer

·        Current Frame Marker

·        Processor Identifiers – CPUID registers are implemented. Their number and contents can be specified by the architecture configuration file.

·        Region Registers

·        Protection Key Registers – just one Protection Key Register with predefined value is implemented.

·        Translation Lookaside Buffer

·        Processor Status Register – PSR register is implemented but the User Mask of PSR can only be changed.

3.4.2.2  Program Simulation

When the processor object is created it initializes all of its subobjects and resets the processor state. Lots of its subobjects are initialized according to the architecture configuration file. This file and possible simulator extensions are described in chapter 3.6. After that a program is loaded and special registers are set to their start‑up values. Now the program is ready for simulation.

The main simulation is realized by single simulation steps. To give a better insight into the simulation behavior, the simulator behaves a little bit different than other common debuggers. There need not be an instruction executed at each step. There are four types of execution steps:

·        Instruction load – at this step the instruction process is started. The instruction is assigned to a proper functional unit. At this point, the instruction is actually executed by the program.

·        Instruction completion – at this step the instruction execution is completed and its functional unit is released.

·        Instruction group completion – this step occurs when all instructions in the current instruction group are completed. The execution is promoted to the next group. In some cases this step proceeds together with the next one.

·        Dispersal window rotation – at this step the dispersal window is “rotated”. Dispersal window and this step are described below.

3.4.2.3  Dispersal Window

Dispersal window is a processor structure where fetched bundles are stored. Itanium's dispersal window can contain two bundles. More information about the Itanium's dispersal window can be found in the specification [9].

First, the bundles are fetched into the window. Then they are processed instruction by instruction. When there is a proper functional unit ready to process the instruction, the instruction is dispersed to this unit. The unit type is determined according to instruction type, slot number and position in the dispersal window. When there isn't any functional unit ready, instruction issue is split. The processor has to wait until some functional unit of the same type is ready before it can issue the next instruction.

If all of the instructions in some bundles are processed, the dispersal window is rotated. Processed bundles up to the first unprocessed one are thrown away. The bundles are moved, so that the first unprocessed bundle is now the first one in the dispersal window. The other ones are fetched from the memory.

Instructions are issued within the current instruction group. After processing all instructions in the group, execution must progress to the next group. If the next instruction group does not begin in the first bundle, a dispersal window is rotated. The simulator updates processor state according to executed instructions. It means that all registers being written by instructions are actually written right now. During this step memory and registers access dependency is checked.

3.4.2.4  Instruction Execution

In order to execute an instruction a lot of steps must proceed.

Before the execution the dispersal window must contain fetched bundles. The first unprocessed instruction is decoded. According to its type a functional unit is searched. There can be some restrictions to the units. Itanium does not have all units of the same type identical. For example there are two integer units but not all of the instructions can be processed by the unit I1. The simulator enables to define encodings of the instructions which cannot be executed by certain units.

The other class of rules called “dispersal rules” are restrictions to issuing depending on the bundle position in the dispersal window and the slot position in the bundle. These restrictions can be also specified in an architecture configuration file. There is a couple of rules currently defined according to Itanium. These rules are:

·        An I‑slot in the third position of the second bundle is always dispersed to I1 unit. In other cases the instruction in I‑slot is dispersed to the lowest numbered I‑unit not in use.

·        An M slot is always dispersed to the lowest numbered M‑unit not already in use.

·        An F‑slot in the first bundle disperses to F0 unit. An F‑slot in the second bundle disperses to F1 unit.

·        A B‑slot is dispersed according to its position in a bundle. A branch instruction in the first slot is dispersed to B0. A branch instruction in the second slot is dispersed to B1 and a branch instruction in the third slot is dispersed to B2.

Since the Itanium contains two bundles in a dispersal window, it can issue maximally six instructions at the same time. Because of the limited number of functional units, slots containing these instructions can be at most two I‑slots, two M‑slots, three B‑slots and two F‑slots.

If there are all appropriate units in use and no instruction can be launched, the simulator increases the simulation time by the minimal number of clock ticks necessary to finish the execution of any instruction being executed. All instructions which became processed at this step are marked “processed” and their functional units are marked “not in use”. The number of clock ticks necessary to complete the other instructions in functional units is decreased by this amount.

If the appropriate unit is found, it is marked “busy”. The simulator's instruction procedure is called. Instruction code is executed and it counts the number of simulated clock ticks to complete this operation. The returned number of ticks is set to the functional unit determined to execute this instruction. According to instruction sequencing rules, instruction writes to memory proceed immediately but instruction writes into registers are cached and deferred to the instruction group ending. During the instruction execution, the IP register contains the bundle address of currently executed instruction.

When the execution of one or more bundles at the beginning of the dispersal window is finished, dispersal window is rotated.

When all of the instructions within the currently executed instruction group are processed, the simulator switches to the next group. All deferred register writes are really processed; and a memory and a register access dependency violation is checked. During the changing of control to the next instruction group, a dispersal window rotation can also occur.

Figure 3‑2 shows the example of instruction dispersal. Current instruction group bounds are marked with a dashed line. The first slot of the instruction group (slot 1 in bundle 0) is already processed. The next three slots are being processed and functional units I0, I1 and M0 are marked “busy”. The last two slots are waiting for available functional units.

 

Figure 3‑2. The Process of Instruction Dispersal

 

Instructions predicated not executed are also dispersed to the functional unit but they are not executed. nop instructions are dispersed to functional units as if they were normal instructions. Since there are just five bits of slot type information in a bundle only, not all of the types are supported. Therefore, compilers (even assemblers) insert a lot of nops into the executable code.

3.4.2.5  Instruction Implementation

There are almost all IA‑64 instructions simulated by this simulator except for several cases:

·        System instructions which operate at the privilege level 0 (the most privileged) and instructions promoting to this level.

·        Floating‑point pair instructions.

Simulated program is executed at the privilege level 3 like application programs executed in the real IA‑64. That is why the system instructions which can be executed at the privilege level 0 only are not implemented. Neither epc (enter privilege code) is implemented. This instruction promote to the privilege level according to memory page flags and a memory page privilege level. All pages created by the simulator have the privilege level 3.

Unimplemented system instructions are bsw, epc, flushrs, fwb, itc, itr, loadrs, mov = cr, mov cr =, mov = psr, mov psr =, ptc, ptc.e, ptr, rfi, rsm, ssm, sync, srlz, tak, thash, tpa and ttag. It does not decrease the functionality of the simulator since the required part of  the operating system, such as system calls, interruption handling, or VHPT maintaining, is implemented by the simulator itself.

There are some exceptions in implementation of floating‑point instructions. Floating‑point integer number format is not supported. These numbers are represented by standard floating‑point numbers. Xma instruction operate with this number format instead of integer floating‑point number format.

Floating‑point pair instructions are not implemented. These instructions treat the FP register as a pair of two single precision FP registers. Unimplemented floating‑point pair instructions are fpabs, fpamax, fpamin, fpcmp, fpcvt, fpma, fpmax, fpmerge, fpmin, fpmpy, fpms, fpneg, fpnegabs, fpnma, fpnmpy, fprcpa, fprsqrta, ldfp and stfp.

All of the other instructions defined by the IA‑64 architecture are fully implemented.

There is no 82‑bit floating‑point arithmetic implemented in the simulator. Only the conversion between 82‑bit floating‑point number and 80‑bit number and the conversion between 82‑bit and 64‑bit numbers are defined. When a floating‑point instruction is called, it converts FR register to the 64‑bit floating‑point number. IA‑32 floating‑point arithmetic computes a result and it is converted back to the 82‑bit number.

The format of IEEE‑854 floating‑point numbers is described in the specification [1]. IA‑64 floating‑point architecture is described in the specification [4].

3.4.2.6  Interruption Handling

During the instruction execution some interruptions can be raised. To achieve a higher performance, interruptions are implemented using a C mechanism similar to C++ exceptions (setjmp, longjmp).

The interruption breaks the instruction execution immediately and change the control flow to the processor. The processor tries to call the function which can handle the pending exception. This function typically handles a Break Instruction Fault or a VHPT Data Fault. When this interruption is received, it jumps to a routine at an address specified by operating system and returns whether the interruption was unhandled or whether it was handled and the instruction causing the interruption should be restarted or not. The other interruptions can be handled here when the operating system is extended.

If the “handle_pending_exception” procedure returns the interruption unhandled, the processor posts it to the debugger which stops the execution and prints a message, otherwise the execution continues correctly. Figure 3‑3 shows the process of an interruption handling.

Besides standard IA‑64 interruptions defined in [4], the simulator uses some interruptions which are raised by program tests. The list of exceptions that the simulator raises, and their descriptions are placed to the chapter 3.5.1.

The simulator treats faults and traps in the same way. Fault and trap information is carried by the exception structure containing the number of fault or trap and a parameter. The difference can be made in the handling function only. The return value specifies a further behavior – the instruction has to be restarted in case of faults or not restarted in case of traps.

The simulator's operating system handles two exceptions only, the other ones break the application simulation. Break Instruction Fault is the first handled exception. When this interruption is raised, the processor calls the break interruption routine defined by the system. System calls are handled by this routine. After a return from the break interruption routine, the break instruction is not restarted.

A system call is invoked by the break instruction with the parameter 0x10000. The GR15 register has to contain a system call number and the local area of the caller's register stack frame should contain incoming parameters. After the system call execution, a return value is placed to GR8.

 

Figure 3‑3. The Process of  Interruption Handling

 

VHPT Data Fault is the other handled exception. The processor jumps to the interruption handling routine defined by the operating system. This interruption is handled because during the program startup, there is no page allocated for memory stack. Just a memory stack address is set and a real page is created after the first access to the stack. This ensures that when the stack underflows the last mapped memory address, a new page is allocated. After mapping the stack page, load or store instruction is restarted. In other cases an access to unmapped memory pages fails. There is no swap file in the simulator, so there is no reason to create other pages than those created during the program startup and stack pages.

3.4.3      Register Stack Engine

Register Stack Engine operations are performed by a couple of processor methods along with its state contained in the processor. RSE state maintained by the simulator consists of the following variables:

·        rse_bof (RSE.BOF) – Bottom‑of‑frame register number. This is the physical register number of GR32 register. This value corresponds to AR[BSP] register.

·        rse_load_reg (RSE.LoadReg) – RSE Load Register Number. This variable contains the physical number of the next loaded register.

·        rse_store_reg (RSE.StoreReg) – RSE Store Register Number. The physical number of the next register to be stored by RSE. This value corresponds to AR[BSPSTORE] register.

·        AR[BSP] – Backing store pointer. This register contains the memory address where logical GR32 is stored or will be stored.

·        rse_bsp_load (RSE.BspLoad) – Backing Store Pointer for memory loads. This variable contains the memory address of the next loaded register.

·        AR[BSPSTORE] – Backing Store Pointer for memory stores. The memory address of the next stored register can be found here.

·        AR[RNAT] – RSE NaT Collection register. NaT bits are placed here when the register is stored.

Some variables of the RSE state are internal and application cannot see them. Application can just change the Backing Store Pointer, Backing Store Pointer for memory stores and RSE NaT Collection register which are mapped to application registers.

After the program startup AR[BSP] and AR[BSPSTORE] are set to the backing store address. The memory page of backing store is not created. The memory pages are created depending on the RSE requirements.

Figure 3‑4. Simulator's Register Stack

 

IA64Emu simulator does not implement a Clean partition of the register stack. It supports Dirty partition, Current frame and Invalid partition defined in [4], volume 2, chapter 6. In case of the simulator, it is possible, because the simulator is able to load all registers of the Clean partition immediately; the simulator omits the number of clock ticks necessary for loading them instead of providing a smart loading of the Clean partition which has the same effect. Therefore, rse_load_reg (RSE.LoadReg) has the same value as rse_store_reg (RSE.StoreReg) and rse_bsp_load (RSE.BspLoad) has the same value as AR[BSPSTORE].

This simulator does not provide exactly Enforced Lazy, Load Intensive, Store Intensive, or Eager modes of RSE behavior defined in the IA‑64 architecture. When a requirement to create a new stack frame is sent to the processor, the processor updates RSE state variables described above if there are enough free registers in the general register field. If there are less registers not in use than the required number, simulated Register Stack Engine stores half of the register field to memory and updates the RSE state. In case there is the requirement to restore previous register stack frame but this frame is not present in the register field, simulated RSE loads half of register field from memory and updates the state variables. To simulate the behavior of each of four RSE modes more exactly, the simulator counts the time taken by register stack updates according to the following rules:

·        If the RSE mode is Enforced lazy, latencies of the register stack load and the register stack store are both the times of a memory load and store duration computed by memory load or store function.

·        If the RSE mode is Load Intensive, the latency of the register stack load is zero cycles. The latency of the register stack store is the time of a memory store duration.

·        If the RSE mode is Store Intensive, the latency of the register stack load is the time of a memory load duration. The latency of the register stack store is zero cycles.

·        If the RSE mode is Eager, the latencies of register stack load and register stack store are both zero cycles.

When the register contents is going to be stored and AR[BSPSTORE]{8:3} is different from (111111)2, the register value is written to memory and AR[RNAT] bit at the index AR[BSPSTORE]{8:3} is set to the value of NaT bit of stored register. When the value of AR[BSPSTORE]{8:3} is (11111111)2, the AR[RNAT] register is stored to this address, AR[BSPSTORE] is increased by eight bytes and normal store of register is processed as described in the first case. The behavior of register stack loads is analogous to the behavior of stores.

3.4.4      Physical and Virtual Memory

Simulator's memory is made up of a block of real memory. Its size has to be specified in the architecture configuration file. 64‑bit addressing and IA‑64 memory paging defined in [4] is implemented in the simulator. The direct access to the physically addressed memory by application instructions is not supported since the PSR.dt bit specifying physical or virtual addressing mode cannot be set by application programs at the privilege level 3.

This memory block is created during the program startup. An access to the memory is provided by several memory and processor functions ensuring the physical to virtual address translation or proper use of memory caches. No swap file is implemented and memory size cannot be changed during the run time. Therefore, the user has to specify the sufficient size of memory. If there is not enough free memory space for the application on startup, an error is announced and the application is not loaded.

There are eight Region Registers implemented to ensure a memory address translation. Only one Protection Key Register is implemented. This register is used for all pages, so there is just one protection domain for the program text, data and stack, global environment, and the registers stack. There are no read, write, and execute restrictions defined by this PKR register. Page access restrictions are defined by page attributes in the VHPT table only.

Several functions providing memory reads and writes are defined in the processor or memory classes. Memory read and write functions provide just low‑level access to memory and they operate with physical addresses.

Memory access of instruction code should be done via processor object. It provides read and write functions operating with virtual addresses, proper exception handling, and some statistic computations.

When a demand on memory read or write is received, the processor translates a virtual address to physical. Figure 3‑5 shows the process of address translation. The virtual address is divided into three parts. An offset occupying the least significant bits of the address is copied to the physical address. The number of offset bits depends on the page size. Virtual Region Number formed by the bits 61 through 63 is the index to the Region Registers set. Region Register indexed by VRN contains the Region Identifier of specified address. A virtual page record is searched in VHPT according to the Region Identifier and Virtual Page Number obtained from Virtual Address.

Protection key register is searched and checked. This is a trivial operation since there is just one protection key register having all types of memory access (read, write, and execute) enabled. Page access rights in the translation record are also checked, and the Physical Page Number field is copied into the outcoming physical address.

If there is not any required page present, or operation cannot be performed because of the insufficient access rights, there are two ways of behavior in this case.

Figure 3‑5. Virtual to Physical Address Translation

 

The simulator's processor can defer exceptions for memory reads and writes as defined by the IA‑64 architecture. When exceptions are deferred and the memory page does not exist or it does not have proper access right flags, the processor just returns that the specified operation failed. Then the instruction implementation sets the NaT bit of a target register to 1. When exceptions are not deferred, an exception is raised and the control is passed to an interruption handling function in the system part of the simulator.

If everything is all right, the main memory access can begin. The processor does not read directly the memory but posts the request to the L1 cache. This request bubbles through the memory caches to the first occurrence of required data. A correct memory mapping ensures that the data are found in the main memory at the latest. A temporal or nontemporal memory cache structure access can be specified by the memory read or write. Itanium does not differentiate temporal and non‑temporal structures and there is not any more exact description of differences between these two structures, so this hint does not have any affect. To enable the future work with this hint, it is stored within the cache lines, and it is propagated up to the definable find_cache_line function described in the chapter 3.6.2.

The processor returns the latency in cycles of the memory operation depending on the level of memory cache hit and the type of the operation – read or write and integer or floating‑point operation. This number of cycles is added to the total latency of the instruction execution.

3.4.4.1  Memory Cache

The memory hierarchy consists of the main memory and several levels of memory cache. Each level of cache can be formed by a unified cache unit or by separated instruction and data caches. The simulator enables to define the number of cache levels, the size of each level, read and write latency for integer and floating‑point operations, and whether they are divided into the instruction cache and the data cache.

Itanium has three levels of cache. L1 cache consists of the instruction part of cache and data part of cache. L2 and L3 caches are unified. The Itanium cache hierarchy is depicted in the specification [9]. Although L3 cache is not on die, the simulator treats it as though it were inside the processor.

Each cache level knows its superior cache or memory. When there is a request to access the memory, it is sent to the memory cache. L1 cache read or write is called. If the required data are not found in L1, the L2 cache read or write is called. If the data are not found in the last level of the cache, the operation is processed by the main memory. The memory should contain the required data, otherwise the operation would have failed while searching the memory mapping before.

A device where the data are found returns the latency of performed operation. This latency can differ depending on the type of the operation (read or write) and the type of data (integer or floating‑point).

Besides the cache level parameters definable in the architecture configuration file, the simulator makes it possible to implement a function searching the cache lines or finding out the place for new data in cache. A proper treatment of Temporality hint can be implemented in these functions. The detailed description of user definable behavior is in the chapter 3.6.

3.4.4.2  Virtual Hash Page Table

Virtual Hash Page Table is an array of translation records like Translation Lookaside Buffer. The simulator does not make any distinction between the structure of TLB translation records and the structure of VHPT translation records.

Normally the VHPT resides in memory. To achieve simpler implementation, in this simulator it is placed to the processor like the TLB structure. This fact does not have any effect on the functionality of the simulator.

VHPT walker is not implemented because it is just one of the ways to access the VHPT. Actually the simulator's VHPT is not a hash table. The number of pages created for application or for the operating system structures does not exceed ten pages and the hash table of such number of entries is not effective.

VHPT translation record consists of these items:

·        p – Present bit. This bit is set to 1 if the record is used.

·        pl – Privilege Level. It contains two bit value of privilege level of the page. All pages has this value set to 3.

·        ar – Access Rights. This field contains access rights defined for this page. Access rights can enable read, write or execute access only. Promotion to the higher privilege level is not supported.

·        ppn – Physical Page Number. The number of the physical page is stored here. Which part of the physical address the Physical Page Number is depends on the page size.

·        ps – Page Size. This 6‑bit field contains a logarithm of virtual or physical page size.

·        key – Protection Key. This is the protection key which has to correspond with any protection key value in PKR registers. Although 24‑bit protection keys are defined, this value is always zero and corresponds to the protection key in PKR[0] register.

·        vpn – Virtual Page Number. The number of virtual page is stored here.

·        rid – Region Identifier. This 24‑bit field contains a region identifier of a region containing required page. During the translation, a region identifier is obtained from the Region Register indexed by the three most significant bits of virtual address.

·        age – this is the internal state variable. For example the age of the record can be stored here which is suggested by the name. The VHPT does not use age field because VHPT entries cannot be replaced without deletion of the page. This field is used by the TLB only. More stuff about definable age‑field accessing is described in chapter 3.6.2.

The VHPT structure defines several operations on page mappings. It enables creating and deleting of a page, obtaining the translation record or translation of virtual address to the physical address. When a new page is to be created, required virtual page address can be specified. Virtual pages cannot overlap.

The simulated VHPT structure can raise VHPT Data Fault, Data Key Miss Fault, Data Key Permission Fault, and Access Rights Fault. These exceptions are described in details in chapter 3.5.1.

The other faults are not supported because of the simpler implementation of the simulator than the real IA‑64 processor.

The IA‑64 architecture specification states that VHPT is maintained by the operating system. The simulator does not define any restriction to VHPT use.

3.4.4.3  Translation Lookaside Buffer

Translation Lookaside Buffer is a field of translation records like VHPT although IA‑64 also enables a usage of short translation records in TLB. This field resides in the processor. The TLB table can consist of more TLB levels and it is divided into Instruction TLB and Data TLB. The number of the levels of each Instruction or Data TLBs, their size and the access latencies of each TLB level can be specified in the architecture configuration file described in chapter 3.6.1.

The simulator does not use the TLB structure to obtain a translation record of memory mapping. When the address is being translated the translation record is searched in VHPT. The TLB record is searched just to count the VHPT and TLB access latency. When the translation is made, the translation record is moved to the proper TLB structure.

The current configuration is set to two levels of Data TLB and one level of Instruction TLB according to the Itanium processor. Itanium does not implement Translation Register. This is not also implemented in this simulator because there is no possibility to insert or delete the translation record into TR because it can be done at the privilege level 0 only.

The simulated TLB does not raise any exception since all of the exceptions to be raised had been raised by VHPT before.

Besides the TLB parameters definable in the architecture configuration file, the simulator enables an implementation of a function searching the TLB records or finding out a place for new translations. The detailed description is in the chapter 3.6.2.

3.4.5      Advanced Load Address Table

The Advanced Load Address Table is represented as a field of ALAT entries. Entries are managed by the processor. The ALAT table provides inserting, invalidating and checking these entries and updating the ALAT table when the register stack frame gets changed.

The ALAT table resides in the processor and it is created on processor start‑up. The size of the ALAT table can be specified in the architecture configuration file.

The ALAT table consists of the following fields:

·        Present Bit – this bit is set when the record is used.

·        Register Type – this is the type of a register which had been the target of advanced memory operation.

·        Register Number – the index of register of the specified type. For General Registers, the physical register number is stored here.

·        Physical Address – contains the physical address of advanced memory load or store.

·        Size – contains the size of accessed memory by the advanced operation.

·        Age – like VHPT age, this is the internal state variable. The age of the record or any other information can be stored here.

The simulator compares the ALAT entries according to their register type and the register index. The ALAT entries can be invalidated explicitly by certain instructions having .clr hint set or by find_alat_entry function defined by user.

Besides the ALAT parameters definable in the architecture configuration file, the simulator enables an implementation of a function searching the ALAT records or finding out the place for new ALAT entries. The detailed description is in the chapter 3.6.2.

3.4.6      Branch Prediction

The IA‑64 branch prediction is a little bit complicated. It uses several prediction structures, several prediction algorithms and plenty of exceptions to these algorithms. IA‑64 branch prediction is described in specifications [4] and [9]. Branch prediction algorithms used by this simulator are descirbed in the works [2] and [10].

The simulator uses BPT, MBPT, TAC and TAR structures to implement a branch prediction. BAC1 and BAC2 structures counting a branch target address are not used – a target branch address is counted immediately and the latency of this operation is defined as zero.

The BPT and MBPT structures are the primary resources to predict whether the branch is taken or not. BPT and MBPT entries are written by the br instruction with dynamic prediction hints.

When a static prediction is used (sptk and spnt hints) no entry is allocated. Static prediction has these rules:

·        If sptk hint is used, a branch is predicted taken.

·        If spnt hint is used and there is no entry in the TAR or TAC structure with the same target address, a branch is predicted not taken.

When a dynamic prediction hints (dptk and dpnt) are used, the prediction is determined according to the following rules:

·        If a non‑return branch is taken, a TAC entry is allocated.

·        If there is no BPT or MBPT entry allocated, a Whether hint is used to predict the branch statically. If the static prediction fails BPT or MBPT entry is allocated and the dynamic prediction is used next time.

·        If BPT or MBPT entry is found, a dynamic prediction is used. Prediction algorithm used by simulator is a local 2‑level predictor with 4 history bits. This algorithm is described below.

·        If a return or call branch instruction is taken, a BPT or MBPT entry is always allocated.

The dealloc hint (clr) tells the processor not to allocate BPT entry or MBPT entry. This rule is more privileged than the previous ones.

The BPT and MBPT branch structure consists of these fields:

·        Present – this bit is set to 1 if this record is used.

·        Source – this field contains the source address of a branch instruction

·        History – 4 history bits are used as a history pattern. The MBPT structure has one history field for each slot.

·        Taken – branch prediction information. This is a 16 entries long field, each entry is for one history pattern. The MBPT structure has one Taken field for each slot.

·        Age – this is the internal state variable. The age of the record or any other information can be stored here.

The size of the BPT and MBPT structures can be specified in the architecture configuration file. A branch prediction algorithm and a multiway branch prediction algorithm can be reimplemented for future IA‑64 implementations. These features are described in details in the chapter 3.6.2.

TAC and TAR structures are used to store a branch target address. They are not important for the simulator's behavior because the simulator can compute the target branch address with no latency of simulated time. They are implemented to make the simulation time more realistic.

The TAC and TAR structures contain these fields:

·        Present – this bit is set to 1 if this record is used.

·        Source – this field contains the source address of a branch instruction

·        Slot – this field contains the position in the bundle of a branch instruction. There can be just one entry for each source address now. If the entry with the same source address should be inserted into the TAC or TAR structure, the entry is overwritten even if the slot numbers are different. This feature can be changed for new IA‑64 implementations by redefining function looking for a new entry.

·        Age – this field has the same meaning as the age field in BPT.

TAC and TAR entries are allocated by dynamic prediction mechanisms or explicitly by brp instruction according to its whether hint. At this time there is no difference between loop, sptk and dptk hint types. The exit form is not implemented by Itanium and there is no specification, how to treat this hint.

The brp instruction can specify the target address of a branch but it does not have any effect on the simulator.

The TAC entry is allocated when the brp instruction without imp completer is executed and when the hint is either loop, sptk or dptk. When an imp completer is specified an entry is allocated in both the TAR and TAC structures.

The size of the TAC and TAR structures is specified in the architecture configuration file. A function looking for the existing entries or determining the position of a new entry can be reimplemented for future IA‑64 implementations. These features are described in details in the chapter 3.6.2.

3.4.6.1  2­‑level Predictor Algorithm

This algorithm is used by the dynamic prediction hardware. The algorithm uses the data stored in the BPT and the MBPT structures. There are several forms of this algorithm. The exact form used by the simulator is a local 2‑level predictor with 4 bits of history.

1‑level scheme is a simple algorithm using 2‑bit branch history counters. Initial value of the branch history counter is 1. When a branch is taken, the counter is incremented. The counter is 2‑bit saturated – it is limited by the values 0 through 3. When the branch is predicted, an appropriate counter is found. If this counter is 0 or 1 the branch is predicted not taken. If the value is 2 or 3, the branch is predicted taken.

2‑level scheme uses a branch history pattern and the table of the branch history counters. The branch history pattern is used to index the table of branch history counters. Therefore, when there are 4 bits of history pattern, the branch history table has 16 entries.

History pattern keeps the information about the last four executions of the corresponding branch. While the execution of the branch, the history pattern is shifted one bit to the left. If the branch was taken, the last bit of pattern is set to 1, if not the last bit is set to 0. Before updating the history pattern a branch history counter is updated in the same way as defined in 1‑level scheme.

Figure 3‑6. BPT Branch Prediction Algorithm

 

When the branch should be predicted, a history pattern is read and the branch history table at the index defined by history pattern is accessed. The decision is made in the same way as in 1‑level scheme. If this value is 0 or 1, the branch is predicted not taken. If this value is 2 or 3, the branch is predicted taken.

The branch prediction process is also shown in the figure 3‑6.

3.5      Code Testing

IA64Emu simulator differs from common simulators by implemented code tests and performance monitoring. These tests help the compiler developers to check some important errors in a generated code. Although the results of performance tests are not as exact as the performance of the IA‑64 processor, they are useful to detect wrong compiler optimizations or inconvenient instruction orderings.

Tests are divided into three categories. The first category contains the tests of fatal application events which would raise a fault in the real processor. These events also raise a fault in the simulator. These tests cannot be discarded.

The second category contains tests of dependency violations during an access to registers or memory. If a developer is sure that the code is correct, these tests can be discarded to achieve a little bit higher performance of the simulator. When these tests are disabled and a dependency violation occurs, the result is undefined.

The third category contains performance and optimization tests. These are the tests of branch prediction successfulness and measuring the time of application execution. Branch prediction and taken branch watching can be also disabled. Memory access statistics and application execution statistics are only written into the log‑file.

These three categories are described in details in the following subchapters.

3.5.1      Interruptions

The simulator currently raises two classes of exceptions. The first class is standard IA‑64 exceptions. These exceptions are just a small part of standard exceptions useful for the simulator. The mechanism of interruption handling during the instruction execution is fully described in the chapter 3.4.2.6.

The IA‑64 faults used by the simulator:

·        Illegal Operation Fault – This exception is raised as a consequence of many events. Some of them are bad numbers of source or target registers, inadmissible operand values, reserved instruction encoding and incorrect privilege level of performed operation. Detailed restrictions to the instruction operands are described in [4], volume 3, chapter 2.

·        Reserved Register Field Fault – This is raised when move to a reserved register field is required. This exception occurs within the writes into AR registers or reads from them. Each case of a reserved field depends on the number of AR register. The description of reserved fields is given in [4], volume 1, chapter 3.

·        Break Instruction Fault – This exception is raised due to the break instruction execution. System calls are invoked by this exception. This exception is not an error exception and does not cause any program execution interruption or any undefined processor state.

·        Register NaT Consumption Fault – This exception is raised by nonadvanced operations when one of the source operands contains NaT value for general registers set or NaTVal for floating‑point registers set. Advanced speculative operations propagate the NaT value to the result.

·        Privileged Register Fault – This exception is raised when a move to a privileged register is required. This exception occurs within the writes into  the privileged AR registers or reads from them. The description of privileged register numbers is given in [4], volume 1, chapter 3.

·        VHPT Data Fault – This exception is raised when the required translation record in VHPT is not found. In almost all cases this exception is fatal and the program cannot continue in the execution. The only case of the proper occurrence of this exception is within an access to the memory stack; in this case a new page is created.

·        Data Key Miss Fault – Raised when the specified Protection Key is not found in PKR registers. This exception should not normally arise.

·        Data Key Permission Fault – Raised when permissions defined by PKR register disable required operation. Like in the previous case this exception should not normally arise.

·        Access Rights Fault – Raised when permissions defined by ar field in a translation record disables required operation. For example this exception occurs when the program writes to the instruction code if it is marked read‑only.

·        Unaligned Data Reference Fault – This exception is raised when an unaligned data reference is forbidden (Alignment Check bit in PSR is set to 1) and memory is read or written at an unaligned address. The default value of Alignment Check bit is zero. This flag can be set or reset by rum and sum instructions only. The simulator counts the unaligned data references even when this flag is reset.

The second class of interruptions consists of simulator's internal interruptions. These interruptions are used to announce the faults of provided tests. Some occurrences of these interruptions can be discarded in the simulator configuration file.

The simulator's internal exceptions:

·        Not Implemented – This exception occurs when an unimplemented instruction or unimplemented code is reached. These instructions are e. g. floating‑point pair instructions.

·        Other Exception – This can be raised at any situation. The detailed description of this exception should be placed into the exception parameter. It is typically used during the simulator's tests and raised when some test fails such as the register access dependency violation occurs. A detailed description of dependency violation is placed into the parameter. When this exception arises due to a functional unit overload, the program can continue execution. In other cases if the program continue execution, the result is undefined.

·        Slot 2 Fault – This exception is raised if the instruction should be placed into the slot 2 but it is not. This event causes an Illegal Operation Fault in the IA‑64 architecture.

·        First In Group Fault – It is raised if the instruction should be the first the current instruction group but it is not. This event causes an Illegal Operation Fault in the IA‑64 architecture.

·        Last In Group Fault – It is raised if the instruction should be the last in the current instruction group but it is not. This event causes an Illegal Operation Fault in the IA‑64 architecture.

3.5.2      Access Dependency Tests

Access dependency tests are provided to diagnose read‑after‑write, write‑after‑write and write‑after‑read dependency violations within the instruction group. There are many strict restrictions in the instruction group defined by the IA‑64 architecture to access the architecture resources such as registers and memory.

There are no restrictions between instruction groups. Each instruction in a given instruction group behaves as though its read occurs after the update of all the instructions from the previous instruction group.

Within an instruction group RAW and WAW register dependencies are not allowed, WAR register dependencies are allowed except for the following special cases. These dependency restrictions also apply to implicit register access. Application and control registers are accessed in this way by several instructions. The PR0 register is excluded from these restrictions because it is used as implicit predicate register and target register of cmp instruction discarding the result.

There are four special cases in which the RAW register access dependencies within an instruction group do not apply. They are

·        alloc instruction – This instruction implicitly accesses the Current Frame Marker. CFM is also used by every instruction accessing the stacked subset of general registers. Dependent instructions see the value of CFM changed by alloc instruction.

·        ld.c instruction – Check load instruction may read the memory in the case of missing ALAT entry. Instructions accessing the target register of the check load may appear in the same instruction group. These instructions see the value changed by a check load.

·        Branch instructions – Branch instructions implicitly access predicate registers, the LC and EC registers, Previous Function State and Current Frame Marker and explicitly branch registers. Branch registers, PFS and CFM registers can be accessed before the branch instruction in the same instruction group. The branch will see values of these registers changed by previous instructions. The register dependency for the other registers is not allowed.

·        ld8.fill, st8.spill instructions – These instructions access the AR[UNAT] register. They access 1‑bit of this register only each time they are executed. The restrictions to register dependencies apply at the bit level.

There are three cases of permitted WAW register access dependencies within an instruction group:

·        cmp instruction – if the compare type is either AND‑type for all compares or OR‑type for all compares targeting the same predicate register, the access dependency is allowed. Compares targeting PR0 are also allowed within the instruction group, since writes to PR0 are discarded.

·        mov pr = – move to PR instruction writes the PR registers defined by a mask. Dependency restrictions apply to these registers only. Move from PR instruction reads all of the PR registers.

·        st8.spill – As noted in the RAW exceptions from dependency restrictions, this instruction accesses the AR[UNAT] register. It accesses only 1‑bit of this register each time it is executed. Restrictions to register dependencies apply at the bit level.

There is one case of register dependency restriction to WAR access. The reading of the PR63 register predicating B‑type instruction and a subsequent writing to the same register by the modulo‑scheduled loop type branch is not allowed.

The RAW, WAR and WAW memory dependencies and ALAT dependencies are allowed. The load will observe the most recent store to the same memory address. The result of two stores is the value written by the later store. Certain instructions implicitly access the ALAT table. RAW, WAR and WAW access dependencies for ALAT table are allowed. The access to the ALAT behaves as the memory accesses described above. The result of the memory accesses to the same location is undefined. Data speculation should be used to overcome these dependencies.

These tests can be disabled in the architecture configuration file. There are two variables used to manage these tests. The ACCESSED_REGISTERS variable describes the behavior of the register access dependency tests. The ACCESSED_MEMORY variable describes the behavior of memory access dependency tests.

These variables have one of three values: NOTHING, LOG, EXCEPTION. NOTHING value signifies that the test will not be performed. LOG value means that all of the results of the tests will be logged but no exception is raised. EXCEPTION value means that when an appropriate dependency violation occurs, an exception is raised and the application is stopped at the point of the failed test. A detailed description of arisen situation is worked out. When an exception is raised, the description is also logged.

3.5.3      Branch Prediction Test

This test provides a branch prediction successfulness measuring. Each time a branch instruction is executed, the result of IA‑64 prediction is computed. Then a table of branch instruction information is updated.

The table of branch instruction information contains entries for each branch instruction. The table can contain two or three entries for MBB and BBB slots. Each entry consists of the following fields:

·        Source address – This field contains the IP address of the bundle containing this instruction.

·        Source slot – This field contains the number of slot containing this instruction. The slot number is used for MBB and BBB bundles only. For the other bundle types the slot number contains NOT_A_NUBMER value.

·        Passes – The number of branch instruction executions. Each time the appropriate instruction is reached this counter is increased.

·        Taken – The counter of taken branches. Each time the appropriate branch is taken this counter is increased.

·        Mispredictions – This is the counter of branch mispredictions. When the algorithms used for predicting the branch target address fails, this counter is increased.

This table is written to the log‑file after the program execution. Below the table entries, the sum of all lines is appended. The percentage of the total number of taken branches and branch mispredictions is also provided.

The branch prediction tests can be enabled or disabled in the architecture configuration file. The BRANCH_PREDICTION variable describes the behavior of this test. Available variable values are similar to those used by dependency tests.

NOTHING value signifies not to perform the test. Branch information table is not created. LOG value means that the branch information table should be logged after the application simulation. When the EXCEPTION value is specified, every branch misprediction raises an exception, and the application is stopped at the point of the failed test. The branch information table is also logged.

3.5.4      Performance Monitoring

Performance monitoring is implemented to give the overall program execution information. The measured execution time is not as exact as the real application execution time in the original IA‑64 implementation. This is caused by the different technology of implementation and by the fact that Intel has not published the detailed internal structure of its microarchitecture. However, performance statistics computed by the simulator give a good approximation to the real application execution time.

During the performance monitoring the simulator counts processor cycles elapsed from the beginning of application simulation. This value is stored in the AR[ITC] register. The application can read this register, but it cannot write into the register. The current number of the cycles is viewed during the simulation.

The simulator measures the execution time of each functional unit. This execution time is increased each time the functional unit is in process by the number of cycles necessary to complete the instruction.

The total execution time and execution times of functional units are written into the log‑file. Besides the number of the cycles, the list of functional units contains also their percentage of the total execution time.

These tests cannot be disabled and do not cause any exception or change in the control flow.

Besides the performance monitoring introduced above, the simulator can locate an attempt to execute the instruction by the functional unit being already in process. The occurrence of this event is not a fault. It causes the instruction issue split only, which decreases performance.

Behavior of this test is managed by the OVERLOADED_UNIT variable in the simulator's configuration file. This variable can hold the three values. NOTHING value signifies that no unit overloading test will be performed. LOG value signifies that the unit overloading test will be performed. When there is an attempt to issue the instruction to the unit already in use, the detailed description of arisen event is logged. EXCEPTION value has almost the same meaning as the LOG value. When an attempt to issue the instruction to the unit which is in use occurs, description is logged, and exception is raised at the point of the attempt and the application is stopped.

3.5.5      Other Statistics

There are some other useful statistics worked out by the simulator. They cannot be disabled and they do not cause any exception or change in the control flow.

Besides the execution time, the simulator counts also the number of the executed instructions. There are counters for each functional unit. The appropriate counter is increased by one each time the instruction is dispersed to the functional unit. The sum of all functional unit counters give the total number of the executed instructions. The values of the counters and their sum are written into the log‑file along with their percentage of the total number of executed instructions.

The IA‑64 architecture does not allow all combinations of the slot types and the slot stops in the bundle. Due to this fact the compilers have to insert some nop instructions into the compiled code. The simulator has a counter of executed nop instructions. The final state of this counter and its percentage of the total number of executed instructions is written into the log‑file.

The last statistics that the simulator provides is the counting of memory accesses and unaligned memory accesses. There are several counters divided according to the operation type and the accessed data size. The operations can be memory read and memory write. The watched data sizes are 1, 2, 4, 8, 10 and 16 bytes.

Besides the counting of memory reads and writes the simulator watches how many accesses of specified sizes were unaligned. The access is unaligned if the virtual address is not aligned to boundaries of the size of accessed data for 1‑byte, 2‑byte, 4‑byte, 8‑byte and 16‑byte accesses or if the virtual address is not aligned to 16‑byte boundaries for 10‑byte accesses.

A small table of two columns is written into the log‑file. The first column contains the number of memory accesses of the specified type and size along with its percentage of total memory reads or writes. The second one contains the number of unaligned memory accesses according to the type and the size as defined for the column one.

3.6      Extending the Simulator

Although there is one implementation of the IA‑64 architecture only and also one processor specification, the simulator was developed to make the extensions to the future processors as easy as possible. Therefore, an architecture configuration file is created with a lot of variables. Besides, several implementation specific algorithms were moved to the special source file, where they can be reimplemented.

3.6.1      Architecture Configuration File

The architecture configuration file name used for a simulation is defined at command‑line. The full command‑line format is described in the chapter 4.3.

The syntax of the file is very simple. The file consists of several variables, comments and blank lines. Comments can be in separate lines only. A comment begins with two slashes and ends at the end of a line. Variables have VARIABLE_NAME = VALUE format. Variable names are a sequence of alpha‑numeric characters and an underscore. They are separated by space, tabulator, or the sign of equivalence. The variable values are almost in all cases numbers. The radix of value depends on the variable. If there is no radix mentioned in the variable description the radix is 10. Variable values can be followed by a multiplier. The available multipliers are:

 

k

=

1 000

 

 

K

=

1 024

 

 

m

=

(1 000)2

=

1 000 000

M

=

(1 024)2

=

1 048 576

g

=

(1 000)3

=

1 000 000 000

G

=

(1 024)3

=

1 073 741 824

 

RESTRICT_XX variable value has an extended syntax. This syntax is described within the variable description.

There is a list of the simulator's variables. Any other variable appearing in the configuration file is not allowed and cause the program exit.

·        PROCESSOR_FREQUENCY – Defines the processor frequency. This value is used for computing the number of processor cycles necessary to access the main memory because its access time is specified in Bps.

·        DISPERSAL_WINDOW_SIZE – Defines the size of the dispersal window in bundles.

·        CPUID_COUNT ­– The number of CPUID registers. This number can differ according to the architecture implementation.

·        CPUIDx – Defines the contents of CPUID register. The x stands for a number in the range <0, CPUID_COUNT – 1>. These variables can define different processor identifications for different architecture configuration file. This can be useful for applications whose behavior depends on the processor type. Values defined for this variable are treated as 8‑byte hexadecimal numbers.

·        I_UNITS – The number of the integer units implemented in the processor.

·        M_UNITS – The number of the memory units implemented in the processor.

·        B_UNITS – The number of the branch units implemented in the processor.

·        F_UNITS – The number of the floating‑point units implemented in the processor.

·        ALAT_SIZE – This variable defines the number of records the ALAT table contains.

·        MEMORY_SIZE – Defines the size of the main memory in Bytes. The memory of this size is created at the program start‑up and cannot be enlarged during the program execution.

·        MEMORY_BANDWITH – The variable defines the memory bandwidth in Bps. Therefore, the main memory access time depends on the accessed data size and the processor frequency.

·        MEMORY_CACHE_LEVELS – Defines the number of the levels in cache hierarchy. A lot of options can be set for each cache level. These options begin with Ln prefix where n stands for the number of cache level in the range <0, MEMORY_CACHE_LEVELS – 1>. The first cache index (L1 cache) is zero.

·        Ln_LINE_COUNT – Defines the number of cache lines. This option also defines that Ln cache is not divided into instruction part and data part of the cache. This variable cannot be used with Ln_I_… and Ln_D_… variables.

·        Ln_I_LINE_COUNT – This variable allows the user to specify the number of cache lines, too, but in this way you declare the Ln cache is not unified. The number of the lines are specified for Ln instruction cache. This variable cannot be used with Ln_LINE_COUNT and Ln_LINE_SIZE.

·        Ln_D_LINE_COUNT – This variable has almost the same meaning as the previous one. The number of Ln data cache lines are defined by this variable. This variable cannot be used with Ln_LINE_COUNT and Ln_LINE_SIZE.

·        Ln_LINE_SIZE – Defines the size of a cache line in bytes. When this variable is used Lx cache is treated as unified cache. This variable cannot be used with Ln_I_… and Ln_D_… variables.

·        Ln_I_LINE_SIZE – This variable defines the size of Ln instruction cache line. When this variable is used, the cache is divided into the instruction part and the data part of a cache. This variable cannot be used with Ln_LINE_COUNT and Ln_LINE_SIZE.

·        Ln_D_LINE_SIZE – This variable is similar to Ln_I_LINE_SIZE but it defines the size of cache line of the data part of Ln cache level. This variable cannot be used with Ln_LINE_COUNT and Ln_LINE_SIZE.

·        Ln_INT_READ_LATENCY – Defines the number of cycles of an integer load when the Ln cache hit occurs. All read and write latencies are the same for both integer and data part of a cache.

·        Ln_FLOAT_READ_LATENCY – Defines the number of cycles of a floating‑point load when the Ln cache is hit.

·        Ln_INT_WRITE_LATENCY – Defines the number of cycles of an integer store when the Ln cache hit occurs.

·        Ln_FLOAT_WRITE_LATENCY – Defines the number of the cycles of a floating‑point store when the Ln cache hit occurs.

·        VHPT_READ_LATENCY – This variable contains the number of cycles of a VHPT access. This latency is applied when a TLB entry is not found and a translation record is read from VHPT.

·        TLB_LEVELS – This variable specifies the number of TLB levels. The use of this variable supposes that all levels of TLB are unified for both instruction and data translation records. This variable cannot be used with DTLB_LEVELS and ITLB_LEVELS variables.

·        DTLB_LEVELS – This variable specifies the number of Data TLB levels. When this variable is used the TLB is divided into instruction and data part for all levels. Data TLB and Instruction TLB can have different number of levels. This variable cannot be used with TLB_LEVELS.

·        ITLB_LEVELS – This variable specifies the number of Instruction TLB levels. The other restrictions are the same as defined in DTLB_LEVELS.

·        Ln_TLB_SIZE – The number of records of the level n TLB structure. It can be used when the number of levels is defined for unified TLBs.

·        Ln_DTLB_SIZE – The number of records of the level n Data TLB structure. It can be used when the number of levels is defined for Instruction TLBs and Data TLBs.

·        Ln_ITLB_SIZE – The number of records of the level n Instruction TLB structure. It can be used when the number of levels is defined for Instruction TLBs and Data TLBs.

·        Ln_TLB_READ_LATENCY – The duration of the level n TLB reads in cycles. It can be used when the number of the levels is defined for unified TLBs.

·        Ln_DTLB_READ_LATENCY – The duration of the level n Data TLB reads. It can be used when the number of levels is defined for separated Data and Instruction TLBs.

·        Ln_ITLB_READ_LATENCY – The duration of the level n Instruction TLB reads. It can be used when the number of the levels is defined for separated Data and Instruction TLBs.

·        BPT_SIZE – The size of the Branch Prediction Table.

·        MBPT_SIZE – The size of the Multiway Branch Prediction Table.

·        TAC_SIZE – The size of the Target Address Cache.

·        TAR_SIZE – The size of the Target Address Register.

·        RESTRICT_un – Defines restrictions to an instruction execution of un unit. The letter u stands for a functional unit type. The letter n stands for a unit number. The available values are I, M, B and F for the unit type and a number from interval <0, u_UNITS – 1> for the unit number. The variable value is the list of encodings of instructions which cannot be executed by the specified functional unit. Entries in the list are separated by space or tabulator characters. Each entry defines bits 27 through 40 of the instruction encoding in the binary representation. The entry consists of zero and one digits, asterisks and dots. Zero and one digit defines zero and one bits. Asterisk should be used when the encoding can contain both zero and one at the position of this asterisk. Dots are omitted. They are useful to highlight the groups of bits. Bits are defined in the reverse order, i. e. bit 40 is the first bit of a record. The instruction encoding can look like “0001.**.1*0*.****”. In this record bits 40, 39, 38 and 32 are set to zero, bits 37 and 34 are set to one and bits 36, 35, 33 and 31 through 27 can have arbitrary value. Records must define 14 bits of the encoding. If the line is too long, a backslash can be used for breaking the line and continuing the records at the next one.

·        FORCE_ubs – This option can assign a concrete functional unit to the u‑type instruction residing in the slot s of bundle b. u is the instruction type, the type can be I, M, B and F. s is the number of the slot where the instruction resides. b is the index of an instruction bundle in the dispersal window. The value of this variable specifies the number of appropriate unit which the instructions are to be dispersed to. The FORCE definition can look like “FORCE_F02 = 0” which means that the floating‑point instructions appearing in the third slot (slot 2) of the first bundle (bundle 0) of the dispersal window are to be dispersed to the F0 unit.

3.6.2      Architecture Specific Algorithms

There are several algorithms which cannot be simply configured like previous features. These algorithms are placed to the special source file. Generally, a function implementing such algorithm receives all of the necessary data as parameters, however, they can use objects and functions of the whole simulator.

These functions are placed to the “archspec.cpp” file. All of user definable functions are listed here:

 

Function: cache_accessed

Declaration:   void cache_accessed(Cache_data *cd, bool write)

This function is called when a memory cache is accessed. A code updating the information used for managing cache lines can be placed here. The current implementation increase the age field of each line by one. cd is a pointer to the cache data structure. The parameter write is true when the cache is going to be written. Cache_data structure is defined in the memory.h file.

 

Function: cache_find_line

Declaration: dword cache_find_line(Cache_data *cd, qword source_line, bool non_temporal)

This function should search a cache line containing the data at the physical address source_line and return the index of that cache line. If the appropriate cache line is not present, the function should look for a new place where this line should be written to. The parameter non_temporal defines whether the line should be placed in temporal structures or non‑temporal structures of the cache. Since the current implementation of the IA‑64 architecture does not distinguish between these structures, the parameter is not used now.

 

Function: alat_accessed

Declaration:   void alat_accessed(ALAT *alat, bool write)

This function is called when the ALAT table is accessed. It can be used for updating the information managing the ALAT entries such as aging information. alat is a pointer to the ALAT table. Parameter write is true when the ALAT entry is going to be written. The ALAT table structure is defined in the processor.h file. The description of the ALAT table and ALAT entries is also in the chapter 3.4.5.

 

Function: alat_find_entry

Declaration: dword alat_find_entry(ALAT *alat, byte rtype, word reg)

This function should search the ALAT entry of the register type rtype and index reg and return the index of found entry. If the appropriate entry is not present, the function should look for a new place where this line should be placed. Register type constants are defined in common.h file. However, a user need not know the register type values.

 

Function: tlb_accessed

Declaration:   void tlb_accessed(TLB **tlb, dword tlb_levels)

This function is called when the TLB is accessed. tlb is a pointer to array of TLB levels; tlb_levels defines the number of levels. The TLB structure and translation record are defined in the processor.h file. A brief description is also given in the chapter 3.4.4.3.

 

Function: tlb_find_entry

Declaration: dword tlb_find_entry(TLB *tlb, Translation_record *tr)

This function should search a translation record in the TLB table and return the index of the found entry. If the appropriate record is not present, the function should look for a new place where this record should be placed. tlb points to a TLB level; tr is searched or newly inserted translation record.

 

Function: tac_accessed

Declaration:   void tac_accessed(TAC *tac)

This function is called when the TAC structure is accessed. tac is a pointer to TAC entries. The number of TAC entries is given by the global tac_size variable. TAC structure is defined in the processor.h file. A brief description is also given in chapter 3.4.6.

 

Function: find_tac_entry

Declaration: dword find_tac_entry(TAC *tac, qword src_ip, byte slot)

This function should search the TAC entry of specified parameters and return the index of found entry. If the appropriate record is not present, the function should look for a new place where this entry should be placed. tac points to the TAC structure; src_ip is the source address of a branch instruction; slot is the slot position in bundle of a branch instruction. The present TAC structure contains one TAC entry per bundle at most. That is why the slot parameter is not used.

 

Function: tar_accessed

Declaration:   void tar_accessed(TAR *tar)

This function is called when the TAR structure is accessed. tar is a pointer to an array of TAR entries. The number of the TAR entries is given by the global tar_size variable. The TAR structure is defined in the processor.h file. A brief description is also given in chapter 3.4.6.

 

Function: find_tar_entry

Declaration: dword find_tar_entry(TAR *tar, qword src_ip, byte slot)

This function should search a TAR entry of specified parameters and return the index of the found entry. If the appropriate record is not present, the function should look for a new place where this entry should be placed. tar points to the TAR structure; src_ip is the source address of a branch instruction; slot is the slot position in bundle of a branch instruction. Currently there can be just one TAR entry per bundle at most. That is why the slot parameter is not used.

 

Function: predict_branch

Declaration: dword predict_branch(BPT *bpt, qword src_ip, bool dynamic_hint, bool taken_hint, bool clr, bool taken, byte btype, bool &prediction)

The predict_branch function is called to determine whether the branch is predicted taken or not‑taken by the real processor. Besides, this function should update the BPT table information and TAC and TAR entries according to the used algorithm. To create or find the TAC and TAR entries, find_and_create_tac_entry or find_and_create_tar_entry functions can be called. This function should return the number of cycles necessary to perform a branch. bpt is a pointer to the BPT table; src_ip is a virtual address of a branch instruction. If dynamic_hint is true, a dynamic prediction should be used. If taken_hint is true and dynamic_hint is false, a branch should be predicted taken. dynamic_hint and taken_hint form the Whether hint defined by the instruction completer. clr stands for clr hint in brp instructions. taken is true if the branch is really taken depending on a branch predicate register and a branch type. btype defines the type of a branch. Branch type constants are defined in “common.h” file. prediction parameter should be filled with the result of a prediction.

 

Function: predict_multiway_branch

Declaration: dword predict_multiway_branch(MBPT *mbpt, qword src_ip, byte slot, bool dynamic_hint, bool taken_hint, bool clr, bool taken, byte btype, bool &prediction)

This function is used for the prediction of multiway branches. The only difference from the previous function is that it contains the slot parameter and that there is a pointer to MBPT table instead of the BPT table. slot is the slot number of the branch instruction.

3.6.3      Instruction Definition

New implementation of the IA‑64 architecture may define new instructions. New instruction implementation in the simulator should proceed in several steps.

The processor.cpp file contains the list of all the instructions (instruction_source_info). New entry should be inserted into this list or any existing entry should be updated. List entries consist of the following fields:

·        slot_type – slot type of the instruction. A couple of constants can be used in this field: A_UNIT, I_UNIT, M_UNIT, B_UNIT, F_UNIT, and X_UNIT

·        inst_bits – instruction encoding definition. An encoding is defined as a string of instruction bits <40, 27>. The string syntax is the same as for RESTRICT_un variable in the architecture configuration file (for example “0001.**.1*0*.1.111”). A full description of this syntax is in the chapter 3.6.1.

·        inst_code – member pointer to the instruction code function. These functions are implemented as processor methods in the inst_code.cpp file. A method should have a qword parameter. This parameter contains 41‑bit slot contents of the instruction for A, I, M, B and F slot types or two 32‑bit pointers to 41‑bit slots for LX slot type (the least significant dword contains a pointer to the slot 2 of a bundle).

·        inst_text – a string containing an instruction text definition. This text is shown by the debugger in Disassembly pane. The syntax of this field is a little bit complicated and it is described below.

·        unit_type – a type of unit that should process the instruction. This field should be defined for LX slots only. In other cases the value is omitted. The available values are I_UNIT, M_UNIT, B_UNIT, and F_UNIT.

An instruction text can contain a normal text and expressions similar to printf expression. These expressions are evaluated according to the actual state.

Expression definition begins with “%” character and ends with d, x, or s characters. D signifies that the result of the expression is a number and should be displayed in decimal form. X signifies that the result is also a number, but it should be displayed in hexadecimal form. S means that the expression result is a string.

Standard operators are defined for expression arithmetic. The following list contains the description of these operators and it is listed in the operator priority order. The first item has the highest priority.

·        (x) – parenthesis

·        x – unary minus

·        x se y – sign extension of the y‑bit number x to 64 bits

·        x << y, x >> y – shift left, shift right by y bits

·        x * y, x ­/ y – multiplication, division

·        x + y, x ­– y – addition, subtraction

·        x & y – binary and

·        x | y – binary or

·        x == y, x == y, x != y, x <= y, x >= y, x < y, x > y – relational operators

·        x ? y : z – conditional expression. If the result of expression x is non‑zero, y expression is evaluated, otherwise z expression is evaluated.

x, y and z can be expressions, numbers, strings, bit definitions, and IP address. Numbers are defined in a decimal form (123), or in a hexadecimal form ending with “H” character (4CH). Strings are bounded with apostrophes ('ahoj'). The plus operator concatenate two strings. The other operators can work with numbers only. The plus operator can also treat mixed number and string operands. In this case a number is converted to a string and operands are concatenated.

Instruction encoding bits can be obtained from an expression. bx or bx_y tokens get bit x of the encoding or bits <x, y) of the encoding. Bit y is excluded; x has to be less than y. ip token stands for the address of a bundle.

There are some examples of instruction texts:

 

add r%b6_13d = r%b13_20d, r%b20_27d

mf%b27_31 == 2 ? '' : '.a's

break.b %b36 << 20 + b6_26x

 

Add instruction completes general register numbers from bits <6, 13), <13, 20) and <20, 27). The outcoming text looks like add r3 = r34, r56. Mf instruction decides whether there is the '.a' completer according to bits <27, 31). The results can be either “mf” or “mf.a”. The parameter of the break instruction is computed from bit 36 shifted by 20 bits to the left and bits <6, 26) added.

4        User Guide

4.1      Diskette Contents

The diskette enclosed to this text contains all the files necessary to run the simulator. An executable simulator file and examples of IA‑64 applications are placed here. To compile the IA‑64 application, a standard library, start‑up code, header files and makefiles are provided. The diskette contains also the source code of all these programs.

There is a list of files:

·        IA64Emu.exe –simulator program

·        ia64emu.cfg – configuration file

·        ia64emu.arc – architecture configuration file

·        Programs\ – examples of simulated programs

·        Programs\*.S – source code of IA‑64 applications written in assembly language

·        Programs\*.c – source code of IA‑64 applications written in C language

·        Programs\*.exe – IA‑64 Windows applications

·        Programs\*. – IA‑64 Linux applications

·        Programs\makec.bat, Programs\makeasm.bat – batch files for the compilation  of Windows applications

·        Programs\makec, Programs\makeasm – scripts for the compilation of Linux applications

·        Programs\Lib – simulator's standard library, start‑up code, and their object files for the compilation of Windows applications

·        Programs\Linux-Lib – simulator's standard library, start‑up code, and their object files for the compilation of Linux applications

·        Source\*.h, Source\*.cpp – source files of simulator

·        Source\*.dsw, Source\*.dsp – Visual C++ workspace and project files

4.2      Installation and IA‑64 Application Compilation

The diskette contents can be simply copied to any folder in the hard drive. The simulator can also be launched from the diskette.

To compile the own IA‑64 application, NUE environment in Linux or Microsoft Platform SDK in Microsoft Windows has to be installed. Two make‑scripts for Linux and a make script for Windows are on the diskette to simplify the compilation. The makec script is used to compile C applications in NUE environment. The makeasm script is used to compile applications written in assembly language in the same environment. The makec.bat and makeasm.bat are analogous scripts used for compiling applications in Windows.

Linux users have to start the NUE environment at first. NUE environment is distributed in the form of RPM packages and can be downloaded in Intel or Hewlett‑Packard pages. Then, make script with the name of application as a parameter can be run. The name of the application has to be entered without an extension. Make command has to be launched from the directory where it is placed because of the path to the standard library and the start‑up code. Otherwise this path should be changed in makec or makeasm file.

Windows users have to start “Win64 Pre‑release Build Environment”. The Platform SDK installation program installs the compiler and creates a Start menu item having this name. As for Linux users, makec.bat or makeasm.bat script with the name of application as a parameter can be run. The name of the application has to be entered without an extension. Make scripts has to be launched from the directory where it is placed because of the path to the standard library and the start‑up code. Otherwise this path should be changed in makec.bat or makeasm.bat file.

4.3      Simulator's Interface

This subchapter describes the simulator appearance and control. The appearance was inspired with NUE simulator by Hewlett-Packard, therefore, this simulator looks a little bit similar.

Program should be launched from a command‑line. Several arguments can be written to specify the simulated application name, log file and configuration files. A command‑line syntax is

 

ia64emu.exe exe=file‑name [log=file‑name] [arc=file‑name] [cfg=file‑name]

 

“exe” option defines the simulated application file‑name. This argument is mandatory. “log” option defines the name of log‑file. This argument is optional, the default value is ia64emu.log. “arc” option defines the name of the architecture configuration file. This argument is also optional, the default value is ia64emu.arc. “cfg” argument defines the simulator configuration file. The default value is ia64emu.cfg.

IA64Emu consists of a single window divided into four panes. These panes are used for viewing register values, memory contents, application's instructions and command‑line. A user can control the simulator by a number of hot keys and a lot of commands. The Tab key should be used to switch among panes. Pane sizes and positions in the main window are fixed and cannot be changed. A mouse is not supported in the simulator. All features are accessible by the keyboard.

Register pane views current values of processor registers. Because of a huge amount of IA‑64 registers, there is just one type of registers viewed at the same time. The viewed register type can be specified on the command line. Standard keys are used to navigate through this pane.

Memory dump pane views a memory contents. This pane works in four modes: one‑byte, two‑byte, four‑byte, and eight‑byte views. These modes are changed by commands. Standard keys (arrows, page up, page down) are used to scroll this window. The “Home” key jumps to the system environment address.

Disassembly pane views the application instructions and a part of application execution state. Each line in this pane contains the address of a bundle (shown by the first slot in a bundle only), an instruction text including the predicate register number if it is different from PR0 and a bundle type (also for the first slot in a bundle only). In addition, during the program execution several characters can appear next to the instruction text. The “|” character indicates bundles fetched into the dispersal window. The “>” character marks the instructions currently in process. The “#” character marks the instructions already processed within the current instruction group. When the instruction is marked “in process”, a number of clock‑ticks necessary to complete the instruction is shown to the right of the instruction text.

Some information is provided on the pane's caption. There is the list of all functional units of the processor. For example the “ii mm bbb ff” string is viewed for Itanium. If the appropriate unit is in use, the capital letter is used for unit representation. Otherwise the letter is small. The total execution time (in processor clock‑ticks) is another information viewed on the caption.

To move inside the disassembly pane, arrow keys, page up and page down can be used. Home key moves the cursor to the program entry point. There are several hot keys used for making the control easier:

·        F5 – Starts the simulation. This hot key can be used in any pane.

·        F10 – Processes one step of the simulation. This hot key can be used in any pane too.

·        F9 – Inserts breakpoint at the address defined by the cursor.

·        Ctrl+F9 – Enables or disables current breakpoint.

Command‑line pane provides the input line to type the simulator commands. Unlike the other panes, arrow up and arrow down keys are used for listing the command‑line history. Following commands are supported by the simulator:

·        run – Starts the program execution. The execution can be broken by any key press, breakpoint, or processor exception.

·        step – Processes one step of the simulation (execution of an instruction, loading new bundles into a dispersal window or finishing the currently executing instructions).

·        reset – Restarts the simulated application.

·        brk ins[ert] address [slot] – Inserts breakpoint at the specified address and slot.

·        brk rem[ove] address [slot] – Removes breakpoint from the specified address and slot.

·        brk en[able] address [slot] – Enables breakpoint.

·        brk en[able] all – Enables all breakpoints (this command does not affect the independent breakpoint setting, it enables the breakpoints enabled individually only).

·        brk dis[able] address [slot] ­– Disables breakpoint.

·        brk dis[able] all – Disables all breakpoints (this command does not affect the individual breakpoint setting).

·        brk list - Lists all breakpoints set by user.

·        view gr – Switches the register pane to the general registers view.

·        view phgr – Switches the register pane to the physical general registers view. Register rotation and register stack frame configuration is not considered by this mode of view.

·        view fr – Switches the register pane to the floating‑point registers view.

·        view pr (view br) – Switches the register pane to the predicate registers and the branch registers view.

·        view ar – Views application registers including the detailed description of RSC and PFS registers.

·        view other – Views IP and CFM registers.

·        view 1 or view byte – Switches memory pane to one-byte view.

·        view 2 or view word – Switches memory pane to two-byte view.

·        view 4 or view dword – Switches memory pane to four-byte view.

·        view 8 or view qword – Switches memory pane to eight-byte view.

·        view address value – Sets the start address of the memory pane to value.

·        set register value – Sets the register register to value value, e.g. set r1 10, set ar.bsp 0, set pr10 1.

·        set {byte | word | dword | qword} address value – Sets the specified number of bytes at the address address to value value.

·        log exec – Starts logging each executed instruction.

·        log exec off – Stops logging of executed instructions.

·        exit – Exits the simulator.

The address and value parameters are treated as hexadecimal numbers.

5        Related Work

A couple of simulators can be found on the Internet. There are some differences between those simulators and IA64Emu. None of the found simulators provide the code testing as this simulator does. There is a brief description of each simulator in this chapter.

5.1      Linux Developer's Kit

The Linux Developer's Kit was created by Hewlett‑Packard company. This kit simulates the general IA‑64 architecture and does not support Itanium specific features. Packages contain the Ski simulator, NUE (Native User Environment) environment allowing the whole system simulation, and the Linux root file system containing IA‑64 native applications. For example standard Linux shell, utilities, compiler, assembler, and linker are in the package.

There are two modes of simulation – user‑mode and system‑mode. The user‑mode simulation can execute application level instructions only. It is faster than the system‑mode mode but it does not support some features such as multi‑threading. The user‑mode application simulation is similar to the IA64Emu. Standard system calls can be used by the application and there is no need to call the special ones as designed in IA64Emu. System calls are also translated to the host operating system.

System‑mode supports all of the IA‑64 features. The IA‑64 Linux kernel is loaded and applications execute along with this kernel. This mode of simulation is used by the NUE environment. Within the NUE environment the standard Linux shell is simulated. User can use this system as the real IA‑64 Linux.

To debug IA‑64 applications, the Ski user interface has been developed. The Ski interface is similar to the IA64Emu interface (to be exact, the IA64Emu interface is similar to the Ski interface). It provides a full‑screen debugger with several panes of memory contents, registers and a disassembled application. Unlike IA64Emu, the cursor cannot move among the panes and cannot scroll them. Hot‑keys are not supported and all features have to be controlled by a command‑line. The Ski command‑line enables an access to all of the standard debugger features like a breakpoint setting, watching memory contents and register values, and a program tracing.

The simulator measures the program execution time and counts the number of instructions. Since Ski does not implement a processor microarchitecture such as dispersal window or functional units, it does not simulate concurrent execution and only one instruction at a time is executed. Therefore it cannot be used to detect dependency violations. Hewlett‑Packard themselves wrote that performance statistics provided by Ski are meaningless and cannot be used to tune the program performance.

Nevertheless, the exactness of the application and system execution, and the fast simulation are significant advantages of Ski. Although there is not any application test implemented, it is very important for the Linux application development. For example Ski simulator has been used to develop the IA‑64 version of the Linux kernel.

5.2      Jason Papadopoulos's Simulator

It was one of the first available simulators. This simulator was released in the fall of 1999, a few month after the release of the first detailed description of the IA‑64 architecture [3].

The source code of this simulator is available for all users on the Internet (www. glue.umd.edu/~jasonp/). There is not an executable file in the distribution. Makefiles in a package should be used to compile the simulator. It can be compiled in Linux by gcc compiler and in Dos or Windows by djgpp compiler.

Since the book mentioned above is just a general description of the architecture and there are no binary formats and operating system conventions defined in this book, the simulator cannot load a compiled source code. Instead of this the simulator parses IA‑64 assembly source code which slows the simulation down.

There are several restrictions to the simulated source code. The source code can consist of program instructions only. Directives such as .global, .local, .type, .size cannot be used in the source code because the specification of IA‑64 assembly language [8] was released in January 2000. Software and operating system conventions [5] were also published in January 2000. Due to this fact some other source code restrictions apply (loc and out names of general registers cannot be used), although the system environment defined by this specification is implemented in the newest version of the simulator.

The simulator can execute one application at a time only. There is no concurrent instruction execution. The simulator supports application‑level instructions only and there are no system calls implemented. The dependency violation detection and other tests and performance monitoring is not provided.

A user interface used by this simulator is command‑line oriented. There are several commands for loading, execution, and register and memory inspection implemented. Simulated application as well as register and memory contents cannot be automatically viewed during the tracing.

In spite of all these facts the simulator is worth mentioning. It was significant in the time of its release, since it provided a general view of the IA‑64 architecture functionality when there was not any other simulator available.

6        Conclusion

The IA‑64 architecture simulator providing instruction code tests and performance monitoring described in the chapter 3 was developed.

The simulator could be a useful tool for compiler debugging and generated code optimizing. Although this simulator cannot compete with other simulators in speed, the implemented tests are not provided by the other ones. The simulator's design also enables simple extension for the next IA‑64 architecture implementations.

Designed user interface is quite friendly. It provides an easy access to almost all of the standard debugger functions and comfortably displays the current execution state.

Even though IA64Emu provides almost complete simulation of IA‑64 architecture applications, a lot of features can be added to this program.

The simulator could be extended to simulate the whole operating system, instead of a single application execution. For this purpose, a lot of things has to be done. Some of them are swap file, exact interruption handling, translation of IA‑64 interruptions to IA‑32 interruptions and implementation of system‑level instructions. This work would be enormous and it would exceed the size of standard master thesis.

To improve a current performance monitoring, simulated processor microarchitecture should be implemented more exactly. It would allow the better measuring of a program execution time. At present this is not possible because the published microarchitecture specifications are not as exact as it is necessary for this purpose.

The program loader and internal operating system could be extended to support the standard program start‑up code, system calls and standard library. Now, several system calls are implemented to provide a basic user to program communication. Implementation of standard system calls would enable the simulation of common applications.

The simulator currently runs in Windows. It could be ported to the other operating systems and platforms but this problem is not important. It does not increase the quality of the simulator, its design and used algorithms, but it increases its usability.

Lots of other suggestions for the next work could be named here. Creation of perfect simulator is a huge work but it was not a purpose of this thesis. This simulator implements all of the required features and satisfies the original goal of the thesis.

Bibliography

[1]     Carreño Victor A., Miner Paul S.: Specification of the IEEE‑854 Floating‑point, Standard in HOL and PVS, NASA Langeley Research Center, 1995

[2]     Prof. Etiemble Daniel: Instruction Level Parallelism in Modern Microprocessors, University of Toronto, 2001

[3]     Intel Co.: IA‑64 Application Developer's Architecture Guide, May 1999

[4]     Intel Co.: IA‑64 Architecture Software Developer's Manual, July 2000

[5]     Intel Co.: IA‑64 Software Conventions and Runtime Architecture Guide, September 2000

[6]     Intel Co.: IA‑64 System Abstraction Layer Specification, July 2000

[7]     Intel Co.: UNIX System V Application Binary Interface, September 2000

[8]     Intel Co.: IA‑64 Assembly Language Reference Guide, January 2000

[9]     Intel Co.: Itanium Processor Microarchitecture Reference for Software Optimization, March 2000

[10]   Shaaban Muhammad Dr., Rochester Institute of Technology: Computer Architecture, Lecture notes, Winter 2000

[11]   Visual C++ Business Unit: Microsoft Portable Executable and Common Object File Format Specification, Revision 6.0, Microsoft Co., February 1999

[12]   The Santa Cruz Operation, Inc., AT&T: System V Application Binary Interface, Edition 4.1, March 1997