search

Google

Sunday, April 13, 2008

Microarchitecture

In computer engineering, microarchitecture (sometime abbreviated to µarch or uarch) is a description of the electrical circuitry of a computer, central processing unit, or digital signal processor that is sufficient for completely describing the operation of the hardware.

In academic circles, the term computer organization is used, while in the computer industry, the term microarchitecture is more often used. Microarchitecture and instruction set architecture (ISA) together constitute the field of computer architecture
Etymology of the term
Since the 1950s, many computers used microprogramming to implement their control logic which decoded the program instructions and executed them. The bits within the microprogram words controlled the processor at the level of electrical signals.

The term microarchitecture was used to describe the units that were controlled by the microprogram words, as opposed to architecture that was visible and documented for programmers. While architecture usually had to be compatible between hardware generations, the underlying microarchitecture could be easily changed.


[edit] Relation to instruction set architecture
Microarchitecture is distinct from a computer's instruction set architecture. The instruction set architecture is the abstract image of a computing system that is seen by a machine language (or assembly language) programmer, including the instruction set, memory address modes, processor registers, and address and data formats. The computer organization is a lower level, more concrete, description of the system than the ISA. The computer organization shows the constituent parts of the system and how they are interconnected and how they interoperate in order to implement the architectural specification. [1][2] [3]

Different machines may have the same instruction set architecture, and thus be capable of executing the same programs, yet have different microarchitectures. These different microarchitectures (along with advances in semiconductor manufacturing technology) are what allows newer generations of processors to achieve higher performance levels as compared to previous generations. In theory, a single microarchitecture (especially if it includes microcode) could be used to implement 2 different instruction sets, by programming 2 different control stores.

The microarchitecture of a machine is usually represented as a block diagram that describes the interconnections of the registers, buses, and functional blocks of the machine. This description includes the number of execution units, the type of execution units (such as floating point, integer, branch prediction, single instruction multiple data (SIMD), the nature of the pipeline (which might include such stages as instruction fetch, decode, assign, execution, completion in a very simple pipeline), the cache memory design (level 1, level 2 interfaces), and the peripheral support.

The actual physical circuit layout, hardware construction, packaging, and other physical details is called the implementation of that microarchitecture. Two machines may have the same microarchitecture, and hence the same block diagram, but different hardware implementations.[4]


[edit] Aspects of microarchitecture
The pipelined datapath is the most commonly used datapath design in microarchitecture today. This technique is used in most modern microprocessors, microcontrollers, and DSPs. The pipelined architecture allows multiple instructions to overlap in execution, much like an assembly line. The pipeline includes several different stages which are fundamental in microarchitecture designs.[4] Some of these stages include instruction fetch, instruction decode, execute, and write back. Some architectures include other stages such as memory access. The design of pipelines is one of the central microarchitectural tasks.

Execution units are also essential to microarchitecture. Execution units include arithmetic logic units (ALU), floating point units (FPU), load/store units, branch prediction, and SIMD. These units perform the operations or calculations of the processor. The choice of the number of execution units, their latency and throughput is a central microarchitectural design task. The size, latency, throughput and connectivity of memories within the system are also microarchitectural decisions.

System-level design decisions such as whether or not to include peripherals, such as memory controllers, can be considered part of the microarchitectural design process. This includes decisions on the performance-level and connectivity of these peripherals.

Unlike architectural design, where achieving a specific performance level is the main goal, microarchitectural design pays closer attention to other constraints. Since microarchitecture design decisions directly affect what goes into a system, attention must be paid to such issues as:

chip area/cost
power consumption
logic complexity
ease of connectivity
manufacturability
ease of debugging
testability

[edit] Micro-Architectural Concepts
In general, all CPUs, single-chip microprocessors or multi-chip implementations run programs by performing the following steps:

read an instruction and decode it
find any associated data that is needed to process the instruction
process the instruction
write the results out
Complicating this simple-looking series of steps is the fact that the memory hierarchy, which includes caching, main memory and non-volatile storage like hard disks, (where the program instructions and data reside) has always been slower than the processor itself. Step (2) often introduces a lengthy (in CPU terms) delay while the data arrives over the computer bus. A considerable amount of research has been put into designs that avoid these delays as much as possible. Over the years, a central goal was to execute more instructions in parallel, thus increasing the effective execution speed of a program. These efforts introduced complicated logic and circuit structures. Initially these techniques could only be implemented on expensive mainframes or supercomputers due to the amount of circuitry needed for these techniques. As semiconductor manufacturing progressed, more and more of these techniques could be implemented on a single semiconductor chip.

See Article Central Processing Unit for a more detailed discussion on operation basics.

See Article History of general purpose CPUs for a more detailed discussion on the development history of CPUs.

What follows is a survey of micro-architectural techniques that are common in modern CPUs.


[edit] Instruction Set choice
The choice of which Instruction Set Architecture to use greatly affects the complexity of implementing high performance devices. Over the years, computer architects have strived to simplify instruction sets, which enables higher performance implementations by allowing designers to spend effort and time on features which improve performance as opposed to spending their energies on the complexity inherent in the instruction set.

Instruction set design has progressed from CISC, RISC, VLIW, EPIC types. Architectures that are dealing with data parallelism include SIMD and Vectors.


[edit] Instruction pipelining
Main article: instruction pipeline
One of the first, and most powerful, techniques to improve performance is the use of the instruction pipeline. Early processor designs would carry out all of the steps above for one instruction before moving onto the next. Large portions of the circuitry were left idle at any one step; for instance, the instruction decoding circuitry would be idle during execution and so on.

Pipelines improve performance by allowing a number of instructions to work their way through the processor at the same time. In the same basic example, the processor would start to decode (step 1) a new instruction while the last one was waiting for results. This would allow up to four instructions to be "in flight" at one time, making the processor look four times as fast. Although any one instruction takes just as long to complete (there are still four steps) the CPU as a whole "retires" instructions much faster and can be run at a much higher clock speed.

RISC make pipelines smaller and much easier to construct by cleanly separating each stage of the instruction process and making them take the same amount of time — one cycle. The processor as a whole operates in an assembly line fashion, with instructions coming in one side and results out the other. Due to the reduced complexity of the Classic RISC pipeline, the pipelined core and an instruction cache could be placed on the same size die that would otherwise fit the core alone on a CISC design. This was the real reason that RISC was faster. Early designs like the SPARC and MIPS often ran over 10 times as fast as Intel and Motorola CISC solutions at the same clock speed and price.

Pipelines are by no means limited to RISC designs. By 1986 the top-of-the-line VAX (the 8800) was a heavily pipelined design, slightly predating the first commercial MIPS and SPARC designs. Most modern CPUs (even embedded CPUs) are now pipelined, and microcoded CPUs with no pipelining are seen only in the most area-constrained embedded processors. Large CISC machines, from the VAX 8800 to the modern Pentium 4 and Athlon, are implemented with both microcode and pipelines. Improvements in pipelining and caching are the two major microarchitectural advances that have enabled processor performance to keep pace with the circuit technology on which they are based.


[edit] Cache
It was not long before improvements in chip manufacturing allowed for even more circuitry to be placed on the die, and designers started looking for ways to use it. One of the most common was to add an ever-increasing amount of cache memory on-die. Cache is simply very fast memory, memory that can be accessed in a few cycles as opposed to "many" needed to talk to main memory. The CPU includes a cache controller which automates reading and writing from the cache, if the data is already in the cache it simply "appears," whereas if it is not the processor is "stalled" while the cache controller reads it in.

RISC designs started adding cache in the mid-to-late 1980s, often only 4 KB in total. This number grew over time, and typical CPUs now have about 512 KB, while more powerful CPUs come with 1 or 2 or even 4, 6, 8 or 12 MB, organized in multiple levels of a memory hierarchy. Generally speaking, more cache means more speed.

Caches and pipelines were a perfect match for each other. Previously, it didn't make much sense to build a pipeline that could run faster than the access latency of off-chip memory. Using on-chip cache memory instead, meant that a pipeline could run at the speed of the cache access latency, a much smaller length of time. This allowed the operating frequencies of processors to increase at a much faster rate than that of off-chip memory.


[edit] Branch Prediction
One of barriers to achieving higher performance through instruction-level parallelism are pipeline stalls and flushes due to branches. Normally, whether a conditional branch will be taken isn't known until late in the pipeline as conditional branches depend on results coming from a register. From the time that the processor's instruction decoder has figured out that it has encountered a conditional branch instruction to the time that the deciding register value can be read out, the pipeline might be stalled for several cycles. On average, every fifth instruction executed is a branch, so that's a high amount of stalling. If the branch is taken, its even worse, as then all of the subsequent instructions which were in the pipeline needs to be flushed.

Techniques such as branch prediction and speculative execution are used to lessen these branch penalties. Branch prediction is where the hardware makes educated guesses on whether a particular branch will be taken. The guess allows the hardware to prefetch instructions without waiting for the register read. Speculative execution is a further enhancement in which the code along the predicted path is executed before it is known whether the branch should be taken or not.


[edit] Superscalar
Even with all of the added complexity and gates needed to support the concepts outlined above, improvements in semiconductor manufacturing soon allowed even more logic gates to be used.

In the outline above the processor processes parts of a single instruction at a time. Computer programs could be executed faster if multiple instructions were processed simultaneously. This is what superscalar processors achieve, by replicating functional units such as ALUs. The replication of functional units was only made possible when the die area of a single-issue processor no longer stretched the limits of what could be reliably manufactured. By the late 1980s, superscalar designs started to enter the market place.

In modern designs it is common to find two load units, one store (many instructions have no results to store), two or more integer math units, two or more floating point units, and often a SIMD unit of some sort. The instruction issue logic grows in complexity by reading in a huge list of instructions from memory and handing them off to the different execution units that are idle at that point. The results are then collected and re-ordered at the end.


[edit] Out-of-order execution
The addition of caches reduces the frequency or duration of stalls due to waiting for data to be fetched from the memory hierarchy, but does not get rid of these stalls entirely. In early designs a cache miss would force the cache controller to stall the processor and wait. Of course there may be some other instruction in the program whose data is available in the cache at that point. Out-of-order execution allows that ready instruction to be processed while an older instruction waits on the cache, then re-orders the results to make it appear that everything happened in the programmed order.


[edit] Speculative execution
One problem with an instruction pipeline is that there are a class of instructions that must make their way entirely through the pipeline before execution can continue. In particular, conditional branches need to know the result of some prior instruction before "which side" of the branch to run is known. For instance, an instruction that says "if x is larger than 5 then do this, otherwise do that" will have to wait for the results of x to be known before it knows if the instructions for this or that can be fetched.

For a small four-deep pipeline this means a delay of up to three cycles — the decode can still happen. But as clock speeds increase the depth of the pipeline increases with it, and some modern processors may have 20 stages or more. In this case the CPU is being stalled for the vast majority of its cycles every time one of these instructions is encountered.

The solution, or one of them, is speculative execution, also known as branch prediction. In reality one side or the other of the branch will be called much more often than the other, so it is often correct to simply go ahead and say "x will likely be smaller than five, start processing that". If the prediction turns out to be correct, a huge amount of time will be saved. Modern designs have rather complex prediction systems, which watch the results of past branches to predict the future with greater accuracy.


[edit] Multiprocessing and multithreading
Computer architects have become stymied by the growing mismatch in CPU operating frequencies and DRAM access times. None of the techniques that exploited instruction-level parallelism within one program could make up for the long stalls that occurred when data had to be fetched from main memory. Additionally, the large transistor counts and high operating frequencies needed for the more advanced ILP techniques required power dissipation levels that could no longer be cheaply cooled. For these reasons, newer generations of computers have started to exploit higher levels of parallelism that exist outside of a single program or program thread.

This trend is sometimes known as throughput computing. This idea originated in the mainframe market where online transaction processing emphasized not just the execution speed of one transaction, but the capacity to deal with massive numbers of transactions. With transaction-based applications such as network routing and web-site serving greatly increasing in the last decade, the computer industry has re-emphasized capacity and throughput issues.

One technique of how this parallelism is achieved is through multiprocessing systems, computer systems with multiple CPUs. Once reserved for high-end mainframes and supercomputers, small scale (2-8) multiprocessors servers have become commonplace for the small business market. For large corporations, large scale (16-256) multiprocessors are common. Even personal computers with multiple CPUs have appeared since the 1990s.

With further transistor size reductions made available with semiconductor technology advances, multicore CPUs have appeared where multiple CPUs are implemented on the same silicon chip. Initially used in chips targeting embedded markets, where simpler and smaller CPUs would allow multiple instantiations to fit on one piece of silicon. By 2005, semiconductor technology allowed dual high-end desktop CPUs CMP chips to be manufactured in volume. Some designs, such as Sun Microsystems' UltraSPARC T1 have reverted back to simpler (scalar, in-order) designs in order to fit more processors on one piece of silicon.

Another technique that has become more popular recently is multithreading. In multithreading, when the processor has to fetch data from slow system memory, instead of stalling for the data to arrive, the processor switches to another program or program thread which is ready to execute. Though this does not speed up a particular program/thread, it increases the overall system throughput by reducing the time the CPU is idle.

Conceptually, multithreading is equivalent to a context switch at the operating system level. The difference is that a multithreaded CPU can do a thread switch in one CPU cycle instead of the hundreds or thousands of CPU cycles a context switch normally requires. This is achieved by replicating the state hardware (such as the register file and program counter) for each active thread.

A further enhancement is simultaneous multithreading. This technique allows superscalar CPUs to execute instructions from different programs/threads simultaneously in the same cycle.

See Article History of general purpose CPUs for other research topics affecting CPU design

No comments: