|GigaHertz Intel Processors - Xeon vs Athlon|
|Originally published April, 2000|
|¿ 2000, 2005 Carlo Kopp|
One of the more interesting events we have seen in the last few months are the demonstrations of the first pre-production commodity market Intel x86 architecture processors, running at clock speeds of the order of a GigaHertz.
Were I confronted, a decade ago, with the prospect of a GigaHertz class microprocessor within the coming decade, I would have laughed very hard indeed. The industry was battling to get CPUs to run properly above 50 MHz, and just getting a motherboard to behave itself with a 40 MHz CPU-memory bus was an engineering challenge within itself.
In this month's feature I will explore the Intel Pentium III Xeon and AMD Athlon designs, with some attention to internal device architecture and performance, and discuss some of the implications of this class of processor.
It is worth briefly exploring the history of the Intel x86 family, to place these developments into context.
The "Intel" Architecture
Intel's first real success in the microprocessor market was the humble 8080, a rather basic 8-bit CISC processor. It very soon led to an improved derivative, the 8085 and its third party contemporary, the very popular Z-80 family.
Improved fabrication technology soon allowed for faster and denser designs, and this led to Intel's first foray into the 16-bit microprocessor market, the humble 8086. The 8086 used what was essentially an improved and extended derivative of the 8085 instruction set, and was a very classical microcoded CISC design, comparable to its contemporaries in the 16-bit minicomputer market.
The 8086 sold quite well in the industrial automation market, but its early life was unremarkable. Its stablemate, the 8088, was a "pseudo-8-bit" processor, essentially an 8086 with its datapaths cut down to 8-bits to make it more easily adaptable to the then abundant 8-bit peripheral chips. I worked during this period at an Intel site, we used 8085 SBCs in telemetry equipment, and I recall distinctly the internal technical debates over whether we should upgrade to the 8086 or 8088 Multibus boards. The 8088 did not stack up, and we never acquired any.
Evidently this viewpoint was not unique, and the 8088 was destined for obscurity like other oddball Intel creations, such as the 432 micro-mainframe.
Then IBM decided that dirt cheap stockpiles of unsalable 8088s were just what they needed for their first IBM PC design. The 8088 and soon after, the 8086, rode on the wave of commercial success produced by IBM's design, and like Microsoft's DOS, became the standards for low end desktop machines.
The 80186 included onboard peripherals and found a niche in the industrial market, however the next big step was the 80286, used in the PC AT, which extended the 86 instruction set with the inclusion of virtual memory, and some smarter features such as pipelining in the CPU core. Running at dazzling speeds around 10 MHz, the 80286 was a major commercial success in the booming market.
At this point Intel decided to tackle the next big technological step, which was the 32-bit datapath. The 80386 was the CPU to do this, and included further architectural improvements in addition to the wider datapath. Perhaps the most important feature was the further extension of the virtual memory architecture, to include both the segmentation used in the 80286 and paging, the preferred technique in the Unix world. The 80386 had a painful birth, indeed I recall a visiting Intel engineer at that time commenting on how they stripped engineers from almost every other division of Intel to push the new CPU design out on time. This was 1984.
Running at then impressive speeds of 10-25 MHz, with 32-bit datapaths, the "386" very quickly dominated the market. It was an integer machine, and required an external floating point unit. With the lethargic DRAMs of the day, external caches on the motherboard became a common feature of many later 386 based designs.
By the early nineties, the 486 arrived, essentially an incrementally improved 386 with an onboard floating point unit, higher clock speeds and various detail improvements. It was not particularly competitive against the RISC processors of the day used in Unix workstations, moreso since the first superscalar architectures began to appear in the RISC world. In a superscalar processor, multiple execution units are used to concurrently execute multiple instructions, where no mutual dependencies exist between the operands of the instructions.
Intel's flagship during this period was the Pentium, a completely new design released in 1993, using a hardwired rather than microcoded CPU, which incorporated the very important architectural improvement of superscalar processing. The Pentium proved to be highly successful, and began to close the hitherto unbreached performance gap between the RISC based Unix workstations of the day. Utilising similar internal techniques as the RISC processors, the Pentium was a curious hybrid, using a hardwired implementation of a CISC instruction set created originally in microcode.
The most significant of the early Pentium variants was the Pentium-Pro, the first Intel processor to genuinely close the performance gap between Intel Architecture CPUs and RISC workstation CPUs. Since it in many respects formed the basis of the Pentium II family, it is deserving of a closer look.
The Pentium Pro
The now obsoleted Pentium Pro chipset was released in 1996 as a top end follow-on to the baseline Pentium, now Pentium I. The design objectives for the Pentium Pro were ambitious: using the same 0.6 micron resolution Aluminium metallised four layer BiCMOS manufacturing process as used by the baseline Pentium, significantly outstrip the performance of the older processor by architectural improvements alone.
This is a difficult engineering task, insofar as the cooling and real estate constraints leave little opportunity to add hardware to the CPU, since the basic CMOS transistor is of the same size as the earlier design.
The Pentium Pro therefore incorporated a number of very aggressive improvements to the internals of the CPU.
The first and most obvious change is in the cache architecture. Virtually all Pentiums employ a two level cache scheme, with a small internal L1 cache and much larger external L2 cache. The limitation of this arrangement in Pentium I designs was that the access time to the larger L2 cache, off chip, was relatively slow compared to the execution time of an instruction.
The Pentium Pro uses a closely coupled L2 cache chip, on the same chip carrier package as the CPU, which is accessible through a 64 bit wide, buffered and queued bus, capable of supporting four concurrent cache requests. The L2 cache static RAM operates at full CPU clock speed, and is sized at 256 kbyte or 512 kbyte in some variants.
The L1 cache is a split (Harvard) architecture design, using a 4 way set associative 8 kbyte instruction cache and a dual ported two way set associative 8 kbyte data cache, the latter capable of handling concurrent load/store operations.
The large/fast L2 cache produces significant performance gains in compute bound number crunching work, the forte of the Unix RISC workstation, and makes a 200 MHz Pentium Pro competitive against a 233 MHz Pentium II for such work.
While the cache design was vital to performance gains, the internal micro-architecture was also improved. The CPU pipeline depth was increased to 12 stages, and the number of superscalar execution units increased to five, two integer, two floating point, and one for memory operations.
The Pentium family is frequently described as RISC-like, or having a "RISC core". This is in some respects an oversimplification.
What Pentiums do to an incoming stream of Intel CISC instructions is this: each instruction is decoded into one or more "micro-operations" (micro-ops), which are RISC like primitive instructions, each of which operates on two source registers and one destination. Simple instructions will map into one micro-op, complex instructions into two or more micro-ops. The decoding into micro-ops is done by a pair of simple operation decoders, and a single complex operation decoder.
It is these micro-ops which are fed into the pipelines and ultimately the execution units.
The Pentium Pro was the first Pentium variant to incorporate "dynamic execution", Intel-speak for a micro-architecture scheme combining deep branch prediction, data flow analysis and speculative execution.
Deep branch prediction means the CPU looks ahead into the instruction stream and attempts to resolve branches before they are encountered, using a 512 entry branch target buffer. Tags are used to associate micro-ops with specific predicted paths of execution.
Data flow analysis allows the detection of opportunities for "out-of-order" execution, whereby micro-ops which do not have mutual data dependencies in operands (ie operands of one micro-op are not the results of another) can be executed in a different order to the instruction stream. The idea is to keep the superscalar execution units busy whenever possible.
Speculative execution involves precomputing the results for the instruction streams on either side of an upcoming branch instruction. These results are held in limbo until the branch is resolved, upon which the results of the correct branch path are committed, and the rest thrown away. The branch prediction results are used to identify which micro-ops are to be committed and which trashed.
To support this complexity, forty hidden registers are used to hold operands, and 40 micro-op registers are used in a content addressable memory arrangement. The Pentium Pro is capable of retiring (ie executing and committing) three micro-ops per clock cycle.
With all superscalar processors, the achievable performance of the CPU core at a given clock speed, as distinct from the CPU coupled to its cache, depends critically upon the fraction of time all of the execution units can be kept fully active crunching operations (RISC instructions or CISC instruction derived micro-ops in a Pentium). The utilisation of execution units improves with improving "Instruction Level Parallelism (ILP)", in other words the proportion of instructions in the code which can be found to have no mutual data dependencies and thus can be executed out-of-order. Finding that ILP requires looking ahead into the instruction stream to be executed, the deeper the CPU can look the more likely it is to find instructions without mutual dependencies. The difficulty in doing this is is analogous to playing a chess game many moves ahead, the number of possibilities to be speculatively evaluated blows out very rapidly, the further ahead you look. In a superscalar CPU this effect amounts to requiring ever increasing amounts of hardware to find ILP, with every additional increment in the number of instructions to be explored.
The driving argument for VLIW architecture is that of shifting this burden of ILP detection into the compiler, from the CPU hardware, thereby moving the time overhead of the work from runtime to compile time. The Itanium/IA-64 and Transmeta Crusoe are built around this idea.
Much of the Pentium Pro micro-architecture found its way into the newer Pentium II and derivative Celeron, by the same token the large/fast L2 cache architecture model of the Pentium Pro was adopted for the follow-on Pentium II Xeon and Pentium III Xeon, the "power user" variants of the baseline Pentium II and III chips. Production Pentium Pros were built using a 0.35 micron process, as distinct from the development models which used the earlier 0.6 micron BiCMOS process, common to the 120 MHz Pentium I.
The Pentium II Xeon L2 cache is a fully error correcting design, running at the CPU core clock speed, and scalable across family models. Sizes of 512 kbyte, 1 Mbyte and 2 Mbyte are quoted by Intel, making these processors directly competitive with their RISC workstation peers. Interestingly the Pentium II family retains the 64 bit wide cache bus pioneered in the Pentium Pro family.
The baseline Pentium II and the Xeon variant are implemented using a 0.25 micron process.
The Pentium III Xeon
The P-III Xeon is Intel's current top end offering for the workstation and server application market, and is derived from variants of the baseline P-III chip. The marketing name Xeon is actually applied to two quite distinct variants of the P-III, one of which is based upon the 0.25 micron process, the other upon the 0.18 micron, six layer Aluminium interconnect process. For convenience, we will label these the P-III/0.25 and P-III/0.18.
In comparison with the P-II series, the P-III family incorporates a number of incremental design improvements to the chip architecture, as well using a higher density 0.18 micron process in later variants, which provides for clock speeds to date up to 800 MHz.
The common P-III CPU core micro-architecture is built around the "Dynamic Execution Technology" model previously discussed, Intel literature is surprisingly coy about specific changes in this area. The instruction decoder portions of the CPU have been expanded to incorporate the 57 MMX multimedia instructions first seen in the P-II family, as well as 70 new streaming SIMD (Single Instruction Multiple Data) floating point and integer instructions.
A wide range of other design improvements were also incorporated, but the most dramatic architectural change in the P-III is the cache design upgrade in the later P-III/0.18 models.
The cache architecture in the P-III/0.18 and Xeon, named the Advanced Transfer Cache (ATC), incorporates an important change, which is the widening of the CPU core to L2 cache bus from 64 bits to 256 bits, thereby virtually quadrupling bandwidth between the CPU and L2 cache at a given clock speed. The on-chip L2 cache is 8-way set associative to extend the range of cachable addresses, and incorporates improvements to the cache interface to reduce latency, against the earlier Pentium Pro / P-II cache design. The ECC features of the later P-II caches are retained. The on-die cache is sized at 256 kbytes.
The interface to the system bus, another potential bottleneck in sucking instructions from memory, incorporates "Advanced System Buffering", Intel-speak for increased queue depths and buffer sizes. This is intended to reduce the frequency of stalls on the 100 - 133 MHz motherboard system bus.
Early model P-III/0.25 chips, introduced in early 1999, and early Xeons, introduced at the same time, employ a 9.5 million transistor die. The basic P-III/0.25 was delivered with clock speeds between 450 and 600 MHz, and a 512 kbyte L2 cache. Contemporary P-III/0.25 Xeons were supplied in 500 and 550 MHz versions, with L2 caches size options of 512 kbyte, 1 MB and 2 MB.
Introduced in late October, 1999, the P-III/0.18 baseline and Xeon models employ a 28.1 million transistor die, with the Advanced Transfer Cache fixed at 256 kbytes. The 0.18 micron process was first released mid 1999, in the Mobile P-II variants. The basic P-III/0.18 is shipped with clock speeds ranging between 500 and 733 MHz, whereas the Xeon variants are supplied with speeds of 600, 667, 733 MHz, with an 800 MHz model recently announced.
Available Intel marketing literature does not indicate any significant differences between the P-III/0.18 and P-III/0.18 Xeon models, other than packaging and available speeds. We can speculate that these Xeons share the same die as the baseline P-III/0.18, but are selected for higher speed and packaged into the Xeon SECC2 package variant. At least one Intel press release indicates that variants with L2 cache sizes of 512 kbyte, 1 MB and 2MB are planned for release in the coming year, which suggests that the current batch of P-III/0.18 Xeon models are a stopgap pending the release of the versions with bigger caches.
The very interesting question, from a machine architecture perspective, is how to integrate a full speed large/fast cache with the P-III/0.18 architecture, while clocking the CPU at 800+ MHz. Several approaches exist using the current 0.18 micron process: one is to add the large external cache as an L3 cache and accept some loss of speed, alternately disable the on-die ATC and add an external ATC die, or enlarge the die to fit a much bigger version of the current L2 ATC design. The latter is the arguably cleanest, but also costliest approach. If we assume the ATC occupies about 1/3 of the die, for 256 kbytes capacity, then a new die layout with a 2 MB ATC variant would require about a 230% increase in area.
The P-III family does have remaining capacity for growth, even with the evolved P-II/P-Pro micro-architecture core. A transition to a 0.18 micron copper metallisation process would add 25-30% of additional clock speed, while enlargement of the caches, and widening of the system bus interface concurrently with a speed improvement would also add respectable amounts of performance. I have yet to see any announcements of a Pentium IV in the pipeline, given Intel's push to establish the Itanium/Merced/IA-64 family this is an interesting question in itself.
The AMD K-7/Athlon
Advanced Micro Devices have carved themselves a niche in the Intel Architecture market with their line of instruction set compatible processors. Like David vs the Intel Goliath, they have certainly been an aggressive player in the market over recent years. While holding a small share of the total market, they have nevertheless managed to maintain the initiative.
The latest AMD offering is the Athlon family of processors, previously known as the K-7, which incorporate a later generation of CPU core design to the Intel P-III family offering, providing genuinely competitive performance against the top end Intel Xeons.
Indeed, the recent demonstration of a 1.1 GHz clocked Athlon chip, using a 0.18 micron process with copper metallisation, actually gives AMD the lead in basic technology for "Pentium class" processors. Skimming the media limelight and numerous industry and press awards, AMD are riding high at this moment. Whether they can convert this technology advantage into market share remains to be seen. Current volume production Athlons are shipped with speeds between 550 and 850 MHz.
The Athlon family of processors uses a 22 million transistor die, fabricated as a Model 1 in a 0.25 micron 6 layer Aluminium process, or a Model 2 in a 0.18 micron process. The newly opened AMD fab in Germany will be able to manufacture using the 0.18 micron copper metallised process. Copper being more conductive allows for higher clock speeds on a given die, where the resistivity of the interconnections is the limiter to clock speeds.
The micro-architecture of the Athlon family is of considerable interest, as it is newer and in many respects more powerful than the older evolved Pentium Pro/II/III core. The first feature of interest are the execution units, the Athlon uses three integer units, three floating point units and three address calculation units, for a total of nine execution units. Independent scheduling is used for the integer and floating point path, and AMD claim the ability to issue 9 operations concurrently, three integer, three address and three floating point. A 10 stage integer and 15 stage floating pipeline are used. The floating point execution units can perform Intel SIMD MMX instructions as well as AMD 3DNow! instructions.
Like Intel's Pentium family, the AMD Athlons use a "RISC-like" core, Intel CISC instructions are decoded by a three way Instruction Decoder into fixed length "MacroOPs", which are then fed into the Instruction Control Unit, which has a 72 entry Reorder Buffer.
Branch prediction is performed using a two-way 2048 entry branch prediction table, a branch target address table and return address stack.
From an architectural perspective, the Athlon is similar enough to the Pentium to make some comparisons, but different enough not to be able to make direct comparisons. The key difference lies in the potential for better superscalar execution performance, due to a larger number of execution units. Determining whether the logic used to detect ILP and reorder MacroOPs can fully exploit this potential is not a "back of the envelope" calculation !
The other major difference in micro-architecture is the Athlon cache design. The L1 cache is large by any measure, indeed half the size of the P-III/0.18 ATC L2 cache. The L1 cache is also a "Harvard" style split arrangement, two-way set associative for both the instruction and data caches. It is accessed via a 64-bit cache bus, which makes it comparable to the Pentium Pro/II/III "narrow" cache, rather than the newer ATC architecture.
The L2 cache architecture of the Athlon is built around a 72-bit L2 cache bus, with 64 bits committed to the bus proper and 8 for ECC. The interface is designed to support a range of industry standard SDR or DDR Static RAMs, and can be programmed to match the speed of the SRAM. L2 cache sizes between 512 kbyte and 8 MB can be supported.
Against the older generation P-III/0.25 L2 cache design, the Athlon clearly has potential for similar throughput for a given clock speed, and up to four times greater size. However against the P-III/0.18 ATC, it will be penalised by the use of a "narrow" cache bus, yet still retain the advantage of size.
The motherboard chipsets offered for the Athlon use a licenced variant of the Compaq (DEC) EV6 synchronous system bus, designed for the Alpha RISC workstation, and clocked at 200 MHz with growth potential above 400 MHz. This is significantly better system bus bandwidth, in comparison with the 100/133 MHz technology used on the P-III family motherboards. The traditional advantage of Unix RISC workstation motherboards over PC motherboards, decisive in the server and power user market, was very cleverly exploited by AMD.
The Athlon uses "Slot A" packaging, mechanically "pin-compatible" with the Intel standard, but electrically incompatible due to different pinouts. Therefore an Athlon cannot be backfitted into a Pentium III motherboard.
The overall summary conclusion is that the Athlon Model 2 series has decisive architectural advantages against the early generation P-III/0.25 series processors in almost all key performance drivers, whereas against the latest P-III/0.18 series it has a cache bandwidth bottleneck due to the four times narrower cache bus.
Xeon or Athlon?
The obvious question a reader will ask at this point is: which is the better performer, the Xeon or the Athlon ?
This is not an easy question to answer, given the diversity of P-III variants and cache configurations on the Athlon series, even should we assume an identical clock speed. If we factor in motherboard performance, the equation becomes even more difficult. Since P-III/0.18 Xeons with large L2 caches are as yet not shipping, there is no direct equivalent in the Intel P-III stable.
The only real test is to benchmark the intended application, as there are enough architectural differences between the machines to render any paper estimates inaccurate. This is especially true of applications which are sensitive to cache performance.
Benchmarks published by AMD when normalised for clock speed, would suggest that the Athlon delivers a 32.5% better rating on the Spec Integer Base 95 benchmark, and a 2% better rating on the floating point Spec 95 base, compared to the P-II/0.18. However these were run on NT, rather than Unix, and there are sufficient undocumented differences in system configuration (eg Athlon L2 cache size), to take the benchmark with a grain of salt.
What is clear is that the Athlon Model 2 is a genuine and credible competitor to the latest P-III Xeon, and in terms of basic architecture, has considerable further growth potential, especially in the L1 and L2 cache architectures.
Size matters, and Intel does hold the upper hand in this respect, but should Intel fall behind in extending the P-III, and not achieve early success with the Itanium, then AMD will have the opportunity to gain considerable market share with the Athlon, at Intel's expense. Pivotal to AMD's play will be the yields in the new German fab, and the supply of quality motherboards.
The longer term outcome is thus not easily predictable in the Xeon vs Athlon play. With IBM about to deploy the 64 bit, 170 million transistor, 1+ GHz Power4, and Compaq working on the 1 GHz Alpha 21364, the market will not be short of contenders for the 1 GHz class processor performance crown.
The turn of the millennium is indeed the time the 1 GHz processor has come of age.
|$Revision: 1.1 $|
|Last Updated: Sun Apr 24 11:22:45 GMT 2005|
|Artwork and text ¿ 2005 Carlo Kopp|