0 Replies Latest reply on Jan 10, 2014 8:16 AM by William Ives

    How To Size Bulk Processing Power By CPU

    William Ives

      How To Size Bulk Processing Power By CPU


      The simple days of computing: Sizing a computers CPU's was based on only one CPU's Hz, the amount of transistors and the L cache sizes of that CPU. All software written back in the 90's were centered on single thread processing (time-division multiplexing). ## see images attached 

       

      Now a single box computer is in reality multiple computers in one simply by the new technology of multi-threading (kernel threading) CPU technology.

       

      Welcome to the Xeon E5 Lvy Bridge EP.


      Single 6c Die socket: 6 CPU's (a core is a CPU), 24 logical CPU's (a logical CPU is a software image of a core times 4), One L3 15M cache, one front side bus controller. Note: 3+ GHz for each CPU, bus speed based on 15M cache. Multitasking shared by 24 logical CPU's with a 15M L buffer.* Size according to I/O speed of your outputs.** The horsepower of the software. ***

       

      Single 10c Die socket: 10 CPU's (a core is a CPU), 40 logical CPU's (a logical CPU is a software image of a core times 4), One L3 25M cache, one front side bus controller. Note: 3+ GHz each CPU, bus speed increased by 25M cache. Note: 3.5 GHz for each CPU, bus speed based on 25M cache.  Multitasking shared by 40 logical CPU's with a 25M L buffer.* Size according to I/O speed of your outputs.** The horsepower of the software. ***

       

      Single 12c Die socket: 12 core (a core is a CPU), 48 logical CPU's (a logical CPU is a software image of a core times 4), Two L3 30M cache, two front side bus controllers. Note: 3+ GHz for each CPU, bus speed based on 30M cache. Multitasking shared by 48 logical CPU's with two 30M L buffer.* Size according to I/O speed of your outputs.** The horsepower of the software. ***

       

      *The larger the cache on the CPU the faster processor arrays regardless of the internal processor speed. L1 to L3 cache keeps copies of requested items in case a different core makes a subsequent request. The advantage of having on-board cache is that it’s faster, more efficient and less expensive than placing separate cache on the motherboard.

       

      **Size by bus Speed: I/O chipset bus speed (motherboard), hard drive access and bus speed (interface, SATA, SAS, RPM, buffer), physical memory RAM (DDR3) bus speed, Ethernet Mbps speed, video controller buffer size and speed High priced, fast CPUs will only bottleneck on the backend thus wasting the high front end cost. Example: oversized expensive CPU's processors, process arrays in microseconds then have to sit and wait for the slow Southbridge, (hard drive to deliver the data through the LAN then to the monitor).

       

      *** Not all programs will utilize (HT) "hyper threading" to utilize multiple processors. The difference between a server database, a high end gaming program, or use of office/internet software is comparable to 100 horse power motor to a 1 horse power motor. Office software will benefit by a high speed processor (like a small fast hair dryer fan), less cores for the price. A large server hosting many clients or a multi user game will require more cores at a lower speed processor (like 4 large turbine fans). If you have money to burn, then purchase highest CPU speeds across 10+ cores at the highest cache for servers (like 10 jet engines). For an office computer (internet & MS Office), 10 cores at the highest cache is a total waste of money for the speed and horsepower and will buy you nothing; a quad core is all you would need for the processing power of low end software (a single 4-core will get the work done with less than half the cost of a 12-core). Check the internet for benchmark hardware test for the software you will run to determine the CPU horsepower you will need. Programs best suited for multi core technology:  Database servers, such as SQL's, can license each core (CPU) as a separate computer. Also, SQL uses a software architecture called Parallelism that divides computations between all available cores. Parallelism programs such as CAD and gaming utilizing rendering add ons and multiplayer additions, will utilize multiple CPU's.

       

       

      The Xeon E5-2600 V2 or "Ivy Bridge EP" (Hashwell Achitecture)

      The core architecture inside the latest high-end model of the Xeon E5-2600 V2 or "Ivy Bridge EP" is, aside from the core architecture, completely different from the Ivy-bridge "i7 \-3xxx". With up to twelve cores, two integrated memory controllers, no GPU and 30MB L3 cache, it is the big brother of the recently reviewed Ivy-bridge E (Core i7-4960X). Intel has three die flavors of the Ivy-bridge EP:

       

      The first lowest core count (4/6 cores), (high frequencies) or low power SKUs; this is the core being used in the enthusiast Ivy Bridge-E processors. The second one is targeted at the typical server environment with higher core counts (6 to 10 cores) and a larger L3 cache (25MB). The third and last one is the high performance HPC and server die, with 12 cores, two memory controllers for lower memory latency, and 30MB of L3 cache.

      The EP uses DDR3 in all of its forms (vanilla, ECC, buffered ECC, LR ECC) where as the EX version is going to use a serial interface similar in concept to FB-DIMMs. There will be two types of memory buffers for the EX line, one for DDR3 and later another that will use DDR4 memory. No changes need to be made to the new EX socket to support both types of memory.

       

      Intel's 1xxx Xeons are uniprocessor, 2xxx is dual socket, 4xxx quad, 8xxx octo socket. But the 4xxx series is still on 2012 models and 8xxx on 2011 releases. 4960X gets about 70% of the performance of a single 2697 at 38% of the cost. Then again, a 1270v3 gets you 50% of the performance at 10% of the price. So when talking farms (i.e. more than one system cooperating), four single-socket boards with 1270v3 will get you almost the power of a dual-socket board with 2697v2 (minus communication overhead), will likely use similar power demand (plus communication overhead), and save you $$$$$ in the process. Since you use 32 instead of 48 threads, but 4 installations instead of 1, software licensing cost may vary strongly in either direction.

       

       

      NOTES:

       

      1. 1. There are two physical buses on x86-processors address+data and two logical buses memory+i/o. Special pins on the processor determine if the operation is memory or i/o, read or write and so on. On the 8086/8088 the data and address buses shared the same pins A0-A15 with D0-D15/A0-A7 with D0-D7 with bits A16-A19/A8-A19 being strictly address. On the 80286 they were separate, not sure about the 80186/80188. On the 80286 there were 24 address and 16 data lines. On the 80386 and 80486 there were 32 each for address and data. The 80386SX had the same external configuration as the 80286. After this buses get complicated. The processors run so fast internally that they are more or less constantly waiting for their caches which in turn are more or less constantly waiting on external RAM. To satisfy the caches' insatiable hunger for data external memory began delivering it in 64 bit wide chunks starting with the Pentium and Pentium MMX that are both 32 bit processors with 32 address lines but with 64 data lines. With later processors the number of address lines were increased to 36 allowing a total addressable external memory of 64 GB. The processors remained 32-bit internally. On multi-core processors the hunger for data is even more pronounced so they may have several sets of address and data buses to facilitate data being crammed into the processor. Desktop processors may have two or three and server processors three to four. Some may have switched to 128 bit wide data buses. For modern 64-bit processors it is not feasible to also have 64 address lines since that would allow memory up to 16 billion Gigabytes which is not possible today. Some motherboards allow 128 GB which means that the processor needs at least 37 address lines. As you can see address and data buses are no longer really usable to determine processor size. They actually haven't for the last 25 (80386 modes) years. In C the int type is supposed to be equivalent to the register width. On AMD64 it isn't because there just isn't that great a need for 64 bit ints: 32 bit ints do just nicely in most cases. The width of a pointer in C on AMD64 is 64 bits though.

       

      1. 2. The vast majority of microprocessors can be found in embedded microcontrollers. The second most common type of processors are common desktop processors, such as Intel's Pentium or AMD's Athlon. Less common are the extremely powerful processors used in high-end servers, such as Sun's SPARC, IBM's Power, or Intel's Itanium. Historically, microprocessors and microcontrollers have come in "standard sizes" of 8 bits, 16 bits, 32 bits, and 64 bits. These sizes are common, but that does not mean that other sizes are not available. Some microcontrollers (usually specially designed embedded chips) can come in other "non-standard" sizes such as 4 bits, 12 bits, 18 bits, or 24 bits. The number of bits represent how much physical memory can be directly addressed by the CPU. It also represents the amount of bits that can be read by one read/write operation. In some circumstances, these are different; for instance, many 8 bit microprocessors have an 8 bit data bus and a 16 bit address bus.

      8 bit processors can read/write 1 byte at a time and can directly address 256 bytes

      16 bit processors can read/write 2 bytes at a time, and can address 65,536 bytes (64 Kilobytes)

      32 bit processors can read/write 4 bytes at a time, and can address 4,294,967,295 bytes (4 Gigabytes)

      64 bit processors can read/write 8 bytes at a time, and can address 18,446,744,073,709,551,616 bytes (16 Exabytes)

       

      1. 3. Internal frequency of microprocessors is usually based on Front Side Bus frequency. To calculate internal frequency the CPU multiplies bus frequency by certain number, which is called clock multiplier. It's important to note that for calculation the CPU uses actual bus frequency, and not effective bus frequency. To determine actual actual bus frequency for processors that use dual-data rate buses (AMD Athlon and Duron) and quad-data rate buses (all Intel microprocessors starting from Pentium 4) the effective bus speed should be divided by 2 for AMD or 4 for Intel.

      Clock multipliers on many modern processors are fixed - it is usually not possible to change them. "Extreme" versions of processors have clock multipliers unlocked, that is they can be "overclocked" by increasing clock multiplier in motherboard BIOS. Some CPU engineering samples may also have clock multiplier unlocked. Many Intel qualification samples have maximum clock multiplier locked - these CPUs may be underclocked (run at lower frequency), but they cannot be overclocked by increasing clock multiplier higher than intended by CPU design. While these qualification samples and majority of production microprocessors cannot be overclocked by increasing their clock multiplier, they still can be overcloked by using different technique - by increasing FSB frequency.

       

      Frequency 3400 MHz

      Turbo frequency             

      3800 MHz (1 core)

      3700 MHz (2 cores)

      3600 MHz (3 cores)

      3500 MHz (4 cores)

       

      1. 4. Base Clock: With the release of the Nehalem architecture in November 2008, Intel introduced new Nehalem based processors to include an integrated DDR3 memory controller as well as QPI – Quick Pack Interconnect, using base clock instead of FSB (Front Side Bus) previous interface. Also, these processors have 256 KB L2 cache per core, plus up to 30+ MB shared level 3 cache. Because of the new I/O interconnect chipsets and mainboards from previous generations can no longer be used with Nehalem based processors.

      BLCK (Base Clock) affects 4 crucial parameters: Memory, Uncore, CPU, QPI. For example, having 1333 MHz DDR RAM frequency means that you have  bclk= 133 and memory multiplier=10. CPU at 4GH is 200 x 20 = 4 GHz, (notice: BLCK=200 & CPU multiplier=20).

         5. i7-2600k: unlocked; i7-2600: locked, the "2600" is just a model number.

      The k means the processor has an unlocked multiplier, meaning you can overclock the CPU in the bios by simply increasing the CPU multiplier.

      When a single core is active, the chip can turbo up to 3.7GHz. If you want, you can change that turbo state to go as high as 4.1GHz. Overclocking these chips relies entirely on turbo however. In the case above, the fastest your chip will run is 4.1GHz but with only one core active. If you have four cores active the fastest your chip can run is 3.8GHz

       

      the Core i5-2500K and Core i7-2600K are fully unlocked and will let you overclock them as far as the CPU and/or your cooling can sustain.

       

       

      http://www.overclockers.com/intel-i7-2600k-sandy-bridge-review