Information about the latest advances in Technology, Tweaks and Tech News

Wednesday 19 February 2014

On 04:13 by Unknown     No comments


Most of the time, when AMD or Nvidia launch a new graphics card, they debut the high-end architecture first. Nvidia is bucking that plan today with the Maxwell-core GTX 750 Ti — a midrange card aimed at the $150 market. After years of dual-slot designs and hefty coolers, the GTX 750 Ti is a throwback to earlier years, when mid-range graphics cards didn’t require additional power connectors or dedicated cooling slots.
GTX 750 Ti
The GTX 750 Ti – 3/4 View
Don’t be fooled by the K6-era Golden Orb or the lack of a six-pin power connector. This midrange GPU could reshape the market. The GM107 (Maxwell) at the heart of the GTX 750 Ti is based on Nvidia’s Kepler architecture, but significantly different than its predecessor. Every aspect of this core has been redesigned for maximum power efficiency, scalability, and die size.
GK107 vs. GM107
Nvidia’s GK107 versus the new GM107. Transistor density has significantly improved.

The GM107 GPU Architecture:

Nvidia’s Kepler, which debuted in 2012, was designed to vastly increase parallelism across the GPU. Unlike the old Fermi-class GPUs, which used processing blocks (SMs) of 32 cores each, Kepler had 192 cores in each of its SMXs. This dramatically shifted where the GPU needed to extract parallelism in order to maximize performance.
According to Nvidia’s own tuning guide, Kepler needed “roughly twice as much parallelism per multiprocessor on Kepler GPUs via either an increased number of active warps of threads or increased instruction-level parallelism (ILP) or some combination thereof.” NV balanced this by using fewer multiprocessor blocks (eight for GK104 compared to 16 for the older GeForce cards) but the amount of parallelism per block still had to double.
Maxwell walks this trend back a bit, and returns to some design elements that Fermi used — but with a new multiprocessor block design (now called an SMM) of its own. Let’s take a look at the two designs.
Kepler SMX
Kepler’s highly parallel SMX design
Maxwell SMM
Maxwell’s core design and distribution
In Kepler, 192 GPU cores are fed by one huge register file, four warp schedulers, a unified instruction cache, and eight dispatch units. Maxwell keeps the same total number of dispatch units and schedulers, but breaks them up into pairs. Previously, all 192 cores inside a Kepler SMX shared a texture cache, unified cache, and L1 cache. Now the L1/texture cache is shared between just 128 cores — and dispatch/decode resources are split between each block.
According to Nvidia, breaking the unified SMX design into smaller blocks simplified the chip and allows for higher compute efficiency. Each SMM block of 128 cores is able to hit roughly 90% the performance of a 192-core SMX. For those of you keeping score, the implication here is that the 128-core design is far more efficient — 192 cores is 50% larger than 128 cores, but according to Nvidia, the actual performance hit is just 10%. The benefit of these smaller, simpler cores is that Nvidia can stuff far more of them into the same space, thereby improving the total number of cores on each GPU.
Maxwell has a much larger L2 cache than any previous GPU in this price bracket. Nvidia doesn’t give many details on why it expanded the L2, but we’re guessing it’s a critical component of the new SMM structure. In Kepler, 192 cores shared a contiguous L1 and a separate “Unified Cache.” With Maxwell, each pair of blocks within the SMM split a combined L1/texture cache. According to Nvidia, the new, larger L2 acts as a buffer for slower caches and for data sharing across the entire core.
Since Maxwell has far more SMMs than previous Kepler designs, the larger L2 cache may be an effective way of ensuring multiple SMMs can update a shared data pool quickly. The card’s power savings and higher transistor density are the result of a great deal of work — Nvidia redesigned control logic partitions, clock gating granularity, compiler-based scheduling, tweaked the number of instructions issued per clock, and rebuilt the interconnect structure.

0 comments :

Post a Comment