Sunday, September 2, 2012

Keep Adding Cores?

There is a trend among the futurists out there that we just need to keep adding cores to our processors to make multi-processing (MP) the ultimate solution to all our computing problems. I think this comes from the conclusions concerning Moore's Law and the physical limits that we seem to be reaching at present.

But, for gadgets, it is not generally the case that adding cores will make everything faster. The trend is, instead, toward specialized processors and distribution of tasks. When possible, these specialized processing units are placed on-die, as in the case of a typical System-on-a-Chip (SoC).

Why specialized processors? Because using some cores of a general CPU to do a specific computationally-intensive task will be far slower and use far more power than using a specialized processor specifically designed to do the task in hardware. And there are plenty of tasks for which this will be true On the flip side, the tasks we are required to do are changing, so specific hardware will not necessarily be able to do them.

What happens is that tasks are not really the same. Taking a picture is different from making a phone call or connecting to wi-fi, which is different from zooming into an image, which is different from real-time encryption, which is different from rendering millions of textured 3D polygons into a frame buffer. Once you see this, it becomes obvious that you need specialized processors to handle these specific tasks.

The moral of the story is this: one processor model does not fit all.

Adding More Cores

When it comes to adding more cores, one thing is certain: the amount of die space on the chip will go up, because each core uses its own die space. Oh, and heat production and power consumption also go up as well. So what are the ways to combat this? The first seems obvious: use a smaller and smaller fabrication process to design the multiple-core systems. So, if you started at a 45-nanometer process for a single CPU design, then you might want to go to 32-nanometer process for a dual-CPU design. And a 22-nanometer process for a 4-core CPU design. You will have to go even finer for an 8-core design. And it just goes up from there. The number of gates you can place on the die goes up roughly as one over the square of the ratio of the new process to the old process. So when you go from 45 nm to 32 nm, you get the ability to put in 1.978x the number of gates. When you go from 32 nm  to 22 nm, you get the ability to put in 2.116x as many gates. This gives you room for more cores.

A change in process resolution gives you more gates and thus more computation per square inch. But it also requires less power to do the same amount of work. This is useful for gadgets, for whom the conservation of power consumption is paramount. If it takes less power, then it may also run cooler.

But wait, we seen to be at the current limits of the process resolution, right? Correct, 22 nm is about the limit at the current time. So we will have to do something else to increase the number of cores.

The conventional wisdom for increasing the number of cores is to use a Reduced Instruction Set Computer (RISC) design. ARM uses one, but Intel really doesn't. The PowerPC uses one.

When you use a RISC processor, it generally takes more instructions to do something than on a non-RISC processor, though your experience may vary.

Increasing the die size also can allow for more cores, but that is impractical for many gadgets because the die size is already at the maximum they can bear.

The only option is to agglomerate more features onto the die. This is the typical procedure for an SoC. Move the accelerometer in. Embed the baseband processor, the ISP, etc. onto the die. This reduces the number of components and allows more room for the die itself. This is hard because your typical smartphone company usually just buys components and assembles them. Yes, the actual packaging for the components actually takes up space!

Heat dissipation becomes a major issue with large die sizes and extreme amounts of computation. This means we have to mount fans on the dies. Oops. This can't be useful for a gadget. They don't have fans!

Gadgets

Modern gadgets are going the way of SoCs. And the advantages are staggering for their use cases.

Consider power management. You can turn on and off each processor individually. This means that if you are not taking a picture, you can turn off the Integrated Signal Processor (ISP). If you are not making a call (or even more useful, if you are in Airplane Mode), then you can turn off the baseband processor. If you are not zooming the image in real time, then you can turn off the a specialized scaler, if there is one. If you are not communicating using encryption, like under VPN, then you can turn off the encryption processor, if you have one. If you are not playing a point-and-shoot game, then maybe you can even turn off the Graphics Processing Unit (GPU).

Every piece you can turn off saves you power. Every core you can turn off saves you power. And the more power you save, the longer your battery will last before it must be recharged. And the amount of time a device will operate on its built-in battery is a huge selling point.

Now consider parallelism. Sure, four cores are useful for increasing parallelism. But the tendency is to use all the cores for a computationally-intensive process. And this ties up the CPU for noticeable amounts of time, which can make UI slow. By using specialized processors, you can free up the CPU cores for doing the stuff that has to be done all the time, and finally the device can actually be a multitasking device.

Really Big Computers

Massive parallelization does lend itself to a few really important problems, and this is the domain of the supercomputing center. When one gets built these days, thousands, if not millions, of CPUs are added in to make a huge petaflop processing unit. The Sequoia unit, a BlueGene/Q parallel array of 1,572,864 cores is capable of 16.32 petaflops.

But wait, the era of processing specialization has found its way into the supercomputing center as well. This is why many supercomputers are adding GPUs into the mix.

And let's face it, very few people use supercomputers. The computing power of the earth is measured in gadgets these days. In 2011, there were about 500 million smartphones sold on the planet. And it's accelerating fast.

The Multi-Processing Challenge

And how the hell do you code on multi-processors? The answer is this: very carefully.

Seriously, it is a hard problem! On GPUs, you set up each shader (what a single processor is called) with the same program and operate them all in parallel. Each small set of shaders (called a work group) shares some memory and also can share the texture cache (where the pixels come from).

It takes some fairly complex analysis and knowledge of the underlying structure of the GPU to really make any kind of general computation go fast. The general processing issue on GPUs is called the GPGPU problem. The OpenCL language is designed to meet this challenge and bring general computation to the GPU.

On multiple cores, you set up a computation thread on one of the cores, and you can set up multiple threads on multiple cores. Microthreading is the technique used to make multiple threads operate efficiently on one core. Which technique you use depends upon how the core is designed. With hyperthreading, one thread can be waiting for data or stalled on a branch prediction while the other is computing at full bore, and vice-versa. On the same core!

So you need to know lots about the underlying architecture to program multiple cores efficiently as well.

But there are general computation solutions that help you to make this work without doing a lot of special-case thought. One such method is Grand Central Dispatch on Mac OS X.

At the Cellular Level

There is a multi-core architecture that is specifically a massively-parallel model that departs from simply just adding cores. The Cell Architecture does this by combining a general processor (in this case a PowerPC) with multiple cores for specific hard computation. This architecture, pioneered by Sony, Toshiba, and IBM targets such applications as cryptography, matrix transforms, lighting, physics, and Fast Fourier Transforms (FFTs).

Take a PowerPC processor and combine it with multiple (8) Signal Processing Engines capable of excellent (but simplified) Single-Instruction Multiple Data (SIMD) floating-point operations, and you have the Cell Broadband Engine, a unit capable of 256 Gflops on a single die.

This architecture is used in the Sony Playstation. But there is some talk that Sony is going to a conventional multi-core with GPU model, possibly supplied by AMD.

But what if you apply a cellular design to computation itself? The GCA model for massively-parallel computation is a potential avenue to consider. Based on cellular automata, each processor has a small set of rules to perform in the cycles in between the communication with its neighboring units. That's right: it uses geometric location to decide which processors to talk with.

This eliminates little complications like an infinitely fast global bus, which might be required by a massively parallel system where each processor can potentially talk to every other processor.

The theory is that, without some kind of structure, massively parallel computation is not really possible. And they are right, because there is a bandwidth limitation to any massively parallel architecture that eventually puts a cap on the number of petaflops of throughput.

I suspect a cellular model is probably a good architecture for at least two-dimensional simulation. One example of this is weather prediction, which is mostly a two-and-a-half dimensional problem.

So, in answer to another question "how do you keep adding cores?" the response is also "very carefully".

6 comments:

  1. I think a lot of ARTISTS and some others would go for a larger (17" diagonal screen)tablet with high resolution and a pressure sensitive pen. People seem to get emotional about not having this so I am thinking about tearing my Wacom Cintiq apart and adding laptop parts so at least I can have one. I will take it to the cofee shop so I can be like all the other people with their 17" laptops only I will be able to draw on mine. Hopefully, I will have Painter running on it as that is still the best drawing program in my opinion. See you there Ron

    ReplyDelete
    Replies
    1. Painter does serve me well also. And I wonder if Corel is moving Painter to the tablet environment.

      I also like drawing on my iPad, but I do miss the pressure sensitive aspects of the Wacom. The optimal thing would be a huge iPad-like tablet with a pressure sensitive pen. There are several technologies to bring pressure sensitivity to the iPad, even without Wacom's magical antenna-array tech. But let's face it, the Wacom is tried and true. It also supports tilt and bearing, which both open up even more creative possibilities.

      The big Cintiq is the cleanest way I have seen to paint directly on any screen surface. But the clarity, contrast, and resolution of screens has gone way up lately. Those retina displays are gorgeous!

      You are right: what we clearly need is a larger tablet with pressure sensitivity for artists. A friend has also mentioned this to me.

      One of the main issues with tablets (capacitive tablet touch screens) is that your hand causes spurious contact points, disturbing the drawing process. I have taken to putting down a handkerchief (to lay my hand on) while sketching. There is also a glove for such a purpose that is being sold.

      --Mark

      Delete
  2. Specialized processors are more efficient in terms of silicon expended for the task they do, so the equation is they are desired when the task they do is frequent enough that the silicon efficiency gained is positive overall, i.e. the duty cycle of the specialized task must be factored in (as well as other considerations such as I/O load on the general CPU, etc).

    That silicon (and energy) efficiency equation can be applied vice versa, in that making the general CPU simpler (e.g. RISC) can lead to greater silicon efficiency, because the complex instruction sets have a lower duty cycle.

    I expect that for some years or decades the number of cores will continue to increase in line with Moore's law. Some ideas I have seen using a material other than silicon, 3D circuits (Intel’s tri-gate technology), and making cores simpler.

    Ultimately the amount of processing we can fit in a very small gadget will reach a limit:

    http://www.dailygalaxy.com/my_weblog/2012/05/is-the-age-of-silicon-coming-to-an-end-physicist-michio-kaku-says-yes.html
    http://www.lifeslittlemysteries.com/2878-future-computers.html

    However, there is solution to this. With near-field radio (e.g. Bluetooth, etc), we can put more cores some where on our body or clothing and offload processing from the gadget we hold in our hand. Hopefully we can charge it without wires too, so we don't have to think about it. Hopefully these spare processors become so cheap that they come standard in clothing and shoes, etc..

    Identifying parallelism in software is not the same as concurrency:

    http://existentialtype.wordpress.com/2011/03/17/parallelism-is-not-concurrency/
    http://existentialtype.wordpress.com/2012/08/26/yet-another-reason-not-to-be-lazy-or-imperative/

    And we can coax parallelism into a series of conditional operations on sets, by using a State monad and Traversable:

    http://augustss.blogspot.com/2011/05/more-points-for-lazy-evaluation-in.html#c2904150906369733736

    ReplyDelete
    Replies
    1. Your comment about parallelism vs. concurrence is exactly my point, really. Although it does at first seem to bolster the objective of adding more cores. This is because multi-cores actually does properly support concurrency of diverse operations. The real point is that each individual operation needs to be handled properly, and using a general-purpose core for that operation will not really make sense from the performance or power footprint. Even a specialized massively parallel processor, like a GPU can't really handle operations like sensor processing, RAW demosaicing, and the entire pipeline required for imaging. It's just not fast enough for the demands of video and also for the increasing size of the sensors.

      This is the era of specialized processors: scalers, baseband processors, imaging and signal processors, encryption hardware, etc. These processors can be practically optimal, can be designed for a specific useful throughput, can be specifically tailored to a power consumption footprint, and can also take up less die space.

      The world of an SoC often contains several such processors embedded on its die.

      The GPU has proven itself to be totally up to the task of a smooth animating user interface, though. And also, as it was designed, for OpenGL-based gaming, rendering textured polygons to a z-buffer, three-dimensional transforms, implementing shaders and geometry shaders as well.

      The multiple cores and the GPU become the tableau for the application programmer, so an application can make use of the device and do wonderful things.

      Back to parallelism and concurrency: the concurrency happens when a diverse set of operations such as video processing and dynamic control of sensors and MEMs needs to be done. Parallelism happens when a specialized computation needs to be performed that is homogenous.

      Parallelism is really the domain of Grand Central Dispatch, the multithreading helper. A block of code may be distributed onto multiple micro threads, for instance, on the Intel platform.

      But I distinguish between this kind of threaded parallelism and lockstep parallelism, like that which occurs in a SIMD instruction. Perhaps 32 identical operations happen at once, in a vector instruction set, such as those on Core i7 processors.

      And on a GPU, the parallelism is best done, increasingly on a scalar shader. It is spread about in workgroups onto multiple shaders to operate in parallel on the GPU. Some GPUs have 500+ shaders operating in parallel. When the operation fits this model, then it can definitely be sped up. Indeed, I have experience in this, as well as in the vector instruction model, which is quite different to program, I assure you.

      You can coax typical code into parallelism using something like a select instruction that is programmed to do a vector comparison operator and use the comparison mask to select the proper results of a vector. While this is eminently programmable and similar to microcode in some ways, it also features some inefficiency since you end up having to compute all the answer and choose between them at the end.

      --Mark

      Delete
    2. Parallelism is the isolation of independent calculations from dependent ones.

      Thus referential transparency is critical.

      Concurrency is used to run independent code in parallel, but it is also used to for multiple threads that run on one processor round-robin via interrupts.

      Correction: the last link I provided in prior post was not about coaxing parallelism, rather about accomplishing the analogous reuse of orthogonal functions in an eager (strict) programming language, as can be accomplished in a lazy (total) language.

      More on that:

      http://copute.com/index.html.orig
      Skeptical -> Purity -> Eager vs. Lazy -> Tradeoffs -> Performance

      Delete
    3. Pragmatically, concurrency allows you to overlap diverse processes. This is useful in I/O for instance, or in a truly coroutined process typical of a pipeline. Parallelism is the static duplication of code and simultaneous execution on multiple logical compute units. In modern GPUs, this is very clearly parallelism since each shader in the workgroup is running the same code in lockstep.

      Sadly, data can vary, and so GPU shader kernels with decisions cause less-than-optimal performance because of the lockstep requirements. In a multicore CPU, this is not the case, and best average running times can be realized. A branch taken on one does not mean the same branch will be taken on another.

      SIMD vector units share this specific downside caused by lockstep parallelism.

      So in some ways, adding more cores is a better parallelism than the GPU and SIMD (vector) model.

      The "recompiling" of a standard scalar algorithm for parallelism is the main problem we are looking to solve today in order to extend the Moore's law-like performance of our existing software. Essentially, we're looking for a "magic bullet".

      This magic bullet is a hard problem, made harder by the lack of referential transparency, as you say. This has been known since the earliest days of optimization: if you can't tell which variables are const within a given scope, then it's nearly impossible to optimize.

      So the inevitable procession of programming languages to higher and higher levels is fraught with problems for optimization and parallelism determination.

      It has led me to use leveled language approach to development. The outer sections that aren't time-dependent are written in a higher-level language, like Objective-C++. The lower level time-dependent sections of the code are written in C or OpenCL.

      In other words, I pay attention to what needs to be fast. To have a program recognize what needs to be fast is currently beyond our understanding. At a minimum, there should be pragmas for that!

      Delete