
But, for gadgets, it is not generally the case that adding cores will make everything faster. The trend is, instead, toward specialized processors and distribution of tasks. When possible, these specialized processing units are placed on-die, as in the case of a typical System-on-a-Chip (SoC).
Why specialized processors? Because using some cores of a general CPU to do a specific computationally-intensive task will be far slower and use far more power than using a specialized processor specifically designed to do the task in hardware. And there are plenty of tasks for which this will be true On the flip side, the tasks we are required to do are changing, so specific hardware will not necessarily be able to do them.
What happens is that tasks are not really the same. Taking a picture is different from making a phone call or connecting to wi-fi, which is different from zooming into an image, which is different from real-time encryption, which is different from rendering millions of textured 3D polygons into a frame buffer. Once you see this, it becomes obvious that you need specialized processors to handle these specific tasks.
The moral of the story is this: one processor model does not fit all.
Adding More Cores
When it comes to adding more cores, one thing is certain: the amount of die space on the chip will go up, because each core uses its own die space. Oh, and heat production and power consumption also go up as well. So what are the ways to combat this? The first seems obvious: use a smaller and smaller fabrication process to design the multiple-core systems. So, if you started at a 45-nanometer process for a single CPU design, then you might want to go to 32-nanometer process for a dual-CPU design. And a 22-nanometer process for a 4-core CPU design. You will have to go even finer for an 8-core design. And it just goes up from there. The number of gates you can place on the die goes up roughly as one over the square of the ratio of the new process to the old process. So when you go from 45 nm to 32 nm, you get the ability to put in 1.978x the number of gates. When you go from 32 nm to 22 nm, you get the ability to put in 2.116x as many gates. This gives you room for more cores.
A change in process resolution gives you more gates and thus more computation per square inch. But it also requires less power to do the same amount of work. This is useful for gadgets, for whom the conservation of power consumption is paramount. If it takes less power, then it may also run cooler.
But wait, we seen to be at the current limits of the process resolution, right? Correct, 22 nm is about the limit at the current time. So we will have to do something else to increase the number of cores.
The conventional wisdom for increasing the number of cores is to use a Reduced Instruction Set Computer (RISC) design. ARM uses one, but Intel really doesn't. The PowerPC uses one.
When you use a RISC processor, it generally takes more instructions to do something than on a non-RISC processor, though your experience may vary.
Increasing the die size also can allow for more cores, but that is impractical for many gadgets because the die size is already at the maximum they can bear.
The only option is to agglomerate more features onto the die. This is the typical procedure for an SoC. Move the accelerometer in. Embed the baseband processor, the ISP, etc. onto the die. This reduces the number of components and allows more room for the die itself. This is hard because your typical smartphone company usually just buys components and assembles them. Yes, the actual packaging for the components actually takes up space!
Heat dissipation becomes a major issue with large die sizes and extreme amounts of computation. This means we have to mount fans on the dies. Oops. This can't be useful for a gadget. They don't have fans!

Modern gadgets are going the way of SoCs. And the advantages are staggering for their use cases.
Consider power management. You can turn on and off each processor individually. This means that if you are not taking a picture, you can turn off the Integrated Signal Processor (ISP). If you are not making a call (or even more useful, if you are in Airplane Mode), then you can turn off the baseband processor. If you are not zooming the image in real time, then you can turn off the a specialized scaler, if there is one. If you are not communicating using encryption, like under VPN, then you can turn off the encryption processor, if you have one. If you are not playing a point-and-shoot game, then maybe you can even turn off the Graphics Processing Unit (GPU).
Every piece you can turn off saves you power. Every core you can turn off saves you power. And the more power you save, the longer your battery will last before it must be recharged. And the amount of time a device will operate on its built-in battery is a huge selling point.
Now consider parallelism. Sure, four cores are useful for increasing parallelism. But the tendency is to use all the cores for a computationally-intensive process. And this ties up the CPU for noticeable amounts of time, which can make UI slow. By using specialized processors, you can free up the CPU cores for doing the stuff that has to be done all the time, and finally the device can actually be a multitasking device.
Really Big Computers
Massive parallelization does lend itself to a few really important problems, and this is the domain of the supercomputing center. When one gets built these days, thousands, if not millions, of CPUs are added in to make a huge petaflop processing unit. The Sequoia unit, a BlueGene/Q parallel array of 1,572,864 cores is capable of 16.32 petaflops.
But wait, the era of processing specialization has found its way into the supercomputing center as well. This is why many supercomputers are adding GPUs into the mix.
And let's face it, very few people use supercomputers. The computing power of the earth is measured in gadgets these days. In 2011, there were about 500 million smartphones sold on the planet. And it's accelerating fast.
The Multi-Processing Challenge
And how the hell do you code on multi-processors? The answer is this: very carefully.
Seriously, it is a hard problem! On GPUs, you set up each shader (what a single processor is called) with the same program and operate them all in parallel. Each small set of shaders (called a work group) shares some memory and also can share the texture cache (where the pixels come from).
It takes some fairly complex analysis and knowledge of the underlying structure of the GPU to really make any kind of general computation go fast. The general processing issue on GPUs is called the GPGPU problem. The OpenCL language is designed to meet this challenge and bring general computation to the GPU.
On multiple cores, you set up a computation thread on one of the cores, and you can set up multiple threads on multiple cores. Microthreading is the technique used to make multiple threads operate efficiently on one core. Which technique you use depends upon how the core is designed. With hyperthreading, one thread can be waiting for data or stalled on a branch prediction while the other is computing at full bore, and vice-versa. On the same core!
So you need to know lots about the underlying architecture to program multiple cores efficiently as well.
But there are general computation solutions that help you to make this work without doing a lot of special-case thought. One such method is Grand Central Dispatch on Mac OS X.

There is a multi-core architecture that is specifically a massively-parallel model that departs from simply just adding cores. The Cell Architecture does this by combining a general processor (in this case a PowerPC) with multiple cores for specific hard computation. This architecture, pioneered by Sony, Toshiba, and IBM targets such applications as cryptography, matrix transforms, lighting, physics, and Fast Fourier Transforms (FFTs).
Take a PowerPC processor and combine it with multiple (8) Signal Processing Engines capable of excellent (but simplified) Single-Instruction Multiple Data (SIMD) floating-point operations, and you have the Cell Broadband Engine, a unit capable of 256 Gflops on a single die.
This architecture is used in the Sony Playstation. But there is some talk that Sony is going to a conventional multi-core with GPU model, possibly supplied by AMD.
But what if you apply a cellular design to computation itself? The GCA model for massively-parallel computation is a potential avenue to consider. Based on cellular automata, each processor has a small set of rules to perform in the cycles in between the communication with its neighboring units. That's right: it uses geometric location to decide which processors to talk with.
This eliminates little complications like an infinitely fast global bus, which might be required by a massively parallel system where each processor can potentially talk to every other processor.
The theory is that, without some kind of structure, massively parallel computation is not really possible. And they are right, because there is a bandwidth limitation to any massively parallel architecture that eventually puts a cap on the number of petaflops of throughput.
I suspect a cellular model is probably a good architecture for at least two-dimensional simulation. One example of this is weather prediction, which is mostly a two-and-a-half dimensional problem.
So, in answer to another question "how do you keep adding cores?" the response is also "very carefully".