The long answer begins with a 1-minute history of computer architecture. A processor generally has something called an instruction pipeline. For scalar architectures, this meant that only one instruction (read: hardware function) could ever be executed at any given time. Some clever hardware engineers determined that this was not utilizing the hardware as effectively as possible, and so they came up with the idea of a pipeline, or the superscalar architecture, which allows more than one hardware function to be executed at a time. Generally speaking, this meant that if the 'add' function was being used at one point in time, the 'memory fetch' function could also be used at the same point of time. This introduced something they industry termed a 'data hazard'. For example, if a certain add operation depended on the result of a memory fetch operation, then the add function would produce unanticipated results if the memory fetch operation had not completed in time. The first solution to this problem was to introduce stalls in the pipeline, which were (and still are) very bad. The second solution (really an improvement on the first solution) was to add another hardware unit in to the chip that would re-order the instructions before sending them down the pipeline in order to minimize pipeline stalls due to data hazards. That hardware unit was called an out-of-order execution unit. Instruction scheduling can actually be done in software as well, by the compiler and linker, but since this only allows off-line instruction scheduling, it cannot account for asynchronous events that are only stochastically predictable. This is where the branch prediction unit comes into play, but I'll omit that for brevity. So far, only instruction-level parallelism has been covered.
The Cortex-A9 Pipeline |
The Cortex-A9 MPCore |
Getting back to the original question, it's important to consider what a 'thread' actually is. A thread is a pure software abstraction for a logical sequence of events. Threads are often associated with a priority, a state (e.g. ready, waiting, zombie), and an instruction pointer. As for the threading abstraction (e.g. POSIX threads), they must a) introduce data protection primitives, as well as mechanisms to b) wait until data is not in use and c) signal when data is no longer in use. Usually, the operating system deals with scheduling which threads are running at any given time, although it isn't that hard to do this without an operating system. The fundamental method of synchronizing threads is via shared sections of memory and atomic processor instructions. The thread scheduler uses timer-generated hardware interrupts to periodically evaluate the state of all threads, and then schedule code (i.e. determine the next branch target) for 1 to N cores. In the case of a uniprocessor system, this means that the scheduler itself is being swapped in and out after a certain number of time slices, where each time slice is occupied with a thread based on priority, state, etc. The number of cores available at any given time is also controllable with software, since cores can be dynamically powered off to save energy. This is something that the thread scheduler must take into account.
As for initialization of each core, typically a single core would be activated at power on, then as the operating system (or main binary) launches, a threading manager would also be launched. The threading manager would initialize and create descriptive data structures for the remaining cores on the system, and so on. As each core runs, it literally operates in a loop; 'jump' to an instruction and start executing, or go to sleep if not needed; then do the same thing again. The details on power up, particularly in the case of the ARM architecture, are very manufacturer dependent, since e.g. an OMAP MPCore implementation can have several physical differences and register locations than, e.g. an MSM MPCore implementation.
In summary, sure, it's easy to have several cores running at the same time, but getting them to coordinate shared data properly (i.e. run threads with shared data sections) requires that the concept of parallel execution be built into an application or library (which is not always easy). For a simple example, assume a library allocates 512 MB (i.e. 2**19 bytes) of memory, sets it all to zero, and then deallocates the memory. Would it run any faster on a multi-core system than it would on a single-core system? Absolutely not, because the processor cores do not follow the programming methodology of DWIMNWIS (unless it has a pretty advanced hardware rescheduler).
If I modify the library to first query a threading library to find the number of logical cores, partition my buffer into N sections, and then create several threads that are aware of their own partition boundaries, then I can expect my library to perform faster by a factor according to Amdahl's law: S=1/(F+(1-F)/N). In this case, since the amount of the problem that is not parallelizable is 0%, F=0, and the speedup is S=N. However, even in the case where a thread scheduler is present, there is still variability about where the code will actually run - for example, all threads could even be run on a single core, rather than distributed among them all.
Amdahl's Law: S = 1 / ( F + (1-F)/N ) |
Also, I'm looking forward to receiving a PandaBoard shortly, with an OMAP4440 dual-core Cortex-A9 chip. This will give me incentive to do some SMP performance tweaks on some of the NEON-enabled software I've written lately (e.g. FFTW).