D. Kalinsky Associates
 Home  |  Online Learning  |  Resources  |  About Us  |  Contact  |  Site Map  
Technical Paper :

"Designing Software for Multicore Systems"

When you first think about it, the design of software for multicore systems seems  to be similar to the design of
traditional embedded multitasking software. But application software for multicore systems is often much larger in
size and much more complex than software that has in the past been developed for traditional single-CPU embedded
systems.  So we need some additional methodological recommendations to guide our work at this new scale and
new level of complexity.  In a first approximation, those guidelines can be similar to those for the design of large
distributed multiprocessor systems.

But, multicore systems are not simply miniaturized distributed systems.  Before embarking on the design of software
for a multicore system, some differences between multi-core and distributed systems must be understood, and then
taken into account when architecting software that is destined to run on multi-core SOCs (Systems-On-a-Chip).  Many
of the design assumptions that have been the firm foundation of embedded systems multitasking software design for
the past 20-30 years, are no longer valid when designing software for multi-core SOCs.


At the heart of architectural design is the partitioning of embedded software into ‘chunks’.  As a result, each ‘chunk’
will then be of smaller scale and complexity than the software system as a whole, and will thus be easier to work on.  
We can apply some starting-point guidelines for how to do this partitioning, based on the concepts of methodologist
Hassan Gomaa [ref. 1] .

The basic idea is to divide a large and complex system into a number of ‘chunks’ that we will call sub-systems.  And
those sub-systems can then, if necessary, be further sub-divided into smaller ‘chunks’ (‘sub-sub-systems’, etc.).  This
hierarchical decomposition of the complex system can continue to as many levels as needed  -- until the software
architectural designer can say something like, “
Oh, I know how to write the software for this chunk in probably fewer
than 1,000 lines of code
”, or “Oh, I know how to implement this chunk in probably fewer than 10 concurrent tasks.

I like to think of the leaf-node ‘chunks’ or sub-systems as plastic containers – where each container can hold within it
a number of application tasks that are related to one another so that they work together to provide a major software
service.  Some operating systems that we use in the embedded world have actually built into the operating system
itself such a container-for-tasks concept.  For example, Linux processes may be thought of as containers for POSIX
threads (concurrent tasks).

The sub-systems or ‘chunks’ need to be as independent of one another as possible.  But of course they will never be
totally independent of one another – because if they were then they would be considered separate projects or
separate products. Here are some guidelines for partitioning a complex system into sub-systems that are minimally
* Partitioning into sub-systems should be guided by the problem to be solved.
    (… the project’s system and software requirements).
* A sub-system should perform 1 major service.
    (… not ½ of a service, or 2 services).
* There should be high “cohesion” between parts of a sub-system.
    (The components of a sub-system should work together very closely.)
* A data store should never be an interface between sub-systems.
    (It should always be encapsulated entirely within 1 sub-system.)

What are typical ‘major services’ that could well deserve to be assigned their own sub-systems?
    * Real-Time Control sub-system
    * Real-Time Coordination sub-system
    * Data Acquisition sub-system
    * Data Analysis sub-system
    * Server sub-system
    * User Services sub-system
    * System Services sub-system (… your embedded operating system).

Here is an example data acquisition system consisting of 4 sub-systems:

1. Gomaa, H., “Software Design Methods for Concurrent and Real-Time Systems”, ISBN-13: 978-0-201-52577-9.
The diagram above is an example of a race condition that can crop up when using message communication (green
envelopes) between tasks (blue rounded rectangles).  Say the temperature sensor at the left is measuring
temperatures around 72 F.  If the management task at the top sends messages to the temperature measurement
task and to the temperature display task to change the measurement units to Celsius,  it is entirely possible for the
temperature display task to receive a measurement update message that’s still carrying the value 72, after it’s already
received the message telling it to go forward using Celsius units.  In that case, it would treat the update as
representing 72 degrees Celsius, which is about 162 F.  This is obviously an error.  After that, the software would
pretty quickly settle down and go forward reporting values around 22 C., which is again approximately 72 F.  In other
words, the screw-up would appear as a transient “glitch”.

While similar race conditions could also crop up in single-processor environments, they are more likely in SMP
environments because of the generally less orderly timing of task scheduling in SMP.  Serially shareable resources
that are not protected by SMP-safe mutual exclusion mechanisms, or have software errors in the usage of the mutual
exclusion mechanisms, are also more likely to produce “glitches” in SMP.

For similar reasons, bugs relating to lack of reentrancy of code shared by multiple tasks, are more likely in SMP

In addition, problems relating to task priorities are more likely to crop up in the truly parallel environment of a multi-
core SOC than in the traditional ‘pseudo-parallel’ environment of a single-core multitasking system.  One of the main
reasons is that in multi-core work, task priorities can no longer be used to guarantee mutual exclusion.  Since there
are multiple processing cores available to an SMP operating system, it can run multiple tasks of the same priority
simultaneously on the cores it controls --- an impossible situation in a single-core system.  In other words, design
strategies such as “cooperative” scheduling do not work in SMP.  Or the multiple cores could be used to run tasks of
different priorities simultaneously, thus violating the traditional single-processor assumption that when a higher
priority task is running – no lower priority task will be running concurrently.


The design of software for multi-core systems is very different from the design of traditional embedded multitasking
software. Software for multi-core systems is often much larger in scale and much more complex than application
software for traditional single-CPU embedded systems.  This paper began with some methodological
recommendations to guide design work at this new scale and new level of complexity.  The first-cut guidelines are
similar to ones for the design of distributed multi-processor systems.

But multi-core systems are far from miniaturized distributed systems.  There are significant differences in the areas of
inter-core communication capacities, and inter-core topology, that must be taken into account when architecting
software that is targeted to run on multi-core SOCs. Many of the design assumptions that have been a foundation of
traditional embedded systems multitasking software design, are no longer valid when designing software for multi-
core SOCs.

This material was presented as a half-day tutorial at the Embedded Systems Conference 2007 Silicon Valley, April 1-5, 2007.

© Copyright 201
6, D. Kalinsky Associates, All Rights Reserved.
This page Updated M
arch 25, 2016
 Once a large and complex system has been partitioned into sub-systems (and those sub-systems have been further
hierarchically decomposed to as many levels as needed), the next step in software architectural design is to
decompose the leaf-node sub-systems (my ‘containers’) into concurrent tasks.  This decomposition is done very
much as for traditional single-CPU embedded software designs. [See H. Gomaa ref.1 for further guidance.]

The primary mechanism for communication between sub-systems is message passing.
Only after all of the above design steps are complete should the architectural designer begin to think about how to
map the software architecture into the processor cores of a multi-core target SOC.  When thinking about mapping the
software into the available cores, additional considerations may come into play, such as …
    * Proximity of processing software to the source of its data
    * Autonomy (or near-autonomy) of software sub-systems
    * Performance considerations
    * Hardware devices and their Software-Hardware interfaces
    * User Interface
    * Servers and large data stores.
These considerations may result in a re-design of the hierarchy of sub-systems, before going forward with the
mapping of the software architecture into the available cores.

In general, one or more sub-systems will be mapped to a processing core.  Leaf-node sub-systems of a sub-system
hierarchy should be mapped to processing cores as complete units.  In other words, a leaf-node sub-system should
not be split so that parts of its software will run on one core, while other parts run on another core.


In the above first approximation, guidelines for the design of software for multi-core systems can be similar to those
for the design of distributed multi-processor systems.  But when taking a more in-depth view, it will become apparent
that they are not the same.

Regarding core-to-core inter-processor communication capacities, these two types of systems are vastly different.  
Multi-core SOCs often have 1-2 orders of magnitude greater inter-processor communication bandwidth, with several
orders of magnitude smaller messaging latency and an underlying reliable physical layer.  They can be thought of as
containing small processors interconnected via wide conduits; whereas distributed systems are better thought of as
possibly containing larger processors but with very narrow interconnection conduits.

Further differences stem from the fact that multi-core SOCs contain a small, fixed number of processing cores
(“CPUs”) in a stable topology; whereas distributed systems can have dynamically changing numbers of processors in
dynamically varying topologies.

The communication-related differences tell us that larger flows of data can be passed between software running on
different cores in a multi-core SOC than in distributed designs.  And these data can usually be passed from core to
core with less reliability-assurance communication overhead than, for example, TCP/IP.

The topology-related differences tell us that monitoring (sometimes called ‘supervision’) of the availability of software
functionality on multiple cores can be handled differently in a multi-core SOC than in distributed designs.  In general,
the monitoring can be done much more simply in a multi-core SOC environment.


A multi-core SOC contains 2 or more processing cores within a single silicon chip.  Those cores can be either
identical or heterogeneous.   For example, the T.I. “OMAP” SOCs, popular in cell phones and PDAs, contain both an
ARM general-purpose processor and a DSP for signal processing.  They are an example of a heterogeneous multi-
core chip, where the different processor cores actually run different instruction sets and different operating systems,
possibly with totally disjoint memories.  Totally different and distinct work assignments are given to the different cores,
in what is called “Asymmetric Multiprocessing”.

On the other hand, some multi-core SOCs contain multiple identical processors, with a common shared memory.  In
that case, it is possible to run one operating system on the SOC that will control software execution on all of the cores
in the SOC.  The operating system can view all of the cores as equivalent and could possibly hand out work
assignments to different cores in its efforts to maximize throughput.  This is called “Symmetric Multiprocessing”.

Please note that it is possible to do “Asymmetric Multiprocessing” on a multi-core SOC containing identical cores
(“homogeneous”).  But it is not possible to do “Symmetric Multiprocessing” on a heterogeneous multi-core SOC.


For the different categories of multi-core multiprocessing, there exist different categories of embedded operating
* “Symmetric Multi-Processing” (“
SMP”) operating systems, for homogeneous multi-core SOCs only; and
* “Asymmetric Multi-Processing” (“
AMP”) operating systems, for heterogeneous and homogeneous multi-core SOCs.

SMP operating systems use a single operating system instance to control software execution on all of the cores of the
homogeneous multi-core SOC at the same time.  It can do “Load Balancing” by which it views all of the cores as
equivalent and can shift work assignments from core to core as it strives to maximize throughput.  Or it can be told to
do “Processor Affinity”, by which specific work assignments are tied to specific cores.

On the other hand, AMP operating systems are built as separate, possibly differing, operating systems on the
multiple cores of the multi-core chip.  They resemble (or can be identical to) the traditional Real-Time Operating
Systems (“RTOSs”) that are long familiar in the embedded single-processor world.  They are well suited for hard real-
time and deadline-oriented applications. See the illustration below.
The RTOS ‘kernel’ services shown above are centered upon a priority-based preemptive scheduler that is the heart of
the ‘task management’ services.  Above ‘task management’ are shown a number of mechanisms for communication
and synchronization among tasks running on the same processor, including messaging, semaphores, mutexes and
event flags.  Additional major categories of RTOS kernel services shown are dynamic memory allocation (similar to
‘malloc’ and ‘free’), timer services and device driver management services.

But these RTOS kernel services must be augmented for use in a multi-core environment: Mechanisms are also
needed for reliable communication and synchronization among tasks running on different processing cores.  For
multi-core devices having shared memory, the RTOS can use some of the shared memory as a communication
encapsulating shared memory within a higher-level more abstract mechanism such as inter-core inter-task message
As shown in the illustration here, it is actually possible for an RTOS to support precisely the same message-passing
model that was used in inter-task communication within a single processor, and to extend its support to
communication between tasks now running on different (perhaps heterogeneous) cores of a multi-core SOC.  A
number of RTOSs have been extended to provide this feature, by implementing within a new component of the RTOS
a communication software design pattern called “Forwarder-Receiver”.


SMP operating systems are quite different from hard deadline-oriented RTOSs. SMP operating systems use a single
operating system instance to control software execution on all of the cores of the homogeneous multi-core SOC.

Some SMP operating systems provide the kind of core-transparent inter-task message communication described for
AMP above.  For example, see <mqueue.h> of POSIX in the Linux world.

In addition, a number of SMP operating systems provide a mutual-exclusion mechanism somewhat like a
semaphore or a mutex, but specializing in multi-processing.  It is called a ‘spinlock’.  For example, see the threads
library of POSIX.

Spinlocks are an operating system mechanism for regulating the access to a serially shareable resource in a multi-
processing or multi-core environment.  Serially shareable resources can include such things as data tables, I/O
devices, or even non-reentrant algorithms, which are to be shared among tasks that may run on different processing
cores. A spinlock can be assigned to ‘protect’ a serially shareable resource, and then all tasks must be instructed to
ask ‘permission’ of the spinlock before actually accessing the resource.  This is usually called ‘locking’ the spinlock.  
And, of course, the task must ‘unlock’ the spinlock when it is done accessing the shareable resource.

So far this sounds pretty much like a classic semaphore.  And indeed, users of spinlocks need to watch out for some
of the same problems that users of semaphores often encounter: danger of deadlocks, danger of lockouts, priority
inversions, difficulties in debugging, etc.

But, unlike semaphores (or mutexes), attempting to lock a spinlock can put the caller into a busy checking loop
(“spinning”).  That’s because spinlocks are built around a hardware “test & set” operation.  If the lock is not currently
available, the caller software will “spin” until the lock does become available.  The caller software is not put into a
Blocked state, or a Waiting state, or a Pending state. Instead, it is still actively running on its processor core.  No other
task can run on that processor core during that time.
Hence, it is recommended that spinlocks be used only in SMP situations.  They are to be avoided in single-processor
designs.  And they should be used only when the expected maximum waiting time is less than the operating system’s
context switch time.  In addition, application software should not block (‘wait’, ‘pend’, ‘suspend’,…) while it’s holding a

Here’s an example of two POSIX threads in two Linux processes serially sharing a resource protected by a spinlock:
Each thread (‘task’) calls  pthread_spin_lock()  to lock the spinlock before accessing the resource; and then later
when it’s done with the resource it calls  pthread_spin_unlock() to release the lock.

Unlike RTOS semaphores or mutexes, it is normally OK for a spinlock to be locked and unlocked by an interrupt
handler or interrupt service routine (“ISR”) --- even if the ISR must spin on the lock for a short while.  However, working
in this way could negatively impact the ISR’s interrupt latency and responsiveness.


A few SMP software design pitfalls have already been mentioned in this paper, but there are a good number more
that deserve your caution.  They all boil down to the fact that humans, including skilled software architects, are weak at
thinking about parallelism in complex systems.  And multi-core SOCs are truly parallel systems.   Many a good single-
processor software design can fail in SMP.  SMP design is fundamentally different.

Timing-related bugs such as race conditions, in which the correctness of a result depends on the relative timing of
tasks, are more likely to crop up in the truly parallel environment of a multi-core SOC than in the traditional ‘pseudo-
parallel’ environment of a multi-tasking single-core system.