The “Delete Member B” operation is shown in the upper part of the figure.  But in parallel, an “Insert Member Z after
Member B” operation could be going on at exactly the same time, as shown in the lower part of the figure.  The result
would be that Member Z would be left inaccessible.

   So here we see that CAS is not an all-powerful ‘magic bullet’ in itself.  Developers of lock-free programs need to
‘help it along’ with solutions to problems such as the race condition of Figure 7.  Here’s an idea that may help: Before
the “Delete Member B” code calls CAS, it should intentionally modify the pointer field of Member B with some kind of
deletion mark.  This would make a parallel CAS from an “Insert” request involving Member B fail.  In that way, the
“Insert” would be postponed until a later re-try which would then succeed in Inserting the new list member correctly.

   In this way, it is possible to do ultra-high performance lock-free manipulation of a linked list.  But once again, we
have encountered logic that is more complex and ‘delicate’ than its lock-based alternative.  So again, it is harder to
understand, and much harder to debug.  And once again, there is looping behavior in this logic that makes its
execution time possibly uncontrollable.


CONCLUSIONS

   Perhaps some of the readers of this article are disappointed at this point.  Wasn’t one of the big expectations of
Lock-Free Programming that we could eliminate the task waiting and core waiting inherent in lock-based
programming on multicore SOCs?  But now we have seen that application software designed for Lock-Free
Programming could itself involve significant task waiting and core waiting.  And this waiting, as in the examples we’ve
seen, is typically busy-waiting where application software just keeps on running (‘hogging’ the “Running State” of a
core) while preparing for the next attempt to CAS.

   Some readers were perhaps expecting that “Lock-Free Programming” would be synonymous with “Wait-Free
Programming”.  But now we know they are different.  And “Wait-Free Programming” is even more challenging (Ref. 3)
to design than “Lock-Free Programming”.

   However, Lock-Free Programming does have distinct advantages over traditional lock-based programming in areas
such as typical performance, deadlock avoidance, lockout avoidance, and priority inversion avoidance.  Program
execution can sometimes be accelerated by factors of two or more at former ‘critical sections’.

   It is not often used at present, perhaps because of the ultra-fine “craftsmanship” required of the software
developers involved.  The design of library software might be a sensible place to consider it.  Or perhaps there is one
small corner of your application where the elimination of high-frequency locking would give a major performance
boost?  Those might be good opportunities for you to try out a lock-free programming approach for the first time.


END.


TECHNICAL REFERENCES

1.        D. Kalinsky, “Multicore’s Fourth Semaphore: Multiple Reader – Writer Lock .

2.        F. Siebert, “ Facing the Challenges for Real-Time Software Development on Multi-Cores”, Embedded Systems
Conference 2010 Boston, paper ESC-422.

3.        A. Alexandrescu, “Lock-Free Data Structures”,  Dr. Dobb’s, Oct. 1, 2004.  
http://www.drdobbs.com/184401865  .
D. Kalinsky Associates
 Home  |  Online Learning  |  Resources  |  About Us  |  Contact  |  Site Map  
Quite Advanced Technical Paper :

"Is Lock-Free Programming
                           
              Practical for Multicore ?"
BACKGROUND

At my courses on multicore software design, the professional engineers attending often feel uncomfortable in the
‘Brave New World’ of truly parallel software, which is significantly different from the ‘pseudo-concurrency’ of traditional
single-CPU multitasking.   During the part of the course about shared resources, participants are shifting nervously in
their seats as 4 different kinds of semaphores are described (Ref. 1).  By the time I add to the discussion 3 or more
varieties of ‘spin-locks’ with their characteristic active ‘spinning’, some attendees begin to audibly groan.   As I try to
explain the often subtle differences in usage of these 7 or more mechanisms, attendees frequently stop me in mid-
sentence to ask:
“Hey David, is there any way I can just avoid that entire ‘can of worms’ ??”.

 What they are really asking is: “Hey David, is there a way I can do resource sharing efficiently without locks in a
multicore environment ?”.


INTRODUCTION

 Facilities for Lock-Free programming are available in the hardware of many multicore SOCs, as well as in the
software of some multicore operating systems.  Unfortunately, the development of application software algorithms
with these facilities is both different and more challenging than working in the traditional way: When using traditional
locks, a software designer first identifies ‘critical sections’ of code; and then ‘protects’ them by selecting one of the
traditional locking mechanisms to ‘guard’ the critical sections.  In the past, avoiding locks seemed dangerous, or
tended to involve intricate, convoluted algorithms.  For these reasons, Lock-Free programming has not been widely
practiced.

 But as application software developers gain experience with multicore SOCs and multicore operating systems, they
frequently discover that traditional multitasking design approaches are often inappropriate in a multicore
environment.  Multiple tasks, and often multiple cores, can spend a great deal of time waiting on locked locks, thus
reducing parallelism, increasing latencies, and reducing the benefits of using a multicore SOC.

 While Lock-Free programming is not a ‘cure-all’, it can be used sensibly to provide a significant performance
advantage over lock-based programming.  This is most often done by using it within a small percentage of application
software’s tightest, most deeply nested and heavily-executed loops.

 The very best form of Lock-Free programming is simply to design application software as very large and totally
independent parallel chunks of code.  Such immense blocks of code could then run concurrently (typically on different
cores) for long periods of time without any interaction at all between them. There would be no need for locks, since
there would be no data interactions.  But this is not practical in many applications.

 If data interactions are unavoidable, some simple data interactions between parallel chunks of code can be
governed by a Lock-Free operation called “CAS” or atomic “Compare-And-Swap” that is offered in hardware on various
multicore SOCs and also in a number of multicore operating systems.  Several examples of the use of “CAS” will be
given later in this article.


PROBLEMS WITH LOCKS

 Locking mechanisms such as semaphores, mutexes and multiple reader-writer locks are problematic even in
single-CPU multitasking environments.  Using them in a multicore environment tends to exacerbate their problems
because of the true parallelism involved and the more ‘chaotic’ nature of task scheduling done by multicore operating
systems.  In multicore environments, additional locking mechanisms called ‘spin-locks’ join the fray.  Here are some
oft-encountered problems involving these locks:

•        Problem: DEADLOCKS

A deadlock is a situation where 2 (or more) tasks wait for each other to release a resource that the other is holding.  
Since they’re waiting, they will not release anything; and as a result they will not do any useful work ever again.  An
example is shown in Figure 1 below.
This material first appeared on the internet, in edited form, in EE Times - Embedded Design, April 4, 2011.
URL:
http://www.eetimes.com/design/embedded/4214763/Is-lock-free-programming-practical-for-multicore-
This material has also been published in German, in Elektronik Praxis magazine (Germany), September 15, 2011.
URL:
http://www.elektronikpraxis.vogel.de/themen/embeddedsoftwareengineering/analyseentwurf/articles/331000/

© Copyright 2016, D. Kalinsky Associates, All Rights Reserved.
This page Updated March 25, 2016
SOC, the deadlocking tasks may well be on different cores.  And if they are trying to lock spinlocks (instead of the
semaphores in the figure), the deadlocking tasks might be ‘spinning’ at full speed on the different cores, while
making absolutely no progress.

•        Problem: LOCKOUTS

A lockout is a situation where a task locks a lock, then uses the resource the lock is ‘guarding’, but later forgets to
release the lock.  If another task (or perhaps the same task again) tries to lock the lock after this, it will never get the
lock. If this second task is insistent about the lock, it will never run its application code again. It will appear to have
‘died’.

•        Problem: PRIORITY INVERSIONS

A priority inversion is a situation in which a low-priority task prevents a high-priority task from executing.  This situation
can happen very easily if a low-priority task locks a lock, and then a high-priority task (perhaps running simultaneously
on another core) tries to lock the same lock.  The high-priority task will be made to wait on the lock, until the low-priority
task releases it.  If the locking, and hence the waiting, goes on for a long time, the performance of the high-priority task
can be seriously impacted.

•        Problem: PERFORMANCE DEGRADATIONS

Locks are operating system abstractions that are used by calling operating system services which are often
cumbersome. For example, operating systems associate task waiting queues with most kinds of locks, and must
manage the associated task queue during every operation on a lock.  This can be time-consuming.  In addition, tasks
asking to lock a lock may be delayed in the task waiting queue if the lock is in use.  In a multicore environment,
multiple tasks at multiple cores may be forced to wait, possibly in a busy-wait if they are ‘spinning’ on a ‘spin-lock’.

•        Problem: INTERRUPT HANDLER LIMITATIONS

Interrupt handlers such as ISRs are limited in their interactions with most locks.  For example, ISRs are forbidden to
wait for semaphores.  They are forbidden to operate in any way at all on (priority-promotion) mutexes. Thus, tasks can’
t use semaphores or mutexes to protect against critical sections of code in ISRs.  ISRs are permitted to wait for spin-
locks; however such waiting could inject significant additional latency into the ISR.


LOCK-FREE PROGRAMMING TO THE RESCUE ??

Lock-Free programming can be used in some situations to provide a significant performance advantage, and avoid
the above litany of problems inherent in lock-based programming.  Simple data interactions between parallel chunks
of code can be governed by a Lock-Free operation such as “CAS” or atomic “Compare-And-Swap” that is offered as a
machine instruction on a number of multicore SOCs as well as a service of some multicore operating systems.

   The operation of CAS, expressed in a C-like language, is shown in Figure 2 below (Ref. 2).
FIGURE 1: A Deadlock Scenario
FIGURE 2: Semantics of a typical implementation of CAS.
   What CAS does is to compare the current content of ‘var’ with an ‘expected’ value.  If it is as expected, then ‘var’ is
updated to a ‘new’ value.  If it is not as expected, then ‘var’ is not updated.  All of this is done as a single non-
interruptable, non-preemptable operation.

   The way this is used in the development of lock-free code, is to begin by copying the original value of a variable (or
pointer) into ‘expected’.  Then a second copy is used in a (possibly long and complex) calculation to arrive at a ‘new’
value for the variable.  But during the time of the processing, another task – perhaps on another core – may have
already modified the variable from its original value.  We would like to update the variable with the ‘new’ value only if it
has not been “tampered with” by another task in the interim.  CAS does the combined check and update together,
without the need for a locking mechanism.

      If CAS shows that the variable no longer contains its original or ‘expected’ value, application software needs to
decide what to do next.  The variable has not been updated with the ‘new’ value.  What is often done is to repeat the
entire calculation of the (possibly long and complex) processing, this time based on a new reading of the current
value of the variable.  Then CAS is invoked again, in the hope that this time there was no “tampering” by another task.


EXAMPLE 1: INCREMENTING A COUNTER

   Let’s begin with a simple example of two tasks that both increment a value using the C-language “++” operator.
This operation is not an atomic operation.  See Figure 3.
FIGURE 3: Two tasks increment a shared variable, with random failures.
   The two tasks may become active at the same time in a multicore environment, whether they have the same priority
or differing priorities.  Occasionally, the “++” operations in the two tasks will be executing at precisely the same time –
resulting in a corruption of the value stored in “
counter”.  For example, a count may be lost.

   This can be avoided by using a lock to protect the “++” operation as a “critical section”.  But we are trying to do Lock-
Free programming!  Figure 4 shows how to achieve the same objective in a Lock-Free fashion (Ref. 2).
FIGURE 4: Two tasks increment a shared variable with no failures, no locks.
   In the Figure 4 code, it does not matter whether the “ + 1 “ operation is atomic or not.  It only matters whether or not
another task has been “tampering with” (or “working on”) the variable “counter” during the calculation.  The CAS
operation checks this, and will only update “counter” if there’s been no tampering.  If there has been tampering, this
code will simply loop around to the beginning and try to increment again.

   Before leaving this example, let’s note several things. First of all, this code is much more complex and ‘delicate’
than its lock-based alternative.  So it is much harder to understand, and excruciatingly harder to debug.  Secondly, the
looping behavior of this code makes its execution time uncontrollable.  (… in situations where “tampering” might be
detected again and again.)  This style of coding might be unacceptable in a hard real-time system.

   In some applications, it may be possible to limit the number of iterations through the loop, in order to limit the worst-
case execution time.  This may require justification such as a probabilistic calculation to show that the iteration limit
will be hit so rarely as to be negligible.


EXAMPLE 2: LOCK-FREE LINKED LIST

   Linked lists are often used in embedded software to organize large quantities of sensor data or other data
streams.  Figure 5 illustrates a linked list, with each list member shown as a rectangle containing both a data field
and a pointer to the next member of the list.  [The data field in a list member could be used to store a pointer to a large
buffer, if there is a large quantity of data associated with it.]
FIGURE 5: A Linked List.
Inserting a new member into a linked list is straightforward with a Lock-Free approach.  First, the proper position in
the list for the new member must be located.  Then the content of the new member must be prepared.  And finally, the
new member is inserted into the list using a single CAS operation.  This is shown in Figure 6.
FIGURE 6: Linked List -- Lock-Free Insert Operation, using CAS.
In this situation, CAS operates on the pointer-to-next-member field of an existing list member to verify that there has
been no “tampering” with it, before CAS changes it to point to the new member.  Here, “tampering” really means that
while the new member was being prepared for insertion into the list, another member had already been inserted at
that position in the list.  Hence all preparations for inserting the new member must be started again ‘from scratch’ in
that situation.

   At first glance, it may seem that Deleting an existing member from a list ought to be just as easy.  
“Just ‘CAS’ the
pointer field of the previous list member to a pointer to the following list member.”
It sounds like it ought to work.  But
there is a “race condition” here, as illustrated in Figure 7.
FIGURE 7: Linked List -- Naïve Lock-Free Delete Operation might 'collide'
with an Insert Operation involving the deleted member.