Computer: Programming
Safe Thread Suspension   (+1, -3)  [vote for, against]
A sneaky trick, of course!

One of the things that modern programming languages allow is "multi-threaded" applications. I probably need to explain this for any non-computer-nerd who is curious about the title of this Idea.

First, a modern computer Operating System (OS) is usually "multi-tasking", meaning that it can run multiple tasks or applications simultaneously --or at least SEEM to run them simultaneously. The trick is, the OS rapidly switches between each application and runs a little bit of it, before switching to the next task. Computers are so fast, compared to human perception, that this trick worked nicely even on 1980s-era PCs that had a clock speed of 2 Megahertz (clock speeds are 1000 times faster than that, today). The result is that every task running on the computer appears to be smoothly and seamlessly chugging along, at normal speed. (Too many of them, of course, does lead to notice-able slowdowns.)

Well, suppose a single application was divided into sections that the Operating System could be told to run "simultaneously"? In this case each section is, in computer lingo, called a "thread"; the OS switches between threads of a single application in exactly the same way that it switches between different tasks.

It can be tricky to write a multi-threaded application. The different threads often need to communicate with each other, and one way is to save some data in one thread, and to read it by another thread. Well, "deadlocks" can happen when both threads try do do that at the same time. Various ways of dealing with these issues have been devised, having such names as "locks", "semaphores", "critical sections", "signals", and "mutexes".

One other thing that a Master Thread might want to do, occasionally, is to temporarily "suspend" one or more of the other threads of an application. It might want to do this if it is about to do some serious data-crunching that cannot be spread across multiple threads. The Master Thread would get more "processor time" (processing time devoted to each task by the Operating System), if the other threads weren't running. Well, technically, the Master Thread cannot suspend the other threads by itself, it has to send a request to the Operating System, which is responsible for switching between the applications and threads that it is running.

Now we come to the Problem that this Idea exists to solve. Just the other day I encountered a note, in a programming-language function-library, that reads: "About Suspend and Resume. POSIX does not support suspending/resuming a thread. Suspending a thread is considered dangerous since it is not guaranteed where the thread would be suspend. It might be holding a lock, mutex or it might be inside a critical section."

POSIX is an international standard, a set of behaviors that an Operating System is expected to follow, for interfacing with other software. While created for Unix, the POSIX standard is largely followed by most other Operating Systems, including, believe it or not, Microsoft Windows. (And just because Windows allows thread-suspending/resuming in spite of POSIX, that doesn't mean other parts of Windows are not POSIX-compliant.)

One peculiarity about multi-threaded programming I need to explain clearly. A good way to think of each thread of an application is to pretend it has a complete copy of all the code, to itself. This is not the truth (only one copy of the program-code exists in the computer's memory, usually), but it is handy to avoid certain mistakes. For example, suppose Thread A wants to Suspend Thread B. Obviously it shouldn't call the Suspend function in its own copy of the application! It should call the Suspend function in the copy that Thread B is using. (Technically, each Thread has a numerical identifier, so the solely-existing Suspend function merely operates on the identifier that it is told to Suspend, no matter what Thread is doing the telling.) Well, let's pretend we have copies of the code, and that Thread A does call B's copy of the Suspend function. Carefully note that it is actually Thread A that is running that code, and not Thread B. This is why, when B is suspended, POSIX says it is dangerous. We don't know what piece of code B was running, when it got Suspended.

The purpose of this Idea is to solve the Problem. We want to guarantee that a thread, when suspended, is always in a safe place. So:

1. Let each Thread have access to a Suspend function in the already-existing way, as explained above (there is actually only one copy of the code being accessed by multiple Threads).

2. Regarding the Resume function, it is accessed like Suspend, except of course a Suspended Thread cannot Resume itself. Some other Thread has to ask the OS to Resume it.

3. Let the Suspend function have two parts. The first part is simply some message-passing code, and nothing else. Any Thread that calls this first part will actually be sending a message *to*the*Thread* that we want to Suspend.

4. The Thread that called Suspend now goes on to do other things. Meanwhile, the Thread that is to be Suspended eventually receives the message, and as a result it calls the second part of the Suspend function. This is the part that asks the Operating System to Suspend the Thread. But this soon-to-be-Suspended Thread does not make that request just yet.

5. A "Resume location" is now required. This is the address of the first program-code-instruction in the computer's memory, that a specified Thread will start to do upon being Resumed, some time after being Suspended. I'll be a little more specific about this in a moment.

6. Inside the second part of the Suspend function, the Thread that is running this code obtains that Resume-location, and now it passes this on to the Operating System, along with its own Thread-identifying code-number, as it makes the request for Suspension. The OS will eventually save the Resume-location and use the identifier to stop switching to that task, as requested.

7. Here is the sneaky trick. Ordinarily, after a Thread passes information on to another process (like the OS), it continues on to do many more instructions. POSIX recognizes that even a Thread that requests its own Suspension might be in the middle of doing something else when that Suspension finally occurs. But all we really need is an endless loop! That is, inside the second part of the Suspend function, immediately after the System-call to request Suspension, the ONLY code that the Thread will now access is a simple loop, essentially "JUMP TO THIS JUMP INSTRUCTION". So the Thread will do that and nothing else, until Suspension finally occurs. NOTE: There is a particular kind of code, called "re-entrant code" that this loop needs to be. We need it to work, and to NOT deadlock, no matter how many Threads might have been told to Suspend themselves, and are all executing this endless loop until it happens.

8. The Resume-location specified earlier, logically, is the next instruction in the overall computer program, that follows the JUMP loop. This will essentially cause a newly-Resumed Thread to exit the second part of the Suspend function, and allow it to go on to do other things.
-- Vernon, Aug 17 2006

What does your busywait loop accomplish that couldn't be done with semaphones?

Give each suspendable thread a semaphone, initialized to the "GO" state. When it reaches a point after which it shouldn't be suspendable, it waits on the semaphore (if necessary); when it's done with that, it resets it, waits on it, and resets it.

To suspend the suspendable thread, the main thread should wait on the semaphore itself. It will sit if the suspendable task is in an unsuspendable section. Once the unsuspendable thread is out of that section, the main thread will get in. The unsuspendable thread will then be suspended at the semaphore wait following its suspended section. When the main thread wants to unsuspend the other thread, it will reset the semaphore allowing the other thread to resume.
-- supercat, Aug 17 2006


Ummm

Have not really thought about what you are trying to do --- but --- it is better to use a wait instead of a busy loop.

Before threads there was/is forking applications this type of application has suspend but not resume i.e. the kill signal.
-- madness, Aug 17 2006


[supercat], the loop seems simpler to me, than what you wrote. No synchronization needed. Note that the OS is responsible for doing/not-doing the stuff you specified, yet the natural tendency for a processor is always to execute the next instruction. So, here Thread A sends its message to Thread B, and goes on to do other things. Thread B, doing other things, eventually receives the message and then ends up executing the same next-instruction, until suspended. Nor is this particularly wasteful of clock cycles, when, if you want Thread B to be suspended, you don't want it processing anything beyond a suspension-point, anyway.

It occurs to me to wonder, if safely stopping a thread is as simple as you say, why the note I quoted existed. Perhaps it is older than I thought.... :)
-- Vernon, Aug 17 2006


//Nor is this particularly wasteful of clock cycles, when, if you want Thread B to be suspended, you don't want it processing anything beyond a suspension-point, anyway.//

Running even a single thread that's in a "while(1);" loop will greatly degrade system performance unless or until the operating system decide it really isn't doing anything useful. The thread itself might not need the CPU cycles it wastes, but that doesn't mean other applications or threads wouldn't like to have them.
-- supercat, Aug 17 2006


[supercat], a jump-to-itself loop is faster than a while(1) loop (which at the machine-language level includes a loopback jump, to the extra code to process the while. Also, I think you are confusing other high-level-language concepts, such as the waiting you described in a semaphore, with what goes on at the machine-language level. Do you know if the waiting you described is actually a do-no-instructions thing, or is it really a block of loop-til-some-condition-met code? I know that some processors have "wait-for-an-interrupt" instructions, and in this case the processor can literally halt until an external hardware signal arrives. Not a good thing for a multi-tasking OS to encounter, though!

Anyway, I think you are missing the point. Once we decide we want a Thread suspended, we no longer care what it is running; we just want it to stop, exactly because it frees up clock cycles for other Threads/applications. I am simply saying that while we are waiting for the OS to actually stop spending cycles on the chosen Thread (and usually this only takes a fraction of a second), that Thread should be doing a known-safe thing, like performing an endless loop. Computer cycles aren't so expensive in most applications, any more, that we need be concerned that some are being wasted for a fraction of a second.
-- Vernon, Aug 17 2006


I know all the words you're using, Vernon, but the order you put them in doesn't make sense.

Imagine you can snap your fingers and freeze people in place. OK? Now, let's say there's a guy holding a ball. You want to go play football with the other kids, but he's holding the ball you want to play with. You snap your fingers, the guy holding the football freezes up. Damn! Now you can't get at the ball. He's holding it tight to his chest, and you can't pry it away from him.

That's the problem with asynchronous suspend. The threads you're suspending may hold resources that other threads need. It doesn't have anything to do with idle loops, jumps, message passing, or any of that stuff - a thread doesn't need to actively "do" something to hold a resource; it just aquired it at some point and it hasn't released it yet.

So, if threads are to share resources with each other, they need to get a say about which point they give away control.

(For example, you should "BOO!", the guy gets scared and puts the ball down, *then* you snap your fingers, he freezes, you take the ball.)

There are many mechanisms available for building systems where threads *do* have a say in their suspension. What the POSIX quote is saying is that if you *don't* give them a say, very rarely, you'll end up with threads suspended while they're holding resources.
-- jutta, Aug 17 2006


This is very similar to a debate that was just going on at the Python-3000 mailing list, over exposing a function to raise an exception in another thread at any time. The problem is exactly the same, that it is simply unsafe. Of course, a thread could check at every safe point for a request to raise an exception or to suspend, but that would just be too much overhead when there are no such messages and it has to keep checking just in case.

In the end, this just isn't something that can be nicely built into thread APIs, and the only good solution is to build it into your application layer. Just make your threads cooperate. Maybe you can do something like have a "priority" mutex that threads can grab when they need to do work. Grab one for a little work, ten for a lot, or all one-hundred if you require everyone else to suspend.
-- ironfroggy, Aug 18 2006


[jutta], thanks for the viewpoint, and I see that I failed to say anything in the main text about message-handling between Threads. It seems to me that a natural way to do this is to have a queue, like the Windows message-handling system, but each Thread should in this model only check for a new message, from some/any other Thread, after it is done doing whatever-its-current-task-is. Then there should never be any resource of any sort held up like you describe.

[ironfroggy], what I have already described does require some stuff to be built into the Operating System, and not just into a multi-threaded application. The overhead problem you mention, of constantly checking a task queue when no tasks are in the queue, I think might be resolved by the OS having code to recognize when a Thread's task queue is empty/being-checked, and to behind-the-scenes automatically give that Thread fewer clock cycles until a message gets into the queue.
-- Vernon, Aug 18 2006


implementation of suspend kill -9 -1
implementation of resume "click start"

-- madness, Aug 18 2006


//[supercat], a jump-to-itself loop is faster than a while(1) loop (which at the machine-language level includes a loopback jump, to the extra code to process the while. Also, I think you are confusing other high-level-language concepts, such as the waiting you described in a semaphore, with what goes on at the machine-language level.//

Programming and math makes me sad. I am not a computer nerd, and tried to understand this idea, but I lost the will a quarter of the way through.
-- jellydoughnut, Aug 19 2006



random, halfbakery