Expected behaviour when a worker returns RCC_FATAL from its run method

dwpmd1 · September 14, 2023, 8:03pm

Hello

Apologies in advance if this is difficult to follow, my understanding of the inner workings of openCPI is far from perfect. To avoid confusion, I will use the word ‘application’ to refer to an openCPI Application object that has been created from the ACI and the word ‘program’ to refer to the C++ executable that is responsible for interacting with the ACI and managing one or more openCPI applications.

Currently, if an RCC worker’s run() method returns RCC_FATAL the entire program to halt. This is different to if a worker’s start() method returns RCC_FATAL in which case an exception is thrown that can be caught be surrounding the app.start() ACI method in a try-catch block.
I discovered this whilst attempting to create a test program that runs multiple test applications, some of which purposely cause a worker to fail. Currently the entire program halts when one of these tests runs, preventing any further tests from running.

I believe this issue is a result of the run method of the worker being called within a container that runs in its own thread. The exception is caught within the runContainer method and printed to the console followed by abort() being called. Surely the container should instead indicate that the exception should be thrown in the main thread. The code responsible for this is in opencpi/runtime/container/src/Container.cc on lines 237-239 .

There is also strange behaviour within the Worker class in regards to how the return value of the run method is handled. If before returning, setError() is called, then the return value is never actually looked at and the behaviour is the same regardless of what is returned. See opencpi/runtime/rcc/src/RccWorker.cc on line 621, this is where I believe the worker’s run method is called from. Then on line 633 checkError is called which will throw an exception if the setError() method has been used before the switch statement looking at the return value is reached.

I experimented with the behaviour when setError() is not called and found that it is different depending on where I put break points. See in RccWorker.cc line 667. If I put a break point here, the following exception is thrown in the main thread: Worker \"workerName\" is now unusable. If I do not have this break point, before the main thread detects that the worker is unusable, the exception is thrown on line 674 which is caught from within the container, printed to the console and abort() is called.

If anyone with a greater understanding of how this is all working / is supposed to work is able to suggest a solution that would be really appreciated.

Many thanks,
Dan

dwpmd1 · September 15, 2023, 1:02pm

I have created a temporary solution which most likely will break other things but it is working for what I need right now.

The changes to the first 2 files mean that rather than an error message being set causing an exception being thrown, we instead copy that error message into a variable within the container.

The change in the third file means that when the container is detected to be in an unusable state, it will check the above variable and throw an exception containing the err message in the main thread.

The change in the 4th and 5th file mean that when the container detects an exception in its internal thread, it will not abort the entire program and will instead disable itself so that next time the container is needed, it can be recreated.