Hello,
Sometimes we have steps that fail and it's duration continue to increase. Then the step is never end, of after a very long time (12 hours from memory).
With some investigations, we noticed that the issue append when the parent node is diconnected from the system (machine crash or whatever).
The child node logs display this error:
2017-11-30 14:59:56,187 [pool-1-thread-1] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://parent:8811/service/node' failed: Connection refused: connect, retrying after 5 seconds...
2017-11-30 15:00:02,218 [pool-1-thread-1] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://parent:8811/service/node' failed: Connection refused: connect, retrying after 5 seconds...
2017-11-30 15:00:08,218 [pool-1-thread-1] ERROR com.pmease.quickbuild.grid.GridJob - Error notifying task node of job finishing (job class: de92494a-ce73-4d67-80ad-d161e3cd7f0d, job id: com.pmease.quickbuild.stepsupport.StepExecutionJob, task node: parent:8811)
com.caucho.hessian.client.HessianRuntimeException: com.caucho.hessian.client.HessianRuntimeException: Error connecting 'http://parent:8811/service/node'
at com.caucho.hessian.client.HessianProxy.sendRequest(HessianProxy.java:285)
at com.caucho.hessian.client.HessianProxy.invoke(HessianProxy.java:171)
at com.sun.proxy.$Proxy48.gridJobFinished(Unknown Source)
at com.pmease.quickbuild.grid.GridJob.run(GridJob.java:128)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Once we restart the parent node, the system don't retry to notify the parent and the step continue to be freezed.
Machines crashes are hard to prevent, but the system should be more robust.
How is it possible to be noticed when a node crash ? I saw the system of System alerts, but I don't see how to use it for my case: http://wiki.pmease.com/display/QB70/Configuring+System+Alerts
At least administrators should be notified about the machine down.
More over, the system should be able to notify a failure correctly through the whole step tree, even if a node is not responding. May be by notify it's parents recusively until the server is reached.
We could even choose to wait more, and be able to stop the specific step, or restart the node that crashed and continue the stack.
How the system can answer to these issues?
Regards