Step failed but duration continue to increase - Breakdown recovery #3873

rpallares · 6 years ago

Hello,

Sometimes we have steps that fail and it's duration continue to increase. Then the step is never end, of after a very long time (12 hours from memory).
With some investigations, we noticed that the issue append when the parent node is diconnected from the system (machine crash or whatever).

The child node logs display this error:

2017-11-30 14:59:56,187 [pool-1-thread-1] WARN  com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://parent:8811/service/node' failed: Connection refused: connect, retrying after 5 seconds...
2017-11-30 15:00:02,218 [pool-1-thread-1] WARN  com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://parent:8811/service/node' failed: Connection refused: connect, retrying after 5 seconds...
2017-11-30 15:00:08,218 [pool-1-thread-1] ERROR com.pmease.quickbuild.grid.GridJob - Error notifying task node of job finishing (job class: de92494a-ce73-4d67-80ad-d161e3cd7f0d, job id: com.pmease.quickbuild.stepsupport.StepExecutionJob, task node: parent:8811)
    com.caucho.hessian.client.HessianRuntimeException: com.caucho.hessian.client.HessianRuntimeException: Error connecting 'http://parent:8811/service/node'
        at com.caucho.hessian.client.HessianProxy.sendRequest(HessianProxy.java:285)
        at com.caucho.hessian.client.HessianProxy.invoke(HessianProxy.java:171)
        at com.sun.proxy.$Proxy48.gridJobFinished(Unknown Source)
        at com.pmease.quickbuild.grid.GridJob.run(GridJob.java:128)
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)

Once we restart the parent node, the system don't retry to notify the parent and the step continue to be freezed.

Machines crashes are hard to prevent, but the system should be more robust.
How is it possible to be noticed when a node crash ? I saw the system of System alerts, but I don't see how to use it for my case: http://wiki.pmease.com/display/QB70/Configuring+System+Alerts

At least administrators should be notified about the machine down.

More over, the system should be able to notify a failure correctly through the whole step tree, even if a node is not responding. May be by notify it's parents recusively until the server is reached.

We could even choose to wait more, and be able to stop the specific step, or restart the node that crashed and continue the stack.

How the system can answer to these issues?

Regards

replies 16
views 3375
stars 1

robinshen ADMIN · 6 years ago

Please check active nodes view in grid page, and make sure the offline alert check is enabled for all your build agents. Then switch to the "alert subscription" tab of your screenshot to add users receiving the notification.

In case of node dying, QB should be able to detect the failure and terminates the build. Please upgrade to latest 7.0.29 version as we fixed some bugs relating to this recently.

1 Reply

rpallares · 6 years ago

Hi,

Thank you for your informations.
In the page https://build.pmease.com/dashboard;jsessionid=1c0jr9bvi1bcozujfj0cl72lh
I don't see lot of issues that can improve breckdown recovery between the version 7.0.24 and the last 7.0.29.
Is there more issues and activity fixed than the displayed on this dashboard?

Regards

robinshen ADMIN · 6 years ago

Recent fix for this issue is included in 7.0.22. Your 7.0.24 version already contains latest fix regarding this. So in your case, if a step runs on an agent, and if the agent dies while the step runs. The build will never end? I tried to reproduce this at my side, and QB is able to detect the failure and stop the build in about 1 minute.

3 #5

rpallares · 6 years ago

Hi,

The problem was

Main agent dispatch a parallel build on some child machines
The main agent die
A child want to inform the main agent that he end his job
The job into the step is ending. But if I refresh the browser, the duration continue to increase.
Restarting the main agent do not fix the problem

robinshen ADMIN · 6 years ago

QB will test all agents involved in a build every 30 seconds, and will stop the build if some agents are died. I tested with below and it works:

Create a test configuration, and run the master step on agent1.
Inside master step add a simple test step sleeping for some time on agent2
Run the build, and kill agent1 while build is running
QB marks the whole build as failed after 30 seconds.

Can you please test if this case works at your side?

kururin · 6 years ago

Hello Robin
I encountered same problem as Rafael (we are working on the same QB instance) several times, but it's quite difficult to find out how it happens.
The build is failed (red), all steps are finished (red as well). But the duration of one of them keeps increasing.
I saw this behavior since QB 7.0 but never found a way to reproduct it. Last time it happened was with QB 7.0.24 like Rafael said above.
We'll make more tests on our side and we'll try to find a list of operations that can lead to this issue.

Thank you
Mathieu

robinshen ADMIN · 6 years ago

I will add some additional code to make sure the step duration is updated as long as build is stopped. The fix will be available in next patch release.

1 :thumbsup:

kururin · 6 years ago

Thanks a lot Robin.

#10

rpallares · 6 years ago

Thank you for that fix.

#11

rpallares · 6 years ago

Hi@robinshen,
We still have this kind of problem at version 7.0.30.

For my exemple now, the comportment is quite different, the step failed with that error message: Build is already stopped.
But the step have the Failed status, and it's duration continue to increase. Here is the build node log when we received the failure notification:

2017-12-15 08:51:46,232 [pool-1-thread-52] INFO  com.pmease.quickbuild.CheckConditionJob - Taking repository snapshots...
2017-12-15 09:14:26,660 [qtp946185866-1023] WARN  com.pmease.quickbuild.grid.NodeServiceImpl - testGridJob is cancelling job '58e202f1-0769-4249-b971-022cedf3d541'...
2017-12-15 09:14:26,661 [pool-1-thread-42] WARN  com.pmease.quickbuild.grid.NodeServiceImpl - testGridJob is cancelling job '09a007be-9ee8-4f75-a2e0-963b7587967f'...
2017-12-15 09:14:26,662 [pool-1-thread-42] WARN  com.pmease.quickbuild.grid.GridTaskFuture - Job still exists on job node and cancel command is issued (job id: 09a007be-9ee8-4f75-a2e0-963b7587967f, build id: 217174, job node: NODE:8811)...
2017-12-15 09:14:26,662 [pool-1-thread-44] WARN  com.pmease.quickbuild.grid.NodeServiceImpl - testGridJob is cancelling job '77e71cd7-33bf-43a2-b3ec-1cdf8bfd06c0'...
2017-12-15 09:14:26,663 [pool-1-thread-44] WARN  com.pmease.quickbuild.grid.GridTaskFuture - Job still exists on job node and cancel command is issued (job id: 77e71cd7-33bf-43a2-b3ec-1cdf8bfd06c0, build id: 217174, job node: NODE:8811)...
2017-12-15 09:14:44,224 [pool-1-thread-53] INFO  com.pmease.quickbuild.CheckConditionJob - Taking repository snapshots...
2017-12-15 09:28:18,420 [pool-1-thread-52] WARN  com.pmease.quickbuild.grid.GridJob - Job entry no longer exists at task node 'MASTER_SERVER:8810', will cancel running job...
2017-12-15 09:28:18,420 [pool-1-thread-53] WARN  com.pmease.quickbuild.grid.NodeServiceImpl - testGridJob is cancelling job 'f008e662-f429-4cb3-a86b-6e7bb9ada106'...
2017-12-15 09:28:18,420 [pool-1-thread-53] WARN  com.pmease.quickbuild.grid.GridTaskFuture - Job still exists on job node and cancel command is issued (job id: f008e662-f429-4cb3-a86b-6e7bb9ada106, build id: 217214, job node: NODE:8811)...
2017-12-15 09:28:18,421 [pool-1-thread-54] WARN  com.pmease.quickbuild.grid.NodeServiceImpl - testGridJob is cancelling job '1cdb9310-fa66-4d1c-a8ef-99d9be7a17e9'...
2017-12-15 09:28:18,421 [pool-1-thread-54] WARN  com.pmease.quickbuild.grid.GridTaskFuture - Job still exists on job node and cancel command is issued (job id: 1cdb9310-fa66-4d1c-a8ef-99d9be7a17e9, build id: 217214, job node: NODE:8811)...
2017-12-15 09:28:21,828 [pool-1-thread-53] ERROR com.pmease.quickbuild.grid.GridJob - Error notifying task node of job finishing (job class: fb33b0f0-cc0a-44ca-9ecd-2d097a5bd0bb, job id: com.pmease.quickbuild.stepsupport.StepExecutionJob, task node: MASTER_SERVER:8810)
    com.pmease.quickbuild.RemotingException: Unable to find task future of job 'fb33b0f0-cc0a-44ca-9ecd-2d097a5bd0bb' at node 'MASTER_SERVER:8810'
        at com.pmease.quickbuild.grid.NodeServiceImpl.gridJobFinished(NodeServiceImpl.java:144)
        at sun.reflect.GeneratedMethodAccessor308.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at com.caucho.hessian.server.HessianSkeleton.invoke(HessianSkeleton.java:302)
        at com.caucho.hessian.server.HessianSkeleton.invoke(HessianSkeleton.java:198)
        at com.caucho.hessian.server.HessianServlet.invoke(HessianServlet.java:399)
        at com.caucho.hessian.server.HessianServlet.service(HessianServlet.java:379)
        at com.pmease.quickbuild.grid.GridServlet.service(GridServlet.java:36)
        at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1669)
        at com.pmease.quickbuild.Quickbuild$DisableTraceFilter.doFilter(Quickbuild.java:1076)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:221)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
        at org.eclipse.jetty.server.Server.handle(Server.java:499)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:258)
        at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
        at java.lang.Thread.run(Unknown Source)
2017-12-15 09:29:35,769 [pool-1-thread-58] INFO  com.pmease.quickbuild.CheckConditionJob - Taking repository snapshots...

I precise that we didn't cancelled that job, and we didn't understand what happened.

Regards

#12

robinshen ADMIN · 6 years ago

What is your main concern here? Is it the build unexpected failure, or the step duration keeping increasing? If the prior, please check the agent log running the failed step to see if there are any errors. Typically it is caused by your build agent being offline. For the latter, please send me the screenshot and also what is the step duration if you refresh the build page.

1 Reply

#13

rpallares · 6 years ago

Hi,

My main concern is really the step duration keeping increase.
In my case, I have a step failed without any log. The log I shared come from the Agent log.

The error seems to incriminate a connection problem with the MASTER_SERVER which host Quickbuild server. But we didn't overserved any connection problem with the MASTER_SERVER. Moreover, we didn't observe any connection problem with the NODE.

So when some connection problems happen into agent nodes, systematically the step fail and the duration keep to increase.

1 #14

robinshen ADMIN · 6 years ago

Can you please also show me the master server log while this is happening? Also please send me [robin AT pmease DOT com] screenshot of the build step status.

#15

rpallares · 6 years ago

Hi Robin, I found something strange into the server logs.
Times are not exactly the sames but that look like to be the right exception.

2017-12-15 09:30:56,446 [qtp667076143-58] WARN  org.eclipse.jetty.servlet.ServletHandler - /rest/configurations
    java.lang.RuntimeException: org.eclipse.jetty.io.EofException
    	at com.pmease.quickbuild.bootstrap.BootstrapUtils.wrapAsUnchecked(BootstrapUtils.java:56)
    	at com.pmease.quickbuild.util.ExceptionUtils.wrapAsUnchecked(ExceptionUtils.java:82)
    	at com.pmease.quickbuild.rest.providers.XmlProvider.writeTo(XmlProvider.java:92)
    	at com.sun.jersey.spi.container.ContainerResponse.write(ContainerResponse.java:289)
    	at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1029)
    	at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:941)
    	at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:932)
    	at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:384)
    	at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:451)
    	at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:632)
    	at com.pmease.quickbuild.rest.RestServlet.service(RestServlet.java:48)
    	at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
    	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1669)
    	at com.pmease.quickbuild.Quickbuild$DisableTraceFilter.doFilter(Quickbuild.java:1076)
    	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
    	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
    	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:221)
    	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
    	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
    	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
    	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    	at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
    	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
    	at org.eclipse.jetty.server.Server.handle(Server.java:499)
    	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
    	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:258)
    	at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
    	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
    	at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
    	at java.lang.Thread.run(Unknown Source)
    Caused by: org.eclipse.jetty.io.EofException
    	at org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:192)
    	at org.eclipse.jetty.io.WriteFlusher.flush(WriteFlusher.java:408)
    	at org.eclipse.jetty.io.WriteFlusher.completeWrite(WriteFlusher.java:364)
    	at org.eclipse.jetty.io.SelectChannelEndPoint.onSelected(SelectChannelEndPoint.java:111)
    	at org.eclipse.jetty.io.SelectorManager$ManagedSelector.processKey(SelectorManager.java:641)
    	at org.eclipse.jetty.io.SelectorManager$ManagedSelector.select(SelectorManager.java:612)
    	at org.eclipse.jetty.io.SelectorManager$ManagedSelector.run(SelectorManager.java:550)
    	at org.eclipse.jetty.util.thread.NonBlockingThread.run(NonBlockingThread.java:52)
    	... 3 more
    Caused by: java.io.IOException: An existing connection was forcibly closed by the remote host
    	at sun.nio.ch.SocketDispatcher.writev0(Native Method)
    	at sun.nio.ch.SocketDispatcher.writev(Unknown Source)
    	at sun.nio.ch.IOUtil.write(Unknown Source)
    	at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
    	at org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:172)
    	... 10 more
2017-12-15 09:30:56,452 [qtp667076143-58] WARN  org.eclipse.jetty.server.HttpChannel - /rest/configurations?recursive=true
    java.lang.RuntimeException: org.eclipse.jetty.io.EofException
    	at com.pmease.quickbuild.bootstrap.BootstrapUtils.wrapAsUnchecked(BootstrapUtils.java:56)
    	at com.pmease.quickbuild.util.ExceptionUtils.wrapAsUnchecked(ExceptionUtils.java:82)
    	at com.pmease.quickbuild.rest.providers.XmlProvider.writeTo(XmlProvider.java:92)
    	at com.sun.jersey.spi.container.ContainerResponse.write(ContainerResponse.java:289)
    	at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1029)
    	at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:941)
    	at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:932)
    	at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:384)
    	at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:451)
    	at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:632)
    	at com.pmease.quickbuild.rest.RestServlet.service(RestServlet.java:48)
    	at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
    	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1669)
    	at com.pmease.quickbuild.Quickbuild$DisableTraceFilter.doFilter(Quickbuild.java:1076)
    	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
    	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
    	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:221)
    	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
    	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
    	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
    	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    	at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
    	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
    	at org.eclipse.jetty.server.Server.handle(Server.java:499)
    	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
    	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:258)
    	at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
    	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
    	at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
    	at java.lang.Thread.run(Unknown Source)
    Caused by: org.eclipse.jetty.io.EofException
    	at org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:192)
    	at org.eclipse.jetty.io.WriteFlusher.flush(WriteFlusher.java:408)
    	at org.eclipse.jetty.io.WriteFlusher.completeWrite(WriteFlusher.java:364)
    	at org.eclipse.jetty.io.SelectChannelEndPoint.onSelected(SelectChannelEndPoint.java:111)
    	at org.eclipse.jetty.io.SelectorManager$ManagedSelector.processKey(SelectorManager.java:641)
    	at org.eclipse.jetty.io.SelectorManager$ManagedSelector.select(SelectorManager.java:612)
    	at org.eclipse.jetty.io.SelectorManager$ManagedSelector.run(SelectorManager.java:550)
    	at org.eclipse.jetty.util.thread.NonBlockingThread.run(NonBlockingThread.java:52)
    	... 3 more
    Caused by: java.io.IOException: An existing connection was forcibly closed by the remote host
    	at sun.nio.ch.SocketDispatcher.writev0(Native Method)
    	at sun.nio.ch.SocketDispatcher.writev(Unknown Source)
    	at sun.nio.ch.IOUtil.write(Unknown Source)
    	at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
    	at org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:172)
    	... 10 more
2017-12-15 09:30:56,452 [qtp667076143-58] WARN  org.eclipse.jetty.server.HttpChannel - Could not send response error 500: java.lang.RuntimeException: org.eclipse.jetty.io.EofException

#16

kururin · 6 years ago

I did this test today:

starting a build on a specific agent
unplugging the network cable from this agent during a few seconds then plugging it again

The step failed... but its duration continues to grow.
The step log is empty.
The server log is:

14:36:22,605 ERROR - Build is failed.
    java.lang.RuntimeException: Error executing step execution job. 
        at com.pmease.quickbuild.stepsupport.StepExecutionTask.reduce(StepExecutionTask.java:29) 
        at com.pmease.quickbuild.stepsupport.StepExecutionTask.reduce(StepExecutionTask.java:19) 
        at com.pmease.quickbuild.grid.GridTaskFuture.get(GridTaskFuture.java:123) 
        at com.pmease.quickbuild.DefaultBuildEngine.run(DefaultBuildEngine.java:598) 
        at com.pmease.quickbuild.DefaultBuildEngine.process(DefaultBuildEngine.java:461) 
        at com.pmease.quickbuild.DefaultBuildEngine.access$000(DefaultBuildEngine.java:141) 
        at com.pmease.quickbuild.DefaultBuildEngine$2.run(DefaultBuildEngine.java:1220) 
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
        at java.lang.Thread.run(Unknown Source) 
    Caused by: com.pmease.quickbuild.QuickbuildException: Unable to find job 'fbe01c3f-7f59-4f35-a272-72144f5ac9a3' on node 'MTP-TGT-BENCH03:8811'. 
        at com.pmease.quickbuild.grid.GridTaskFuture.testJobs(GridTaskFuture.java:55) 
        at com.pmease.quickbuild.grid.GridTaskFuture.get(GridTaskFuture.java:105)

#17

robinshen ADMIN · 6 years ago

Seems that you have some steps not interruptible. I did below test:

Create a test configuration with master step running on QB server.
Add a test step inside master step running below command on a build agent:
sleep 300
Now run the test configuration.
After 5 seconds, unplug the cable from build agent for some seconds, and plugin it again
Since QB tests network connectivity every 30 seconds, if the test happens while cable is unplugged, the step will be cancelled, and the sleep step is stopped, and build is failed. If no tests happened while cable is unplugged, the build is not affected, and it will eventually succeeds after 300 seconds.

Can you please run my test case at your side to see what happens?