Build Agent communication failure #2496

amn · 1 decade ago

We've seen cases where QuickBuild agents lose their connection to the master server, leaving the build hanging indefinitely in the UI. Here is an example.

2013-08-20 10:42:40,568 [WrapperStartStopAppMain] INFO com.pmease.quickbuild.Quickbuild - QuickBuild agent started.
2013-08-20 11:26:07,641 [pool-1-thread-18] INFO com.pmease.quickbuild.execution.ProcessTree - killAll (process: java.lang.UNIXProcess@59ce3eef, env: QB_COMMAND_EXECUTOR_SESSION=fd2a5790-944a-4462-97cf-f5b5cad77794
2013-08-20 11:26:13,120 [pool-1-thread-28] INFO com.pmease.quickbuild.execution.ProcessTree - killAll (process: java.lang.UNIXProcess@140bbe6b, env: QB_COMMAND_EXECUTOR_SESSION=1d34adce-df2f-46bc-96af-abc3e2e5bb80
2013-08-20 11:26:13,120 [pool-1-thread-28] INFO com.pmease.quickbuild.execution.ProcessTree - Recursively killing pid=10837
2013-08-20 11:26:13,121 [pool-1-thread-28] INFO com.pmease.quickbuild.execution.ProcessTree - Killing pid=10837
2013-08-20 11:26:13,123 [pool-1-thread-28] INFO com.pmease.quickbuild.execution.ProcessTree - Recursively killing pid=10837
2013-08-20 11:26:13,123 [pool-1-thread-28] INFO com.pmease.quickbuild.execution.ProcessTree - Killing pid=10837
2013-08-20 11:28:41,364 [pool-1-thread-22] INFO com.pmease.quickbuild.execution.ProcessTree - killAll (process: java.lang.UNIXProcess@69c43daa, env: QB_COMMAND_EXECUTOR_SESSION=98a56185-1933-4c1c-bf05-ce499a0b6766
2013-08-20 11:28:41,365 [pool-1-thread-22] INFO com.pmease.quickbuild.execution.ProcessTree - Recursively killing pid=10847
2013-08-20 11:28:41,365 [pool-1-thread-22] INFO com.pmease.quickbuild.execution.ProcessTree - Killing pid=10847
2013-08-20 11:29:14,068 [pool-1-thread-25] INFO com.pmease.quickbuild.execution.ProcessTree - killAll (process: java.lang.UNIXProcess@212f4655, env: QB_COMMAND_EXECUTOR_SESSION=dc691596-b9e4-4f66-bfb7-38b887692f21
2013-08-20 11:29:14,069 [pool-1-thread-25] INFO com.pmease.quickbuild.execution.ProcessTree - Recursively killing pid=10851
2013-08-20 11:29:14,069 [pool-1-thread-25] INFO com.pmease.quickbuild.execution.ProcessTree - Killing pid=10851
2013-08-27 21:35:09,564 [Thread-13] ERROR com.pmease.quickbuild.Quickbuild - Error connecting server.
 com.caucho.hessian.client.HessianConnectionException: HessianProxy cannot connect to 'http://server:8810/service/connect
 	at com.caucho.hessian.client.HessianURLConnection.sendRequest(HessianURLConnection.java:156)
 	at com.caucho.hessian.client.HessianProxy.sendRequest(HessianProxy.java:280)
 	at com.caucho.hessian.client.HessianProxy.invoke(HessianProxy.java:175)
 	at $Proxy20.connect(Unknown Source)
 	at com.pmease.quickbuild.grid.AgentConnectivityTask.run(AgentConnectivityTask.java:51)
 	at java.lang.Thread.run(Thread.java:722)
 Caused by: java.io.FileNotFoundException: http://server:8810/service/connect
 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 	at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
 	at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1674)
 	at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1672)
 	at java.security.AccessController.doPrivileged(Native Method)
 	at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1670)
 	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1243)
 	at com.caucho.hessian.client.HessianURLConnection.sendRequest(HessianURLConnection.java:139)
 	... 5 more
 Caused by: java.io.FileNotFoundException: http://server:8810/service/connect
 	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1623)
 	at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
 	at com.caucho.hessian.client.HessianURLConnection.sendRequest(HessianURLConnection.java:126)
 	... 5 more

There is a gap of 7 days before the connection error appears - is this an inactivity problem?
If this is caused by a temporary network problem, can the build agent be made to recover?
If the master server does lose the connection to the build agent, can it detect this (e.g. via a timeout) and fail the build automatically?

replies 14
views 6628
stars 0

robinshen ADMIN · 1 decade ago

From the message, it seems that QB server is being restarted. Are you able to cancel the dead build in this case manually? If yes, the timeout setting in general configuration setting should also work.

amn · 1 decade ago

This relates to the other forum post about cancelling triggered builds should cancel the builds they trigger as well.

It also seems like a bug to me that when builds are running if communication to an agent is interrupted it no longer knows about the builds that were previously running on it when it is restored. It would seem to me that the agent should be able to find these builds that were previously running on it and either stop them or pick up where it left off. Stopping them would probably be the scenario that makes more sense.

robinshen ADMIN · 1 decade ago

Some background why it behaves like this:
One goal of QB design is to reduce QB server load as much as possible to be able to handle many configurations and manage many agents. When a step runs on an agent, QB server simply sits there and wait for the agent to report the step status instead of querying the step status actively. So if communication link between server/agent breaks, the server itself has no way to know about the the problem unless the build itself times out. On the other side, QB agent is designed as dummy agent only knows about to execute tasks of server without persisting any build information to make it easier to run different builds as well as easier for upgrading/replacing. So we choose to use the build timeout mechanism to handle the abnormal server/agent communication error.

amn · 1 decade ago

When the buildagent starts, it must register with the server to tell it that it is available for work (certainly the server seems to be aware of which nodes are active). Is it not then possible for the server to check for any jobs that it thinks are running on the agent that has just started, and to remove them?

Also, is there any time frame you are looking at to make QuickBuild 5.1.0 an official release?

robinshen ADMIN · 1 decade ago

[quote="amn"]When the buildagent starts, it must register with the server to tell it that it is available for work (certainly the server seems to be aware of which nodes are active). Is it not then possible for the server to check for any jobs that it thinks are running on the agent that has just started, and to remove them?

Also, is there any time frame you are looking at to make QuickBuild 5.1.0 an official release?[/quote]
Thanks for the idea. We thought about this previously, but it has a problem. Assuming below case:
1. there is an intermittent communication breakage between server and agent
2. server cancels the step, agent is not aware of the cacellation and continues to run the step
3. agent and server communicates again
4. server thinks that there is no job running on agent and sends another job to agent
5. then it is possible that agent runs more than one job even if resource constraint does not allow this.

Although this can be worked around by additional communications between server and agent (for instance agent checks with server for step status periodically to see if it has been cancelled), but QB server might be running thousands of steps, and we hope to reduce the communication to the minimum to make it scalable.

As to formal release of 5.1.0, we plan to get it out before 30 Nov.

amn · 1 decade ago

There is nothing that distinguishes a build agent starting and communicating with the master server for the first time, versus a build agent making some random communication with the server after completing a step, following an intermittent communication outage?

robinshen ADMIN · 1 decade ago

Yes that is true.

amn · 1 decade ago

Could this be changed so that when an agent does start it can tell the server and stop all builds that were running (and now hanging) on it previously?

#10

robinshen ADMIN · 1 decade ago

Please file an improvement request for this at track.pmease.com, and we will check what we can improve in this case.

#11

amn · 1 decade ago

Created enhancement request: http://track.pmease.com/browse/QB-1818

#12

probert · 1 decade ago

We had a similar problem this week-end. We were upgrading and rebooting our switches (to upgrade the firmware), so it created a small network outage. One of the agent was saying that it cannot resolve the DNS name of the QB server. I restarted the agent and that error went away. But I didn't notice that a build was "stuck" and even if a build timeout is set, the job didn't cancel, and more important, the agent wouldn't accept new jobs, even if QB see it as an active node. I had to cancel the job in QB Web GUI and the agent started accepting jobs after.

#13

robinshen ADMIN · 1 decade ago

This problem has been solved in QB 5.1

#14

amn · 1 decade ago

Is there a reason it wasn't shown in the change log and this request is still listed as open then?

#15

robinshen ADMIN · 1 decade ago

Sorry I forget to close that issue. The enhancement itself is logged in this issue: http://track.pmease.com/browse/QB-1847