Your browser was unable to load all of the resources. They may have been blocked by your firewall, proxy or browser configuration.
Press Ctrl+F5 or Ctrl+Shift+R to have your browser try again.

Build Agent communication failure #2496

amn ·
We've seen cases where QuickBuild agents lose their connection to the master server, leaving the build hanging indefinitely in the UI. Here is an example.

2013-08-20 10:42:40,568 [WrapperStartStopAppMain] INFO com.pmease.quickbuild.Quickbuild - QuickBuild agent started.
2013-08-20 11:26:07,641 [pool-1-thread-18] INFO com.pmease.quickbuild.execution.ProcessTree - killAll (process: java.lang.UNIXProcess@59ce3eef, env: QB_COMMAND_EXECUTOR_SESSION=fd2a5790-944a-4462-97cf-f5b5cad77794
2013-08-20 11:26:13,120 [pool-1-thread-28] INFO com.pmease.quickbuild.execution.ProcessTree - killAll (process: java.lang.UNIXProcess@140bbe6b, env: QB_COMMAND_EXECUTOR_SESSION=1d34adce-df2f-46bc-96af-abc3e2e5bb80
2013-08-20 11:26:13,120 [pool-1-thread-28] INFO com.pmease.quickbuild.execution.ProcessTree - Recursively killing pid=10837
2013-08-20 11:26:13,121 [pool-1-thread-28] INFO com.pmease.quickbuild.execution.ProcessTree - Killing pid=10837
2013-08-20 11:26:13,123 [pool-1-thread-28] INFO com.pmease.quickbuild.execution.ProcessTree - Recursively killing pid=10837
2013-08-20 11:26:13,123 [pool-1-thread-28] INFO com.pmease.quickbuild.execution.ProcessTree - Killing pid=10837
2013-08-20 11:28:41,364 [pool-1-thread-22] INFO com.pmease.quickbuild.execution.ProcessTree - killAll (process: java.lang.UNIXProcess@69c43daa, env: QB_COMMAND_EXECUTOR_SESSION=98a56185-1933-4c1c-bf05-ce499a0b6766
2013-08-20 11:28:41,365 [pool-1-thread-22] INFO com.pmease.quickbuild.execution.ProcessTree - Recursively killing pid=10847
2013-08-20 11:28:41,365 [pool-1-thread-22] INFO com.pmease.quickbuild.execution.ProcessTree - Killing pid=10847
2013-08-20 11:29:14,068 [pool-1-thread-25] INFO com.pmease.quickbuild.execution.ProcessTree - killAll (process: java.lang.UNIXProcess@212f4655, env: QB_COMMAND_EXECUTOR_SESSION=dc691596-b9e4-4f66-bfb7-38b887692f21
2013-08-20 11:29:14,069 [pool-1-thread-25] INFO com.pmease.quickbuild.execution.ProcessTree - Recursively killing pid=10851
2013-08-20 11:29:14,069 [pool-1-thread-25] INFO com.pmease.quickbuild.execution.ProcessTree - Killing pid=10851
2013-08-27 21:35:09,564 [Thread-13] ERROR com.pmease.quickbuild.Quickbuild - Error connecting server.
com.caucho.hessian.client.HessianConnectionException: HessianProxy cannot connect to 'http://server:8810/service/connect
at com.caucho.hessian.client.HessianURLConnection.sendRequest(HessianURLConnection.java:156)
at com.caucho.hessian.client.HessianProxy.sendRequest(HessianProxy.java:280)
at com.caucho.hessian.client.HessianProxy.invoke(HessianProxy.java:175)
at $Proxy20.connect(Unknown Source)
at com.pmease.quickbuild.grid.AgentConnectivityTask.run(AgentConnectivityTask.java:51)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.io.FileNotFoundException: http://server:8810/service/connect
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1674)
at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1672)
at java.security.AccessController.doPrivileged(Native Method)
at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1670)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1243)
at com.caucho.hessian.client.HessianURLConnection.sendRequest(HessianURLConnection.java:139)
... 5 more
Caused by: java.io.FileNotFoundException: http://server:8810/service/connect
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1623)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
at com.caucho.hessian.client.HessianURLConnection.sendRequest(HessianURLConnection.java:126)
... 5 more


There is a gap of 7 days before the connection error appears - is this an inactivity problem?
If this is caused by a temporary network problem, can the build agent be made to recover?
If the master server does lose the connection to the build agent, can it detect this (e.g. via a timeout) and fail the build automatically?
  • replies 14
  • views 6628
  • stars 0
robinshen ADMIN ·
From the message, it seems that QB server is being restarted. Are you able to cancel the dead build in this case manually? If yes, the timeout setting in general configuration setting should also work.
amn ·
This relates to the other forum post about cancelling triggered builds should cancel the builds they trigger as well.

It also seems like a bug to me that when builds are running if communication to an agent is interrupted it no longer knows about the builds that were previously running on it when it is restored. It would seem to me that the agent should be able to find these builds that were previously running on it and either stop them or pick up where it left off. Stopping them would probably be the scenario that makes more sense.
robinshen ADMIN ·
Some background why it behaves like this:
One goal of QB design is to reduce QB server load as much as possible to be able to handle many configurations and manage many agents. When a step runs on an agent, QB server simply sits there and wait for the agent to report the step status instead of querying the step status actively. So if communication link between server/agent breaks, the server itself has no way to know about the the problem unless the build itself times out. On the other side, QB agent is designed as dummy agent only knows about to execute tasks of server without persisting any build information to make it easier to run different builds as well as easier for upgrading/replacing. So we choose to use the build timeout mechanism to handle the abnormal server/agent communication error.
amn ·
When the buildagent starts, it must register with the server to tell it that it is available for work (certainly the server seems to be aware of which nodes are active). Is it not then possible for the server to check for any jobs that it thinks are running on the agent that has just started, and to remove them?

Also, is there any time frame you are looking at to make QuickBuild 5.1.0 an official release?
robinshen ADMIN ·
[quote="amn"]When the buildagent starts, it must register with the server to tell it that it is available for work (certainly the server seems to be aware of which nodes are active). Is it not then possible for the server to check for any jobs that it thinks are running on the agent that has just started, and to remove them?

Also, is there any time frame you are looking at to make QuickBuild 5.1.0 an official release?[/quote]
Thanks for the idea. We thought about this previously, but it has a problem. Assuming below case:
1. there is an intermittent communication breakage between server and agent
2. server cancels the step, agent is not aware of the cacellation and continues to run the step
3. agent and server communicates again
4. server thinks that there is no job running on agent and sends another job to agent
5. then it is possible that agent runs more than one job even if resource constraint does not allow this.

Although this can be worked around by additional communications between server and agent (for instance agent checks with server for step status periodically to see if it has been cancelled), but QB server might be running thousands of steps, and we hope to reduce the communication to the minimum to make it scalable.

As to formal release of 5.1.0, we plan to get it out before 30 Nov.
amn ·
There is nothing that distinguishes a build agent starting and communicating with the master server for the first time, versus a build agent making some random communication with the server after completing a step, following an intermittent communication outage?
robinshen ADMIN ·
Yes that is true.
amn ·
Could this be changed so that when an agent does start it can tell the server and stop all builds that were running (and now hanging) on it previously?
robinshen ADMIN ·
Please file an improvement request for this at track.pmease.com, and we will check what we can improve in this case.
amn ·
Created enhancement request: http://track.pmease.com/browse/QB-1818
probert ·
We had a similar problem this week-end. We were upgrading and rebooting our switches (to upgrade the firmware), so it created a small network outage. One of the agent was saying that it cannot resolve the DNS name of the QB server. I restarted the agent and that error went away. But I didn't notice that a build was "stuck" and even if a build timeout is set, the job didn't cancel, and more important, the agent wouldn't accept new jobs, even if QB see it as an active node. I had to cancel the job in QB Web GUI and the agent started accepting jobs after.
robinshen ADMIN ·
This problem has been solved in QB 5.1
amn ·
Is there a reason it wasn't shown in the change log and this request is still listed as open then?
robinshen ADMIN ·
Sorry I forget to close that issue. The enhancement itself is logged in this issue: http://track.pmease.com/browse/QB-1847