Your browser was unable to load all of the resources. They may have been blocked by your firewall, proxy or browser configuration.
Press Ctrl+F5 or Ctrl+Shift+R to have your browser try again.

Can not cancel request when agent server was hang. #4571

hung ·

Hello Mr.Robin Shen,

Now we are facing an issue in prod with agent-master connection
We can not reproduce the issue but here are the symptons:

  • Build request was hanged in CHECKING_BUILD_CONDITION status
  • Can not cancel build request, other request was pending in queue
  • Build server was unauthorized but build still not cancelled
  • Can not open or remote server at that time.
  • The server's log stop at "Active build agent {agentAddress} timed out..."
    After restart server, the log continue generate:
    Job still exists on job node...
    Unable to find job (job class:...
    Error processing build request...

I think the cause of problem is that the proxy cannot be created in getNodeService function when the agent server was lag
So we want to add timeout in that function
Please let me know you opinion.
Thank you.

  • replies 7
  • views 41
  • stars 0
robinshen ADMIN ·

Do you mean agent machine is down in this case, but build job never finishes? Have you tried the "Check Condition Timeout" in general setting of the configuration?

hung ·

Do you mean agent machine is down in this case, but build job never finishes?
-> Yes, the job only finishs when I reboot the server
Have you tried the "Check Condition Timeout" in general setting of the configuration?
-> Check Condition Timeout was set to 5 minutes.

robinshen ADMIN ·

Tested hard shutdown a build agent, and the job eventually terminated when QB server found error when test the job connectivity:

Caused by: com.pmease.quickbuild.QuickbuildException: Error testing job.
 	at com.pmease.quickbuild.grid.GridTaskFuture.testJobs(GridTaskFuture.java:111)
 	at com.pmease.quickbuild.grid.GridTaskFuture.get(GridTaskFuture.java:150)
 	... 6 more
 Caused by: com.caucho.hessian.client.HessianRuntimeException: com.caucho.hessian.client.HessianRuntimeException: Error connecting 'http://192.168.71.9:8811/service/node'
 	at com.caucho.hessian.client.HessianProxy.sendRequest(HessianProxy.java:285)
 	at com.caucho.hessian.client.HessianProxy.invoke(HessianProxy.java:171)
 	at com.sun.proxy.$Proxy93.testGridJob(Unknown Source)
 	at com.pmease.quickbuild.grid.GridNode$1.testGridJob(GridNode.java:254)
 	at com.pmease.quickbuild.grid.GridTaskFuture.testJobs(GridTaskFuture.java:89)
 	... 7 more
 Caused by: com.caucho.hessian.client.HessianRuntimeException: Error connecting 'http://192.168.71.9:8811/service/node'
 	at com.caucho.hessian.client.HessianURLConnection.getOutputStream(HessianURLConnection.java:101)
 	at com.caucho.hessian.client.HessianProxy.sendRequest(HessianProxy.java:283)
 	... 11 more
 Caused by: java.net.ConnectException: Operation timed out (Connection timed out)
 	at java.net.PlainSocketImpl.socketConnect(Native Method)
 	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
 	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
 	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
 	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
 	at java.net.Socket.connect(Socket.java:607)
 	at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
 	at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
 	at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
 	at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
 	at sun.net.www.http.HttpClient.New(HttpClient.java:339)
 	at sun.net.www.http.HttpClient.New(HttpClient.java:357)
 	at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1228)
 	at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162)
 	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056)
 	at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:990)
 	at sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1342)
 	at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1317)
 	at com.caucho.hessian.client.HessianURLConnection.getOutputStream(HessianURLConnection.java:99)

What I am done:

  1. Run QB server on a machine
  2. Run QB agent on another machine
  3. Create a test configuration with build condition set to run a groovy script sleeping for 5 minutes:
groovy:
sleep(300000);
  1. Edit pre-queue script setting of the configuration in advanced settings to run below script in order to make manual job triggering evaluating build condition:
request.respectBuildCondition
  1. Configure the master step to run on build agent
  2. Trigger the test configuration manually and then power off the build agent machine
  3. Build agent becomes inactive in a short while, and the job keeps running. After 5 minutes or so, the job is terminated due to connectivity test failure

Can you please do the same to see if it works at your side?

hung ·

Yes, it worked in our servers.
But this is not symptons of our servers when this issue occurred.
I simulated a case similar when server hang:

  • Run agent in a physical server
  • Run fork bomb in agent server
  • Trigger build continously
    Then, after a period of time, the build was hanging and the request will stop at CHECKING_BUILD_CONDITION status
    even the "Check Condition Timeout" was set to 5
robinshen ADMIN ·

Run fork bomb in agent server

What do you mean about this? Is this happening even when agent server is not down?

hung ·

I ran fork bomb command to simulate the overload server
Actually, when server was hanging, I counldn't access it

robinshen ADMIN ·

Are things working fine when agent machine is not down, even if you are running fork bomb on agent?