Your browser was unable to load all of the resources. They may have been blocked by your firewall, proxy or browser configuration.
Press Ctrl+F5 or Ctrl+Shift+R to have your browser try again.

Curious Error On Build Agent, Build Happened but Never Logged #4568

tomz ·

@robinshen

Two overnight build jobs completely lost their build logs. The errors around the issue look as follows:
AGENT LOG

2024-04-29 04:50:49,414 [qtp885338582-34] ERROR com.pmease.quickbuild.grid.NodeServiceImpl - Unable to find job '191035ff-1ef7-4bfd-837e-3f6c20c69b18' on node 'buildhostaddress' (Job is ever started: false).

BUILD LOG

04:50:49,285 ERROR - Build is failed.
    java.lang.RuntimeException: Error executing step execution job.
    	at com.pmease.quickbuild.stepsupport.StepExecutionTask.reduce(StepExecutionTask.java:29)
    	at com.pmease.quickbuild.stepsupport.StepExecutionTask.reduce(StepExecutionTask.java:19)
    	at com.pmease.quickbuild.grid.GridTaskFuture.get(GridTaskFuture.java:168)
    	at com.pmease.quickbuild.DefaultBuildEngine.run(DefaultBuildEngine.java:645)
    	at com.pmease.quickbuild.DefaultBuildEngine.process(DefaultBuildEngine.java:482)
    	at com.pmease.quickbuild.DefaultBuildEngine.access$000(DefaultBuildEngine.java:156)
    	at com.pmease.quickbuild.DefaultBuildEngine$2.run(DefaultBuildEngine.java:1333)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:750)
    Caused by: com.pmease.quickbuild.QuickbuildException: Unable to find job '191035ff-1ef7-4bfd-837e-3f6c20c69b18' on node 'buildhostaddress'.

However, the build ran to completion just fine. Any idea what might cause this?

  • replies 13
  • views 570
  • stars 0
robinshen ADMIN ·

Please check console log of agent "buildhostaddress" to see if there are any errors printed around that time.

tomz ·

The only error log on the agent around the time of the problem is the one already printed in the first post. However, more information in the log does look important. It appears to be in the middle of an unplanned agent restart. More log here:

2024-04-29 04:50:15,621 [WrapperStartStopAppMain] INFO  com.pmease.quickbuild.bootstrap.Bootstrap - Cleaning cached plugin libraries...
2024-04-29 04:50:16,012 [WrapperStartStopAppMain] INFO  com.pmease.quickbuild.bootstrap.Bootstrap - Cleaning temp directory...
2024-04-29 04:50:16,012 [WrapperStartStopAppMain] INFO  com.pmease.quickbuild.bootstrap.Bootstrap - Benchmarking...
2024-04-29 04:50:16,684 [WrapperStartStopAppMain] INFO  com.pmease.quickbuild.bootstrap.Bootstrap - Connecting to server 'https://bwe-joqbmaster.bioware.ad.ea.com:1443'...
2024-04-29 04:50:17,356 [WrapperStartStopAppMain] INFO  com.pmease.quickbuild.bootstrap.AgentUpdater - Updating framework...
2024-04-29 04:50:17,403 [WrapperStartStopAppMain] INFO  com.pmease.quickbuild.bootstrap.AgentUpdater - Updating plugins...
2024-04-29 04:50:17,559 [WrapperStartStopAppMain] INFO  com.pmease.quickbuild.bootstrap.AgentUpdater - Updating plugins/com.pmease.quickbuild.libs...
2024-04-29 04:50:17,574 [WrapperStartStopAppMain] INFO  com.pmease.quickbuild.bootstrap.AgentUpdater - Updating plugins/site...
2024-04-29 04:50:17,590 [WrapperStartStopAppMain] INFO  com.pmease.quickbuild.bootstrap.AgentUpdater - Updating asset...
2024-04-29 04:50:17,621 [WrapperStartStopAppMain] INFO  com.pmease.quickbuild.bootstrap.Bootstrap - Booting QuickBuild from 'C:\QBAgent'...
2024-04-29 04:50:17,621 [WrapperStartStopAppMain] INFO  com.pmease.quickbuild.bootstrap.Bootstrap - Starting framework...
2024-04-29 04:50:18,696 [WrapperStartStopAppMain] INFO  com.pmease.quickbuild.equinoxadapter.EquinoxAdapter - Caching plugin libraries...
2024-04-29 04:50:29,330 [WrapperStartStopAppMain] INFO  com.pmease.quickbuild.Quickbuild - Starting QuickBuild agent...
2024-04-29 04:50:29,330 [WrapperStartStopAppMain] INFO  com.pmease.quickbuild.Quickbuild - Initializing velocity...
2024-04-29 04:50:29,439 [WrapperStartStopAppMain] INFO  com.pmease.quickbuild.Quickbuild - Checking node launch data...
2024-04-29 04:50:49,414 [qtp885338582-34] ERROR com.pmease.quickbuild.grid.NodeServiceImpl - Unable to find job '191035ff-1ef7-4bfd-837e-3f6c20c69b18' on node 'buildhostaddress' (Job is ever started: false).
2024-04-29 04:50:49,537 [WrapperStartStopAppMain] INFO  com.pmease.quickbuild.Quickbuild - Starting plugins...

It was pushing an update to the agent by the looks of it - but it was doing so when a job was activating. This looks like a race condition that should be avoided.

robinshen ADMIN ·

I see. Agent is only updated upon server upgrade, or when an old agent is started after server upgrade. Is this the case at your side?

tomz ·

There was a server upgrade but it was a very long time ago (9 days before this error). I find it surprising that it took this long for the agent to do the upgrade. We normally have the build schedules paused for a few hours after the upgrade which I thought was sufficient to let all the agents take the update at that time.

robinshen ADMIN ·

How many agents to you have? And is it possible that this agent is started recently? (hence it does not have chance to upgrade previously).

tomz ·

Roughly 425 agents. The last time they were restarted on purpose was the same date mentioned, 9 days before update. Maybe it took so long due to the amount of agents?

robinshen ADMIN ·

It should not take so long. Please watch it for some time to see if this happens again.

tomz ·

I am starting to see it more frequently now, but I believe it is because we recently added a lot of new hosts to the build grid. I'm led to believe that these two things are the cause

  • We have a slightly older template VM where our agents are spawned so the version of the agent is possibly out of date
  • there exists an ugly bug that will force update and reboot an agent, even if it's in the middle of a job.

Essentially, the problem's first sign is this in the agent log:

2024-06-06 08:06:14,691 [WrapperListener_stop_runner] INFO com.pmease.quickbuild.Quickbuild - Stopping QuickBuild agent...

followed by a bunch of logs about updating and restarting, followed by errors about job's that are no longer found

2024-06-06 08:07:34,312 [qtp321292933-94] ERROR com.pmease.quickbuild.grid.NodeServiceImpl - Unable to find job 'e3fd235e-f5f5-4179-b53c-07e8c183df3e' on node 'buildagent' (Job is ever started: false).
2024-06-06 08:07:40,963 [pool-2-thread-5] ERROR com.pmease.quickbuild.grid.GridTaskFuture - Error finishing job.

Is it possible to investigate and maybe fix the second issue that forces an update on an agent with an active job?

tomz ·

Alternatively, I notice there's an AgentUpdater class in the build system. Can you provide an example of using that to update an agent and then we can do it when provisioning the node on the grid, as opposed to having it randomly happen later? If that's the wrong class, is there another way to do this within groovy script?

robinshen ADMIN ·

We have a slightly older template VM where our agents are spawned so the version of the agent is possibly out of date

AgentUpdater will run as soon as agent connects to server and the agent will be forced to update and restart if it is using an old version to make sure both agent and server are using same program versions to avoid unexected errors during job run. Are you launching agent via VM image with cloud profiles?

tomz ·

They are not cloud profiles. And we definitely need debugging to find out what's going on here - is there some way to force logging when connected to server what version it thinks the server is, and what version it thinks the agent is? All we know is that long after the agent has first connected, it's still trying to update something based on the agent logs. Also, is there any way to know from server or agent logs what is triggering:

[WrapperListener_stop_runner] INFO com.pmease.quickbuild.Quickbuild - Stopping

Stopping because of "server requesting stop to perform X" or "user requesting stop on host" would be a really good log entry. When you shutdown a windows server OS you have to give it a shutdown command and I think we need one of those on the quickbuild agent so it logs better information.

robinshen ADMIN ·

Will add more debug logging for this.

robinshen ADMIN ·

Please login to machine running problematic agent, and check " /logs/console.log" (and console.log.1, console.log.2 etc if the entry is old) to see if there is below line around the time the agent is restarted:

TERM trapped.  Shutting down

If exists, the build agent is terminated by some other programs or due to restart of agent machine.