Your browser was unable to load all of the resources. They may have been blocked by your firewall, proxy or browser configuration.
Press Ctrl+F5 or Ctrl+Shift+R to have your browser try again.

Weird recent behavior - zombie builds #4387

tomz ·

Is there some way in groovy scripting to retrieve the same data from the build grid that you get by viewing the Queue tab?

I'm trying to debug an issue with our build system that is frequently causing build agent deadlocks and zombie build jobs. The zombies are running builds in quickbuild that you can see in the configuration build history. If browse to the agent that the master node is running on, it says there are NO running steps on that node. This zombie build has catastrophic consequences down the line because it has a post-build and pre-build script that mutex locks that agent. Any other resource requesting that agent gets deadlocked with the zombie.

I first want to find a way to detect this error. The easiest way to do this manually is to do the following:

Find all builds in queue triggered by a specific user that fall under a specific build tree (which is massive) and have been running longer than expected. From these builds check the node they run on, and find the mutex file described above. The mutex file will have all buildIDs inside that are waiting on that node. The oldest (lowest numbered one, which is always the topmost in the file) is the stuck/zombie build. You can go to that build ID in quickbuild and see it stuck on the master step, but you can never find that build in the build queue, and you can't see that build ID in the node's "running steps" tab.

How can I get the queue entries in script so I can start working on a work-around for this problem? (automating the manual steps above)

Do you know of any thing we might be doing wrong on the grid that might result in these kinds of zombies? Our mutexing system hasn't changed in months, perhaps years, but sadly right now it has become heavily problematic, hanging nodes roughly 10 to 15 times a day.

Help or suggestions would be highly appreciated.

Tom Z.

  • replies 3
  • views 590
  • stars 0
robinshen ADMIN ·

To get all build requests in queue, use below script on QB server:

groovy:
for (eachRequest in system.buildEngine.getBuildRequests(null)) {
// do something with eachRequest.build.id
}

If build entry no longer exists in queue, it means that the build is actually finished. The reason that it is still displayed as running in build detail page might because that there are some errors preventing the build entry to be updated in database. For these odd builds, please check if there are entries starting with "Error processing build request" in server log.

tomz ·

I have determined that a few things we were doing were problematic. The most egregious issues involved exceptions being thrown in either the master step's pre-step script or post-step script, where the file node locking mechanisms are implemented. Because this pattern is used in roughly 100 configurations, it would be useful to be able to modify steps via quickbuild scripting, rather than manually fixing each instance.

What is the best way to accomplish this? The REST API?

Thanks in advance,
Tom Z.

robinshen ADMIN ·

You may create a maintenance configuration running below script:

groovy:
import com.pmease.quickbuild.setting.step.executeaction.ScriptExecuteAction;
import com.pmease.quickbuild.migration.VersionedDocument;

for (eachConf in system.configurationManager.getAll()) {
  def masterStep = eachConf.getStep("master");
  if (masterStep != null) {
    if (masterStep.preExecuteAction instanceof ScriptExecuteAction) {
      masterStep.preExecuteAction.script = .. // fix script
    }
    if (masterStep.postExecuteAction instanceof ScriptExecuteAction) {
      masterStep.postExecuteAction.script = .. // fix script
    }
    eachConf.stepDOMs["master"] = VersionedDocument.fromBean(masterStep);
    system.configurationManager.save(eachConf);
  }
}