Handling an agent going offline temporarily #4289

dlregisx · 3 years ago

We're considering using QB for a case where the work of the agents entails them rebooting and going offline for a period of time. The overall build needs to be able to span the offline period.

I envision the following:

Master step would run on target agent doing various prep tasks
To handle agent reboot, a sequential composition step would be used and set to run on the server:
2a. An inner-step targeting the agent would initiate the agent reboot/offline action
2b. An inner-step set to run on the server would wait for the agent's return
Continue with steps on the agent

Using the QB API, I've been able to determine on which agent the current job is targeting ... e.g.,

masterStep = build.getMasterStep();
    masterNode = masterStep.getNode();  // get node on which this job is running
    logger.info("Master node agent = " + masterNode.getHostName());

Once I have the GridNode, I'd like to periodically see if the agent is active. Merely checking PING status won't work because the system may be booted but doing work before the agent is restarted. Is there a REST API on the agent that I can use for this? Or perhaps some API calls that I'm not seeing? Or is there a better overall approach you might suggest?

Thanks,
Dave Regis

solved #4
replies 5
views 1040
stars 0

robinshen ADMIN · 3 years ago

Unfortunately QB step running on an agent experiencing a reboot will cause the whole build to fail and cancelled as QB monitors each agent running step to make sure it is active. It can not recover from agent reboot for now.

1 Reply

dlregisx · 3 years ago

Robin,

Thank you for your quick reply!

I've noticed that the task to initiate the offline work will succeed if it's performed asynchronously, i.e., the script launches an separate process then immediately returns.

As an experiment, I've constructed a build which runs on the server but only the specific step that initiates offline work runs on the target agent. The step on the target agent returns success before the agent goes offline and the build continues without errors on the server. I still need to work out having the server wait for the agent's return, but this looks promising. Once I work out that detail, it will be trivial to verify that the offline work was successful and then handle fail/success logic from there.

To prevent overloading the server, I think we could designate persistent agents to share the load of the overall build execution.

Any thoughts on this approach?

Regards,
Dave Regis

robinshen ADMIN · 3 years ago

It will be fine if agent is rebooted asynchronously. To check if agent is back online afterwards, please follow below procedure:

Define a variable say "rebootingAgentId" with empty value in configuration
Create a step running on server to execute below script before rebooting agent:

groovy:
vars.get("rebootingAgentId").setValue(grid.getNode("rebooting-agent-host-name:8811").id.toString(), false);

Create a step running on server to execute below script after rebooting agent:

groovy:
while (true) {
  def rebootingAgent = grid.getNode("rebooting-agent-host-name:8811");
  if (rebootingAgent != null && !rebootingAgent.id.toString().equals(vars.getValue("rebootingAgentId"))) { // agent restarted successfully
    break;
  } else {
    Thread.sleep(1000);
  }
}

The idea behind this is that agent id will change after each reboot. You can also run this script on other build agents not involved in rebooting to reduce server load.

1 Reply

dlregisx · 3 years ago

Robin,

Excellent! Again, thank you very much!

Dave

robinshen ADMIN · 3 years ago

Glad to be helpful. Let me know if I can be of any further help.