Step to check if node is active #2495

amn · 1 decade ago

At the beginning of some builds we start the build agent first and then have a hard-coded sleep for 2 minutes to cover any time it might take to show up in the Active Nodes list. Instead of this sleep I would rather have a second step that runs until the agent shows up in the Active Nodes list or times out after a certain point. I started trying to do this:

groovy:

import com.pmease.quickbuild.QuickbuildException;

def node = "agent:8811"
def running = false

//while(current.getDuration() < 120000) {
  logger.info "Status: " + step.getStatus()
  if (grid.getNode(node) != null) {
    logger.info "Node is running..."
    running = true;
    //break;
  }
//}

if (!running) {
  logger.info "Not is not running..."
  throw new QuickbuildException("The agent could not be found in the Active Nodes list after 2 minutes.")
}

However, I can't get the current duration of a step while it is running. That just returns null. Do you have a suggested way to do this?

replies 7
views 3778
stars 0

robinshen ADMIN · 1 decade ago

The step duration is only calculated after step finishes. You may need to calculate the duration by yourself, for instance, record a timestamp at start of the loop and check ellapsed time against that timestamp later on.

amn · 1 decade ago

We are seeing periodic failures with this where grid.getNode does not return null, but then the next step which runs through every node in the grid to find a match can't find it and fails. Is there some delay that might happen here even after grid.getNode is successful where it wouldn't show up in the Node Selection list?

robinshen ADMIN · 1 decade ago

Generally non-null return value of grid.getNode means a node is active, except that the node is forcibly stopped and QB server will wait for agent time out before removing it from active node list. I'd suggest to print grid.getNode() again right after step matching is failed. Also you might want to upgrade to QB 5.1.0-rc1 as it can reports the node not matching reason for every node to help investigating such issues.

amn · 1 decade ago

We check to see if the agent is active:

groovy:

import com.pmease.quickbuild.QuickbuildException;

def node = params.get("currentNodeAddress");
def trimmedNode = node.substring(0, node.indexOf('.'));
def running = false

def time = System.currentTimeMillis()
logger.info "Time: " + time;

while(System.currentTimeMillis() < time+300000) {
  if (grid.getNode(node + ":8811") != null || grid.getNode(trimmedNode + ":8811") != null) {
    logger.info "Node is running..."
    running = true;
    sleep(10*1000)
    break;
  }
}

if (!running) {
  logger.info "Node is not running..."
  throw new QuickbuildException("The agent could not be found in the Active Nodes list after 5 minutes.")
}

Then the next step uses the Node Selection to find it on node with specified script evaluating to true:

groovy:logger.info(node.getAddress());
logger.info(node.getHostName());
logger.info(params.get('currentNodeAddress'));
def testNode = params.get("currentNodeAddress");
def trimmedTestNode = testNode.substring(0, testNode.indexOf('.'));
logger.info(trimmedTestNode);
return (node.getHostName() == trimmedTestNode || node.getHostName() == testNode);

We see where sometimes the node that was found in the first step does not know up in the list for the Node Selection and then fails because it can't find a node that matches. It seems to become more stable when I increase the sleep time after it finds it in the first step, but I feel like there should not have to be a sleep at all.

In this case where we are seeing it fail there might be around 14 to 20 new agents that are being started up at the same time. Is it just not able to refresh the node selection in time? Is there a way to force that?

robinshen ADMIN · 1 decade ago

It turns out that QB caches nodes while processing steps, and only refresh nodes after a process round in order not to lock grid data structure too frequently. The process round might take some time if there are many steps running. So please increase sleep time to tolerate this. Another more reliable approach is to utilize QB cloud profile API to start/stop agents, just as what we've done for EC2 agents.

amn · 1 decade ago

I'm not familiar with the QB cloud profile API and how that would be used to start/stop agents. Could you provide an example or direct me to the documentation for this?

Thanks,
Adam

robinshen ADMIN · 1 decade ago

The best documenation is the example. You may refer to the built-in EC2 plugin on how to use the cloud integration API. Let me know if you have any specific questions implementing your own.