Your browser was unable to load all of the resources. They may have been blocked by your firewall, proxy or browser configuration.
Press Ctrl+F5 or Ctrl+Shift+R to have your browser try again.

Steps executed even if previous one is failed #3996

kururin ·

Hello

Quickbuild 8.0.20

If you look at the screenshot below, you'll see steps being executed whereas a previous step is failed:
qb8_bug.png

The "revert all" step is set to "Always execute" but the next step ("unshelve files") is set to "If all previous sibling steps are successful".
It always happens in this configuration. Could the problem come from the error returned by the "sync code" step?

11:00:32,740 ERROR - Step 'master>== Targets>== PC Final>== Perforce>Sync code (do not unshelve)' is failed: Failed to run command: p4 -p p4-tgt-codeedge-mtr:1806 -u silex-buildfarm -P ****** -c qb-WIN-7UARQU9O190-412 sync //qb-WIN-7UARQU9O190-412/...@3228179
Command return code: 1
Command error output: unlink: D:\PRESUBMIT\main\source\server\script\matchmaking_rules.py: The system cannot find the file specified.
Fatal client error; disconnecting!
chmod: D:\PRESUBMIT\main\source\server\script\matchmaking_rules.py: The system cannot find the file specified.

Thank you
Mathieu

  • solved #18
  • replies 20
  • views 4228
  • stars 0
kururin ·
robinshen ADMIN ·

This looks odd. Do you have any customer script running to change the step status unintentionally?

kururin ·

No we don't have script modifying the status.
This problem never happens in other configurations, that's why I think it comes from our "Sync" step...
We are still investigating this perforce error that came out a few days after upgrading to QB8.

robinshen ADMIN ·

How about running this command from console directly to see how it works?

kururin ·

The step fails very rarely, but each time it does it returns this strange error and this weird QB behavior.

robinshen ADMIN ·

There might exist some race conditions. Do you have two or more parallel builds or steps writing/cleaning files under the Perforce workspace while the sync is going on?

rpallares ·

Hi,

I suppose we have the same issue on QuickBuild 8.0.20, but I'll probably add new information on what happened.
Our issue is a whole sequential step executed in parallel with others is executed twice.

Steps to reproduce:

  1. Setup a parallel composite step executed on one machine (not the server to be closer to our real case)
  2. Add several sequential composite steps into
  3. Each sequential reserve a new machine using a unique resource
  4. Do several things on each machine, each step must have a pre and post execute actions that log "Hey, I will execute this step once", "Hey I have executed this step"
  5. Execute this process several times
  6. Observe that sometimes (1/100), a sequetial step is executed twice on the same machine.
    QuickBuild don't add a new column, but if we look at the logs of all steps in this column, all steps are executed twice with ~20ms offset (clearly identified by the pre and post execute steps scripts).

As second symptom, we see like described in this issue (that's why I made the connection) the same issue:

  1. A step fail (due to the concurrent double execution)
  2. The following step with execute condition defined as "If all previous sibling steps are successful" is executed then cancelled immediatly

We suppose that these two symptoms have the same cause, the whole column (or sequential composite setp) is executed twice concurrently that produce our build failures.
As the issue have a low reproducibilty, it will quite hard to reproduce it at your side.
But may be an automated process that will automatically parse the logs and detect the double execution will catch the error quickly.

Can you confirm the issue please ?
Is this issue is known ?
How can we workaround that ?
I suppose that a kind of mutex is missing when resevering several machines by resource into a parallel step.

Do not hesitate to contact me by email if you need more information to reproduce the issue.

Thanks

robinshen ADMIN ·

Thanks for the info. Will set up test env to check this issue...

robinshen ADMIN ·

Having tested this all day generating tens of thousands of builds and no one is failing. I am sending my database backup to you, can you please check if I am missing something?

kururin ·

Just a few more details:

When we say that the steps are duplicated, the duplication is not visible on the build view of QB. You can just see it by opening the log. Here is an example of step log we have:

02:28:30,758 INFO - Checking step execute condition...
02:28:30,758 INFO - Step execute condition satisfied, executing...
02:28:30,880 INFO - Executing pre-execute action...
02:28:30,880 INFO - Running step...
02:28:59,939 INFO - Executing post-execute action...
02:28:27,776 INFO - Checking step execute condition...
02:28:27,776 INFO - Step execute condition satisfied, executing...
02:28:27,898 INFO - Executing pre-execute action...
02:28:27,898 INFO - Running step...
02:28:44,490 INFO - Executing post-execute action...

The step is executed twice but as soon as one of this launch is finished, the next step is triggered without waiting for the end of the second launch of the current step.
That explains why sometimes we have 2 steps failed in a row.

We're still trying to find what these builds have in common that could lead to this kind of problem.
It happens on different configurations, not just one and I'm sure we missed some of them because the builds can finish green even if this problem happens.
The config in the backup you sent is correct even if ours are more complexe (with triggers by example). But because we have no idea what can cause the problem, it's difficult to say what the configuration must look like to reproduce the issue...

kururin ·

This still often happens. We just live with it because we have no idea of the cause of such a behavior but it's really annoying.
Again today, look at this step log:

17:31:41,096 INFO  - Checking step execute condition...
17:31:41,096 INFO  - Step execute condition satisfied, executing...
17:31:41,226 INFO  - Executing pre-execute action...
17:31:41,226 INFO  - Running step...
17:31:42,444 INFO  - Enter password:
17:31:42,444 INFO  - User silex-buildfarm logged in.
17:31:42,447 INFO  - Executing post-execute action...
17:31:41,089 INFO  - Checking step execute condition...
17:31:41,089 INFO  - Step execute condition satisfied, executing...
17:31:41,218 INFO  - Executing pre-execute action...
17:31:41,218 INFO  - Running step...
17:31:42,614 INFO  - Enter password:
17:31:42,614 INFO  - User silex-buildfarm logged in.
17:31:42,617 INFO  - Executing post-execute action...

The step is launched twice almost in the same time :(
Are we the only customers who reported this issue?

More details on our configurations:

  • They are running on a "master" node then splitted on many agents
  • Each agent is selected according a few rules, for example:
node_selection.png

We'll probably try to upgrade to QB9 but we are not very optimistic...

robinshen ADMIN ·

Could it be that it is just a duplicated capture of build log? To verify, please execute below logic in pre-execute action of the problematic step:

groovy:
logger.info("UUID of this running: " + UUID.randomUUID());
kururin ·

I've just added this step to one of our configuration where we saw the problem is happened.
But it is not a duplicated capture of build log as you suggest. If it was the case, it won't be a problem.
The steps are really executed twice and doe not wait for both execution are finished to jump to the next step.

  1. first execution of step1
  2. second execution of step1 (almost in the same time)
  3. as soon as one execution is finished, step2 starts
  4. when the other execution of step1 is finished, step2 starts again
  5. and so on...

We'll post the log with UUID here when it happens again.

robinshen ADMIN ·

Will check it further.

robinshen ADMIN ·

Can you please send backup of your database (can be taken from QB admin page) and let me know the build and step to check? Please send to [robin AT pmease DOT com]

robinshen ADMIN ·

Also please send me corresponding build log. When download the build log, please make sure to check the option "show steps".

robinshen ADMIN ·
kururin ·

Thanks a lot!!!

kururin ·

Just curious, could you shortly explain what was the problem and when this bug could happen?

robinshen ADMIN ·

There is a race condition in our parallel step execution scheduling, and it was introduced when worker count feature of parallel composition step is added. It only happens when some steps finish very quickly in the parallel composition before some other steps being triggered. For instance, when some child steps are skipped.