Your browser was unable to load all of the resources. They may have been blocked by your firewall, proxy or browser configuration.
Press Ctrl+F5 or Ctrl+Shift+R to have your browser try again.

build step is canceled by unknown reason, and the build status is failed not canceled #4056

OnQuickBuild ·

Hello Quickbuild Support

I've encountered a very confusing build failure.
In a build, a step cancels itself due to an unknown reason.
I'm trying to find the reason on my end, but the step doesn't report any error code. The "Terminating launched command gracefully..." just shows right after my normal command output.

The step is a batch step. In it, it only runs one line "load_bit.bat my_bit.bit".

Breakdown:
This batch script involves certain tools we use that require licenses.
And I have four artifact directories, inside each one, I use a sequential step to run this for them. The pattern is the step always cancels the fourth one. So I'm only guessing it has something to do with my license.
I'm still looking for the real cause. Since it doesn't report any error, I'm not sure what went wrong.

Could you tell me why a step would be canceled by itself? And the final build status says "failed", not the pink canceled status.

Thank you!

  • replies 22
  • views 4079
  • stars 0
Laba42 ·

Could it be a timeout?
With me that was the mistake!

Georg

OnQuickBuild ·

No it was not due to timeout. Besides, the build result shows status "failed", not the pink "timed out"

OnQuickBuild ·

The config timeout is set to 720 minutes. No timeout restrictions on the steps either. No one clicked stop the build.

At first I thought it was some license issue, preventing from using that step too frequently in a short time. But I logged on to that build agent, and tried the same thing with command prompt, it didn't report issues.

Again, on QB GUI, the step didn't report any error, that "terminate gracefully" just appeared after some normal stdout.

robinshen ADMIN ·

Is this happening every time you run the build? If yes, please run the build again and observe console log of both the QB server and the build agent running that step to see if there are any errors.

OnQuickBuild ·

agent node log:

2019-02-26 13:53:20,070 [qtp2116429876-28] WARN  com.pmease.quickbuild.grid.NodeServiceImpl - testGridJob is cancelling job '14692f74-bc42-4d57-bae4-747ef62357de'...
2019-02-26 13:53:20,131 [pool-1-thread-1] WARN  com.pmease.quickbuild.grid.NodeServiceImpl - testGridJob is cancelling job 'f4e799b1-74ef-48b7-86f2-6339f58b3222'...
2019-02-26 13:53:20,133 [pool-1-thread-1] WARN  com.pmease.quickbuild.grid.GridTaskFuture - Job still exists on job node and cancel command is issued (job id: f4e799b1-74ef-48b7-86f2-6339f58b3222, build id: 10688, job node: L3:8811)...
2019-02-26 13:53:20,133 [pool-1-thread-3] WARN  com.pmease.quickbuild.grid.NodeServiceImpl - testGridJob is cancelling job 'a0ae6e8d-7c94-4c4e-b669-8a821c5a928f'...
2019-02-26 13:53:20,134 [pool-1-thread-5] WARN  com.pmease.quickbuild.grid.NodeServiceImpl - testGridJob is cancelling job 'e547675e-8e16-46f0-922a-825574d871e3'...
2019-02-26 13:53:20,134 [pool-1-thread-3] WARN  com.pmease.quickbuild.grid.GridTaskFuture - Job still exists on job node and cancel command is issued (job id: a0ae6e8d-7c94-4c4e-b669-8a821c5a928f, build id: 10688, job node: L3:8811)...
2019-02-26 13:53:20,135 [pool-1-thread-7] WARN  com.pmease.quickbuild.grid.NodeServiceImpl - testGridJob is cancelling job '61085295-0937-44e6-9415-d745c72eb2e3'...
2019-02-26 13:53:20,135 [pool-1-thread-5] WARN  com.pmease.quickbuild.grid.GridTaskFuture - Job still exists on job node and cancel command is issued (job id: e547675e-8e16-46f0-922a-825574d871e3, build id: 10688, job node: L3:8811)...
2019-02-26 13:53:20,135 [pool-1-thread-7] WARN  com.pmease.quickbuild.grid.GridTaskFuture - Job still exists on job node and cancel command is issued (job id: 61085295-0937-44e6-9415-d745c72eb2e3, build id: 10688, job node: L3:8811)...
2019-02-26 13:53:20,136 [pool-1-thread-9] WARN  com.pmease.quickbuild.grid.NodeServiceImpl - testGridJob is cancelling job 'bcdeb859-d659-40f0-9ba3-0c2a687e8708'...
2019-02-26 13:53:20,137 [pool-1-thread-9] WARN  com.pmease.quickbuild.grid.GridTaskFuture - Job still exists on job node and cancel command is issued (job id: bcdeb859-d659-40f0-9ba3-0c2a687e8708, build id: 10688, job node: L3:8811)...
2019-02-26 13:53:20,137 [pool-1-thread-11] WARN  com.pmease.quickbuild.grid.NodeServiceImpl - testGridJob is cancelling job 'd0af9b46-d640-468f-b66f-00a59f243d68'...
2019-02-26 13:53:20,137 [pool-1-thread-11] WARN  com.pmease.quickbuild.grid.GridTaskFuture - Job still exists on job node and cancel command is issued (job id: d0af9b46-d640-468f-b66f-00a59f243d68, build id: 10688, job node:L3:8811)...
2019-02-26 13:53:20,137 [pool-1-thread-13] WARN  com.pmease.quickbuild.grid.NodeServiceImpl - testGridJob is cancelling job 'c0f8998e-0fe3-43e3-9fbf-1f4f478ec829'...
2019-02-26 13:53:20,138 [pool-1-thread-13] WARN  com.pmease.quickbuild.grid.GridTaskFuture - Job still exists on job node and cancel command is issued (job id: c0f8998e-0fe3-43e3-9fbf-1f4f478ec829, build id: 10688, job node:L3:8811)...
2019-02-26 13:53:20,139 [pool-1-thread-15] WARN  com.pmease.quickbuild.grid.NodeServiceImpl - testGridJob is cancelling job '7e4592e4-396f-41b6-9cbf-0eb407d1f80e'...
2019-02-26 13:53:20,139 [pool-1-thread-15] WARN  com.pmease.quickbuild.grid.GridTaskFuture - Job still exists on job node and cancel command is issued (job id: 7e4592e4-396f-41b6-9cbf-0eb407d1f80e, build id: 10688, job node:L3:8811)...

server log:

2019-02-26 13:53:20,099 [pool-1-thread-17774] WARN  com.pmease.quickbuild.grid.NodeServiceImpl - testGridJob is cancelling job '9634e1cc-cfba-4efa-824d-9b24430ebbed'...
2019-02-26 13:53:20,099 [pool-1-thread-17774] WARN  com.pmease.quickbuild.grid.GridTaskFuture - Job still exists on job node and cancel command is issued (job id: 9634e1cc-cfba-4efa-824d-9b24430ebbed, build id: 10688, job node: quickbuild:80)...
2019-02-26 13:53:20,099 [pool-1-thread-17780] WARN  com.pmease.quickbuild.grid.NodeServiceImpl - testGridJob is cancelling job '9966448c-82e1-4842-ba65-ad508e709711'...
2019-02-26 13:53:20,100 [pool-1-thread-17780] WARN  com.pmease.quickbuild.grid.GridTaskFuture - Job still exists on job node and cancel command is issued (job id: 9966448c-82e1-4842-ba65-ad508e709711, build id: 10688, job node: quickbuild:80)...
2019-02-26 13:53:20,100 [pool-1-thread-17796] WARN  com.pmease.quickbuild.grid.NodeServiceImpl - testGridJob is cancelling job 'f608d540-e85f-4122-9ff1-2a553323a127'...
2019-02-26 13:53:20,100 [pool-1-thread-17796] WARN  com.pmease.quickbuild.grid.GridTaskFuture - Job still exists on job node and cancel command is issued (job id: f608d540-e85f-4122-9ff1-2a553323a127, build id: 10688, job node: quickbuild:80)...
2019-02-26 13:53:20,100 [pool-1-thread-17791] WARN  com.pmease.quickbuild.grid.NodeServiceImpl - testGridJob is cancelling job '5383314b-4261-46cf-96c6-fe4d9d6637e9'...
2019-02-26 13:53:20,101 [pool-1-thread-17791] WARN  com.pmease.quickbuild.grid.GridTaskFuture - Job still exists on job node and cancel command is issued (job id: 5383314b-4261-46cf-96c6-fe4d9d6637e9, build id: 10688, job node: quickbuild:80)...
2019-02-26 13:53:20,192 [pool-1-thread-17806] WARN  com.pmease.quickbuild.grid.GridTaskFuture - Job still exists on job node and cancel command is issued (job id: 14692f74-bc42-4d57-bae4-747ef62357de, build id: 10688, job node: L3:8811)...
2019-02-26 13:55:35,962 [pool-1-thread-17793] WARN  com.pmease.quickbuild.Context - {"build_id":10688,"begin_date":"2019-02-26T13:37:57","end_date":"2019-02-26T13:55:34","duration_ms":1057012,"status":"FAILED","status_date":"2019-02-26T13:55:34"
OnQuickBuild ·

my QB version:
PMEase QuickBuild, 8.0.0, 2018-02-26, COMMERCIAL LICENSE FOR SITE XXX

robinshen ADMIN ·

The log looks fine. Are there any clue in audit log of this configuration for the cancellation? Also please run a test batch file containing a simple sleep command instead of your current command to see if it will be cancelled.

OnQuickBuild ·

Hi Robin

I haven't told you that there are mysterious canceling outside of the batch file execution. There is a python script right after the batch step to run tests. And sometimes it would cancel without reason at the Python step. So I didn't bother to try the batch command with sleep (timeout in batch)

Anyway I suspect it could be the Java heap size defined in wrapper.conf. I increased it to 10 GB (it used to be 2 GB) hoping it would solve the problem.

OnQuickBuild ·

increased Java heap to 10 GB and ran the build. Still canceled without a reason.

robinshen ADMIN ·

Running out of Java heap should report other errors instead of cancelling. Have you checked the audit log to see if there is anything related to build cancellation? Also are you able to reproduce this issue with a sample configuration?

OnQuickBuild ·

I ran the build again. It still canceled and the audit log didn't say anything. Do you mean the audit log in the "administration" tab? It only says that configuration "Build was requested." And nothing newer than that entry in the log.

Also, I haven't tried using batch sleep(timeout) command. Could you tell me why using this way to debug the problem? I'll try that later when I have more time.

In the meantime, yes it ALWAYS happens. I know it could very well be my part of responsibility, but when it doesn't report any reason or error code, I really don't know what's going on.

Furthermore, sometimes it would cancel during a batch step that runs a Python script. This doesn't happen all the time. Only once or twice in total. But every time it cancels during the batch file execution.

OnQuickBuild ·

Just tried using batch timeout command

timeout /t 10 /nobreak

The /nobreak here prevents any key interruption except for ctrl+c.

But apparently QB doesn't support this kind of behavior:

18:49:13,016 WARN  - ERROR: Input redirection is not supported, exiting the process immediately.
OnQuickBuild ·

problem still exists, and it doesn't only happen at one step. There are two steps, one runs batch file, another runs a python script. Totally different steps.
This is the step error log from the step that cancels. Today it only cancels during the python step

java.lang.RuntimeException: java.lang.InterruptedException 
        at com.pmease.quickbuild.execution.Commandline.execute(Commandline.java:395) 
        at com.pmease.quickbuild.plugin.basis.CommandBuildStep.run(CommandBuildStep.java:231) 
        at com.pmease.quickbuild.plugin.basis.CommandBuildStep$$EnhancerByCGLIB$$6e9eb833.CGLIB$run$0(<generated>) 
        at com.pmease.quickbuild.plugin.basis.CommandBuildStep$$EnhancerByCGLIB$$6e9eb833$$FastClassByCGLIB$$c8319d74.invoke(<generated>) 
        at net.sf.cglib.proxy.MethodProxy.invokeSuper(MethodProxy.java:215) 
        at com.pmease.quickbuild.DefaultScriptEngine$Interpolator.intercept(DefaultScriptEngine.java:261) 
        at com.pmease.quickbuild.plugin.basis.CommandBuildStep$$EnhancerByCGLIB$$6e9eb833.run(<generated>) 
        at com.pmease.quickbuild.stepsupport.Step.doExecute(Step.java:654) 
        at com.pmease.quickbuild.stepsupport.Step.execute(Step.java:568) 
        at com.pmease.quickbuild.stepsupport.StepExecutionJob.executeStepAwareJob(StepExecutionJob.java:31) 
        at com.pmease.quickbuild.stepsupport.StepAwareJob.executeBuildAwareJob(StepAwareJob.java:56) 
        at com.pmease.quickbuild.BuildAwareJob.execute(BuildAwareJob.java:77) 
        at com.pmease.quickbuild.grid.GridJob.run(GridJob.java:129) 
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) 
        at java.util.concurrent.FutureTask.run(Unknown Source) 
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
        at java.lang.Thread.run(Unknown Source) 
    Caused by: java.lang.InterruptedException 
        at java.lang.ProcessImpl.waitFor(Unknown Source) 
        at com.pmease.quickbuild.execution.Commandline.execute(Commandline.java:357) 
        ... 17 more

I don't suppose this general error log would mean anything to you?
I just hope you guys have seen it and know what it might be based on your experience.

Thanks very much

robinshen ADMIN ·

Can you please take a backup of your database via QB administration menu and send it to [robin AT pmease DOT com] for diagnostics?

OnQuickBuild ·

Hi Robin

I'm afraid I'm not allowed to do that.

You told me to test with sleep commands. I'm already using a Python script that waits for certain tasks to finish. To slow down CPU consumption, I used Python time.sleep() function.
Could you tell me if Quickbuild tries to cancel a build if this command is used too frequently?

The thing is the same Python script has been used for about a year and never had such problem, not once; but now our product moves to revision 2 dev, this unknown cancelling starts to occur.

Thank you!

robinshen ADMIN ·

We just fixed an issue of unexpected build cancelling when number of workers are specified in parallel steps. If this is your case, please upgrade to QB 9.0.4. For 8.0.x, the issue was fixed since 8.0.32

OnQuickBuild ·

Our version is 9.0.0, 2019-02-28

And no the cancelling step for me is shell/batch and it's in a sequential wrapper.
The sequential wrapper is in a parallel wrapper but currently there's only one resource so there's only one running at all times.
Besides, our resource is defined based on hardware setup (separate laptops)

Thanks anyway

robinshen ADMIN ·

Please upgrade to 9.0.4 and if the problem still exists, let's schedule an online meeting to check the problem.

OnQuickBuild ·

Hi Robin

I'll suggest my manager to update. In the meantime, we increased the logging level to debug to see if there are any clues. While we are at it, the canceling issue (sort of) disappeared, but a new problem occurred. JVM (heap) memory would max out and throw Java exception

java.lang.RuntimeException: java.lang.OutOfMemoryError: GC overhead limit exceeded 
        at com.pmease.quickbuild.bootstrap.BootstrapUtils.wrapAsUnchecked(BootstrapUtils.java:56) 
        at com.pmease.quickbuild.util.ExceptionUtils.wrapAsUnchecked(ExceptionUtils.java:82) 
        at com.pmease.quickbuild.stepsupport.Step.doExecute(Step.java:666) 
        at com.pmease.quickbuild.stepsupport.Step.execute(Step.java:568) 
        at com.pmease.quickbuild.stepsupport.StepExecutionJob.executeStepAwareJob(StepExecutionJob.java:31) 
        at com.pmease.quickbuild.stepsupport.StepAwareJob.executeBuildAwareJob(StepAwareJob.java:56) 
        at com.pmease.quickbuild.BuildAwareJob.execute(BuildAwareJob.java:77) 
        at com.pmease.quickbuild.grid.GridJob.run(GridJob.java:129) 
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) 
        at java.util.concurrent.FutureTask.run(Unknown Source) 
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
        at java.lang.Thread.run(Unknown Source) 
    Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded

There is a sequential step that repeats 3 times. It happened 3 times in one build. Later builds finished but took very long time due to high memory usage. GUI indicates that a sequential step is finished but no further step is being executed. The parent step is still yellow when all child steps are finished. The build just kept on halt like this for a considerable amount of time. (sometimes 30 minutes)
This is when the agent JVM memory is set to 10240 MB. In the grid it will show the agent's JVM memory is near 100%, making the build very slow.
I saw a similar post (https://support.pmease.com/PMEase/QuickBuild/topics/3725/java-lang-outofmemoryerror-java-heap-space?7) in the forum about JVM memory. I made sure I don't have a custom groovy script that fetches all builds/configs whatsoever. Besides, in the first build that failed 3 times, it threw this exception in different steps. So I don't know how to identify the problem from there.

My question is could you introduce some ways to debug this issue? How can I find out the cause of it?
I haven't yet found anything useful in the log (server or agent). Still searching...

Thank you!

robinshen ADMIN ·

The OutOfMemoryError issue may explain all odd things you've encountered, as thread execution may exit unexpectedly upon such error. Let's solve it first. While the memory of the agent is running high, please get a memory dump by running below command as the user running the agent process:

/path/to/jdk/bin/jmap -dump:format=b,file=/tmp/heapdump.bin <QB agent JVM process id> (not the wrapper process id)

After getting the dump you may analyze it with Yourkit trial version (yourkit.com) to see what is occupying most of the memory.

OnQuickBuild ·

Hey Robin

Thanks. "not the wrapper process id". Then what pid should I specify?

I get it. Quickbuild build agent service PID. Correct?

robinshen ADMIN ·

QB (either agent or server) will spawn two processes, one for wrapper (on Linux it is normally wrapper-linux-x86-64), and another for JVM process. We should dump memory of JVM process.