Error reporting on deployment failure

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Error reporting on deployment failure

Stuart Douglas
Hi Everyone,

I have been thinking a bit about the way we report errors in WildFly, and I think this is something that we can improve on. At the moment I think we are way to liberal with what we report, which results in a ton of services being listed in the error report that have nothing to do with the actual failure.

As an example to work from I have created [1], which is a simple EJB application. This consists of 10 EJB's, one of which has a reference to a non-existant data source, the rest are simply empty no-op EJB's (just @Stateless on an empty class).

This app fails to deploy because the java:global/NonExistant data source is missing, which gives the failure description in [2]. This is ~120 lines long and lists multiple services for every single component in the application (part of the reason this is so long is because the failures are reported twice, once when the deployment fails and once when the server starts).

I think we can improve on this. I think in every failure case there will be some root causes that are all the end user cares about, and we should limit our reporting to just these cases, rather than listing every internal service that can no longer start due to missing transitive deps.

In particular these root causes are:
1) A service threw and exception in its start() method and failed to start
2) A dependency is actually missing (i.e. not installed, not just not started)

I think that one or both of these two cases will be the root cause of any failure, and as such that is all we should be reporting on.

We already do an OK job of handing case 1), services that have failed, as they get their own line item in the error report, however case 2) results in a huge report that lists every service that has not come up, no matter how far removed they are from the actual problem.

I think we could make a change to the way this is reported so that only direct problems are reported [3], so the error report would look something like [4] (note that this commit only changes the operation report, the container state reporting after boot is still quite verbose).

I am guessing that this is not as simple as it sounds, otherwise it would have already been addressed, but I think we can do better that the current state of affairs so I thought I would get a discussion started.

Stuart


_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: Error reporting on deployment failure

Brian Stansberry
On Tue, Feb 13, 2018 at 8:24 PM, Stuart Douglas <[hidden email]> wrote:
Hi Everyone,

I have been thinking a bit about the way we report errors in WildFly, and I think this is something that we can improve on. At the moment I think we are way to liberal with what we report, which results in a ton of services being listed in the error report that have nothing to do with the actual failure.

As an example to work from I have created [1], which is a simple EJB application. This consists of 10 EJB's, one of which has a reference to a non-existant data source, the rest are simply empty no-op EJB's (just @Stateless on an empty class).

This app fails to deploy because the java:global/NonExistant data source is missing, which gives the failure description in [2]. This is ~120 lines long and lists multiple services for every single component in the application (part of the reason this is so long is because the failures are reported twice, once when the deployment fails and once when the server starts).

I think we can improve on this. I think in every failure case there will be some root causes that are all the end user cares about, and we should limit our reporting to just these cases, rather than listing every internal service that can no longer start due to missing transitive deps.

In particular these root causes are:
1) A service threw and exception in its start() method and failed to start
2) A dependency is actually missing (i.e. not installed, not just not started)

I think that one or both of these two cases will be the root cause of any failure, and as such that is all we should be reporting on.

We already do an OK job of handing case 1), services that have failed, as they get their own line item in the error report, however case 2) results in a huge report that lists every service that has not come up, no matter how far removed they are from the actual problem.

If the 2) case can be correctly determined, then +1 to reporting some new section and not reporting the current "WFLYCTL0180: Services with missing/unavailable dependencies" section. The WFLYCTL0180 section could only be reported as a fallback if for some reason the 1) and 2) stuff is empty.
 

I think we could make a change to the way this is reported so that only direct problems are reported [3], so the error report would look something like [4] (note that this commit only changes the operation report, the container state reporting after boot is still quite verbose).

I think the container state reporting is ok. IMHO the proper fix to the container state reporting is to rollback and fail boot if Stage.RUNTIME failures occur. Configurable, but rollback by default. If we did that there would be no container state reporting. If you deploy your broken app post-boot you shouldn't see the container state reporting because by the time the report is written the op should have rolled back and the services are no longer "missing". It's only because we don't rollback on boot that this is reported.
 

I am guessing that this is not as simple as it sounds, otherwise it would have already been addressed, but I think we can do better that the current state of affairs so I thought I would get a discussion started.

It sounds pretty simple. Any "problem" ServiceController exposes its ServiceContainer, and if relying on that registry to check if a missing dependency is installed is not correct for some reason, the ModelControllerImpl exposes its ServiceRegistry via a package protected getter. So AbstractOperationContext can provide that to the SVH.


_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev



--
Brian Stansberry
Manager, Senior Principal Software Engineer
Red Hat

_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: Error reporting on deployment failure

Stuart Douglas


On Wed, Feb 14, 2018 at 4:43 PM, Brian Stansberry <[hidden email]> wrote:
On Tue, Feb 13, 2018 at 8:24 PM, Stuart Douglas <[hidden email]> wrote:
Hi Everyone,

I have been thinking a bit about the way we report errors in WildFly, and I think this is something that we can improve on. At the moment I think we are way to liberal with what we report, which results in a ton of services being listed in the error report that have nothing to do with the actual failure.

As an example to work from I have created [1], which is a simple EJB application. This consists of 10 EJB's, one of which has a reference to a non-existant data source, the rest are simply empty no-op EJB's (just @Stateless on an empty class).

This app fails to deploy because the java:global/NonExistant data source is missing, which gives the failure description in [2]. This is ~120 lines long and lists multiple services for every single component in the application (part of the reason this is so long is because the failures are reported twice, once when the deployment fails and once when the server starts).

I think we can improve on this. I think in every failure case there will be some root causes that are all the end user cares about, and we should limit our reporting to just these cases, rather than listing every internal service that can no longer start due to missing transitive deps.

In particular these root causes are:
1) A service threw and exception in its start() method and failed to start
2) A dependency is actually missing (i.e. not installed, not just not started)

I think that one or both of these two cases will be the root cause of any failure, and as such that is all we should be reporting on.

We already do an OK job of handing case 1), services that have failed, as they get their own line item in the error report, however case 2) results in a huge report that lists every service that has not come up, no matter how far removed they are from the actual problem.

If the 2) case can be correctly determined, then +1 to reporting some new section and not reporting the current "WFLYCTL0180: Services with missing/unavailable dependencies" section. The WFLYCTL0180 section could only be reported as a fallback if for some reason the 1) and 2) stuff is empty.

I have adjusted this a bit so a service with mode NEVER is treated the same as if it is missing. I am pretty sure that with this change 1) and 2) will cover 100% of cases.

 
 

I think we could make a change to the way this is reported so that only direct problems are reported [3], so the error report would look something like [4] (note that this commit only changes the operation report, the container state reporting after boot is still quite verbose).

I think the container state reporting is ok. IMHO the proper fix to the container state reporting is to rollback and fail boot if Stage.RUNTIME failures occur. Configurable, but rollback by default. If we did that there would be no container state reporting. If you deploy your broken app post-boot you shouldn't see the container state reporting because by the time the report is written the op should have rolled back and the services are no longer "missing". It's only because we don't rollback on boot that this is reported.

I don't think it is nessesary to report on services that are only down because their dependents are down. It basically just adds noise, as they are not really related to the underlying issue. I have expanded my branch to also do this:

 
This ends up with very concise reports that just detail the services that are the root cause of the problem: https://gist.github.com/stuartwdouglas/42a68aaaa130ceee38ca5f66d0040de3

Does this approach seem reasonable? lf a user really does want a complete dump of all services that are down that information is still available directly from MSC anyway.

Stuart

 

I am guessing that this is not as simple as it sounds, otherwise it would have already been addressed, but I think we can do better that the current state of affairs so I thought I would get a discussion started.

It sounds pretty simple. Any "problem" ServiceController exposes its ServiceContainer, and if relying on that registry to check if a missing dependency is installed is not correct for some reason, the ModelControllerImpl exposes its ServiceRegistry via a package protected getter. So AbstractOperationContext can provide that to the SVH.


_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev



--
Brian Stansberry
Manager, Senior Principal Software Engineer
Red Hat


_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: Error reporting on deployment failure

Bob McWhirter
Agreed. I’ve had to track giant errors from A to B to C etc only to figure out Z was missing. 

On Wed, Feb 14, 2018 at 10:38 PM Stuart Douglas <[hidden email]> wrote:
On Wed, Feb 14, 2018 at 4:43 PM, Brian Stansberry <[hidden email]> wrote:
On Tue, Feb 13, 2018 at 8:24 PM, Stuart Douglas <[hidden email]> wrote:
Hi Everyone,

I have been thinking a bit about the way we report errors in WildFly, and I think this is something that we can improve on. At the moment I think we are way to liberal with what we report, which results in a ton of services being listed in the error report that have nothing to do with the actual failure.

As an example to work from I have created [1], which is a simple EJB application. This consists of 10 EJB's, one of which has a reference to a non-existant data source, the rest are simply empty no-op EJB's (just @Stateless on an empty class).

This app fails to deploy because the java:global/NonExistant data source is missing, which gives the failure description in [2]. This is ~120 lines long and lists multiple services for every single component in the application (part of the reason this is so long is because the failures are reported twice, once when the deployment fails and once when the server starts).

I think we can improve on this. I think in every failure case there will be some root causes that are all the end user cares about, and we should limit our reporting to just these cases, rather than listing every internal service that can no longer start due to missing transitive deps.

In particular these root causes are:
1) A service threw and exception in its start() method and failed to start
2) A dependency is actually missing (i.e. not installed, not just not started)

I think that one or both of these two cases will be the root cause of any failure, and as such that is all we should be reporting on.

We already do an OK job of handing case 1), services that have failed, as they get their own line item in the error report, however case 2) results in a huge report that lists every service that has not come up, no matter how far removed they are from the actual problem.

If the 2) case can be correctly determined, then +1 to reporting some new section and not reporting the current "WFLYCTL0180: Services with missing/unavailable dependencies" section. The WFLYCTL0180 section could only be reported as a fallback if for some reason the 1) and 2) stuff is empty.

I have adjusted this a bit so a service with mode NEVER is treated the same as if it is missing. I am pretty sure that with this change 1) and 2) will cover 100% of cases.

 
 

I think we could make a change to the way this is reported so that only direct problems are reported [3], so the error report would look something like [4] (note that this commit only changes the operation report, the container state reporting after boot is still quite verbose).

I think the container state reporting is ok. IMHO the proper fix to the container state reporting is to rollback and fail boot if Stage.RUNTIME failures occur. Configurable, but rollback by default. If we did that there would be no container state reporting. If you deploy your broken app post-boot you shouldn't see the container state reporting because by the time the report is written the op should have rolled back and the services are no longer "missing". It's only because we don't rollback on boot that this is reported.

I don't think it is nessesary to report on services that are only down because their dependents are down. It basically just adds noise, as they are not really related to the underlying issue. I have expanded my branch to also do this:

 
This ends up with very concise reports that just detail the services that are the root cause of the problem: https://gist.github.com/stuartwdouglas/42a68aaaa130ceee38ca5f66d0040de3

Does this approach seem reasonable? lf a user really does want a complete dump of all services that are down that information is still available directly from MSC anyway.

Stuart

 

I am guessing that this is not as simple as it sounds, otherwise it would have already been addressed, but I think we can do better that the current state of affairs so I thought I would get a discussion started.

It sounds pretty simple. Any "problem" ServiceController exposes its ServiceContainer, and if relying on that registry to check if a missing dependency is installed is not correct for some reason, the ModelControllerImpl exposes its ServiceRegistry via a package protected getter. So AbstractOperationContext can provide that to the SVH.


_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev



--
Brian Stansberry
Manager, Senior Principal Software Engineer
Red Hat
_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev

_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: Error reporting on deployment failure

Tomaž Cerar-2
In reply to this post by Stuart Douglas
Hey,

One of things we did talk about at f2f, but never got into details,
that would help with this is adding capabilities to deployments.

This way one failure you would get error message with telling you what capability is not available.
for example, that datasource is missing and where you can define it.
Or datasource defined at address xyz is in error so you know where to look to fix it.

For start we should need to expose capability registry and few other things to DUPs
and continue from there.

--
tomaz

On Thu, Feb 15, 2018 at 4:37 AM, Stuart Douglas <[hidden email]> wrote:


On Wed, Feb 14, 2018 at 4:43 PM, Brian Stansberry <[hidden email]> wrote:
On Tue, Feb 13, 2018 at 8:24 PM, Stuart Douglas <[hidden email]> wrote:
Hi Everyone,

I have been thinking a bit about the way we report errors in WildFly, and I think this is something that we can improve on. At the moment I think we are way to liberal with what we report, which results in a ton of services being listed in the error report that have nothing to do with the actual failure.

As an example to work from I have created [1], which is a simple EJB application. This consists of 10 EJB's, one of which has a reference to a non-existant data source, the rest are simply empty no-op EJB's (just @Stateless on an empty class).

This app fails to deploy because the java:global/NonExistant data source is missing, which gives the failure description in [2]. This is ~120 lines long and lists multiple services for every single component in the application (part of the reason this is so long is because the failures are reported twice, once when the deployment fails and once when the server starts).

I think we can improve on this. I think in every failure case there will be some root causes that are all the end user cares about, and we should limit our reporting to just these cases, rather than listing every internal service that can no longer start due to missing transitive deps.

In particular these root causes are:
1) A service threw and exception in its start() method and failed to start
2) A dependency is actually missing (i.e. not installed, not just not started)

I think that one or both of these two cases will be the root cause of any failure, and as such that is all we should be reporting on.

We already do an OK job of handing case 1), services that have failed, as they get their own line item in the error report, however case 2) results in a huge report that lists every service that has not come up, no matter how far removed they are from the actual problem.

If the 2) case can be correctly determined, then +1 to reporting some new section and not reporting the current "WFLYCTL0180: Services with missing/unavailable dependencies" section. The WFLYCTL0180 section could only be reported as a fallback if for some reason the 1) and 2) stuff is empty.

I have adjusted this a bit so a service with mode NEVER is treated the same as if it is missing. I am pretty sure that with this change 1) and 2) will cover 100% of cases.

 
 

I think we could make a change to the way this is reported so that only direct problems are reported [3], so the error report would look something like [4] (note that this commit only changes the operation report, the container state reporting after boot is still quite verbose).

I think the container state reporting is ok. IMHO the proper fix to the container state reporting is to rollback and fail boot if Stage.RUNTIME failures occur. Configurable, but rollback by default. If we did that there would be no container state reporting. If you deploy your broken app post-boot you shouldn't see the container state reporting because by the time the report is written the op should have rolled back and the services are no longer "missing". It's only because we don't rollback on boot that this is reported.

I don't think it is nessesary to report on services that are only down because their dependents are down. It basically just adds noise, as they are not really related to the underlying issue. I have expanded my branch to also do this:

 
This ends up with very concise reports that just detail the services that are the root cause of the problem: https://gist.github.com/stuartwdouglas/42a68aaaa130ceee38ca5f66d0040de3

Does this approach seem reasonable? lf a user really does want a complete dump of all services that are down that information is still available directly from MSC anyway.

Stuart

 

I am guessing that this is not as simple as it sounds, otherwise it would have already been addressed, but I think we can do better that the current state of affairs so I thought I would get a discussion started.

It sounds pretty simple. Any "problem" ServiceController exposes its ServiceContainer, and if relying on that registry to check if a missing dependency is installed is not correct for some reason, the ModelControllerImpl exposes its ServiceRegistry via a package protected getter. So AbstractOperationContext can provide that to the SVH.


_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev



--
Brian Stansberry
Manager, Senior Principal Software Engineer
Red Hat


_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev


_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: Error reporting on deployment failure

Brian Stansberry
In reply to this post by Stuart Douglas
On Wed, Feb 14, 2018 at 9:37 PM, Stuart Douglas <[hidden email]> wrote:


On Wed, Feb 14, 2018 at 4:43 PM, Brian Stansberry <[hidden email]> wrote:
On Tue, Feb 13, 2018 at 8:24 PM, Stuart Douglas <[hidden email]> wrote:
Hi Everyone,

I have been thinking a bit about the way we report errors in WildFly, and I think this is something that we can improve on. At the moment I think we are way to liberal with what we report, which results in a ton of services being listed in the error report that have nothing to do with the actual failure.

As an example to work from I have created [1], which is a simple EJB application. This consists of 10 EJB's, one of which has a reference to a non-existant data source, the rest are simply empty no-op EJB's (just @Stateless on an empty class).

This app fails to deploy because the java:global/NonExistant data source is missing, which gives the failure description in [2]. This is ~120 lines long and lists multiple services for every single component in the application (part of the reason this is so long is because the failures are reported twice, once when the deployment fails and once when the server starts).

I think we can improve on this. I think in every failure case there will be some root causes that are all the end user cares about, and we should limit our reporting to just these cases, rather than listing every internal service that can no longer start due to missing transitive deps.

In particular these root causes are:
1) A service threw and exception in its start() method and failed to start
2) A dependency is actually missing (i.e. not installed, not just not started)

I think that one or both of these two cases will be the root cause of any failure, and as such that is all we should be reporting on.

We already do an OK job of handing case 1), services that have failed, as they get their own line item in the error report, however case 2) results in a huge report that lists every service that has not come up, no matter how far removed they are from the actual problem.

If the 2) case can be correctly determined, then +1 to reporting some new section and not reporting the current "WFLYCTL0180: Services with missing/unavailable dependencies" section. The WFLYCTL0180 section could only be reported as a fallback if for some reason the 1) and 2) stuff is empty.

I have adjusted this a bit so a service with mode NEVER is treated the same as if it is missing. I am pretty sure that with this change 1) and 2) will cover 100% of cases.

 
 

I think we could make a change to the way this is reported so that only direct problems are reported [3], so the error report would look something like [4] (note that this commit only changes the operation report, the container state reporting after boot is still quite verbose).

I think the container state reporting is ok. IMHO the proper fix to the container state reporting is to rollback and fail boot if Stage.RUNTIME failures occur. Configurable, but rollback by default. If we did that there would be no container state reporting. If you deploy your broken app post-boot you shouldn't see the container state reporting because by the time the report is written the op should have rolled back and the services are no longer "missing". It's only because we don't rollback on boot that this is reported.

I don't think it is nessesary to report on services that are only down because their dependents are down. It basically just adds noise, as they are not really related to the underlying issue. I have expanded my branch to also do this:

 
This ends up with very concise reports that just detail the services that are the root cause of the problem: https://gist.github.com/stuartwdouglas/42a68aaaa130ceee38ca5f66d0040de3

Does this approach seem reasonable? lf a user really does want a complete dump of all services that are down that information is still available directly from MSC anyway.

It seems reasonable.

I'm going to get all lawyerly now. This is because while we don't treat our failure messages as "API" requiring compatibility, for these particular ones I think we should be as careful as possible.

1)  "WFLYCTL0180: Services with missing/unavailable dependencies" => ["jboss.naming.context.java.comp.\"error-reporting-1.0-SNAPSHOT\".\"error-reporting-1.0-SNAPSHOT\".ErrorEjb.env.\"com.stuartdouglas.ErrorEjb\".nonExistant is missing [jboss.naming.context.java.global.NonExistant]"]

Here you've somewhat repurposed an existing message. That can be quite ok IMHO so long as what's gone is just noise and the English meaning of the message is still correct. Basically, what did "missing/unavailable dependencies" mean before, what does it mean now, and is there a clear story behind the shift from A to B.  The "missing" part is pretty clear -- not installed or NEVER is "missing". For "unavailable" now we've dropped the installed but unstarted ones. If we're including the ones that directly depend on *failed* services then that's a coherent definition of "unavailable". If we're not then "unavailable" is misleading. Sorry, I'm juggling stuff so I haven't checked the code. :(

2) I think "38 additional services are down due to their dependencies being missing or failed" should have a message code, not NONE. It's a separate message that may or may not appear. Plus it's new. And I think we're better off in these complex message structures to be precise vs trying to avoid codes for cosmetic reasons.



Stuart

 

I am guessing that this is not as simple as it sounds, otherwise it would have already been addressed, but I think we can do better that the current state of affairs so I thought I would get a discussion started.

It sounds pretty simple. Any "problem" ServiceController exposes its ServiceContainer, and if relying on that registry to check if a missing dependency is installed is not correct for some reason, the ModelControllerImpl exposes its ServiceRegistry via a package protected getter. So AbstractOperationContext can provide that to the SVH.


_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev



--
Brian Stansberry
Manager, Senior Principal Software Engineer
Red Hat




--
Brian Stansberry
Manager, Senior Principal Software Engineer
Red Hat

_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: Error reporting on deployment failure

Stuart Douglas


On Thu, Feb 15, 2018 at 6:51 PM, Brian Stansberry <[hidden email]> wrote:
On Wed, Feb 14, 2018 at 9:37 PM, Stuart Douglas <[hidden email]> wrote:


On Wed, Feb 14, 2018 at 4:43 PM, Brian Stansberry <[hidden email]> wrote:
On Tue, Feb 13, 2018 at 8:24 PM, Stuart Douglas <[hidden email]> wrote:
Hi Everyone,

I have been thinking a bit about the way we report errors in WildFly, and I think this is something that we can improve on. At the moment I think we are way to liberal with what we report, which results in a ton of services being listed in the error report that have nothing to do with the actual failure.

As an example to work from I have created [1], which is a simple EJB application. This consists of 10 EJB's, one of which has a reference to a non-existant data source, the rest are simply empty no-op EJB's (just @Stateless on an empty class).

This app fails to deploy because the java:global/NonExistant data source is missing, which gives the failure description in [2]. This is ~120 lines long and lists multiple services for every single component in the application (part of the reason this is so long is because the failures are reported twice, once when the deployment fails and once when the server starts).

I think we can improve on this. I think in every failure case there will be some root causes that are all the end user cares about, and we should limit our reporting to just these cases, rather than listing every internal service that can no longer start due to missing transitive deps.

In particular these root causes are:
1) A service threw and exception in its start() method and failed to start
2) A dependency is actually missing (i.e. not installed, not just not started)

I think that one or both of these two cases will be the root cause of any failure, and as such that is all we should be reporting on.

We already do an OK job of handing case 1), services that have failed, as they get their own line item in the error report, however case 2) results in a huge report that lists every service that has not come up, no matter how far removed they are from the actual problem.

If the 2) case can be correctly determined, then +1 to reporting some new section and not reporting the current "WFLYCTL0180: Services with missing/unavailable dependencies" section. The WFLYCTL0180 section could only be reported as a fallback if for some reason the 1) and 2) stuff is empty.

I have adjusted this a bit so a service with mode NEVER is treated the same as if it is missing. I am pretty sure that with this change 1) and 2) will cover 100% of cases.

 
 

I think we could make a change to the way this is reported so that only direct problems are reported [3], so the error report would look something like [4] (note that this commit only changes the operation report, the container state reporting after boot is still quite verbose).

I think the container state reporting is ok. IMHO the proper fix to the container state reporting is to rollback and fail boot if Stage.RUNTIME failures occur. Configurable, but rollback by default. If we did that there would be no container state reporting. If you deploy your broken app post-boot you shouldn't see the container state reporting because by the time the report is written the op should have rolled back and the services are no longer "missing". It's only because we don't rollback on boot that this is reported.

I don't think it is nessesary to report on services that are only down because their dependents are down. It basically just adds noise, as they are not really related to the underlying issue. I have expanded my branch to also do this:

 
This ends up with very concise reports that just detail the services that are the root cause of the problem: https://gist.github.com/stuartwdouglas/42a68aaaa130ceee38ca5f66d0040de3

Does this approach seem reasonable? lf a user really does want a complete dump of all services that are down that information is still available directly from MSC anyway.

It seems reasonable.

I'm going to get all lawyerly now. This is because while we don't treat our failure messages as "API" requiring compatibility, for these particular ones I think we should be as careful as possible.

1)  "WFLYCTL0180: Services with missing/unavailable dependencies" => ["jboss.naming.context.java.comp.\"error-reporting-1.0-SNAPSHOT\".\"error-reporting-1.0-SNAPSHOT\".ErrorEjb.env.\"com.stuartdouglas.ErrorEjb\".nonExistant is missing [jboss.naming.context.java.global.NonExistant]"]

Here you've somewhat repurposed an existing message. That can be quite ok IMHO so long as what's gone is just noise and the English meaning of the message is still correct. Basically, what did "missing/unavailable dependencies" mean before, what does it mean now, and is there a clear story behind the shift from A to B.  The "missing" part is pretty clear -- not installed or NEVER is "missing". For "unavailable" now we've dropped the installed but unstarted ones. If we're including the ones that directly depend on *failed* services then that's a coherent definition of "unavailable". If we're not then "unavailable" is misleading. Sorry, I'm juggling stuff so I haven't checked the code. :(

Previously this section would display every service that was down due to its dependencies being down. This would include services that were many levels away from the actual problem (e.g. if A depends on B which depends on C which depends on D which is down, A, B and C would all be listed in this section). This change displays the same information, but only for direct dependents, so in the example about only C would be listed in this section.

The 'New missing/unsatisfied dependencies:' section in the container state report is similar. Previously it would list every service that had failed to come up, now it will only list services that are directly affected by a problem.
 

2) I think "38 additional services are down due to their dependencies being missing or failed" should have a message code, not NONE. It's a separate message that may or may not appear. Plus it's new. And I think we're better off in these complex message structures to be precise vs trying to avoid codes for cosmetic reasons.

Ok.

Stuart
 



Stuart

 

I am guessing that this is not as simple as it sounds, otherwise it would have already been addressed, but I think we can do better that the current state of affairs so I thought I would get a discussion started.

It sounds pretty simple. Any "problem" ServiceController exposes its ServiceContainer, and if relying on that registry to check if a missing dependency is installed is not correct for some reason, the ModelControllerImpl exposes its ServiceRegistry via a package protected getter. So AbstractOperationContext can provide that to the SVH.


_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev



--
Brian Stansberry
Manager, Senior Principal Software Engineer
Red Hat




--
Brian Stansberry
Manager, Senior Principal Software Engineer
Red Hat


_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: Error reporting on deployment failure

Stuart Douglas
I have opened https://github.com/wildfly/wildfly-core/pull/3114 to allow for testing/further review.

Stuart

On Thu, Feb 15, 2018 at 11:32 PM, Stuart Douglas <[hidden email]> wrote:


On Thu, Feb 15, 2018 at 6:51 PM, Brian Stansberry <[hidden email]> wrote:
On Wed, Feb 14, 2018 at 9:37 PM, Stuart Douglas <[hidden email]> wrote:


On Wed, Feb 14, 2018 at 4:43 PM, Brian Stansberry <[hidden email]> wrote:
On Tue, Feb 13, 2018 at 8:24 PM, Stuart Douglas <[hidden email]> wrote:
Hi Everyone,

I have been thinking a bit about the way we report errors in WildFly, and I think this is something that we can improve on. At the moment I think we are way to liberal with what we report, which results in a ton of services being listed in the error report that have nothing to do with the actual failure.

As an example to work from I have created [1], which is a simple EJB application. This consists of 10 EJB's, one of which has a reference to a non-existant data source, the rest are simply empty no-op EJB's (just @Stateless on an empty class).

This app fails to deploy because the java:global/NonExistant data source is missing, which gives the failure description in [2]. This is ~120 lines long and lists multiple services for every single component in the application (part of the reason this is so long is because the failures are reported twice, once when the deployment fails and once when the server starts).

I think we can improve on this. I think in every failure case there will be some root causes that are all the end user cares about, and we should limit our reporting to just these cases, rather than listing every internal service that can no longer start due to missing transitive deps.

In particular these root causes are:
1) A service threw and exception in its start() method and failed to start
2) A dependency is actually missing (i.e. not installed, not just not started)

I think that one or both of these two cases will be the root cause of any failure, and as such that is all we should be reporting on.

We already do an OK job of handing case 1), services that have failed, as they get their own line item in the error report, however case 2) results in a huge report that lists every service that has not come up, no matter how far removed they are from the actual problem.

If the 2) case can be correctly determined, then +1 to reporting some new section and not reporting the current "WFLYCTL0180: Services with missing/unavailable dependencies" section. The WFLYCTL0180 section could only be reported as a fallback if for some reason the 1) and 2) stuff is empty.

I have adjusted this a bit so a service with mode NEVER is treated the same as if it is missing. I am pretty sure that with this change 1) and 2) will cover 100% of cases.

 
 

I think we could make a change to the way this is reported so that only direct problems are reported [3], so the error report would look something like [4] (note that this commit only changes the operation report, the container state reporting after boot is still quite verbose).

I think the container state reporting is ok. IMHO the proper fix to the container state reporting is to rollback and fail boot if Stage.RUNTIME failures occur. Configurable, but rollback by default. If we did that there would be no container state reporting. If you deploy your broken app post-boot you shouldn't see the container state reporting because by the time the report is written the op should have rolled back and the services are no longer "missing". It's only because we don't rollback on boot that this is reported.

I don't think it is nessesary to report on services that are only down because their dependents are down. It basically just adds noise, as they are not really related to the underlying issue. I have expanded my branch to also do this:

 
This ends up with very concise reports that just detail the services that are the root cause of the problem: https://gist.github.com/stuartwdouglas/42a68aaaa130ceee38ca5f66d0040de3

Does this approach seem reasonable? lf a user really does want a complete dump of all services that are down that information is still available directly from MSC anyway.

It seems reasonable.

I'm going to get all lawyerly now. This is because while we don't treat our failure messages as "API" requiring compatibility, for these particular ones I think we should be as careful as possible.

1)  "WFLYCTL0180: Services with missing/unavailable dependencies" => ["jboss.naming.context.java.comp.\"error-reporting-1.0-SNAPSHOT\".\"error-reporting-1.0-SNAPSHOT\".ErrorEjb.env.\"com.stuartdouglas.ErrorEjb\".nonExistant is missing [jboss.naming.context.java.global.NonExistant]"]

Here you've somewhat repurposed an existing message. That can be quite ok IMHO so long as what's gone is just noise and the English meaning of the message is still correct. Basically, what did "missing/unavailable dependencies" mean before, what does it mean now, and is there a clear story behind the shift from A to B.  The "missing" part is pretty clear -- not installed or NEVER is "missing". For "unavailable" now we've dropped the installed but unstarted ones. If we're including the ones that directly depend on *failed* services then that's a coherent definition of "unavailable". If we're not then "unavailable" is misleading. Sorry, I'm juggling stuff so I haven't checked the code. :(

Previously this section would display every service that was down due to its dependencies being down. This would include services that were many levels away from the actual problem (e.g. if A depends on B which depends on C which depends on D which is down, A, B and C would all be listed in this section). This change displays the same information, but only for direct dependents, so in the example about only C would be listed in this section.

The 'New missing/unsatisfied dependencies:' section in the container state report is similar. Previously it would list every service that had failed to come up, now it will only list services that are directly affected by a problem.
 

2) I think "38 additional services are down due to their dependencies being missing or failed" should have a message code, not NONE. It's a separate message that may or may not appear. Plus it's new. And I think we're better off in these complex message structures to be precise vs trying to avoid codes for cosmetic reasons.

Ok.

Stuart
 



Stuart

 

I am guessing that this is not as simple as it sounds, otherwise it would have already been addressed, but I think we can do better that the current state of affairs so I thought I would get a discussion started.

It sounds pretty simple. Any "problem" ServiceController exposes its ServiceContainer, and if relying on that registry to check if a missing dependency is installed is not correct for some reason, the ModelControllerImpl exposes its ServiceRegistry via a package protected getter. So AbstractOperationContext can provide that to the SVH.


_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev



--
Brian Stansberry
Manager, Senior Principal Software Engineer
Red Hat




--
Brian Stansberry
Manager, Senior Principal Software Engineer
Red Hat



_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: Error reporting on deployment failure

Brad Maxwell
Yes, we had some bz/jiras opened about this before.  We get cases where customers application is failing and has pages upon pages of dependency errors and the customer cannot easily determine the issue.  And even support has difficultly, we usually try searching for common things like datasources or other JNDI references that might be missing, but I have seen several where it was not a datasource and took a while of tearing the apps apart to resolve.  It looks like there was some improvement in EAP 7.1 [2], but it sounds like Stuart's PR may be even better.

I found one example deployment on [1] that we could try and see what the logging looks like with the new PR.
I figure the service dump would show all of the failed dependencies in case there was a need to look at the others?

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1283294
[2] https://issues.jboss.org/browse/JBEAP-5311

On 2/15/18 5:15 PM, Stuart Douglas wrote:
I have opened https://github.com/wildfly/wildfly-core/pull/3114 to allow for testing/further review.

Stuart

On Thu, Feb 15, 2018 at 11:32 PM, Stuart Douglas <[hidden email]> wrote:


On Thu, Feb 15, 2018 at 6:51 PM, Brian Stansberry <[hidden email]> wrote:
On Wed, Feb 14, 2018 at 9:37 PM, Stuart Douglas <[hidden email]> wrote:


On Wed, Feb 14, 2018 at 4:43 PM, Brian Stansberry <[hidden email]> wrote:
On Tue, Feb 13, 2018 at 8:24 PM, Stuart Douglas <[hidden email]> wrote:
Hi Everyone,

I have been thinking a bit about the way we report errors in WildFly, and I think this is something that we can improve on. At the moment I think we are way to liberal with what we report, which results in a ton of services being listed in the error report that have nothing to do with the actual failure.

As an example to work from I have created [1], which is a simple EJB application. This consists of 10 EJB's, one of which has a reference to a non-existant data source, the rest are simply empty no-op EJB's (just @Stateless on an empty class).

This app fails to deploy because the java:global/NonExistant data source is missing, which gives the failure description in [2]. This is ~120 lines long and lists multiple services for every single component in the application (part of the reason this is so long is because the failures are reported twice, once when the deployment fails and once when the server starts).

I think we can improve on this. I think in every failure case there will be some root causes that are all the end user cares about, and we should limit our reporting to just these cases, rather than listing every internal service that can no longer start due to missing transitive deps.

In particular these root causes are:
1) A service threw and exception in its start() method and failed to start
2) A dependency is actually missing (i.e. not installed, not just not started)

I think that one or both of these two cases will be the root cause of any failure, and as such that is all we should be reporting on.

We already do an OK job of handing case 1), services that have failed, as they get their own line item in the error report, however case 2) results in a huge report that lists every service that has not come up, no matter how far removed they are from the actual problem.

If the 2) case can be correctly determined, then +1 to reporting some new section and not reporting the current "WFLYCTL0180: Services with missing/unavailable dependencies" section. The WFLYCTL0180 section could only be reported as a fallback if for some reason the 1) and 2) stuff is empty.

I have adjusted this a bit so a service with mode NEVER is treated the same as if it is missing. I am pretty sure that with this change 1) and 2) will cover 100% of cases.

 
 

I think we could make a change to the way this is reported so that only direct problems are reported [3], so the error report would look something like [4] (note that this commit only changes the operation report, the container state reporting after boot is still quite verbose).

I think the container state reporting is ok. IMHO the proper fix to the container state reporting is to rollback and fail boot if Stage.RUNTIME failures occur. Configurable, but rollback by default. If we did that there would be no container state reporting. If you deploy your broken app post-boot you shouldn't see the container state reporting because by the time the report is written the op should have rolled back and the services are no longer "missing". It's only because we don't rollback on boot that this is reported.

I don't think it is nessesary to report on services that are only down because their dependents are down. It basically just adds noise, as they are not really related to the underlying issue. I have expanded my branch to also do this:

 
This ends up with very concise reports that just detail the services that are the root cause of the problem: https://gist.github.com/stuartwdouglas/42a68aaaa130ceee38ca5f66d0040de3

Does this approach seem reasonable? lf a user really does want a complete dump of all services that are down that information is still available directly from MSC anyway.

It seems reasonable.

I'm going to get all lawyerly now. This is because while we don't treat our failure messages as "API" requiring compatibility, for these particular ones I think we should be as careful as possible.

1)  "WFLYCTL0180: Services with missing/unavailable dependencies" => ["jboss.naming.context.java.comp.\"error-reporting-1.0-SNAPSHOT\".\"error-reporting-1.0-SNAPSHOT\".ErrorEjb.env.\"com.stuartdouglas.ErrorEjb\".nonExistant is missing [jboss.naming.context.java.global.NonExistant]"]

Here you've somewhat repurposed an existing message. That can be quite ok IMHO so long as what's gone is just noise and the English meaning of the message is still correct. Basically, what did "missing/unavailable dependencies" mean before, what does it mean now, and is there a clear story behind the shift from A to B.  The "missing" part is pretty clear -- not installed or NEVER is "missing". For "unavailable" now we've dropped the installed but unstarted ones. If we're including the ones that directly depend on *failed* services then that's a coherent definition of "unavailable". If we're not then "unavailable" is misleading. Sorry, I'm juggling stuff so I haven't checked the code. :(

Previously this section would display every service that was down due to its dependencies being down. This would include services that were many levels away from the actual problem (e.g. if A depends on B which depends on C which depends on D which is down, A, B and C would all be listed in this section). This change displays the same information, but only for direct dependents, so in the example about only C would be listed in this section.

The 'New missing/unsatisfied dependencies:' section in the container state report is similar. Previously it would list every service that had failed to come up, now it will only list services that are directly affected by a problem.
 

2) I think "38 additional services are down due to their dependencies being missing or failed" should have a message code, not NONE. It's a separate message that may or may not appear. Plus it's new. And I think we're better off in these complex message structures to be precise vs trying to avoid codes for cosmetic reasons.

Ok.

Stuart
 



Stuart

 

I am guessing that this is not as simple as it sounds, otherwise it would have already been addressed, but I think we can do better that the current state of affairs so I thought I would get a discussion started.

It sounds pretty simple. Any "problem" ServiceController exposes its ServiceContainer, and if relying on that registry to check if a missing dependency is installed is not correct for some reason, the ModelControllerImpl exposes its ServiceRegistry via a package protected getter. So AbstractOperationContext can provide that to the SVH.


_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev



--
Brian Stansberry
Manager, Senior Principal Software Engineer
Red Hat




--
Brian Stansberry
Manager, Senior Principal Software Engineer
Red Hat




_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev


_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev