A look at Eclipse MicroProfile Healthcheck

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

A look at Eclipse MicroProfile Healthcheck

Jeff Mesnil
Hi,

I had a look at the Eclipse MicroProfile Healthcheck spec[1] and wanted to share some thoughts and experiments about it, how it relates to WildFly and its use in containers (such as OpenShift).

# Eclipse MicroProfile Healthcheck

The Eclipse MicroProfile Healthcheck (MPHC for short) is a specification to determine the healthiness of an application.
It defines a Health Check Procedure (HCP for short) interface that can be implemented by application to determine its healthiness. It’s a single method that returns a Health Status: either UP or DOWN (+ some metadata).
Typically, an application would provide one or more HCP to check healthiness of its parts.
The overall healthiness of the application is determined by the aggregation of all the HCP provided by the application. If any HCP is DOWN, the overall outcome is DOWN. Else the application is considered as UP.

The MPHC spec has a companion document[2] that specifies an HTTP format to check the healthiness of an application.

Heiko is leading the spec and Swarm is the sample implementation for it (MicroProfile does not have the notion of reference implementation).
The spec is still in flux and we have a good opportunity to contribute to it to ensure that it meets our requirements and use cases.

# Use case

Using the HTTP endpoint, a container can ask an application whether it is healthy. If it is not healthy, the container could stop the application and respin a new instance.
For example, OpenShift/Kubernetes can configure liveness probes[3][4].

Supporting MPHC in WildFly would allow a better integration with containers and ensure that any unhealthy WildFly  process is restarted promptly.

# Prototype

I’ve written a prototype of a WildFly extension to support MPHC for applications deployed in WildFly *and* add health check procedures inside WildFly:

https://github.com/jmesnil/wildfly-microprofile-health

and it passes the MPHC tck :)

The microprofile-health subsystem supports an operation to check the health of the app server:

[standalone@localhost:9990 /] /subsystem=microprofile-health:check
{
    "outcome" => "success",
    "result" => {
        "checks" => [{
            "id" => "heap-memory",
            "result" => "UP",
            "data" => {
                "max" => "477626368",
                "used" => "156216336"
            }
        }],
        "outcome" => "UP"
    }
}

It also exposes an (unauthenticated) HTTP endpoint:

$ curl http://localhost:8080/health/:
{
   "checks":[
      {
         "id":"heap-memory",
         "result":"UP",
         "data":{
            "max":"477626368",
            "used":"160137128"
         }
      }
   ],
   "outcome":"UP"
}

This HTTP endpoint can be used by OpenShift for its liveness probe.

Any deployment that defines Health Check Procedures will have them registered to determine the overall healthiness of the process.

# WildFly health check procedures

The MPHC specification mainly targets user applications that can apply application logic to determine their healthiness.
However I wonder if we could reuse the concepts *inside* WildFly. There are things that we could check to determine if the App server runtime is healthy, e.g.:
* The amount of heap memory is close to the max
* some deployments have failed
* Excessive GC
* Running out of disk space

Subsystems inside WildFly could provide Health check procedures that would be queried to check the overall healthiness.
We could for example provide a health check that the used heap memory is less that 90% of the max:

        HealthCheck.install(context, "heap-memory", () -> {
            MemoryMXBean memoryBean = ManagementFactory.getMemoryMXBean();
            long memUsed = memoryBean.getHeapMemoryUsage().getUsed();
            long memMax = memoryBean.getHeapMemoryUsage().getMax();
            HealthResponse response = HealthResponse.named("heap-memory")
                    .withAttribute("used", memUsed)
                    .withAttribute("max", memMax);
            // status is is down is used memory is greater than 90% of max memory.
            HealthStatus status = (memUsed < memMax * 0.9) ? response.up() : response.down();
            return status;
        });

HealthCheck.install creates a MSC service and makes sure that is is registered by the health monitor that queries all the procedures.
A subsystem would just have to call HealthCheck.install/uninstall with a Health check procedures to help determine the healthiness of the app server.

What do you think about this use case?

I even wonder if this is something that should be instead provided by our core-management subsystem with a private API (1 interface and some data structures).
The microprofile-health extension would then map our private API to the MPHC spec and handled health check procedures coming from deployments.

# Summary

To better integrate WildFly with OpenShift, we should provide a way to let OpenShift checks the healthiness of WildFly. The MPHC spec is a good candidate to provide such feature.
It is worth exploring how we could leverage it for user deployments and also for WildFly internals (when that makes sense).
Swarm is providing an implementation of the MPHC, we also need to see how we can collaborate between WildFly and Swarm to avoid duplicating code and efforts from providing the same feature to our users.

jeff


[1] https://github.com/eclipse/microprofile-evolution-process/blob/master/proposals/0003-health-checks.md
[2] https://github.com/eclipse/microprofile-evolution-process/blob/master/proposals/0003-spec.md
[3] https://docs.openshift.com/enterprise/3.0/dev_guide/application_health.html
[4] https://kubernetes.io/v1.0/docs/user-guide/walkthrough/k8s201.html#health-checking
--
Jeff Mesnil
JBoss, a division of Red Hat
http://jmesnil.net/


_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: A look at Eclipse MicroProfile Healthcheck

David Lloyd
On Thu, Jul 6, 2017 at 4:45 AM, Jeff Mesnil <[hidden email]> wrote:

> Hi,
>
> I had a look at the Eclipse MicroProfile Healthcheck spec[1] and wanted to share some thoughts and experiments about it, how it relates to WildFly and its use in containers (such as OpenShift).
>
> # Eclipse MicroProfile Healthcheck
>
> The Eclipse MicroProfile Healthcheck (MPHC for short) is a specification to determine the healthiness of an application.
> It defines a Health Check Procedure (HCP for short) interface that can be implemented by application to determine its healthiness. It’s a single method that returns a Health Status: either UP or DOWN (+ some metadata).
> Typically, an application would provide one or more HCP to check healthiness of its parts.
> The overall healthiness of the application is determined by the aggregation of all the HCP provided by the application. If any HCP is DOWN, the overall outcome is DOWN. Else the application is considered as UP.
>
> The MPHC spec has a companion document[2] that specifies an HTTP format to check the healthiness of an application.
>
> Heiko is leading the spec and Swarm is the sample implementation for it (MicroProfile does not have the notion of reference implementation).
> The spec is still in flux and we have a good opportunity to contribute to it to ensure that it meets our requirements and use cases.
>
> # Use case
>
> Using the HTTP endpoint, a container can ask an application whether it is healthy. If it is not healthy, the container could stop the application and respin a new instance.
> For example, OpenShift/Kubernetes can configure liveness probes[3][4].
>
> Supporting MPHC in WildFly would allow a better integration with containers and ensure that any unhealthy WildFly  process is restarted promptly.
>
> # Prototype
>
> I’ve written a prototype of a WildFly extension to support MPHC for applications deployed in WildFly *and* add health check procedures inside WildFly:
>
> https://github.com/jmesnil/wildfly-microprofile-health
>
> and it passes the MPHC tck :)
>
> The microprofile-health subsystem supports an operation to check the health of the app server:
>
> [standalone@localhost:9990 /] /subsystem=microprofile-health:check
> {
>     "outcome" => "success",
>     "result" => {
>         "checks" => [{
>             "id" => "heap-memory",
>             "result" => "UP",
>             "data" => {
>                 "max" => "477626368",
>                 "used" => "156216336"
>             }
>         }],
>         "outcome" => "UP"
>     }
> }
>
> It also exposes an (unauthenticated) HTTP endpoint:
>
> $ curl http://localhost:8080/health/:
> {
>    "checks":[
>       {
>          "id":"heap-memory",
>          "result":"UP",
>          "data":{
>             "max":"477626368",
>             "used":"160137128"
>          }
>       }
>    ],
>    "outcome":"UP"
> }
>
> This HTTP endpoint can be used by OpenShift for its liveness probe.
>
> Any deployment that defines Health Check Procedures will have them registered to determine the overall healthiness of the process.
>
> # WildFly health check procedures
>
> The MPHC specification mainly targets user applications that can apply application logic to determine their healthiness.
> However I wonder if we could reuse the concepts *inside* WildFly. There are things that we could check to determine if the App server runtime is healthy, e.g.:
> * The amount of heap memory is close to the max
> * some deployments have failed
> * Excessive GC
> * Running out of disk space
>
> Subsystems inside WildFly could provide Health check procedures that would be queried to check the overall healthiness.
> We could for example provide a health check that the used heap memory is less that 90% of the max:
>
>         HealthCheck.install(context, "heap-memory", () -> {
>             MemoryMXBean memoryBean = ManagementFactory.getMemoryMXBean();
>             long memUsed = memoryBean.getHeapMemoryUsage().getUsed();
>             long memMax = memoryBean.getHeapMemoryUsage().getMax();
>             HealthResponse response = HealthResponse.named("heap-memory")
>                     .withAttribute("used", memUsed)
>                     .withAttribute("max", memMax);
>             // status is is down is used memory is greater than 90% of max memory.
>             HealthStatus status = (memUsed < memMax * 0.9) ? response.up() : response.down();
>             return status;
>         });
>
> HealthCheck.install creates a MSC service and makes sure that is is registered by the health monitor that queries all the procedures.
> A subsystem would just have to call HealthCheck.install/uninstall with a Health check procedures to help determine the healthiness of the app server.
>
> What do you think about this use case?
>
> I even wonder if this is something that should be instead provided by our core-management subsystem with a private API (1 interface and some data structures).
> The microprofile-health extension would then map our private API to the MPHC spec and handled health check procedures coming from deployments.
>
> # Summary
>
> To better integrate WildFly with OpenShift, we should provide a way to let OpenShift checks the healthiness of WildFly. The MPHC spec is a good candidate to provide such feature.
> It is worth exploring how we could leverage it for user deployments and also for WildFly internals (when that makes sense).
> Swarm is providing an implementation of the MPHC, we also need to see how we can collaborate between WildFly and Swarm to avoid duplicating code and efforts from providing the same feature to our users.

I like the idea of having a WildFly health API that can bridge to MPHC
via a subsystem; this is consistent with what we've done in other
areas.  I'm not so sure about having (more?) APIs which drive
services.  It might be better to use cap/req to have a health
capability to which other systems can be registered.  This might allow
multiple independent health check resources to be defined, for systems
which perform more than one function; downstream health providers
could reference the resource(s) to register with by capability name.

Is this a polling-only service, or is there a "push" mechanism?

Just brainstorming, I can think of a few more potentially useful
health checks beyond what you've listed:

• EJB failure rate (if an EJB starts failing more than some percentage
of the last, say 50 or 100 invocations, it could report an "unhealthy"
condition)
• Database failure rate (something with JDBC exceptions maybe)
• Authentication realm failure rate (Elytron's RealmUnavailableException)

--
- DML

_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: A look at Eclipse MicroProfile Healthcheck

Jeff Mesnil

> On 6 Jul 2017, at 15:00, David Lloyd <[hidden email]> wrote:
>
>> To better integrate WildFly with OpenShift, we should provide a way to let OpenShift checks the healthiness of WildFly. The MPHC spec is a good candidate to provide such feature.
>> It is worth exploring how we could leverage it for user deployments and also for WildFly internals (when that makes sense).
>> Swarm is providing an implementation of the MPHC, we also need to see how we can collaborate between WildFly and Swarm to avoid duplicating code and efforts from providing the same feature to our users.
>
> I like the idea of having a WildFly health API that can bridge to MPHC
> via a subsystem; this is consistent with what we've done in other
> areas.  I'm not so sure about having (more?) APIs which drive
> services.  It might be better to use cap/req to have a health
> capability to which other systems can be registered.  This might allow
> multiple independent health check resources to be defined, for systems
> which perform more than one function; downstream health providers
> could reference the resource(s) to register with by capability name.

You are right.
If we provide our own health API, it will rely on req/cap to bind everything.
My idea was to provide an API that hides the req/cap plumbing but is built on top of it.
It’d be similar to what I’m doing in the messaging-activemq subsystem where I almost always hides the installation of service in static install() methods such as [1] that requires only some parameters and hides all the dependencies/capabilities service names and injection)

[1]https://github.com/wildfly/wildfly/blob/master/messaging-activemq/src/main/java/org/wildfly/extension/messaging/activemq/HTTPUpgradeService.java#L91

> Is this a polling-only service, or is there a "push" mechanism?

Polling only.
The container (OpenShift) will call the HTTP endpoint regularly to check the application healthiness.

> Just brainstorming, I can think of a few more potentially useful
> health checks beyond what you've listed:
>
> • EJB failure rate (if an EJB starts failing more than some percentage
> of the last, say 50 or 100 invocations, it could report an "unhealthy"
> condition)
> • Database failure rate (something with JDBC exceptions maybe)

That one is interesting.
I proposed a health check that pings a JDBC connection to Heiko when we talked about the API and he told me that might be a bad idea after all.
If the database fails, the application will not function as expected. But restarting the application will not make the problem goes away (it’s likely the DB that has a problem).
Having health checks that cross service boundaries (such as "my app" <—> “DB”) may have a snowballing effect where one unhealthy service (the DB) would propagate its unhealthiness to other services (“my app”).
In that case, the DB should be probed and restarted asap but there is nothing that should be done in the app server.

We would need guidelines to determine which health checks actually makes sense for WildFly extensions.

Caucho has an interesting list of health checks[1] that could make sense for WildFly.
There is the usual suspects (memory, CPU) and some more interesting ones:
* JVM deadlock check
* transaction failure rate

We’d have to be careful implementing a JVM deadlock health check though.
These health checks should not impact the app server runtime too much and should be fast (by default Kubernetes has a 1 second timeout for its liveness probe).

jeff

[1] http://www.caucho.com/resin-4.0/admin/health-checking.xtp#Defaulthealthconfiguration

--
Jeff Mesnil
JBoss, a division of Red Hat
http://jmesnil.net/


_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: A look at Eclipse MicroProfile Healthcheck

Rob Cernich
In reply to this post by Jeff Mesnil
> Hi,
>
> I had a look at the Eclipse MicroProfile Healthcheck spec[1] and wanted to
> share some thoughts and experiments about it, how it relates to WildFly and
> its use in containers (such as OpenShift).
>
> # Eclipse MicroProfile Healthcheck
>
> The Eclipse MicroProfile Healthcheck (MPHC for short) is a specification to
> determine the healthiness of an application.
> It defines a Health Check Procedure (HCP for short) interface that can be
> implemented by application to determine its healthiness. It’s a single
> method that returns a Health Status: either UP or DOWN (+ some metadata).
> Typically, an application would provide one or more HCP to check healthiness
> of its parts.
> The overall healthiness of the application is determined by the aggregation
> of all the HCP provided by the application. If any HCP is DOWN, the overall
> outcome is DOWN. Else the application is considered as UP.
>
> The MPHC spec has a companion document[2] that specifies an HTTP format to
> check the healthiness of an application.
>
> Heiko is leading the spec and Swarm is the sample implementation for it
> (MicroProfile does not have the notion of reference implementation).
> The spec is still in flux and we have a good opportunity to contribute to it
> to ensure that it meets our requirements and use cases.
>
> # Use case
>
> Using the HTTP endpoint, a container can ask an application whether it is
> healthy. If it is not healthy, the container could stop the application and
> respin a new instance.
> For example, OpenShift/Kubernetes can configure liveness probes[3][4].
>
> Supporting MPHC in WildFly would allow a better integration with containers
> and ensure that any unhealthy WildFly  process is restarted promptly.
>
> # Prototype
>
> I’ve written a prototype of a WildFly extension to support MPHC for
> applications deployed in WildFly *and* add health check procedures inside
> WildFly:
>
> https://github.com/jmesnil/wildfly-microprofile-health
>
> and it passes the MPHC tck :)
>
> The microprofile-health subsystem supports an operation to check the health
> of the app server:
>
> [standalone@localhost:9990 /] /subsystem=microprofile-health:check
> {
>     "outcome" => "success",
>     "result" => {
>         "checks" => [{
>             "id" => "heap-memory",
>             "result" => "UP",
>             "data" => {
>                 "max" => "477626368",
>                 "used" => "156216336"
>             }
>         }],
>         "outcome" => "UP"
>     }
> }
>
> It also exposes an (unauthenticated) HTTP endpoint:
>
> $ curl http://localhost:8080/health/:
> {
>    "checks":[
>       {
>          "id":"heap-memory",
>          "result":"UP",
>          "data":{
>             "max":"477626368",
>             "used":"160137128"
>          }
>       }
>    ],
>    "outcome":"UP"
> }
>
> This HTTP endpoint can be used by OpenShift for its liveness probe.

Regarding the probes, three states would be best, if you can swing it, as OpenShift defines two probe types: liveness and readiness.  Live is running, but unable to handle requests, while ready means it's running and able to handle requests.  For example, while the server is initializing, it's alive, but not ready.  Something to think about.

>
> Any deployment that defines Health Check Procedures will have them registered
> to determine the overall healthiness of the process.
>
> # WildFly health check procedures
>
> The MPHC specification mainly targets user applications that can apply
> application logic to determine their healthiness.
> However I wonder if we could reuse the concepts *inside* WildFly. There are
> things that we could check to determine if the App server runtime is
> healthy, e.g.:
> * The amount of heap memory is close to the max
> * some deployments have failed
> * Excessive GC
> * Running out of disk space
>
> Subsystems inside WildFly could provide Health check procedures that would be
> queried to check the overall healthiness.
> We could for example provide a health check that the used heap memory is less
> that 90% of the max:
>
>         HealthCheck.install(context, "heap-memory", () -> {
>             MemoryMXBean memoryBean = ManagementFactory.getMemoryMXBean();
>             long memUsed = memoryBean.getHeapMemoryUsage().getUsed();
>             long memMax = memoryBean.getHeapMemoryUsage().getMax();
>             HealthResponse response = HealthResponse.named("heap-memory")
>                     .withAttribute("used", memUsed)
>                     .withAttribute("max", memMax);
>             // status is is down is used memory is greater than 90% of max
>             memory.
>             HealthStatus status = (memUsed < memMax * 0.9) ? response.up() :
>             response.down();
>             return status;
>         });
>
> HealthCheck.install creates a MSC service and makes sure that is is
> registered by the health monitor that queries all the procedures.
> A subsystem would just have to call HealthCheck.install/uninstall with a
> Health check procedures to help determine the healthiness of the app server.
>
> What do you think about this use case?
>
> I even wonder if this is something that should be instead provided by our
> core-management subsystem with a private API (1 interface and some data
> structures).
> The microprofile-health extension would then map our private API to the MPHC
> spec and handled health check procedures coming from deployments.
>
> # Summary
>
> To better integrate WildFly with OpenShift, we should provide a way to let
> OpenShift checks the healthiness of WildFly. The MPHC spec is a good
> candidate to provide such feature.
> It is worth exploring how we could leverage it for user deployments and also
> for WildFly internals (when that makes sense).
> Swarm is providing an implementation of the MPHC, we also need to see how we
> can collaborate between WildFly and Swarm to avoid duplicating code and
> efforts from providing the same feature to our users.
>
> jeff
>
>
> [1]
> https://github.com/eclipse/microprofile-evolution-process/blob/master/proposals/0003-health-checks.md
> [2]
> https://github.com/eclipse/microprofile-evolution-process/blob/master/proposals/0003-spec.md
> [3]
> https://docs.openshift.com/enterprise/3.0/dev_guide/application_health.html
> [4]
> https://kubernetes.io/v1.0/docs/user-guide/walkthrough/k8s201.html#health-checking
> --
> Jeff Mesnil
> JBoss, a division of Red Hat
> http://jmesnil.net/
>
>
> _______________________________________________
> wildfly-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/wildfly-dev

_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: A look at Eclipse MicroProfile Healthcheck

Jeff Mesnil

> On 6 Jul 2017, at 16:13, Rob Cernich <[hidden email]> wrote:
>
>> Hi,
>>
>> I had a look at the Eclipse MicroProfile Healthcheck spec[1] and wanted to
>> share some thoughts and experiments about it, how it relates to WildFly and
>> its use in containers (such as OpenShift).
>>
>> # Eclipse MicroProfile Healthcheck
>>
>> The Eclipse MicroProfile Healthcheck (MPHC for short) is a specification to
>> determine the healthiness of an application.
>> It defines a Health Check Procedure (HCP for short) interface that can be
>> implemented by application to determine its healthiness. It’s a single
>> method that returns a Health Status: either UP or DOWN (+ some metadata).
>> Typically, an application would provide one or more HCP to check healthiness
>> of its parts.
>> The overall healthiness of the application is determined by the aggregation
>> of all the HCP provided by the application. If any HCP is DOWN, the overall
>> outcome is DOWN. Else the application is considered as UP.
>>
>> The MPHC spec has a companion document[2] that specifies an HTTP format to
>> check the healthiness of an application.
>>
>> Heiko is leading the spec and Swarm is the sample implementation for it
>> (MicroProfile does not have the notion of reference implementation).
>> The spec is still in flux and we have a good opportunity to contribute to it
>> to ensure that it meets our requirements and use cases.
>>
>> # Use case
>>
>> Using the HTTP endpoint, a container can ask an application whether it is
>> healthy. If it is not healthy, the container could stop the application and
>> respin a new instance.
>> For example, OpenShift/Kubernetes can configure liveness probes[3][4].
>>
>> Supporting MPHC in WildFly would allow a better integration with containers
>> and ensure that any unhealthy WildFly  process is restarted promptly.
>>
>> # Prototype
>>
>> I’ve written a prototype of a WildFly extension to support MPHC for
>> applications deployed in WildFly *and* add health check procedures inside
>> WildFly:
>>
>> https://github.com/jmesnil/wildfly-microprofile-health
>>
>> and it passes the MPHC tck :)
>>
>> The microprofile-health subsystem supports an operation to check the health
>> of the app server:
>>
>> [standalone@localhost:9990 /] /subsystem=microprofile-health:check
>> {
>>    "outcome" => "success",
>>    "result" => {
>>        "checks" => [{
>>            "id" => "heap-memory",
>>            "result" => "UP",
>>            "data" => {
>>                "max" => "477626368",
>>                "used" => "156216336"
>>            }
>>        }],
>>        "outcome" => "UP"
>>    }
>> }
>>
>> It also exposes an (unauthenticated) HTTP endpoint:
>>
>> $ curl http://localhost:8080/health/:
>> {
>>   "checks":[
>>      {
>>         "id":"heap-memory",
>>         "result":"UP",
>>         "data":{
>>            "max":"477626368",
>>            "used":"160137128"
>>         }
>>      }
>>   ],
>>   "outcome":"UP"
>> }
>>
>> This HTTP endpoint can be used by OpenShift for its liveness probe.
>
> Regarding the probes, three states would be best, if you can swing it, as OpenShift defines two probe types: liveness and readiness.  Live is running, but unable to handle requests, while ready means it's running and able to handle requests.  For example, while the server is initializing, it's alive, but not ready. Something to think about.

Three states (red/orange/green) was discussed when the healthcheck API was proposed. It was rejected as it puts the burden on the consumer to determine the overall healthiness.
Besides, Kubernetes expects a binary response from its probes. If the HTTP status code is between 200 and 400, it is successful[1]. Anything else is considered as a failure.

Kubernetes distinguishes between readiness and liveness. As defined, the healthcheck API deals mainly with liveness.
But it could be possible provide some annotation to specify that some health check procedures can determine when an application is ready.
For example, WildFly could be considered ready (i.e. Kubernetes will start to route request to it) when:
* it status health check is “up and running”
* its deployment health check verifies that all its deployments are enabled.

We could then provide a 2nd HTTP endpoint that Kubernetes could query to check that the server is ready to serve requests.

jeff

[1] https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#examples

--
Jeff Mesnil
JBoss, a division of Red Hat
http://jmesnil.net/


_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: A look at Eclipse MicroProfile Healthcheck

Heiko Braun


> Am 06.07.2017 um 16:30 schrieb Jeff Mesnil <[hidden email]>:
>
> Kubernetes distinguishes between readiness and liveness. As defined, the healthcheck API deals mainly with liveness.

Kubernetes has different semantics for live and ready, i.e how it interprets and reacts to certain health responses, but underneath its the same protocol.

With the health api it's no different: you can model readiness or liveness checks with it. I think the only constraint we are facing atm, is the usage of a single protocol entrypoint. This approach disallows to have live and readiness checks for the same node (i.e there is just /health)

Maybe we should change that to support multiple protocol entry points? These could either be custom ones or defined in the spec. For instance rather then having /health, we could introduce /live and /ready. The api underneath however would remain the same.

Food for thought.

_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: A look at Eclipse MicroProfile Healthcheck

Rob Cernich
In reply to this post by Jeff Mesnil
> > Regarding the probes, three states would be best, if you can swing it, as
> > OpenShift defines two probe types: liveness and readiness.  Live is
> > running, but unable to handle requests, while ready means it's running and
> > able to handle requests.  For example, while the server is initializing,
> > it's alive, but not ready. Something to think about.
>
> Three states (red/orange/green) was discussed when the healthcheck API was
> proposed. It was rejected as it puts the burden on the consumer to determine
> the overall healthiness.
> Besides, Kubernetes expects a binary response from its probes. If the HTTP
> status code is between 200 and 400, it is successful[1]. Anything else is
> considered as a failure.
>
> Kubernetes distinguishes between readiness and liveness. As defined, the
> healthcheck API deals mainly with liveness.
> But it could be possible provide some annotation to specify that some health
> check procedures can determine when an application is ready.
> For example, WildFly could be considered ready (i.e. Kubernetes will start to
> route request to it) when:
> * it status health check is “up and running”
> * its deployment health check verifies that all its deployments are enabled.

I think this is a bit over simplistic, e.g. what happens if a deployment failed to start?  The pod would be marked as alive, but would never become ready.  If the deployment failed, presumably, the pod should be marked as dead.  One use case we've seen is using JPA to initialize a DB.  If the DB isn't available, the deployment fails and the pod continues to bounce until the DB is accessible.  (Yes, this is not good practice, but it illustrates the point, and some of our layered products actually do this, so...)

>
> We could then provide a 2nd HTTP endpoint that Kubernetes could query to
> check that the server is ready to serve requests.
>
> jeff
>
> [1]
> https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#examples
>
> --
> Jeff Mesnil
> JBoss, a division of Red Hat
> http://jmesnil.net/
>
>

_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: A look at Eclipse MicroProfile Healthcheck

Jeff Mesnil
In reply to this post by Heiko Braun

> On 6 Jul 2017, at 21:00, Heiko Braun <[hidden email]> wrote:
> Kubernetes has different semantics for live and ready, i.e how it interprets and reacts to certain health responses, but underneath its the same protocol.
>
> With the health api it's no different: you can model readiness or liveness checks with it. I think the only constraint we are facing atm, is the usage of a single protocol entrypoint. This approach disallows to have live and readiness checks for the same node (i.e there is just /health)
>
> Maybe we should change that to support multiple protocol entry points? These could either be custom ones or defined in the spec. For instance rather then having /health, we could introduce /live and /ready. The api underneath however would remain the same.

Some further food for thought (that will go in an issue in the spec later today):

Maybe the spec could provide some additional metadata to further characterize the health check procedures.
For example a procedure annotated with @Ready would return UP when the component is “ready” (for whatever it is doing) and DOWN otherwise.
The same component could have another procedure (without any annotation or a @Live one) that returns UP when the component checks it is healthy.

The component would return different status depending on its state:
* during its initialization (READY = DOWN, LIVE = UP)
* after its initialization (READY = UP, LIVE = UP)
* if it is encountering some issues, (LIVE = DOWN, READY = <whatever>).

The spec could then provide a different entry point that only checks procedures annotated with @Ready.

TL;DR:
the spec should enforce only 2 states (UP and DOWN) for each check procedures but could characterise them to let consumers query different type of healthiness (live and ready)

jeff


--
Jeff Mesnil
JBoss, a division of Red Hat
http://jmesnil.net/


_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: A look at Eclipse MicroProfile Healthcheck

Jeff Mesnil
In reply to this post by Rob Cernich

> On 6 Jul 2017, at 21:26, Rob Cernich <[hidden email]> wrote:
>
>>> Regarding the probes, three states would be best, if you can swing it, as
>>> OpenShift defines two probe types: liveness and readiness.  Live is
>>> running, but unable to handle requests, while ready means it's running and
>>> able to handle requests.  For example, while the server is initializing,
>>> it's alive, but not ready. Something to think about.
>>
>> Three states (red/orange/green) was discussed when the healthcheck API was
>> proposed. It was rejected as it puts the burden on the consumer to determine
>> the overall healthiness.
>> Besides, Kubernetes expects a binary response from its probes. If the HTTP
>> status code is between 200 and 400, it is successful[1]. Anything else is
>> considered as a failure.
>>
>> Kubernetes distinguishes between readiness and liveness. As defined, the
>> healthcheck API deals mainly with liveness.
>> But it could be possible provide some annotation to specify that some health
>> check procedures can determine when an application is ready.
>> For example, WildFly could be considered ready (i.e. Kubernetes will start to
>> route request to it) when:
>> * it status health check is “up and running”
>> * its deployment health check verifies that all its deployments are enabled.
>
> I think this is a bit over simplistic, e.g. what happens if a deployment failed to start?  The pod would be marked as alive, but would never become ready.  If the deployment failed, presumably, the pod should be marked as dead.  One use case we've seen is using JPA to initialize a DB.  If the DB isn't available, the deployment fails and the pod continues to bounce until the DB is accessible.  (Yes, this is not good practice, but it illustrates the point, and some of our layered products actually do this, so…)

You are right, these were only some (simple) ideas but not actual definitions of health checks.
The deployment healthiness is a bit complex. It needs to identify deployment that failed but to do that we need to know about deployment attempts (I don’t think we get such info at the moment).
We could check when the server is started if there is any deployment in the deployments directories and correlates that with /deployment resource after the server is started, etc.

But the idea remains the same: WildFly should be able to identify whether its deployments status is healthy or not.

jeff

--
Jeff Mesnil
JBoss, a division of Red Hat
http://jmesnil.net/


_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: A look at Eclipse MicroProfile Healthcheck

Rob Cernich
In reply to this post by Rob Cernich
> On 6 Jul 2017, at 16:13, Rob Cernich wrote:
> > Regarding the probes, three states would be best, if you can swing it,
> > as OpenShift defines two probe types: liveness and readiness.  Live is
> > running, but unable to handle requests, while ready means it's running
> > and able to handle requests.  For example, while the server is
> > initializing, it's alive, but not ready.  Something to think about.
>
> OpenShift / Kube only cares about up down.
> So instead of three states, there should be a parameter
> to the endpoint ?ready to indicate that the readiness
> HCP is queried and not the liveliness (or the other
> way around).
>

It cares about live and ready.  Dead implies not ready.  Not ready does not imply dead.
_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: A look at Eclipse MicroProfile Healthcheck

Jeff Mesnil
In reply to this post by Heiko Braun

> On 7 Jul 2017, at 14:22, Heiko Rupp <[hidden email]> wrote:
>
> On 6 Jul 2017, at 21:00, Heiko Braun wrote:
>> Maybe we should change that to support multiple protocol entry points? These could either be custom ones or defined in the spec. For instance rather then having /health, we could introduce /live and /ready. The api underneath however would remain the same.
>
> I think only keeping /health and adding a parameter
> is better, as it "pollutes" less the namespace of urls
> left to applications.

Related issues to improve the spec that are relevant to this conversation.

* Provide different types of health check - https://github.com/eclipse/microprofile-health/issues/35
* Health check endpoint should have app name - https://github.com/eclipse/microprofile-health/issues/29

jeff

--
Jeff Mesnil
JBoss, a division of Red Hat
http://jmesnil.net/


_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: A look at Eclipse MicroProfile Healthcheck

Jeff Mesnil
In reply to this post by David Lloyd

> On 7 Jul 2017, at 14:18, Heiko Rupp <[hidden email]> wrote:
>
> On 6 Jul 2017, at 15:00, David Lloyd wrote:
>> Is this a polling-only service, or is there a "push" mechanism?
>
> Right now poll only, basically building the counterpart
> of the Kubernetes/OpenShift health check calls.
>
>> Just brainstorming, I can think of a few more potentially useful health checks beyond what you've listed:
>>
>> • EJB failure rate (if an EJB starts failing more than some percentage of the last, say 50 or 100 invocations, it could report an "unhealthy"
>> condition)
>> • Database failure rate (something with JDBC exceptions maybe)
>> • Authentication realm failure rate (Elytron's RealmUnavailableException)
>
> I think those are internal implementation details and something
> each application needs to come up with good values.
>
> While the HCP can have a detailed payload as above, it is not
> required, and could only return 204 UP or 503 DOWN.
> Kube will not look at the body anyway.

Some food for thought.
It’s not clear to me if any checks dealing with ranges ("over time" or "over invocations" such a failure rate) should be handled by a health check procedure at all.
Heiko R. has proposed a spec to cover Metrics via HTTP endpoints[1] that is also relevant here.
Anything dealing with such range might be better addressed by monitoring tools with can detect trends.
Health checks and telemetry are somehow related but it’s not clear to me where the boundaries lie.

[1] https://github.com/eclipse/microprofile-evolution-process/blob/master/proposals/0002-metrics.md

--
Jeff Mesnil
JBoss, a division of Red Hat
http://jmesnil.net/


_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: A look at Eclipse MicroProfile Healthcheck

Rob Cernich
In reply to this post by Rob Cernich
> On 7 Jul 2017, at 14:27, Rob Cernich wrote:
>
> > It cares about live and ready.  Dead implies not ready.  Not ready
> > does not imply dead.
>
> Isn't that what I wrote? The query from Kube expects
> a binary result (or no result at all), but not up/down/perhaps/white
>

It does, and that would imply you have two different interfaces, one for ready and one for liveness.  For EAP, we've been using the Exec probes, which provide more flexibility for reading state, beyond just status code good/bad.
_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: A look at Eclipse MicroProfile Healthcheck

David M. Lloyd
In reply to this post by David Lloyd
On Fri, Jul 7, 2017 at 7:18 AM, Heiko Rupp <[hidden email]> wrote:

> On 6 Jul 2017, at 15:00, David Lloyd wrote:
>>
>> Is this a polling-only service, or is there a "push" mechanism?
>
> Right now poll only, basically building the counterpart
> of the Kubernetes/OpenShift health check calls.
>
>> Just brainstorming, I can think of a few more potentially useful health
>> checks beyond what you've listed:
>>
>> • EJB failure rate (if an EJB starts failing more than some percentage of
>> the last, say 50 or 100 invocations, it could report an "unhealthy"
>> condition)
>> • Database failure rate (something with JDBC exceptions maybe)
>> • Authentication realm failure rate (Elytron's RealmUnavailableException)
>
> I think those are internal implementation details and something
> each application needs to come up with good values.
>
> While the HCP can have a detailed payload as above, it is not
> required, and could only return 204 UP or 503 DOWN.
> Kube will not look at the body anyway.

Right I'm thinking of the data that we can feed to the internal health
SPI, not the actual payload of the service (which would presumably be
one implementation of a more general SPI), based on Jeff's
description.

Thinking ahead, it would probably be good if the SPI itself were
agnostic of push/poll unless we can determine that all existent or
likely implementations are poll-only.

_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev