Inter Host Controller group communication mesh

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Inter Host Controller group communication mesh

Brian Stansberry
Just an FYI: I spent a couple days and worked up a POC[1] of creating a
JGroups-based reliable group communication mesh over the sockets our
Host Controllers use for intra-domain management communications.

Currently those sockets are used to form a tree of connections; master
HC to slave HCs and then HCs to their servers. Slave HCs don't talk to
each other. That kind of topology works fine for our current use cases,
but not for other use cases, where a full communication mesh is more
appropriate.

2 use cases led me to explore this:

1) A longstanding request to have automatic failover of the master HC to
a backup. There are different ways to do this, but group communication
based leader election is a possible solution. My preference, really.

2) https://issues.jboss.org/browse/WFLY-1066, which has led to various
design alternatives, one of which is a distributed cache of topology
information, available via each HC. See [2] for some of that discussion.

I don't know if this kind of communication is a good idea, or if it's
the right solution to either of these use cases. Lots of things need
careful thought!! But I figured it was worth some time to experiment.
And it worked in at least a basic POC way, hence this FYI.

If you're interested in details, here are some Q&A:

Q: Why JGroups?

A: Because 1) I know it well 2) I trust it and 3) it's already used for
this kind of group communications in full WildFly.

Q: Why the management sockets? Why not other sockets?

A: Slave HCs already need configuration for how to discover the master.
Using the same sockets lets us reuse that discovery configuration for
the JGroups communications as well. If we're going to use this kind of
communication in an serious way, the configuration needs to be as easy
as possible.

Q: How does it work?

A: JGroups is based on a stack of "protocols" each of which handles one
aspect of reliable group communications. The POC creates and uses a
standard protocol stack, except it replaces two standard protocols with
custom ones:

a) JGroups has various "Discovery" protocols which are used to find
possible peers. I implemented one that integrates with the HC's domain
controller discovery logic. It's basically a copy of the oft used
TCPPING protocol with about 10-15 lines of code changed.

b) JGroups has various "Transport" protocols which are responsible for
actually sending/receiving over the network. I created a new one of
those that knows how to use the WF management comms stuff built on JBoss
Remoting. JGroups provides a number of base classes to use in this
transport area, so I was able to rely on a lot of existing functionality
and could just focus on the details specific to this case.

Q: What have you done using the POC?

A: I created a master HC and a slave on my laptop and saw them form a
cluster and exchange messages. Typical stuff like starting and stopping
the HCs worked. I see no reason why having multiple slaves wouldn't have
worked too; I just didn't do it.

Q: What's next?

A: Nothing really. We have a couple concrete use cases we're looking to
solve. We need to figure out the best solution for those use cases. If
this kind of thing is useful in that, great. If not, it was a fun POC.

[1]
https://github.com/wildfly/wildfly-core/compare/master...bstansberry:jgroups-dc 
. See the commit message on the single commit to learn a bit more.

[2] https://developer.jboss.org/wiki/ADomainManagedServiceRegistry

--
Brian Stansberry
Senior Principal Software Engineer
JBoss by Red Hat
_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: Inter Host Controller group communication mesh

Ken Wills


On Mon, Apr 11, 2016 at 11:57 AM, Brian Stansberry <[hidden email]> wrote:
Just an FYI: I spent a couple days and worked up a POC[1] of creating a
JGroups-based reliable group communication mesh over the sockets our
Host Controllers use for intra-domain management communications.


Nice! I've been thinking about the mechanics of this a bit recently, but I hadn't gotten to any sort of transport details, this looks interesting.
 
Currently those sockets are used to form a tree of connections; master
HC to slave HCs and then HCs to their servers. Slave HCs don't talk to
each other. That kind of topology works fine for our current use cases,
but not for other use cases, where a full communication mesh is more
appropriate.

2 use cases led me to explore this:

1) A longstanding request to have automatic failover of the master HC to
a backup. There are different ways to do this, but group communication
based leader election is a possible solution. My preference, really.

I'd come to the same conclusion of it being an election. A deterministic election algorithm, perhaps allowing the configuration to supply some sort of weighted value to influence the election on each node, perhaps analogous to how the master browser smb election works (version + weight + etc).
 

2) https://issues.jboss.org/browse/WFLY-1066, which has led to various
design alternatives, one of which is a distributed cache of topology
information, available via each HC. See [2] for some of that discussion.

I don't know if this kind of communication is a good idea, or if it's
the right solution to either of these use cases. Lots of things need
careful thought!! But I figured it was worth some time to experiment.
And it worked in at least a basic POC way, hence this FYI.

Not knowing a lot about jgroups .. for very large domains is the mesh NxN in size? For thousands of nodes would this become a problem, or would
a mechanism to segment into local groups perhaps, with only certain nodes participating in the mesh and being eligible for election?
 
Ken


_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: Inter Host Controller group communication mesh

Brian Stansberry
On 4/11/16 3:43 PM, Ken Wills wrote:

>
>
> On Mon, Apr 11, 2016 at 11:57 AM, Brian Stansberry
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     Just an FYI: I spent a couple days and worked up a POC[1] of creating a
>     JGroups-based reliable group communication mesh over the sockets our
>     Host Controllers use for intra-domain management communications.
>
>
> Nice! I've been thinking about the mechanics of this a bit recently, but
> I hadn't gotten to any sort of transport details, this looks interesting.
>
>     Currently those sockets are used to form a tree of connections; master
>     HC to slave HCs and then HCs to their servers. Slave HCs don't talk to
>     each other. That kind of topology works fine for our current use cases,
>     but not for other use cases, where a full communication mesh is more
>     appropriate.
>
>     2 use cases led me to explore this:
>
>     1) A longstanding request to have automatic failover of the master HC to
>     a backup. There are different ways to do this, but group communication
>     based leader election is a possible solution. My preference, really.
>
>
> I'd come to the same conclusion of it being an election. A deterministic
> election algorithm, perhaps allowing the configuration to supply some
> sort of weighted value to influence the election on each node, perhaps
> analogous to how the master browser smb election works (version + weight
> + etc).

Yep.

For sure the master must be running the latest version.

>
>
>     2) https://issues.jboss.org/browse/WFLY-1066, which has led to various
>     design alternatives, one of which is a distributed cache of topology
>     information, available via each HC. See [2] for some of that discussion.
>
>     I don't know if this kind of communication is a good idea, or if it's
>     the right solution to either of these use cases. Lots of things need
>     careful thought!! But I figured it was worth some time to experiment.
>     And it worked in at least a basic POC way, hence this FYI.
>
>
> Not knowing a lot about jgroups .. for very large domains is the mesh
> NxN in size?

Yes.

For thousands of nodes would this become a problem,

It's one concern I have, yes. There are large JGroups clusters, but they
may be based on the UDP multicast transport JGroups offers.

> or would
> a mechanism to segment into local groups perhaps, with only certain
> nodes participating in the mesh and being eligible for election?


For sure we'd have something in the host.xml that controls whether a
particular HC joins the group.

I don't think this is a big problem for the DC election use case, as you
don't need a large number of HCs in the group. You'd have a few
"potential" DCs that could join the group, and the remaining slaves
don't need to.

For use cases where you want slave HCs to be in the cluster though, it's
a concern. The distributed topology cache thing may or may not need
that. It needs a few HCs to provide HA, but those could be the same ones
that are "potential" HCs. But if only a few are in the group, the
servers need to be told how to reach those HCs. Chicken and egg, as the
point of the topology cache is to provide that kind of data to servers!
If a server's own HC is required to be a part of the group though, that
helps cut through the chicken/egg problem.


> Ken
>


--
Brian Stansberry
Senior Principal Software Engineer
JBoss by Red Hat
_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: Inter Host Controller group communication mesh

Heiko Braun
In reply to this post by Brian Stansberry

Have you seen the RAFT implementation in jgroups [1]? It maybe helpful to implement the leader election.




On 11 Apr 2016, at 18:57, Brian Stansberry <[hidden email]> wrote:

1) A longstanding request to have automatic failover of the master HC to 
a backup. There are different ways to do this, but group communication 
based leader election is a possible solution. My preference, really.


_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: Inter Host Controller group communication mesh

Sebastian Laskawiec
In reply to this post by Brian Stansberry
Adding Bela to the thread...

The POC looks really nice to me. I could try take it from herre and finish WFLY-1066 implementation to see how everything works together.

The only thing that comes into my mind is whether we should (or or not) add capability and server group information to it? I think most of the subsystems would be interested in that. 

On Mon, Apr 11, 2016 at 6:57 PM, Brian Stansberry <[hidden email]> wrote:
Just an FYI: I spent a couple days and worked up a POC[1] of creating a
JGroups-based reliable group communication mesh over the sockets our
Host Controllers use for intra-domain management communications.

Currently those sockets are used to form a tree of connections; master
HC to slave HCs and then HCs to their servers. Slave HCs don't talk to
each other. That kind of topology works fine for our current use cases,
but not for other use cases, where a full communication mesh is more
appropriate.

2 use cases led me to explore this:

1) A longstanding request to have automatic failover of the master HC to
a backup. There are different ways to do this, but group communication
based leader election is a possible solution. My preference, really.

2) https://issues.jboss.org/browse/WFLY-1066, which has led to various
design alternatives, one of which is a distributed cache of topology
information, available via each HC. See [2] for some of that discussion.

I don't know if this kind of communication is a good idea, or if it's
the right solution to either of these use cases. Lots of things need
careful thought!! But I figured it was worth some time to experiment.
And it worked in at least a basic POC way, hence this FYI.

If you're interested in details, here are some Q&A:

Q: Why JGroups?

A: Because 1) I know it well 2) I trust it and 3) it's already used for
this kind of group communications in full WildFly.

Q: Why the management sockets? Why not other sockets?

A: Slave HCs already need configuration for how to discover the master.
Using the same sockets lets us reuse that discovery configuration for
the JGroups communications as well. If we're going to use this kind of
communication in an serious way, the configuration needs to be as easy
as possible.

Q: How does it work?

A: JGroups is based on a stack of "protocols" each of which handles one
aspect of reliable group communications. The POC creates and uses a
standard protocol stack, except it replaces two standard protocols with
custom ones:

a) JGroups has various "Discovery" protocols which are used to find
possible peers. I implemented one that integrates with the HC's domain
controller discovery logic. It's basically a copy of the oft used
TCPPING protocol with about 10-15 lines of code changed.

b) JGroups has various "Transport" protocols which are responsible for
actually sending/receiving over the network. I created a new one of
those that knows how to use the WF management comms stuff built on JBoss
Remoting. JGroups provides a number of base classes to use in this
transport area, so I was able to rely on a lot of existing functionality
and could just focus on the details specific to this case.

Q: What have you done using the POC?

A: I created a master HC and a slave on my laptop and saw them form a
cluster and exchange messages. Typical stuff like starting and stopping
the HCs worked. I see no reason why having multiple slaves wouldn't have
worked too; I just didn't do it.

Q: What's next?

A: Nothing really. We have a couple concrete use cases we're looking to
solve. We need to figure out the best solution for those use cases. If
this kind of thing is useful in that, great. If not, it was a fun POC.

[1]
https://github.com/wildfly/wildfly-core/compare/master...bstansberry:jgroups-dc
. See the commit message on the single commit to learn a bit more.

[2] https://developer.jboss.org/wiki/ADomainManagedServiceRegistry

--
Brian Stansberry
Senior Principal Software Engineer
JBoss by Red Hat
_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev


_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: Inter Host Controller group communication mesh

Ryan Emerson
In reply to this post by Brian Stansberry
Overall looks good to me, however I have a question about the automatic failover use case, how do you intend to handle split brain scenarios?

Example scenarios: You have a network of {Master HC, Slave1, Slave2, Slave3, Slave4, Slave5} and the network splits into two partitions of {Master HC, Slave1, Slave2} and {Slave3, Slave4, Slave5}. Or even three distinct partitions consisting of #2 nodes.  

If no additional provisions were added, how detrimental would it be if two Master HCs were elected in distinct partitions and the network partitions became one again (resulting in two Master HCs)?

----- Original Message -----
From: "Brian Stansberry" <[hidden email]>
To: [hidden email]
Sent: Monday, 11 April, 2016 5:57:59 PM
Subject: [wildfly-dev] Inter Host Controller group communication mesh

Just an FYI: I spent a couple days and worked up a POC[1] of creating a
JGroups-based reliable group communication mesh over the sockets our
Host Controllers use for intra-domain management communications.

Currently those sockets are used to form a tree of connections; master
HC to slave HCs and then HCs to their servers. Slave HCs don't talk to
each other. That kind of topology works fine for our current use cases,
but not for other use cases, where a full communication mesh is more
appropriate.

2 use cases led me to explore this:

1) A longstanding request to have automatic failover of the master HC to
a backup. There are different ways to do this, but group communication
based leader election is a possible solution. My preference, really.

2) https://issues.jboss.org/browse/WFLY-1066, which has led to various
design alternatives, one of which is a distributed cache of topology
information, available via each HC. See [2] for some of that discussion.

I don't know if this kind of communication is a good idea, or if it's
the right solution to either of these use cases. Lots of things need
careful thought!! But I figured it was worth some time to experiment.
And it worked in at least a basic POC way, hence this FYI.

If you're interested in details, here are some Q&A:

Q: Why JGroups?

A: Because 1) I know it well 2) I trust it and 3) it's already used for
this kind of group communications in full WildFly.

Q: Why the management sockets? Why not other sockets?

A: Slave HCs already need configuration for how to discover the master.
Using the same sockets lets us reuse that discovery configuration for
the JGroups communications as well. If we're going to use this kind of
communication in an serious way, the configuration needs to be as easy
as possible.

Q: How does it work?

A: JGroups is based on a stack of "protocols" each of which handles one
aspect of reliable group communications. The POC creates and uses a
standard protocol stack, except it replaces two standard protocols with
custom ones:

a) JGroups has various "Discovery" protocols which are used to find
possible peers. I implemented one that integrates with the HC's domain
controller discovery logic. It's basically a copy of the oft used
TCPPING protocol with about 10-15 lines of code changed.

b) JGroups has various "Transport" protocols which are responsible for
actually sending/receiving over the network. I created a new one of
those that knows how to use the WF management comms stuff built on JBoss
Remoting. JGroups provides a number of base classes to use in this
transport area, so I was able to rely on a lot of existing functionality
and could just focus on the details specific to this case.

Q: What have you done using the POC?

A: I created a master HC and a slave on my laptop and saw them form a
cluster and exchange messages. Typical stuff like starting and stopping
the HCs worked. I see no reason why having multiple slaves wouldn't have
worked too; I just didn't do it.

Q: What's next?

A: Nothing really. We have a couple concrete use cases we're looking to
solve. We need to figure out the best solution for those use cases. If
this kind of thing is useful in that, great. If not, it was a fun POC.

[1]
https://github.com/wildfly/wildfly-core/compare/master...bstansberry:jgroups-dc 
. See the commit message on the single commit to learn a bit more.

[2] https://developer.jboss.org/wiki/ADomainManagedServiceRegistry

--
Brian Stansberry
Senior Principal Software Engineer
JBoss by Red Hat
_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: Inter Host Controller group communication mesh

Brian Stansberry
In reply to this post by Heiko Braun
Yes, I planned to look further into that.

On 4/12/16 1:57 AM, Heiko Braun wrote:

>
> Have you seen the RAFT implementation in jgroups [1]? It maybe helpful
> to implement the leader election.
>
>
> [1] http://belaban.github.io/jgroups-raft/manual/index.html
>
>
>> On 11 Apr 2016, at 18:57, Brian Stansberry
>> <[hidden email] <mailto:[hidden email]>> wrote:
>>
>> 1) A longstanding request to have automatic failover of the master HC to
>> a backup. There are different ways to do this, but group communication
>> based leader election is a possible solution. My preference, really.
>


--
Brian Stansberry
Senior Principal Software Engineer
JBoss by Red Hat
_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: Inter Host Controller group communication mesh

Brian Stansberry
In reply to this post by Sebastian Laskawiec
On 4/12/16 4:29 AM, Sebastian Laskawiec wrote:
> Adding Bela to the thread...
>
> The POC looks really nice to me. I could try take it from herre and
> finish WFLY-1066 implementation to see how everything works together.
>
> The only thing that comes into my mind is whether we should (or or not)
> add capability and server group information to it? I think most of the
> subsystems would be interested in that.
>

We'd still need a design for exactly how a distributed cache of topology
info would work. Using JGroups opens up the possibility of using
Infinispan, but the structure of the data in the cache is still TBD. I
think capability and server group data will be part of that.

We also have to work out how the servers access the cache data. As Ken
pointed out having a large TCP mesh might be problematic, so do we want
each HC in the cluster, or a subset, with then some other mechanism for
the servers accessing the cache?

> On Mon, Apr 11, 2016 at 6:57 PM, Brian Stansberry
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     Just an FYI: I spent a couple days and worked up a POC[1] of creating a
>     JGroups-based reliable group communication mesh over the sockets our
>     Host Controllers use for intra-domain management communications.
>
>     Currently those sockets are used to form a tree of connections; master
>     HC to slave HCs and then HCs to their servers. Slave HCs don't talk to
>     each other. That kind of topology works fine for our current use cases,
>     but not for other use cases, where a full communication mesh is more
>     appropriate.
>
>     2 use cases led me to explore this:
>
>     1) A longstanding request to have automatic failover of the master HC to
>     a backup. There are different ways to do this, but group communication
>     based leader election is a possible solution. My preference, really.
>
>     2) https://issues.jboss.org/browse/WFLY-1066, which has led to various
>     design alternatives, one of which is a distributed cache of topology
>     information, available via each HC. See [2] for some of that discussion.
>
>     I don't know if this kind of communication is a good idea, or if it's
>     the right solution to either of these use cases. Lots of things need
>     careful thought!! But I figured it was worth some time to experiment.
>     And it worked in at least a basic POC way, hence this FYI.
>
>     If you're interested in details, here are some Q&A:
>
>     Q: Why JGroups?
>
>     A: Because 1) I know it well 2) I trust it and 3) it's already used for
>     this kind of group communications in full WildFly.
>
>     Q: Why the management sockets? Why not other sockets?
>
>     A: Slave HCs already need configuration for how to discover the master.
>     Using the same sockets lets us reuse that discovery configuration for
>     the JGroups communications as well. If we're going to use this kind of
>     communication in an serious way, the configuration needs to be as easy
>     as possible.
>
>     Q: How does it work?
>
>     A: JGroups is based on a stack of "protocols" each of which handles one
>     aspect of reliable group communications. The POC creates and uses a
>     standard protocol stack, except it replaces two standard protocols with
>     custom ones:
>
>     a) JGroups has various "Discovery" protocols which are used to find
>     possible peers. I implemented one that integrates with the HC's domain
>     controller discovery logic. It's basically a copy of the oft used
>     TCPPING protocol with about 10-15 lines of code changed.
>
>     b) JGroups has various "Transport" protocols which are responsible for
>     actually sending/receiving over the network. I created a new one of
>     those that knows how to use the WF management comms stuff built on JBoss
>     Remoting. JGroups provides a number of base classes to use in this
>     transport area, so I was able to rely on a lot of existing functionality
>     and could just focus on the details specific to this case.
>
>     Q: What have you done using the POC?
>
>     A: I created a master HC and a slave on my laptop and saw them form a
>     cluster and exchange messages. Typical stuff like starting and stopping
>     the HCs worked. I see no reason why having multiple slaves wouldn't have
>     worked too; I just didn't do it.
>
>     Q: What's next?
>
>     A: Nothing really. We have a couple concrete use cases we're looking to
>     solve. We need to figure out the best solution for those use cases. If
>     this kind of thing is useful in that, great. If not, it was a fun POC.
>
>     [1]
>     https://github.com/wildfly/wildfly-core/compare/master...bstansberry:jgroups-dc
>     . See the commit message on the single commit to learn a bit more.
>
>     [2] https://developer.jboss.org/wiki/ADomainManagedServiceRegistry
>
>     --
>     Brian Stansberry
>     Senior Principal Software Engineer
>     JBoss by Red Hat
>     _______________________________________________
>     wildfly-dev mailing list
>     [hidden email] <mailto:[hidden email]>
>     https://lists.jboss.org/mailman/listinfo/wildfly-dev
>
>


--
Brian Stansberry
Senior Principal Software Engineer
JBoss by Red Hat
_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: Inter Host Controller group communication mesh

Brian Stansberry
In reply to this post by Ryan Emerson
On 4/12/16 4:43 AM, Ryan Emerson wrote:
> Overall looks good to me, however I have a question about the automatic failover use case, how do you intend to handle split brain scenarios?
>

My basic thought was to require a quorum and if no quorum is available,
provide a degraded level of service.

A degraded level of service probably means no master. A domain can
function with no master. I can brainstorm about possible slight
enhancements to service beyond that, but I question whether they are
worth the effort, at least initially.

> Example scenarios: You have a network of {Master HC, Slave1, Slave2, Slave3, Slave4, Slave5} and the network splits into two partitions of {Master HC, Slave1, Slave2} and {Slave3, Slave4, Slave5}. Or even three distinct partitions consisting of #2 nodes.
>

I think we need a 3rd conceptual type -- a Potential Master. Not just
any HC can become master. It has to:

1) Be the latest version.
2) Be configured such that it's keeping a complete set of the domain
config and any domain managed content.
3) Is configured to use the group communication service used for leader
election.
4) Most likely it would also have specific config saying it can be a
master. I doubt this is something users will want to leave to chance.

So, electing a leader requires a quorum of Potential Masters.

> If no additional provisions were added, how detrimental would it be if two Master HCs were elected in distinct partitions and the network partitions became one again (resulting in two Master HCs)?
>

Two masters means two potentially inconsistent domain configurations
(i.e. domain.xml and content repo) are possible. We don't want that,
hence the quorum requirement.

A question is what should slave HCs do in the absence of a master. They
are isolated from control by a master, but don't know if there is still
a functioning set of DC+slaves out there, meaning the slaves may be
missing relevant config changes. Should they shut down, or keep going?

We already have this issue though, and we've elected to have the slaves
keep going, updating their config if they can reconnect to a master. We
chose to keep the appservers running, and not to have them be vulnerable
to problems with master-slave connectivity. Having autopromotion of a
new master makes it slightly more valid to just shut down, since going
masterless is less likely, but I still think it's not a good idea.



> ----- Original Message -----
> From: "Brian Stansberry" <[hidden email]>
> To: [hidden email]
> Sent: Monday, 11 April, 2016 5:57:59 PM
> Subject: [wildfly-dev] Inter Host Controller group communication mesh
>
> Just an FYI: I spent a couple days and worked up a POC[1] of creating a
> JGroups-based reliable group communication mesh over the sockets our
> Host Controllers use for intra-domain management communications.
>
> Currently those sockets are used to form a tree of connections; master
> HC to slave HCs and then HCs to their servers. Slave HCs don't talk to
> each other. That kind of topology works fine for our current use cases,
> but not for other use cases, where a full communication mesh is more
> appropriate.
>
> 2 use cases led me to explore this:
>
> 1) A longstanding request to have automatic failover of the master HC to
> a backup. There are different ways to do this, but group communication
> based leader election is a possible solution. My preference, really.
>
> 2) https://issues.jboss.org/browse/WFLY-1066, which has led to various
> design alternatives, one of which is a distributed cache of topology
> information, available via each HC. See [2] for some of that discussion.
>
> I don't know if this kind of communication is a good idea, or if it's
> the right solution to either of these use cases. Lots of things need
> careful thought!! But I figured it was worth some time to experiment.
> And it worked in at least a basic POC way, hence this FYI.
>
> If you're interested in details, here are some Q&A:
>
> Q: Why JGroups?
>
> A: Because 1) I know it well 2) I trust it and 3) it's already used for
> this kind of group communications in full WildFly.
>
> Q: Why the management sockets? Why not other sockets?
>
> A: Slave HCs already need configuration for how to discover the master.
> Using the same sockets lets us reuse that discovery configuration for
> the JGroups communications as well. If we're going to use this kind of
> communication in an serious way, the configuration needs to be as easy
> as possible.
>
> Q: How does it work?
>
> A: JGroups is based on a stack of "protocols" each of which handles one
> aspect of reliable group communications. The POC creates and uses a
> standard protocol stack, except it replaces two standard protocols with
> custom ones:
>
> a) JGroups has various "Discovery" protocols which are used to find
> possible peers. I implemented one that integrates with the HC's domain
> controller discovery logic. It's basically a copy of the oft used
> TCPPING protocol with about 10-15 lines of code changed.
>
> b) JGroups has various "Transport" protocols which are responsible for
> actually sending/receiving over the network. I created a new one of
> those that knows how to use the WF management comms stuff built on JBoss
> Remoting. JGroups provides a number of base classes to use in this
> transport area, so I was able to rely on a lot of existing functionality
> and could just focus on the details specific to this case.
>
> Q: What have you done using the POC?
>
> A: I created a master HC and a slave on my laptop and saw them form a
> cluster and exchange messages. Typical stuff like starting and stopping
> the HCs worked. I see no reason why having multiple slaves wouldn't have
> worked too; I just didn't do it.
>
> Q: What's next?
>
> A: Nothing really. We have a couple concrete use cases we're looking to
> solve. We need to figure out the best solution for those use cases. If
> this kind of thing is useful in that, great. If not, it was a fun POC.
>
> [1]
> https://github.com/wildfly/wildfly-core/compare/master...bstansberry:jgroups-dc
> . See the commit message on the single commit to learn a bit more.
>
> [2] https://developer.jboss.org/wiki/ADomainManagedServiceRegistry
>


--
Brian Stansberry
Senior Principal Software Engineer
JBoss by Red Hat
_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: Inter Host Controller group communication mesh

Brian Stansberry
In reply to this post by Brian Stansberry
As an FYI, copying Bela Ban, who I stupidly forgot to copy on the first
post. Sebastian kindly copied him on the other main branch of the thread.

Bela, tl;dr on this branch is it mostly discusses concerns about N^2 TCP
connections in a possibly very large cluster. Whether the JGroups
cluster would need to get very large depends on what use cases we used
it to solve.

On 4/11/16 4:20 PM, Brian Stansberry wrote:

> On 4/11/16 3:43 PM, Ken Wills wrote:
>>
>>
>> On Mon, Apr 11, 2016 at 11:57 AM, Brian Stansberry
>> <[hidden email] <mailto:[hidden email]>> wrote:
>>
>>      Just an FYI: I spent a couple days and worked up a POC[1] of creating a
>>      JGroups-based reliable group communication mesh over the sockets our
>>      Host Controllers use for intra-domain management communications.
>>
>>
>> Nice! I've been thinking about the mechanics of this a bit recently, but
>> I hadn't gotten to any sort of transport details, this looks interesting.
>>
>>      Currently those sockets are used to form a tree of connections; master
>>      HC to slave HCs and then HCs to their servers. Slave HCs don't talk to
>>      each other. That kind of topology works fine for our current use cases,
>>      but not for other use cases, where a full communication mesh is more
>>      appropriate.
>>
>>      2 use cases led me to explore this:
>>
>>      1) A longstanding request to have automatic failover of the master HC to
>>      a backup. There are different ways to do this, but group communication
>>      based leader election is a possible solution. My preference, really.
>>
>>
>> I'd come to the same conclusion of it being an election. A deterministic
>> election algorithm, perhaps allowing the configuration to supply some
>> sort of weighted value to influence the election on each node, perhaps
>> analogous to how the master browser smb election works (version + weight
>> + etc).
>
> Yep.
>
> For sure the master must be running the latest version.
>
>>
>>
>>      2) https://issues.jboss.org/browse/WFLY-1066, which has led to various
>>      design alternatives, one of which is a distributed cache of topology
>>      information, available via each HC. See [2] for some of that discussion.
>>
>>      I don't know if this kind of communication is a good idea, or if it's
>>      the right solution to either of these use cases. Lots of things need
>>      careful thought!! But I figured it was worth some time to experiment.
>>      And it worked in at least a basic POC way, hence this FYI.
>>
>>
>> Not knowing a lot about jgroups .. for very large domains is the mesh
>> NxN in size?
>
> Yes.
>
> For thousands of nodes would this become a problem,
>
> It's one concern I have, yes. There are large JGroups clusters, but they
> may be based on the UDP multicast transport JGroups offers.
>
>> or would
>> a mechanism to segment into local groups perhaps, with only certain
>> nodes participating in the mesh and being eligible for election?
>
>
> For sure we'd have something in the host.xml that controls whether a
> particular HC joins the group.
>
> I don't think this is a big problem for the DC election use case, as you
> don't need a large number of HCs in the group. You'd have a few
> "potential" DCs that could join the group, and the remaining slaves
> don't need to.
>
> For use cases where you want slave HCs to be in the cluster though, it's
> a concern. The distributed topology cache thing may or may not need
> that. It needs a few HCs to provide HA, but those could be the same ones
> that are "potential" HCs. But if only a few are in the group, the
> servers need to be told how to reach those HCs. Chicken and egg, as the
> point of the topology cache is to provide that kind of data to servers!
> If a server's own HC is required to be a part of the group though, that
> helps cut through the chicken/egg problem.
>
>
>> Ken
>>
>
>


--
Brian Stansberry
Senior Principal Software Engineer
JBoss by Red Hat
_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: Inter Host Controller group communication mesh

Brian Stansberry
On 4/18/16 10:18 AM, Bela Ban wrote:

> Hey Brian,
>
> On 18/04/16 17:04, Brian Stansberry wrote:
>> As an FYI, copying Bela Ban, who I stupidly forgot to copy on the first
>> post. Sebastian kindly copied him on the other main branch of the thread.
>
> Yes, I read that thread.
>
>> Bela, tl;dr on this branch is it mostly discusses concerns about N^2 TCP
>> connections in a possibly very large cluster. Whether the JGroups
>> cluster would need to get very large depends on what use cases we used
>> it to solve.
>
> When I tested on a 2000+ node cluster running over TCP in Google Compute
> Engine, TCP wasn't that much of an issue.

Good!

> The main drawbacks were that
> every node needed to have ~2000 connections open, which means 1 reader
> thread running per connection. However, connections are closed after a
> configurable idle time.
>
> TCP_NIO2 is much better in that respect as it gets rid of the reader
> threads even if the connection is open.
>

In the POC I did, the transport uses the NIO-based JBoss Remoting
infrastructure we already use for intra-domain communications. All
connections are created using a single remoting Endpoint instance, which
in turn uses a Xnio worker. That worker is configured with two io
threads, and then a pool of 5 min, max 10 threads for handling tasks.

That pool size is just the settings that were already in the code for
the existing uses of the endpoint (CLI requests, intra-domain comms etc)
and I spent zero time when I did the POC thinking about whether those
settings are appropriate if we also add JGroups traffic to the mix.

For the existing management uses of the endpoint, most of the work
actually gets shunted off to threads from a separate pool. A management
request comes in and the xnio worker threads deal with the initial work
of reading it off the wire or writing the response, but the bulk of the
work in between of actually doing the management stuff is done on
another thread. I'd need to refamiliarize myself with TP and the thread
pools JGroups uses to see if we get a similar effect with the JGroups
communications. I have some vague memories of up pools and down pools
and OOB pools and .... ;) All surely out of date.

> The other option is to use UDP without multicasting, ie. ip_mcast=false.
> This would not create N-1 connections and possibly N-1 reader threads
> and sockets, but only 2 sockets (constant) and no reader threads. A
> message would still need to be sent N-1 times though, creating increased
> traffic.
>

I don't think we could do that, at least not with this approach using
the existing JBoss Remoting server sockets. That's all TCP based.

> A potential solution for going from N-1 to a constant number of
> connections/threads would be daisy chaining where you only connect to
> your neighbor and a multicast basically is 1 round across the logical
> overlay, see [1] for details. I'd have to revisit this protocol though
> if you wanted to use it, so let me know asap for me to include this in
> the roadmap.

Ok, good to know. Will do.

> Cheers,
>
> [1] http://belaban.blogspot.ch/2010/08/daisychaining-in-clouds.html
>
>> On 4/11/16 4:20 PM, Brian Stansberry wrote:
>>> On 4/11/16 3:43 PM, Ken Wills wrote:
>>>>
>>>>
>>>> On Mon, Apr 11, 2016 at 11:57 AM, Brian Stansberry
>>>> <[hidden email] <mailto:[hidden email]>>
>>>> wrote:
>>>>
>>>>      Just an FYI: I spent a couple days and worked up a POC[1] of
>>>> creating a
>>>>      JGroups-based reliable group communication mesh over the sockets
>>>> our
>>>>      Host Controllers use for intra-domain management communications.
>>>>
>>>>
>>>> Nice! I've been thinking about the mechanics of this a bit recently,
>>>> but
>>>> I hadn't gotten to any sort of transport details, this looks
>>>> interesting.
>>>>
>>>>      Currently those sockets are used to form a tree of connections;
>>>> master
>>>>      HC to slave HCs and then HCs to their servers. Slave HCs don't
>>>> talk to
>>>>      each other. That kind of topology works fine for our current use
>>>> cases,
>>>>      but not for other use cases, where a full communication mesh is
>>>> more
>>>>      appropriate.
>>>>
>>>>      2 use cases led me to explore this:
>>>>
>>>>      1) A longstanding request to have automatic failover of the
>>>> master HC to
>>>>      a backup. There are different ways to do this, but group
>>>> communication
>>>>      based leader election is a possible solution. My preference,
>>>> really.
>>>>
>>>>
>>>> I'd come to the same conclusion of it being an election. A
>>>> deterministic
>>>> election algorithm, perhaps allowing the configuration to supply some
>>>> sort of weighted value to influence the election on each node, perhaps
>>>> analogous to how the master browser smb election works (version +
>>>> weight
>>>> + etc).
>>>
>>> Yep.
>>>
>>> For sure the master must be running the latest version.
>>>
>>>>
>>>>
>>>>      2) https://issues.jboss.org/browse/WFLY-1066, which has led to
>>>> various
>>>>      design alternatives, one of which is a distributed cache of
>>>> topology
>>>>      information, available via each HC. See [2] for some of that
>>>> discussion.
>>>>
>>>>      I don't know if this kind of communication is a good idea, or if
>>>> it's
>>>>      the right solution to either of these use cases. Lots of things
>>>> need
>>>>      careful thought!! But I figured it was worth some time to
>>>> experiment.
>>>>      And it worked in at least a basic POC way, hence this FYI.
>>>>
>>>>
>>>> Not knowing a lot about jgroups .. for very large domains is the mesh
>>>> NxN in size?
>>>
>>> Yes.
>>>
>>> For thousands of nodes would this become a problem,
>>>
>>> It's one concern I have, yes. There are large JGroups clusters, but they
>>> may be based on the UDP multicast transport JGroups offers.
>>>
>>>> or would
>>>> a mechanism to segment into local groups perhaps, with only certain
>>>> nodes participating in the mesh and being eligible for election?
>>>
>>>
>>> For sure we'd have something in the host.xml that controls whether a
>>> particular HC joins the group.
>>>
>>> I don't think this is a big problem for the DC election use case, as you
>>> don't need a large number of HCs in the group. You'd have a few
>>> "potential" DCs that could join the group, and the remaining slaves
>>> don't need to.
>>>
>>> For use cases where you want slave HCs to be in the cluster though, it's
>>> a concern. The distributed topology cache thing may or may not need
>>> that. It needs a few HCs to provide HA, but those could be the same ones
>>> that are "potential" HCs. But if only a few are in the group, the
>>> servers need to be told how to reach those HCs. Chicken and egg, as the
>>> point of the topology cache is to provide that kind of data to servers!
>>> If a server's own HC is required to be a part of the group though, that
>>> helps cut through the chicken/egg problem.
>>>
>>>
>>>> Ken
>>>>
>>>
>>>
>>
>>
>


--
Brian Stansberry
Senior Principal Software Engineer
JBoss by Red Hat
_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev