[wildfly-dev] Call For Help (Testsuite)

classic Classic list List threaded Threaded
35 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: [wildfly-dev] Call For Help (Testsuite)

Jaikiran Pai
Hi Scott,

I added a comment to that JIRA
https://issues.jboss.org/browse/WFLY-1531#comment-12781836 explaining
what the problem is.

-Jaikiran
On Friday 14 June 2013 08:14 PM, Scott Marlow wrote:

> I'm looking into why the
> org.hibernate.cache.infinispan.InfinispanRegionFactory.manager instance
> variable is null after it should still be non-null (WFLY-1531) that
> caused
> org.jboss.as.test.clustering.cluster.ejb3.security.FailoverWithSecurityTestCase
> to fail with a NPE on Tomaz's test machine.
>
> If anyone sees NPE failures in
> org.jboss.as.test.clustering.cluster.ejb3.security.FailoverWithSecurityTestCase,
> please let me know.  +10 if you can save the clustered testsuite
> server.log so we can see if the previous test completed before starting
> FailoverWithSecurityTestCase. :)
>
> On 06/12/2013 03:01 PM, Jason Greene wrote:
>> We still have a number intermittent test failures that have been around for over a year now. I'm asking for everyone's help in doing what we can to make them stable. If you submit a PR, and you see what looks like an intermittent failure, can you do some investigation and report your findings even if it is not your area? It would be awesome if you can report what you find to the mailing list, and rope in help.
>>
>> Nearly all of these seem to only occur when virtualization is involved, so if need be we can work out a plan to create either a special run to capture diagnostic info, or I can give access to a dedicated slave.
>>
>> If anyone has any further ideas on how to tackle these issues I am all ears.
>>
>> --
>> Jason T. Greene
>> WildFly Lead / JBoss EAP Platform Architect
>> JBoss, a division of Red Hat
>>
>>
>> _______________________________________________
>> wildfly-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>
> _______________________________________________
> wildfly-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/wildfly-dev

_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: [wildfly-dev] Call For Help (Testsuite)

Scott Marlow
On 06/14/2013 11:49 PM, Jaikiran Pai wrote:
> Hi Scott,
>
> I added a comment to that JIRA
> https://issues.jboss.org/browse/WFLY-1531#comment-12781836 explaining
> what the problem is.

Thanks for the help Jaikiran!  As a start, I'll try the (minimal fix
that you mentioned) calling ServiceController.awaitValue() instead of
getValue().

As far as what the missing "proper dependencies" are, perhaps we could
discuss that in freenode irc room #jboss-clustering sometime, so Paul
can also be involved in the discussion.

>
> -Jaikiran
> On Friday 14 June 2013 08:14 PM, Scott Marlow wrote:
>> I'm looking into why the
>> org.hibernate.cache.infinispan.InfinispanRegionFactory.manager instance
>> variable is null after it should still be non-null (WFLY-1531) that
>> caused
>> org.jboss.as.test.clustering.cluster.ejb3.security.FailoverWithSecurityTestCase
>>
>> to fail with a NPE on Tomaz's test machine.
>>
>> If anyone sees NPE failures in
>> org.jboss.as.test.clustering.cluster.ejb3.security.FailoverWithSecurityTestCase,
>>
>> please let me know.  +10 if you can save the clustered testsuite
>> server.log so we can see if the previous test completed before starting
>> FailoverWithSecurityTestCase. :)
>>
>> On 06/12/2013 03:01 PM, Jason Greene wrote:
>>> We still have a number intermittent test failures that have been
>>> around for over a year now. I'm asking for everyone's help in doing
>>> what we can to make them stable. If you submit a PR, and you see what
>>> looks like an intermittent failure, can you do some investigation and
>>> report your findings even if it is not your area? It would be awesome
>>> if you can report what you find to the mailing list, and rope in help.
>>>
>>> Nearly all of these seem to only occur when virtualization is
>>> involved, so if need be we can work out a plan to create either a
>>> special run to capture diagnostic info, or I can give access to a
>>> dedicated slave.
>>>
>>> If anyone has any further ideas on how to tackle these issues I am
>>> all ears.
>>>
>>> --
>>> Jason T. Greene
>>> WildFly Lead / JBoss EAP Platform Architect
>>> JBoss, a division of Red Hat
>>>
>>>
>>> _______________________________________________
>>> wildfly-dev mailing list
>>> [hidden email]
>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>
>> _______________________________________________
>> wildfly-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>

_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: [wildfly-dev] Call For Help (Testsuite)

Scott Marlow
Which Phase is the last one that application deployment dependencies can
be added?

Is Phase.INSTALL too late to add an application dependency on the 2LC?

On 06/15/2013 08:25 AM, Scott Marlow wrote:

> On 06/14/2013 11:49 PM, Jaikiran Pai wrote:
>> Hi Scott,
>>
>> I added a comment to that JIRA
>> https://issues.jboss.org/browse/WFLY-1531#comment-12781836 explaining
>> what the problem is.
>
> Thanks for the help Jaikiran!  As a start, I'll try the (minimal fix
> that you mentioned) calling ServiceController.awaitValue() instead of
> getValue().
>
> As far as what the missing "proper dependencies" are, perhaps we could
> discuss that in freenode irc room #jboss-clustering sometime, so Paul
> can also be involved in the discussion.
>
>>
>> -Jaikiran
>> On Friday 14 June 2013 08:14 PM, Scott Marlow wrote:
>>> I'm looking into why the
>>> org.hibernate.cache.infinispan.InfinispanRegionFactory.manager instance
>>> variable is null after it should still be non-null (WFLY-1531) that
>>> caused
>>> org.jboss.as.test.clustering.cluster.ejb3.security.FailoverWithSecurityTestCase
>>>
>>> to fail with a NPE on Tomaz's test machine.
>>>
>>> If anyone sees NPE failures in
>>> org.jboss.as.test.clustering.cluster.ejb3.security.FailoverWithSecurityTestCase,
>>>
>>> please let me know.  +10 if you can save the clustered testsuite
>>> server.log so we can see if the previous test completed before starting
>>> FailoverWithSecurityTestCase. :)
>>>
>>> On 06/12/2013 03:01 PM, Jason Greene wrote:
>>>> We still have a number intermittent test failures that have been
>>>> around for over a year now. I'm asking for everyone's help in doing
>>>> what we can to make them stable. If you submit a PR, and you see what
>>>> looks like an intermittent failure, can you do some investigation and
>>>> report your findings even if it is not your area? It would be awesome
>>>> if you can report what you find to the mailing list, and rope in help.
>>>>
>>>> Nearly all of these seem to only occur when virtualization is
>>>> involved, so if need be we can work out a plan to create either a
>>>> special run to capture diagnostic info, or I can give access to a
>>>> dedicated slave.
>>>>
>>>> If anyone has any further ideas on how to tackle these issues I am
>>>> all ears.
>>>>
>>>> --
>>>> Jason T. Greene
>>>> WildFly Lead / JBoss EAP Platform Architect
>>>> JBoss, a division of Red Hat
>>>>
>>>>
>>>> _______________________________________________
>>>> wildfly-dev mailing list
>>>> [hidden email]
>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>
>>> _______________________________________________
>>> wildfly-dev mailing list
>>> [hidden email]
>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>
>
> _______________________________________________
> wildfly-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>

_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: [wildfly-dev] Call For Help (Testsuite)

Scott Marlow
Rewording to make (a little) more sense.

On 06/17/2013 04:26 PM, Scott Marlow wrote:
> Which Phase is the last one that application deployment dependencies can
> be added?

I didn't mean the deployment module spec dependencies (JPA dup sets
those during the Phase.DEPENDENCIES).

The persistence unit service depends on the second level cache service
and the deployments ComponentDescriptions have a required dependency on
the persistence unit service.

However, we may wait until the Phase.INSTALL, to update the deployments
ComponentDescriptions dependencies (via
ComponentDescriptions.addDependency(persistenceUnitService, REQUIRED)).

>
> Is Phase.INSTALL too late to add an application dependency on the 2LC?
>
> On 06/15/2013 08:25 AM, Scott Marlow wrote:
>> On 06/14/2013 11:49 PM, Jaikiran Pai wrote:
>>> Hi Scott,
>>>
>>> I added a comment to that JIRA
>>> https://issues.jboss.org/browse/WFLY-1531#comment-12781836 explaining
>>> what the problem is.
>>
>> Thanks for the help Jaikiran!  As a start, I'll try the (minimal fix
>> that you mentioned) calling ServiceController.awaitValue() instead of
>> getValue().
>>
>> As far as what the missing "proper dependencies" are, perhaps we could
>> discuss that in freenode irc room #jboss-clustering sometime, so Paul
>> can also be involved in the discussion.
>>
>>>
>>> -Jaikiran
>>> On Friday 14 June 2013 08:14 PM, Scott Marlow wrote:
>>>> I'm looking into why the
>>>> org.hibernate.cache.infinispan.InfinispanRegionFactory.manager instance
>>>> variable is null after it should still be non-null (WFLY-1531) that
>>>> caused
>>>> org.jboss.as.test.clustering.cluster.ejb3.security.FailoverWithSecurityTestCase
>>>>
>>>> to fail with a NPE on Tomaz's test machine.
>>>>
>>>> If anyone sees NPE failures in
>>>> org.jboss.as.test.clustering.cluster.ejb3.security.FailoverWithSecurityTestCase,
>>>>
>>>> please let me know.  +10 if you can save the clustered testsuite
>>>> server.log so we can see if the previous test completed before starting
>>>> FailoverWithSecurityTestCase. :)
>>>>
>>>> On 06/12/2013 03:01 PM, Jason Greene wrote:
>>>>> We still have a number intermittent test failures that have been
>>>>> around for over a year now. I'm asking for everyone's help in doing
>>>>> what we can to make them stable. If you submit a PR, and you see what
>>>>> looks like an intermittent failure, can you do some investigation and
>>>>> report your findings even if it is not your area? It would be awesome
>>>>> if you can report what you find to the mailing list, and rope in help.
>>>>>
>>>>> Nearly all of these seem to only occur when virtualization is
>>>>> involved, so if need be we can work out a plan to create either a
>>>>> special run to capture diagnostic info, or I can give access to a
>>>>> dedicated slave.
>>>>>
>>>>> If anyone has any further ideas on how to tackle these issues I am
>>>>> all ears.
>>>>>
>>>>> --
>>>>> Jason T. Greene
>>>>> WildFly Lead / JBoss EAP Platform Architect
>>>>> JBoss, a division of Red Hat
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> wildfly-dev mailing list
>>>>> [hidden email]
>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>>
>>>> _______________________________________________
>>>> wildfly-dev mailing list
>>>> [hidden email]
>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>
>>
>> _______________________________________________
>> wildfly-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>
>
> _______________________________________________
> wildfly-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>

_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: [wildfly-dev] Call For Help (Testsuite)

Carlo de Wolf
In reply to this post by Jeff Mesnil
On 06/14/2013 11:33 AM, Jeff Mesnil wrote:

> Le 14 juin 2013 à 11:25, Tomaž Cerar <[hidden email]> a écrit :
>
>> org.jboss.as.test.manualmode.messaging.HornetQBackupActivationTestCase.testActiveBackupReload
>> org.jboss.as.test.manualmode.messaging.HornetQBackupActivationTestCase.testLiveReload
> I've sent a PR[1] for these tests. The intermittent failures occur during a :reload operation when the connection is closed before the operation result returns.
> I swallow up the exception for this special case and this should prevent anymore failure.
>
> Note that these tests would be revisited when we provide a high-level reload operation on the client that takes care of the disconnection/reconnection issues.
>
> jeff
>
> [1] https://github.com/wildfly/wildfly/pull/4639
>
This is a hack to hide the real issue. I don't want anything that hides
issues, if possible I want a real fix. If not, then just let it stand
out for further analysis.
I propose we rollback https://github.com/wildfly/wildfly/pull/4639.

Carlo
_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: [wildfly-dev] Call For Help (Testsuite)

Carlo de Wolf
In reply to this post by Tomaž Cerar-2
On 06/14/2013 11:25 AM, Tomaž Cerar wrote:
Carlo,

what are the most common problems your team found when running on bare metal?

anything standing out?

The one problem I see happening a lot is a channel closed prematurely. It affects different tests in different ways.
It also lies at the very core of our functional offering, so if possible I would like to see that one tackled.

Carlo


From analysis (combination of linux & windows jobs) we have on TC we can see something like this:

Most commonly failed, more than 10% of time

org.jboss.as.test.integration.osgi.jta.TransactionTestCase.testUserTransaction (this one fails in about 50% of time)
org.jboss.as.test.integration.osgi.deployment.BundleReplaceTestCase.testDirectBundleReplace

Bit less common, < 10% of time (no particular order)

org.jboss.as.test.integration.domain.suites.DeploymentManagementTestCase.testFullReplaceViaHash
org.jboss.as.test.manualmode.messaging.HornetQBackupActivationTestCase.testActiveBackupReload
org.jboss.as.test.manualmode.messaging.HornetQBackupActivationTestCase.testLiveReload
org.jboss.as.test.integration.osgi.jndi.JNDITestCase.testInitialContextFactoryBuilderService    
org.jboss.as.test.integration.osgi.jndi.JNDITestCase.testObjectFactoryOSGiService
org.jboss.as.test.manualmode.ejb.shutdown.RemoteCallWhileShuttingDownTestCase.testServerShutdownRemoteCall
org.jboss.as.test.clustering.cluster.ejb2.invalidation.CacheInvalidationTestCase(SYNC-tcp).testCacheInvalidation

plus few more that fail less commonly.

From my point of view, the most annoying one is the OSGI's testUserTransaction, Thomas: can you please take a look at it?

But what can be seen from the list up there, most of intermittent test failures are either manual mode tests or osgi, everything else is in minority and we are working on fixing it.
There use to be lots of problems with clustering, but that was mostly fixed.

--
tomaz


On Fri, Jun 14, 2013 at 11:05 AM, Carlo de Wolf <[hidden email]> wrote:
There are intermittent failures running on bare metal as well. I think
they are confined to conditions where the machine is stressed.
I'm especially interested in cases where the GC seems to come through at
an ill-timed moment, at which point it looks like connections that are
in use are actually closed asynchronously.

Carlo

On 06/12/2013 11:38 PM, Andrig Miller wrote:
> I recently help out on a customer case, which came from the Linux side of the house, which had to do with Java when virtualized.  The customer produced a test case that was emulating the application where the performance slowdown was occuring.  It turned out that they were using ReentrantReadWriteLock for some synchronization.  I did some thread dumps of the test case, and looked at the source too, and found out that ReentrantReadWriteLock was using the ...Unsafe.park() native method, which does a futex() system call.  That system call has terrible performance when virtualized, compared to using ReentrantLock, which has been changed with JDK 7 to no longer depend on the native code, and does everything in Java, in user space.  Performance was so much better, and with some vitualization configuration settings was almost equal to bare metal performance.
>
> That's a long story, just to say, if we have intermittent test suite failures only when run virtualized, perhaps we have some Java code that is using native code that calls futex() in the OS.  If so, it will perform very poorly compared to bare metal, and could be part of the problem with intermittent test failures, especially if there are any timing issues in the test cases.
>
> Just an FYI for something to look out for.
>
> Andy
>
> ----- Original Message -----
>> From: "Jason Greene" <[hidden email]>
>> To: [hidden email]
>> Sent: Wednesday, June 12, 2013 1:01:10 PM
>> Subject: [wildfly-dev] Call For Help (Testsuite)
>>
>> We still have a number intermittent test failures that have been
>> around for over a year now. I'm asking for everyone's help in doing
>> what we can to make them stable. If you submit a PR, and you see
>> what looks like an intermittent failure, can you do some
>> investigation and report your findings even if it is not your area?
>> It would be awesome if you can report what you find to the mailing
>> list, and rope in help.
>>
>> Nearly all of these seem to only occur when virtualization is
>> involved, so if need be we can work out a plan to create either a
>> special run to capture diagnostic info, or I can give access to a
>> dedicated slave.
>>
>> If anyone has any further ideas on how to tackle these issues I am
>> all ears.
>>
>> --
>> Jason T. Greene
>> WildFly Lead / JBoss EAP Platform Architect
>> JBoss, a division of Red Hat
>>
>>
>> _______________________________________________
>> wildfly-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>
> _______________________________________________
> wildfly-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/wildfly-dev

_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev



_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: [wildfly-dev] Call For Help (Testsuite)

Carlo de Wolf
In reply to this post by jtgreene
One timing issue everybody should be on the lookout for is unexpected
time spent during a thread context switch.

For example this piece assumes that if the time out has been reached the
condition has not been obtained:
https://github.com/wildfly/wildfly/blob/e4d4ac54685d531f1ee62d0f8c919ccbde898568/testsuite/integration/clust/src/test/java/org/jboss/as/test/clustering/ViewChangeListenerBean.java#L72
While in actuality the notify might have already happened way sooner,
but the thread context switch took too long.

Carlo

On 06/14/2013 06:46 PM, Jason Greene wrote:

> Well the problem is that you really need more than one core to catch timing & memory visibility issues.
>
> On Jun 14, 2013, at 11:40 AM, Aleksandar Kostadinov <[hidden email]> wrote:
>
>> My hope was that test suite would run reliably on a smaller instance - at most medium which is the largest 32bit instance.
>>
>> The test suite should as well be run on modern multi-cpu/core machines but being run on slower ones will avoid issues later when certifying legacy hardware and cloud environments.
>>
>> Jason Greene wrote, On 06/14/2013 07:01 PM (EEST):
>>> Ah yes duh. We have to do some Linux hacks to get IPv6 multicast to work on the loopback with 2 addresses, but that should work on EC2 as well.
>>>
>>> We'd have to get a large instance because we need more than 1 core to adequately test. So I'll have to ask around about budget.
>>>
>>> On Jun 14, 2013, at 10:47 AM, Aleksandar Kostadinov <[hidden email]> wrote:
>>>
>>>> Multicast is not a problem. It works fine on the local machine as long as you don't expect another machine to see the messages.
>>>>
>>>> Jason Greene wrote, On 06/14/2013 06:35 PM (EEST):
>>>>> Unfortunately EC2 won't run the test suite since some of the tests rely on multicast. Lightning is virtualize but its at capacity. We may be getting some more hardware though which would help.
>>>>>
>>>>> On Jun 14, 2013, at 10:32 AM, Aleksandar Kostadinov <[hidden email]> wrote:
>>>>>
>>>>>> Not exactly an answer but can we have a persistent amazon EC2 small instance to run the testsuite? This way such intermittent issues would be caught early after being introduced.
>>>>>>
>>>>>> Jason Greene wrote, On 06/12/2013 10:01 PM (EEST):
>>>>>>> We still have a number intermittent test failures that have been around for over a year now. I'm asking for everyone's help in doing what we can to make them stable. If you submit a PR, and you see what looks like an intermittent failure, can you do some investigation and report your findings even if it is not your area? It would be awesome if you can report what you find to the mailing list, and rope in help.
>>>>>>>
>>>>>>> Nearly all of these seem to only occur when virtualization is involved, so if need be we can work out a plan to create either a special run to capture diagnostic info, or I can give access to a dedicated slave.
>>>>>>>
>>>>>>> If anyone has any further ideas on how to tackle these issues I am all ears.
>>>>>>>
>>>>>>> --
>>>>>>> Jason T. Greene
>>>>>>> WildFly Lead / JBoss EAP Platform Architect
>>>>>>> JBoss, a division of Red Hat
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> wildfly-dev mailing list
>>>>>>> [hidden email]
>>>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>>>>
>>>>> --
>>>>> Jason T. Greene
>>>>> WildFly Lead / JBoss EAP Platform Architect
>>>>> JBoss, a division of Red Hat
>>>>>
>>> --
>>> Jason T. Greene
>>> WildFly Lead / JBoss EAP Platform Architect
>>> JBoss, a division of Red Hat
>>>
> --
> Jason T. Greene
> WildFly Lead / JBoss EAP Platform Architect
> JBoss, a division of Red Hat
>
>
> _______________________________________________
> wildfly-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/wildfly-dev

_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: [wildfly-dev] Call For Help (Testsuite)

Jeff Mesnil
In reply to this post by Carlo de Wolf

Le 18 juin 2013 à 08:53, Carlo de Wolf <[hidden email]> a écrit :

> On 06/14/2013 11:33 AM, Jeff Mesnil wrote:
>> Le 14 juin 2013 à 11:25, Tomaž Cerar <[hidden email]> a écrit :
>>
>>> org.jboss.as.test.manualmode.messaging.HornetQBackupActivationTestCase.testActiveBackupReload
>>> org.jboss.as.test.manualmode.messaging.HornetQBackupActivationTestCase.testLiveReload
>> I've sent a PR[1] for these tests. The intermittent failures occur during a :reload operation when the connection is closed before the operation result returns.
>> I swallow up the exception for this special case and this should prevent anymore failure.
>>
>> Note that these tests would be revisited when we provide a high-level reload operation on the client that takes care of the disconnection/reconnection issues.
>>
>> jeff
>>
>> [1] https://github.com/wildfly/wildfly/pull/4639
>>
> This is a hack to hide the real issue. I don't want anything that hides issues, if possible I want a real fix. If not, then just let it stand out for further analysis.
> I propose we rollback https://github.com/wildfly/wildfly/pull/4639.

Afaict, the tests are executing a :reload are swallowing up the exception is the channel is closed prematurely[1][2].

If we want to remove this, the :reload command must refactored to make sure it returns before the channel of the server is closed.

Note that in the HornetQBackupActivationTestCase, the :reload operation would normally be executed by an admin using the cli and would not be intermingled with application code. That's not the important part of the test: it checks that HornetQ master/slaver servers are correctly configured upon server reloading.
We could rollback the test (and keep having intermittent failures) or @Ignore the test but I'd prefer to keep it with the exception swallowing rather than having a regression on this test undetected until we get customer issue.

jeff

[1] https://github.com/wildfly/wildfly/blob/master/testsuite/domain/src/test/java/org/jboss/as/test/integration/respawn/RespawnTestCase.java#L384
[2] https://github.com/wildfly/wildfly/blob/master/testsuite/shared/src/main/java/org/jboss/as/test/integration/domain/management/util/DomainLifecycleUtil.java#L335

--
Jeff Mesnil
JBoss, a division of Red Hat
http://jmesnil.net/


_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: [wildfly-dev] Call For Help (Testsuite)

Dimitris Andreadis
In reply to this post by Aleksandar Kostadinov
Can't we setup up a "slow" virtualized machine locally? I mean, why it has to be on EC2?

On 14/06/2013 22:18, Aleksandar Kostadinov wrote:

> My experience shows that slow machines tend to reveal a lot of problems.
> That might be more of test suite deficiencies than project issues but
> they still do a lot of harm when found too late after being introduced.
>
> I'm all for running the test suite at least daily on a range of machine
> types but I think a slow and a fast run will mostly help. A large
> instance will be an improvement in any case though.
>
> Jason Greene wrote, On 06/14/2013 07:46 PM (EEST):
>> Well the problem is that you really need more than one core to catch timing & memory visibility issues.
>>
>> On Jun 14, 2013, at 11:40 AM, Aleksandar Kostadinov <[hidden email]> wrote:
>>
>>> My hope was that test suite would run reliably on a smaller instance - at most medium which is the largest 32bit instance.
>>>
>>> The test suite should as well be run on modern multi-cpu/core machines but being run on slower ones will avoid issues later when certifying legacy hardware and cloud environments.
>>>
>>> Jason Greene wrote, On 06/14/2013 07:01 PM (EEST):
>>>> Ah yes duh. We have to do some Linux hacks to get IPv6 multicast to work on the loopback with 2 addresses, but that should work on EC2 as well.
>>>>
>>>> We'd have to get a large instance because we need more than 1 core to adequately test. So I'll have to ask around about budget.
>>>>
>>>> On Jun 14, 2013, at 10:47 AM, Aleksandar Kostadinov <[hidden email]> wrote:
>>>>
>>>>> Multicast is not a problem. It works fine on the local machine as long as you don't expect another machine to see the messages.
>>>>>
>>>>> Jason Greene wrote, On 06/14/2013 06:35 PM (EEST):
>>>>>> Unfortunately EC2 won't run the test suite since some of the tests rely on multicast. Lightning is virtualize but its at capacity. We may be getting some more hardware though which would help.
>>>>>>
>>>>>> On Jun 14, 2013, at 10:32 AM, Aleksandar Kostadinov <[hidden email]> wrote:
>>>>>>
>>>>>>> Not exactly an answer but can we have a persistent amazon EC2 small instance to run the testsuite? This way such intermittent issues would be caught early after being introduced.
>>>>>>>
>>>>>>> Jason Greene wrote, On 06/12/2013 10:01 PM (EEST):
>>>>>>>> We still have a number intermittent test failures that have been around for over a year now. I'm asking for everyone's help in doing what we can to make them stable. If you submit a PR, and you see what looks like an intermittent failure, can you do some investigation and report your findings even if it is not your area? It would be awesome if you can report what you find to the mailing list, and rope in help.
>>>>>>>>
>>>>>>>> Nearly all of these seem to only occur when virtualization is involved, so if need be we can work out a plan to create either a special run to capture diagnostic info, or I can give access to a dedicated slave.
>>>>>>>>
>>>>>>>> If anyone has any further ideas on how to tackle these issues I am all ears.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jason T. Greene
>>>>>>>> WildFly Lead / JBoss EAP Platform Architect
>>>>>>>> JBoss, a division of Red Hat
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> wildfly-dev mailing list
>>>>>>>> [hidden email]
>>>>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jason T. Greene
>>>>>> WildFly Lead / JBoss EAP Platform Architect
>>>>>> JBoss, a division of Red Hat
>>>>>>
>>>>
>>>> --
>>>> Jason T. Greene
>>>> WildFly Lead / JBoss EAP Platform Architect
>>>> JBoss, a division of Red Hat
>>>>
>>
>> --
>> Jason T. Greene
>> WildFly Lead / JBoss EAP Platform Architect
>> JBoss, a division of Red Hat
>>
> _______________________________________________
> wildfly-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>

--
xxxxxxxxxxxxxxxxxxxxxxxxxxxx
Dimitris Andreadis
Engineering Manager
JBoss Application Server
by Red Hat
xxxxxxxxxxxxxxxxxxxxxxxxxxxx

http://dandreadis.blogspot.com/
_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: [wildfly-dev] Call For Help (Testsuite)

Jason T. Greene
We certainly could. However, we are short on capacity so it makes more sense to keep the configuration we have which has no problem producing issues.

I will say again though that single core runs hide memory visibility issues that we want to catch. The minimum requirement for running a multi threaded test suite and a multi threaded server on the same box is 2 cores.

On Jun 18, 2013, at 7:12 AM, Dimitris Andreadis <[hidden email]> wrote:

> Can't we setup up a "slow" virtualized machine locally? I mean, why it has to be on EC2?
>
> On 14/06/2013 22:18, Aleksandar Kostadinov wrote:
>> My experience shows that slow machines tend to reveal a lot of problems.
>> That might be more of test suite deficiencies than project issues but
>> they still do a lot of harm when found too late after being introduced.
>>
>> I'm all for running the test suite at least daily on a range of machine
>> types but I think a slow and a fast run will mostly help. A large
>> instance will be an improvement in any case though.
>>
>> Jason Greene wrote, On 06/14/2013 07:46 PM (EEST):
>>> Well the problem is that you really need more than one core to catch timing & memory visibility issues.
>>>
>>> On Jun 14, 2013, at 11:40 AM, Aleksandar Kostadinov <[hidden email]> wrote:
>>>
>>>> My hope was that test suite would run reliably on a smaller instance - at most medium which is the largest 32bit instance.
>>>>
>>>> The test suite should as well be run on modern multi-cpu/core machines but being run on slower ones will avoid issues later when certifying legacy hardware and cloud environments.
>>>>
>>>> Jason Greene wrote, On 06/14/2013 07:01 PM (EEST):
>>>>> Ah yes duh. We have to do some Linux hacks to get IPv6 multicast to work on the loopback with 2 addresses, but that should work on EC2 as well.
>>>>>
>>>>> We'd have to get a large instance because we need more than 1 core to adequately test. So I'll have to ask around about budget.
>>>>>
>>>>> On Jun 14, 2013, at 10:47 AM, Aleksandar Kostadinov <[hidden email]> wrote:
>>>>>
>>>>>> Multicast is not a problem. It works fine on the local machine as long as you don't expect another machine to see the messages.
>>>>>>
>>>>>> Jason Greene wrote, On 06/14/2013 06:35 PM (EEST):
>>>>>>> Unfortunately EC2 won't run the test suite since some of the tests rely on multicast. Lightning is virtualize but its at capacity. We may be getting some more hardware though which would help.
>>>>>>>
>>>>>>> On Jun 14, 2013, at 10:32 AM, Aleksandar Kostadinov <[hidden email]> wrote:
>>>>>>>
>>>>>>>> Not exactly an answer but can we have a persistent amazon EC2 small instance to run the testsuite? This way such intermittent issues would be caught early after being introduced.
>>>>>>>>
>>>>>>>> Jason Greene wrote, On 06/12/2013 10:01 PM (EEST):
>>>>>>>>> We still have a number intermittent test failures that have been around for over a year now. I'm asking for everyone's help in doing what we can to make them stable. If you submit a PR, and you see what looks like an intermittent failure, can you do some investigation and report your findings even if it is not your area? It would be awesome if you can report what you find to the mailing list, and rope in help.
>>>>>>>>>
>>>>>>>>> Nearly all of these seem to only occur when virtualization is involved, so if need be we can work out a plan to create either a special run to capture diagnostic info, or I can give access to a dedicated slave.
>>>>>>>>>
>>>>>>>>> If anyone has any further ideas on how to tackle these issues I am all ears.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jason T. Greene
>>>>>>>>> WildFly Lead / JBoss EAP Platform Architect
>>>>>>>>> JBoss, a division of Red Hat
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> wildfly-dev mailing list
>>>>>>>>> [hidden email]
>>>>>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>>>>
>>>>>>> --
>>>>>>> Jason T. Greene
>>>>>>> WildFly Lead / JBoss EAP Platform Architect
>>>>>>> JBoss, a division of Red Hat
>>>>>
>>>>> --
>>>>> Jason T. Greene
>>>>> WildFly Lead / JBoss EAP Platform Architect
>>>>> JBoss, a division of Red Hat
>>>
>>> --
>>> Jason T. Greene
>>> WildFly Lead / JBoss EAP Platform Architect
>>> JBoss, a division of Red Hat
>> _______________________________________________
>> wildfly-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>
> --
> xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> Dimitris Andreadis
> Engineering Manager
> JBoss Application Server
> by Red Hat
> xxxxxxxxxxxxxxxxxxxxxxxxxxxx
>
> http://dandreadis.blogspot.com/
> _______________________________________________
> wildfly-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/wildfly-dev

_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: [wildfly-dev] Call For Help (Testsuite)

Jaikiran Pai
In reply to this post by Carlo de Wolf
We have been filing JIRAs for specific issues that we have noticed in
these intermittent failures and relevant owners have been helping with
the fixes. I think we should do the same here.

-Jaikiran
On Tuesday 18 June 2013 12:40 PM, Carlo de Wolf wrote:

> One timing issue everybody should be on the lookout for is unexpected
> time spent during a thread context switch.
>
> For example this piece assumes that if the time out has been reached the
> condition has not been obtained:
> https://github.com/wildfly/wildfly/blob/e4d4ac54685d531f1ee62d0f8c919ccbde898568/testsuite/integration/clust/src/test/java/org/jboss/as/test/clustering/ViewChangeListenerBean.java#L72
> While in actuality the notify might have already happened way sooner,
> but the thread context switch took too long.
>
> Carlo
>
> On 06/14/2013 06:46 PM, Jason Greene wrote:
>> Well the problem is that you really need more than one core to catch timing & memory visibility issues.
>>
>> On Jun 14, 2013, at 11:40 AM, Aleksandar Kostadinov <[hidden email]> wrote:
>>
>>> My hope was that test suite would run reliably on a smaller instance - at most medium which is the largest 32bit instance.
>>>
>>> The test suite should as well be run on modern multi-cpu/core machines but being run on slower ones will avoid issues later when certifying legacy hardware and cloud environments.
>>>
>>> Jason Greene wrote, On 06/14/2013 07:01 PM (EEST):
>>>> Ah yes duh. We have to do some Linux hacks to get IPv6 multicast to work on the loopback with 2 addresses, but that should work on EC2 as well.
>>>>
>>>> We'd have to get a large instance because we need more than 1 core to adequately test. So I'll have to ask around about budget.
>>>>
>>>> On Jun 14, 2013, at 10:47 AM, Aleksandar Kostadinov <[hidden email]> wrote:
>>>>
>>>>> Multicast is not a problem. It works fine on the local machine as long as you don't expect another machine to see the messages.
>>>>>
>>>>> Jason Greene wrote, On 06/14/2013 06:35 PM (EEST):
>>>>>> Unfortunately EC2 won't run the test suite since some of the tests rely on multicast. Lightning is virtualize but its at capacity. We may be getting some more hardware though which would help.
>>>>>>
>>>>>> On Jun 14, 2013, at 10:32 AM, Aleksandar Kostadinov <[hidden email]> wrote:
>>>>>>
>>>>>>> Not exactly an answer but can we have a persistent amazon EC2 small instance to run the testsuite? This way such intermittent issues would be caught early after being introduced.
>>>>>>>
>>>>>>> Jason Greene wrote, On 06/12/2013 10:01 PM (EEST):
>>>>>>>> We still have a number intermittent test failures that have been around for over a year now. I'm asking for everyone's help in doing what we can to make them stable. If you submit a PR, and you see what looks like an intermittent failure, can you do some investigation and report your findings even if it is not your area? It would be awesome if you can report what you find to the mailing list, and rope in help.
>>>>>>>>
>>>>>>>> Nearly all of these seem to only occur when virtualization is involved, so if need be we can work out a plan to create either a special run to capture diagnostic info, or I can give access to a dedicated slave.
>>>>>>>>
>>>>>>>> If anyone has any further ideas on how to tackle these issues I am all ears.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jason T. Greene
>>>>>>>> WildFly Lead / JBoss EAP Platform Architect
>>>>>>>> JBoss, a division of Red Hat
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> wildfly-dev mailing list
>>>>>>>> [hidden email]
>>>>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>>>>>
>>>>>> --
>>>>>> Jason T. Greene
>>>>>> WildFly Lead / JBoss EAP Platform Architect
>>>>>> JBoss, a division of Red Hat
>>>>>>
>>>> --
>>>> Jason T. Greene
>>>> WildFly Lead / JBoss EAP Platform Architect
>>>> JBoss, a division of Red Hat
>>>>
>> --
>> Jason T. Greene
>> WildFly Lead / JBoss EAP Platform Architect
>> JBoss, a division of Red Hat
>>
>>
>> _______________________________________________
>> wildfly-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
> _______________________________________________
> wildfly-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/wildfly-dev

_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: [wildfly-dev] Call For Help (Testsuite)

Brian Stansberry
In reply to this post by Carlo de Wolf
On 6/18/13 1:59 AM, Carlo de Wolf wrote:

> On 06/14/2013 11:25 AM, Tomaž Cerar wrote:
>> Carlo,
>>
>> what are the most common problems your team found when running on bare
>> metal?
>>
>> anything standing out?
>
> The one problem I see happening a lot is a channel closed prematurely.
> It affects different tests in different ways.
> It also lies at the very core of our functional offering, so if possible
> I would like to see that one tackled.
>

Can you point to an example?

--
Brian Stansberry
Principal Software Engineer
JBoss by Red Hat
_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: [wildfly-dev] Call For Help (Testsuite)

Scott Marlow
In reply to this post by Scott Marlow
I pushed an update to WFLY-1531 to add a (persistence unit service)
dependency on EmbeddedCacheManagerService.

On 06/17/2013 09:57 PM, Scott Marlow wrote:

> Rewording to make (a little) more sense.
>
> On 06/17/2013 04:26 PM, Scott Marlow wrote:
>> Which Phase is the last one that application deployment dependencies can
>> be added?
>
> I didn't mean the deployment module spec dependencies (JPA dup sets
> those during the Phase.DEPENDENCIES).
>
> The persistence unit service depends on the second level cache service
> and the deployments ComponentDescriptions have a required dependency on
> the persistence unit service.
>
> However, we may wait until the Phase.INSTALL, to update the deployments
> ComponentDescriptions dependencies (via
> ComponentDescriptions.addDependency(persistenceUnitService, REQUIRED)).
>
>>
>> Is Phase.INSTALL too late to add an application dependency on the 2LC?
>>
>> On 06/15/2013 08:25 AM, Scott Marlow wrote:
>>> On 06/14/2013 11:49 PM, Jaikiran Pai wrote:
>>>> Hi Scott,
>>>>
>>>> I added a comment to that JIRA
>>>> https://issues.jboss.org/browse/WFLY-1531#comment-12781836 explaining
>>>> what the problem is.
>>>
>>> Thanks for the help Jaikiran!  As a start, I'll try the (minimal fix
>>> that you mentioned) calling ServiceController.awaitValue() instead of
>>> getValue().
>>>
>>> As far as what the missing "proper dependencies" are, perhaps we could
>>> discuss that in freenode irc room #jboss-clustering sometime, so Paul
>>> can also be involved in the discussion.
>>>
>>>>
>>>> -Jaikiran
>>>> On Friday 14 June 2013 08:14 PM, Scott Marlow wrote:
>>>>> I'm looking into why the
>>>>> org.hibernate.cache.infinispan.InfinispanRegionFactory.manager instance
>>>>> variable is null after it should still be non-null (WFLY-1531) that
>>>>> caused
>>>>> org.jboss.as.test.clustering.cluster.ejb3.security.FailoverWithSecurityTestCase
>>>>>
>>>>> to fail with a NPE on Tomaz's test machine.
>>>>>
>>>>> If anyone sees NPE failures in
>>>>> org.jboss.as.test.clustering.cluster.ejb3.security.FailoverWithSecurityTestCase,
>>>>>
>>>>> please let me know.  +10 if you can save the clustered testsuite
>>>>> server.log so we can see if the previous test completed before starting
>>>>> FailoverWithSecurityTestCase. :)
>>>>>
>>>>> On 06/12/2013 03:01 PM, Jason Greene wrote:
>>>>>> We still have a number intermittent test failures that have been
>>>>>> around for over a year now. I'm asking for everyone's help in doing
>>>>>> what we can to make them stable. If you submit a PR, and you see what
>>>>>> looks like an intermittent failure, can you do some investigation and
>>>>>> report your findings even if it is not your area? It would be awesome
>>>>>> if you can report what you find to the mailing list, and rope in help.
>>>>>>
>>>>>> Nearly all of these seem to only occur when virtualization is
>>>>>> involved, so if need be we can work out a plan to create either a
>>>>>> special run to capture diagnostic info, or I can give access to a
>>>>>> dedicated slave.
>>>>>>
>>>>>> If anyone has any further ideas on how to tackle these issues I am
>>>>>> all ears.
>>>>>>
>>>>>> --
>>>>>> Jason T. Greene
>>>>>> WildFly Lead / JBoss EAP Platform Architect
>>>>>> JBoss, a division of Red Hat
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> wildfly-dev mailing list
>>>>>> [hidden email]
>>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>>>
>>>>> _______________________________________________
>>>>> wildfly-dev mailing list
>>>>> [hidden email]
>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>
>>>
>>> _______________________________________________
>>> wildfly-dev mailing list
>>> [hidden email]
>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>
>>
>> _______________________________________________
>> wildfly-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>
>
> _______________________________________________
> wildfly-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>

_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: [wildfly-dev] Call For Help (Testsuite)

Carlo de Wolf
In reply to this post by Brian Stansberry
On 06/18/2013 03:21 PM, Brian Stansberry wrote:

> On 6/18/13 1:59 AM, Carlo de Wolf wrote:
>> On 06/14/2013 11:25 AM, Tomaž Cerar wrote:
>>> Carlo,
>>>
>>> what are the most common problems your team found when running on bare
>>> metal?
>>>
>>> anything standing out?
>> The one problem I see happening a lot is a channel closed prematurely.
>> It affects different tests in different ways.
>> It also lies at the very core of our functional offering, so if possible
>> I would like to see that one tackled.
>>
> Can you point to an example?
>
https://issues.jboss.org/browse/WFLY-1518

Carlo
_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
Reply | Threaded
Open this post in threaded view
|

Re: [wildfly-dev] Call For Help (Testsuite)

Carlo de Wolf
On 06/19/2013 12:06 AM, Carlo de Wolf wrote:

> On 06/18/2013 03:21 PM, Brian Stansberry wrote:
>> On 6/18/13 1:59 AM, Carlo de Wolf wrote:
>>> On 06/14/2013 11:25 AM, Tomaž Cerar wrote:
>>>> Carlo,
>>>>
>>>> what are the most common problems your team found when running on bare
>>>> metal?
>>>>
>>>> anything standing out?
>>> The one problem I see happening a lot is a channel closed prematurely.
>>> It affects different tests in different ways.
>>> It also lies at the very core of our functional offering, so if
>>> possible
>>> I would like to see that one tackled.
>>>
>> Can you point to an example?
>>
> https://issues.jboss.org/browse/WFLY-1518
>
> Carlo
A "channel closed" during regular clustered EJB invocations which
happens when finalize is being run.

https://bugzilla.redhat.com/show_bug.cgi?id=979935

This seems to be spot on the sore area.

Carlo
_______________________________________________
wildfly-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/wildfly-dev
12