昨天办公网挂了一上午,因为压测用的是线下办公网络,有一个slave节点虚拟机所在的物理机网络异常,结果无心插柳平台异常了;一直在做平台功能扩充,而对于稳定性可靠性方面做得比较少,特别是当初做了个节点管理,也没做太多的的异常测试,基本都扑在了业务迭代上了,以前都是测试各种分布式系统,如今得自己fix分布式问题,把至今遇到的分布式异常问题总结一波
全篇以问题描述,重现方法,原因描述,解决方法来描述
1、Master节点JMeter客户端非正常结束,导致Slave节点JMeter-Server还在执行压测任务的线程依旧存在,当Master节点重新启动新的任务时,会提示Engine is busy
重现方法:写一个脚本,JMeter分布式压测,Master节点上直接kill调jmeter进程,Slave节点jmeter-server.log会输出如下日志
2021-07-07 22:29:14,046 ERROR o.a.j.t.RemoteThreadsListenerWrapper: Exception invoking listener on threadFinished. java.rmi.ConnectException: Connection refused to host: 172.30.0.99; nested exception is: java.net.ConnectException: Connection refused (Connection refused) at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:623) ~[?:1.8.0_275] at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:216) ~[?:1.8.0_275] at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:202) ~[?:1.8.0_275] at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:132) ~[?:1.8.0_275] at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:235) ~[?:1.8.0_275] at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:180) ~[?:1.8.0_275] at com.sun.proxy.$Proxy21.threadFinished(Unknown Source) ~[?:?] at org.apache.jmeter.threads.RemoteThreadsListenerWrapper.threadFinished(RemoteThreadsListenerWrapper.java:60) [ApacheJMeter_core.jar:5.4] at org.apache.jmeter.threads.JMeterThread$ThreadListenerTraverser.addNode(JMeterThread.java:787) [ApacheJMeter_core.jar:5.4] at org.apache.jorphan.collections.HashTree.traverseInto(HashTree.java:993) [jorphan.jar:5.4] at org.apache.jorphan.collections.HashTree.traverse(HashTree.java:976) [jorphan.jar:5.4] at org.apache.jmeter.threads.JMeterThread.threadFinished(JMeterThread.java:754) [ApacheJMeter_core.jar:5.4] at org.apache.jmeter.threads.JMeterThread.run(JMeterThread.java:328) [ApacheJMeter_core.jar:5.4] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275] Caused by: java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_275] at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_275] at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_275] at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_275] at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_275] at java.net.Socket.connect(Socket.java:607) ~[?:1.8.0_275] at java.net.Socket.connect(Socket.java:556) ~[?:1.8.0_275] at java.net.Socket.(Socket.java:452) ~[?:1.8.0_275] at java.net.Socket.(Socket.java:229) ~[?:1.8.0_275] at sun.rmi.transport.proxy.RMIDirectSocketFactory.createSocket(RMIDirectSocketFactory.java:40) ~[?:1.8.0_275] at sun.rmi.transport.proxy.RMIMasterSocketFactory.createSocket(RMIMasterSocketFactory.java:148) ~[?:1.8.0_275] at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:617) ~[?:1.8.0_275] ... 13 more
此时假如你不清楚此事,开启一个新的压测任务
Configuring remote engine: 172.30.0.14 Configuring remote engine: 172.30.0.202 Starting distributed test with remote engines: [172.30.0.202, 172.30.0.14] @ Wed Jul 07 22:31:55 CST 2021 (1625650315999) Engine is busy - please try later Engine is busy - please try later Remote engines have been started:[] Waiting for possible Shutdown/StopTestNow/HeapDump/ThreadDump message on port 4445
压测任务直接结束,如果是平台上操作,会一脸懵逼,其实系统早已经异常了
原因描述:前面一个压测任务没正常结束,Slave节点线程并未退出,而Slave节点上压测Engine绑定IP和PORT,因此无法启动新的Engine
具体JMeter源码报错部分
/**
* Adds a feature to the ThreadGroup attribute of the RemoteJMeterEngineImpl
* object.
*
* @param testTree
* the feature to be added to the ThreadGroup attribute
* @param hostAndPort Host and Port
* @param jmxBase JMX base
* @param scriptName Name of script
*/
@Override
public void rconfigure(HashTree testTree, String hostAndPort, File jmxBase, String scriptName) throws RemoteException {
log.info("Creating JMeter engine on host {} base '{}'", hostAndPort, jmxBase);
try {
if (log.isInfoEnabled()) {
log.info("Remote client host: {}", getClientHost());
}
} catch (ServerNotActiveException e) {
// ignored
}
synchronized(LOCK) { // close window where another remote client might jump in
if (backingEngine != null && backingEngine.isActive()) {
log.warn("Engine is busy - cannot create JMeter engine");
throw new IllegalStateException("Engine is busy - please try later");
}
ownerThread = Thread.currentThread();
JMeterUtils.setProperty(JMeterUtils.THREAD_GROUP_DISTRIBUTED_PREFIX_PROPERTY_NAME, hostAndPort);
backingEngine = new StandardJMeterEngine(hostAndPort);
backingEngine.configure(testTree); // sets active = true
}
FileServer.getFileServer().setScriptName(scriptName);
FileServer.getFileServer().setBase(jmxBase);
}
这个就是RMI调用的具体压测前的一些状态变更,比如异常的条件Engine存在,且是Active的时候,就会抛出is busy;正常流程就会走到下面起一个线程执行压测脚本,并且在configure方法里将isActive置为true,表示正在压测中;
而对于当前异常场景,显然engine是存在的,那么校验一下backingEngine.isActive()是否为false
active在notifyTestListenersOfEnd()方法里置为false
notifyTestListenersOfEnd()方法是在StandardJMeterEngine的run()方法压测结束前调用
可见原因就是,未正常结束,该字段为true,然后新的任务起来之后,发现该slave节点上engine存在且active,因此无法启动新的engine
解决方法:待续
2、Slave节点JMeter-Server所在的机器网络突然异常,导致节点都是启用状态,但压测的节点只有其中一个
重现方法:这个问题比较直接,如果是自己的测试机,直接将网卡down调即可,当然要保证还有其它方法再up起来,比如console操作;应该不需要重现也很清除问题
原因描述:网络正常的时候,两个slave节点都是启用状态,但其中一个slave节点网络异常,此时master并未察觉,在进行压测任务的时候,还是按2个slave节点进行分布式压测,导致压力机减少了一半,原因还是平台不够完善,导致节点状态未同步更新
解决方法:考虑主从心跳检测(待续)
3、Slave节点JMeter-Server启动了两个进程
重现方法:这个问题是平台不健壮的问题,只要一个slave节点连续启动jmeter-server进程两次即可,第二次进程依旧是可以起来的
原因描述:具体出现启动两次的原因比较奇怪,因为平台定时任务的时候,是先关闭在启动,启动的时候已经判断进程不存在了,可是不知道为啥重新启动jmeter-server会起来两个
Master节点,平台打印日志,这时候没有进程
2021-07-12 00:30:01.639, TID: N/A, INFO, c.p.n.c.u.SSHUtils, 85, execCommand 执行命令: ps aux | grep jmeter-server | grep -v grep, 返回结果为: null
Slave节点端口已占用
2021-07-12 00:30:03,448 WARN o.a.j.e.RemoteJMeterEngineImpl: Problem creating registry: java.rmi.server.ExportException: Port already in use: 1099; nested exception is: java.net.BindException: 地址已在使用 (Bind failed)
抛出异常的源码在
private void init() throws RemoteException {
log.info("Starting backing engine on {}", this.rmiRegistryPort);
InetAddress localHost = RmiUtils.getRmiHost();
if (localHost.isSiteLocalAddress()){
// should perhaps be log.warn, but this causes the client-server test to fail
log.info("IP address is a site-local address; this may cause problems with remote access.\n"
+ "\tCan be overridden by defining the system property 'java.rmi.server.hostname' - see jmeter-server script file");
}
log.debug("This = {}", this);
Registry reg = null;
if (CREATE_SERVER){
log.info("Creating RMI registry (server.rmi.create=true)");
try {
reg = LocateRegistry.createRegistry(this.rmiRegistryPort,
RmiUtils.createClientSocketFactory(),
RmiUtils.createServerSocketFactory());
log.debug("Created registry: {}", reg);
} catch (RemoteException e){
String msg="Problem creating registry: "+e;
log.warn(msg);
System.err.println(msg);
System.err.println("Continuing...");
}
}
具体是在RemoteJMeterEngineImpl.startServer(RmiUtils.getRmiRegistryPort()); 也就是server启动时,engine调用init()方法里报错
解决方法:考虑server启动和关闭的完善(待续)
4、Master节点出现两个JMeter进程,第一个无法正常退出
重现方法:这也是平台不健壮的问题,因为如果手动执行jmeter是可以执行多个的;真实场景是JMeter脚本里启动的线程数非常多,遇到的一个用例,一个脚本里起了5000+线程,超过了单节点的资源能力,最终导致脚本ThreadGroup到达上限了,也无法正常退出,手动这个脚本也无法退出,平台直接调用shutdown.sh也无法结束,导致资源占用,影响压测
有如下异常日志
原因描述:一个JMeter脚本在单节点执行,起5000+线程,我已经修改了linux用户最大线程数,挂起无法退出,导致平台结束用例失败,进程还残留,影响压测
解决方法:网上说配置jmeterengine.force.system.exit=true,我不太倾向于这种解决方法,因为如果有异常也会强制结束;但是也不能直接杀进程,这样会导致slave节点线程残留,再考虑下解决办法吧(待续)
5、
待续……