Jekyll2023-08-06T03:04:31+08:00https://www.louxiaohui.com/feed.xml也曾鲜衣怒马年少时vow的个人博客vowService 是 NodePort 类型,为什么集群中部分节点的端口不通?2023-08-06T00:00:00+08:002023-08-06T00:00:00+08:00https://www.louxiaohui.com/2023/08/06/why-not-all-host-port-can-be-connected-in-the-same-K8S-cluster<h2 id="问题">问题</h2>
<p>近日测试反馈,测试环境有一个服务,通过 NodePort 对外暴露的端口 8081 ,集群中只有部分主机通。该服务是通过 K8S 下 NodePort 类型的 service 对外暴露的,理论上通过集群中任一主机都可以访问。</p>
<p>实际测试 40.5 可以通, 40.4 不通。
<img src="https://s2.loli.net/2023/08/06/pLqYf9WP1SubElj.png" alt="svc-nodeport-telnet.png" /></p>
<h2 id="排查">排查</h2>
<p>通过 service 对外暴露端口,从端口过来的报文数据,都需要经过反向代理 kube-proxy 流入后端 pod 的 targetPort ,从而到达 pod 上的容器内。部分主机端口不通,说明没有正常转发到目标 pod。</p>
<ul>
<li>1,查看防火墙规则,没有针对 8081 端口做限制</li>
<li>2,查看端口不通的节点状态正常</li>
<li>3,进一步分析 pod,发现通的四个 IP 40.5-40.8 刚好是 pod 所在的主机,猜测可能和 service 的转发策略有关</li>
</ul>
<p><img src="https://s2.loli.net/2023/08/06/53rbfOqAElQRnHY.png" alt="svc-nodeport-get-pod.png" /></p>
<p>运行 <code class="language-plaintext highlighter-rouge">kubectl get svc service-name -n rtmp -o yaml</code> 查看 service 配置如下:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/app-metrics-project: rtmp
labels:
name: srs-service-name
namespace: rtmp
spec:
clusterIP: 10.99.99.99
clusterIPs:
- 10.99.99.99
externalTrafficPolicy: Local
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: srs-service-name
nodePort: 8081
port: 1935
protocol: TCP
targetPort: 1935
sessionAffinity: None
<span class="nb">type</span>: NodePort
status:
loadBalancer: <span class="o">{}</span>
</code></pre></div></div>
<p>可以看到 externalTrafficPolicy 的值为 Local,externaltrafficpolicy 用于把集群外部的服务引入到集群内部来,之后在集群内部直接使用,而值为 Local 表示流量只发给本机的 Pod。</p>
<h2 id="解决">解决</h2>
<p>更改部署的实例数,确保集群中每个节点上都有部署 Pod。</p>
<h2 id="ref">REF</h2>
<p><a href="https://blog.csdn.net/agonie201218/article/details/122215040">K8s中的external-traffic-policy是什么?</a></p>vow问题安卓手机预装包直播无法观看问题2022-10-11T00:00:00+08:002022-10-11T00:00:00+08:00https://www.louxiaohui.com/2022/10/11/cannot-play-live-stream-using-pre-installed-apps-on-android<h2 id="问题描述">问题描述</h2>
<p>用户报障使用预装包打开直播间失败,在后台监控系统根据客户 ID 过滤对应时间段的埋点日志:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>客户端IP:153.153.153.207
节点IP:124.198.198.143
省份:江苏
运营商:联通
URL:https://pull.test.com/live/6666666.flv
</code></pre></div></div>
<h2 id="排查">排查</h2>
<p>1,绑定host,使用 VLC 观看测试正常</p>
<blockquote>
<p>124.198.198.143 pull.test.com</p>
</blockquote>
<p>2,观察第三方博睿拨测,查看可用性正常</p>
<p>3,观察江苏地区整体带宽及状态码正常</p>
<p>以上三点可大概率排除服务节点问题,此问题很大可能为个例。后沟通了解到客户在应用市场下载的最新版播放正常,进一步佐证节点服务正常,但也不排除可能为服务端兼容性问题,为进一步排查,让用户提供异常的抓包。</p>
<p>4,使用 wireshark 分析抓包</p>
<ul>
<li>使用以下命令过滤出 pull.test.com 的 HTTPS 包</li>
</ul>
<blockquote>
<p>tls.handshake.extensions_server_name == “pull.test.com”</p>
</blockquote>
<p><img src="https://s2.loli.net/2022/12/04/6rh31te8YfyqIMg.png" alt="pre-installed-app-play-failed-1.png" /></p>
<ul>
<li>追踪第一个 50 号包的 TCP 流</li>
</ul>
<p><img src="https://s2.loli.net/2022/12/04/kRXBjUSwmINe8Hd.png" alt="pre-installed-app-play-failed-2.png" /></p>
<ul>
<li>数据流详细信息</li>
</ul>
<p>三次握手正常,说明网络没问题。客户端在发出 client hello 后服务端就抛出异常,之后服务端主动关闭连接。异常错误为:<code class="language-plaintext highlighter-rouge">Description: Protocol Version (70)</code>。</p>
<p><img src="https://s2.loli.net/2022/12/04/YOEWeT5ALo74ju6.png" alt="pre-installed-app-play-failed-3.png" /></p>
<p>查阅 RFC 文档 <a href="https://www.rfc-editor.org/rfc/rfc5246#section-7.2.2">rfc5246#section-7.2.2</a>,此错误描述为:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>The protocol version the client has attempted to negotiate is recognized but not supported.
(For example, old protocol versions might be avoided for security reasons.) This message is
always fatal.
</code></pre></div></div>
<p>大意为客户端协商的 TLS 版本服务端不支持,比如出于安全原因,应避免使用旧协议版本。</p>
<ul>
<li>查看客户端请求时发出的 TLS 版本为 1.0</li>
</ul>
<p><img src="https://s2.loli.net/2022/12/04/Lq2fhmk6HDScFso.png" alt="pre-installed-app-play-failed-4.png" /></p>
<p>综上,基本可以确认是服务端不支持 TLS 1.0。</p>
<p>5,验证服务端是否支持 TLS 1.0</p>
<p>使用 curl 指定协议版本测试,可以看到对端不支持的错误提示:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-vo</span> /dev/null https://pull.test.com/live/6666666.flv <span class="nt">--resolve</span> pull.test.com:443:124.198.198.143 <span class="nt">--tlsv1</span>.0
<span class="k">*</span> Added pull.test.com:443:124.198.198.143 to DNS cache
<span class="k">*</span> About to connect<span class="o">()</span> to pull.test.com port 443 <span class="o">(</span><span class="c">#0)</span>
<span class="k">*</span> Trying 124.198.198.143...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 <span class="nt">--</span>:--:-- <span class="nt">--</span>:--:-- <span class="nt">--</span>:--:-- 0<span class="k">*</span> Connected to pull.test.com <span class="o">(</span>124.198.198.143<span class="o">)</span> port 443 <span class="o">(</span><span class="c">#0)</span>
<span class="k">*</span> Initializing NSS with certpath: sql:/etc/pki/nssdb
<span class="k">*</span> CAfile: /etc/pki/tls/certs/ca-bundle.crt
CApath: none
0 0 0 0 0 0 0 0 <span class="nt">--</span>:--:-- <span class="nt">--</span>:--:-- <span class="nt">--</span>:--:-- 0<span class="k">*</span> NSS error <span class="nt">-12190</span> <span class="o">(</span>SSL_ERROR_PROTOCOL_VERSION_ALERT<span class="o">)</span>
<span class="k">*</span> Peer reports incompatible or unsupported protocol version.
0 0 0 0 0 0 0 0 <span class="nt">--</span>:--:-- <span class="nt">--</span>:--:-- <span class="nt">--</span>:--:-- 0
<span class="k">*</span> Closing connection 0
curl: <span class="o">(</span>35<span class="o">)</span> Peer reports incompatible or unsupported protocol version.
</code></pre></div></div>
<p>6,微信群联系第三方 CDN 侧,确认其是否支持 TLS 1.0</p>
<p>第三方反馈出于安全考虑,没有支持 TLS 1.0 和 TLS 1.1 。针对这个配置在走紧急配置流程,不过需要进行审批和配置下发等操作,时间上大约需要 2 个小时左右。</p>
<p>7,用户使用最新版的包播放并抓包</p>
<p>可以看到使用的协议为 TLS 1.2</p>
<p><img src="https://s2.loli.net/2022/12/04/hvoeVTQrc5YFkb4.png" alt="pre-installed-app-play-failed-5.png" /></p>
<h2 id="解决">解决</h2>
<p>让第三方 CDN 开启 TLS 1.0 和 TLS 1.1 支持</p>
<h2 id="ref">REF</h2>
<p><a href="https://github.com/halfrost/Halfrost-Field/blob/master/contents/Protocol/TLS_1.3_Alert_Protocol.md">TLS 1.3 Alert Protocol</a></p>
<p><a href="http://kipway.com/kipway_TSL.html">The TLS Protocol V1.0 - RFC 2246 中文版</a></p>vow问题描述github markdown 文档图片异常显示问题2022-07-12T00:00:00+08:002022-07-12T00:00:00+08:00https://www.louxiaohui.com/2022/07/12/github-md-cannot-display-picture-normally<p>github 的 markdown 文档,嵌入的图片外链无法正常显示,直接访问图片链接正常。</p>
<p>github 预览图片显示碎裂,效果见下:</p>
<p><a href="https://imgse.com/i/vWvhQK"><img src="https://s1.ax1x.com/2022/08/29/vWvhQK.png" alt="vWvhQK.png" /></a></p>
<h2 id="分析">分析</h2>
<p>github 使用 camo 代理显示图片,而 camo 只支持 image 类型的 <code class="language-plaintext highlighter-rouge">Content-Type</code>,详见:<a href="https://github.com/atmos/camo/blob/master/mime-types.json">the list of types supported by Camo</a></p>
<p>使用 <code class="language-plaintext highlighter-rouge">curl</code> 检查图片链接返回的 <code class="language-plaintext highlighter-rouge">Content-Type</code></p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># curl -i https://img1.test.com/cn/1.server-manager.jpg</span>
HTTP/1.1 302 Moved Temporarily
Date: Tue, 12 Jul 2022 19:26:05 GMT
Content-Type: text/html
Content-Length: 138
Connection: keep-alive
Server: nginx
Location: https://img1.test.com/cn/1.server-manager.jpg
X-Trace: 302-1657653965433-0-0-0-0-0
Strict-Transport-Security: max-age<span class="o">=</span>3600
X-Via: 1.1 hx172:1 <span class="o">(</span>Cdn Cache Server V2.0<span class="o">)</span>, 1.1 am20:10 <span class="o">(</span>Cdn Cache Server V2.0<span class="o">)</span>
X-Ws-Request-Id: 62cdcacd_PSmgmamMIA2dr149_2157-48853
</code></pre></div></div>
<p>返回的 <code class="language-plaintext highlighter-rouge">Content-Type</code> 为 <code class="language-plaintext highlighter-rouge">text/html</code>,不在 camo 的支持范围内,导致无法显示。</p>
<h2 id="解决">解决</h2>
<p>由于 img1.test.com 走了 CDN,联系第三方 CDN 修改返回的 <code class="language-plaintext highlighter-rouge">Content-Type</code> header 头为 <code class="language-plaintext highlighter-rouge">image/jpeg</code>。修改后查看文档,图片可以正常显示。</p>
<h2 id="ref">REF</h2>
<p><a href="https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/about-anonymized-urls">about-anonymized-urls</a></p>vowgithub 的 markdown 文档,嵌入的图片外链无法正常显示,直接访问图片链接正常。HTTP2 非法头部导致苹果手机访问白屏2022-05-26T00:00:00+08:002022-05-26T00:00:00+08:00https://www.louxiaohui.com/2022/05/26/safari-displays-white-screen-on-ios-over-http2<h2 id="故障现象">故障现象</h2>
<p>用户反馈苹果手机浏览器打开网页失败,页面显示报错“无法解析响应”</p>
<p>URL:</p>
<blockquote>
<p>https://s.test.com/protocols/format/66bd0c66a96d56a8b33b11b664451337.html?_t=1632403876589</p>
</blockquote>
<p><a href="https://imgtu.com/i/XAWrCj"><img src="https://s1.ax1x.com/2022/05/26/XAWrCj.png" alt="XAWrCj.png" /></a></p>
<h2 id="排查">排查</h2>
<h3 id="绑定源站测试">绑定源站测试</h3>
<p>由于该域名经过 CDN 加速,首先绑定源站测试,测试结果显示正常</p>
<p><a href="https://imgtu.com/i/Xinmi8"><img src="https://s1.ax1x.com/2022/05/24/Xinmi8.png" alt="Xinmi8.png" /></a></p>
<h3 id="绑定-cdn-节点测试">绑定 CDN 节点测试</h3>
<p>电脑绑定 CDN 节点的 host,同时开启 fiddler 抓包,手机连接电脑发射的热点,打开报障的 URL,可以复现到现象。
从抓包数据看,请求协议为 HTTP2。</p>
<p>使用 curl 模拟请求,其中 115.230.205.224 为 CDN IP</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl -Lvo /dev/null --http2 "https://s.test.com/protocols/format/66bd0c66a96d56a8b33b11b664451337.html?_t=163163163163" --resolve s.test.com:443:115.230.205.224
</code></pre></div></div>
<p><a href="https://imgtu.com/i/XAW6vq"><img src="https://s1.ax1x.com/2022/05/26/XAW6vq.md.png" alt="XAW6vq.md.png" /></a></p>
<p>可以看到 <code class="language-plaintext highlighter-rouge">Invalid HTTP header field</code> 的报错,具体报错见下:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">*</span> Connection state changed <span class="o">(</span>MAX_CONCURRENT_STREAMS <span class="o">==</span> 128<span class="o">)!</span>
<span class="k">*</span> http2 error: Invalid HTTP header field was received: frame <span class="nb">type</span>: 1, stream: 1, name: <span class="o">[</span>proxy-connection], value: <span class="o">[</span>keep-alive]
<span class="k">*</span> HTTP/2 stream 0 was not closed cleanly: PROTOCOL_ERROR <span class="o">(</span>err 1<span class="o">)</span>
<span class="k">*</span> stopped the pause stream!
0 0 0 0 0 0 0 0 <span class="nt">--</span>:--:-- <span class="nt">--</span>:--:-- <span class="nt">--</span>:--:-- 0
<span class="k">*</span> Connection <span class="c">#0 to host s.test.com left intact</span>
curl: <span class="o">(</span>92<span class="o">)</span> HTTP/2 stream 0 was not closed cleanly: PROTOCOL_ERROR <span class="o">(</span>err 1<span class="o">)</span>
<span class="k">*</span> Closing connection 0
</code></pre></div></div>
<p>猜测为 <code class="language-plaintext highlighter-rouge">proxy-connection</code> 头 与 HTTP2 兼容有问题,后查看 RFC 文档 <a href="https://httpwg.org/specs/rfc7540.html#rfc.section.8.1.2.2">8.1.2.2. Connection-Specific Header Fields
</a> 得知,HTTP2 不使用 <code class="language-plaintext highlighter-rouge">Connection</code> header 头,任何带有特定连接的 header 头都会被视为格式错误,例如 Keep-Alive, Proxy-Connection, Transfer-Encoding, and Upgrade。当中间层将 HTTP/1.x 消息转换为 HTTP2 时,需要移除这些头。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>HTTP/2 does not use the Connection header field to indicate connection-specific header fields; in this protocol, connection-specific metadata is conveyed by other means. An endpoint MUST NOT generate an HTTP/2 message containing connection-specific header fields; any message containing connection-specific header fields MUST be treated as malformed
......
This means that an intermediary transforming an HTTP/1.x message to HTTP/2 will need to remove any header fields nominated by the Connection header field, along with the Connection header field itself. Such intermediaries SHOULD also remove other connection-specific header fields, such as Keep-Alive, Proxy-Connection, Transfer-Encoding, and Upgrade, even if they are not nominated by the Connection header field.
</code></pre></div></div>
<h2 id="解决">解决</h2>
<p>1,联系 CDN 厂商,当响应 HTTP/2 消息时,移除掉 Proxy-Connection、Keep-Alive、Transfer-Encoding、Upgrade 响应头。</p>
<p>2,联系 CDN 厂商修改配置,不缓存 Proxy-Connection、Keep-Alive、Transfer-Encoding、Upgrade 响应头。</p>
<h2 id="ref">REF</h2>
<p><a href="https://cloud.tencent.com/developer/article/1754005">http2.0非法头部导致iphone访问白屏</a></p>vow故障现象为什么向目标主机发送请求实际却打到了本机?2022-03-09T00:00:00+08:002022-03-09T00:00:00+08:00https://www.louxiaohui.com/2022/03/09/Why-was-my-ping-answered-by-myself-than-the-one-pinged<h2 id="问题">问题</h2>
<p>直播业务中转集群节点间设备会互相拉取 hls ts 切片及接收外部转码集群推送的转码切片流,原来走的是公网,为节省成本,需要走专线,配置上专线IP后:</p>
<ul>
<li>中转–>转码集群,相互访问正常</li>
<li>中转 A–>中转 B,节点间互拉异常</li>
</ul>
<p>A B 主机 IP:</p>
<table>
<thead>
<tr>
<th>A</th>
<th>10.174.20.81</th>
</tr>
</thead>
<tbody>
<tr>
<td>B</td>
<td>10.174.20.83</td>
</tr>
</tbody>
</table>
<p>在 A 主机上,使用 curl 请求 B 主机,分别查看 A B 主机的 nginx 访问日志,发现请求打到了 A 主机,没有打到 B 主机,与预期不符。curl 命令及日志见下:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># 指定特殊 UA 头方便过滤日志</span>
curl <span class="nt">-svo</span> /dev/null http://10.174.20.83/monitor.html <span class="nt">-H</span> <span class="s2">"Host: mon.cloud.com"</span> <span class="nt">-A</span> lxhlxhlxh
<span class="c"># 主机 A 上的日志,主机 B 过滤不到日志</span>
<span class="c"># 10.174.20.81 为客户端IP,10.174.20.83 为服务端IP</span>
2022-03-09T02:37:12+08:00 :1646764632.579 :- :10.174.20.81 :10.174.20.83 :GET :http :mon.cloud.com :/monitor.html :/monitor.html :HTTP/1.1 :200 :lxhlxhlxh :END
</code></pre></div></div>
<h2 id="排查">排查</h2>
<p>当向 10.174.20.83 发起请求时,首先会查本机的路由表。</p>
<table>
<thead>
<tr>
<th>Destination</th>
<th>Gateway</th>
<th>Genmask</th>
<th>Use Iface</th>
</tr>
</thead>
<tbody>
<tr>
<td>10.174.20.0</td>
<td>0.0.0.0</td>
<td>255.255.255.0</td>
<td>lo</td>
</tr>
<tr>
<td>0.0.0.0</td>
<td>100.88.88.254</td>
<td>0.0.0.0</td>
<td>eth0</td>
</tr>
</tbody>
</table>
<p>(Gateway是0.0.0.0或者*表示目标是本主机所属的网络,不需要路由)</p>
<p>注意:上面的 lo 是假设的,为了方便下面的说明,实际在 Linux 中执行 <code class="language-plaintext highlighter-rouge">route -n</code> 是看不到 lo 接口的,因为它不对外。</p>
<p>目标IP 10.174.20.83,本机 lo 网卡 IP 为 10.174.20.81。分别拿 IP 及掩码进行 <code class="language-plaintext highlighter-rouge">&</code> 运算,得到网络位。</p>
<ul>
<li>
<p>lo 网卡 10.174.20.81 & 255.255.255.0 = 10.174.20.0</p>
</li>
<li>
<p>目标 IP 地址 10.174.20.83 & 255.255.255.0 = 10.174.20.0</p>
</li>
</ul>
<p>结果表明目标 IP 和主机在同一网段内,根据 IP 最长匹配原则,之后会走 lo 那条路由,不会走最后的默认网关 100.88.88.254。而且 lo 环回网卡离内核近,在路由表中为第一条,即使在有多条路由匹配的情况下,数据包也会优先走 lo 环回接口。</p>
<p>环回接口有一个特点,就是<strong>接收到的数据包又会发回给本机,也就是说回环网卡是自己和自己玩</strong>,因此如果走的是环回接口发送数据包,永远也发不出去,因此我们不能让数据包走环回接口,所以需要将掩码设置成 255.255.255.255</p>
<p>为验证数据包是发给了本机,在 10.174.20.81 上 ping 10.174.20.83,同时在两台主机开启抓包。</p>
<ul>
<li>81 ping 83</li>
</ul>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># 使用 -s 指定数据包大小</span>
<span class="c"># ping -s 1000 -c 3 -i 0.5 10.174.20.83 </span>
PING 10.174.20.83 <span class="o">(</span>10.174.20.83<span class="o">)</span> 1000<span class="o">(</span>1028<span class="o">)</span> bytes of data.
1008 bytes from 10.174.20.83: <span class="nv">icmp_seq</span><span class="o">=</span>1 <span class="nv">ttl</span><span class="o">=</span>64 <span class="nb">time</span><span class="o">=</span>0.059 ms
1008 bytes from 10.174.20.83: <span class="nv">icmp_seq</span><span class="o">=</span>2 <span class="nv">ttl</span><span class="o">=</span>64 <span class="nb">time</span><span class="o">=</span>0.050 ms
1008 bytes from 10.174.20.83: <span class="nv">icmp_seq</span><span class="o">=</span>3 <span class="nv">ttl</span><span class="o">=</span>64 <span class="nb">time</span><span class="o">=</span>0.050 ms
<span class="nt">---</span> 10.174.20.83 ping statistics <span class="nt">---</span>
3 packets transmitted, 3 received, 0% packet loss, <span class="nb">time </span>1003ms
rtt min/avg/max/mdev <span class="o">=</span> 0.050/0.053/0.059/0.004 ms
</code></pre></div></div>
<ul>
<li>81 上抓包</li>
</ul>
<p>可以看到回包</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># tcpdump -i lo:6 -s0 -nv '((icmp) and (dst host 10.174.20.81))' </span>
tcpdump: listening on lo:6, link-type EN10MB <span class="o">(</span>Ethernet<span class="o">)</span>, capture size 65535 bytes
23:27:13.156396 IP <span class="o">(</span>tos 0x0, ttl 64, <span class="nb">id </span>27424, offset 0, flags <span class="o">[</span>none], proto ICMP <span class="o">(</span>1<span class="o">)</span>, length 1028<span class="o">)</span>
10.174.20.83 <span class="o">></span> 10.174.20.81: ICMP <span class="nb">echo </span>reply, <span class="nb">id </span>63337, <span class="nb">seq </span>1, length 1008
23:27:13.657089 IP <span class="o">(</span>tos 0x0, ttl 64, <span class="nb">id </span>27861, offset 0, flags <span class="o">[</span>none], proto ICMP <span class="o">(</span>1<span class="o">)</span>, length 1028<span class="o">)</span>
10.174.20.83 <span class="o">></span> 10.174.20.81: ICMP <span class="nb">echo </span>reply, <span class="nb">id </span>63337, <span class="nb">seq </span>2, length 1008
23:27:14.160123 IP <span class="o">(</span>tos 0x0, ttl 64, <span class="nb">id </span>28266, offset 0, flags <span class="o">[</span>none], proto ICMP <span class="o">(</span>1<span class="o">)</span>, length 1028<span class="o">)</span>
10.174.20.83 <span class="o">></span> 10.174.20.81: ICMP <span class="nb">echo </span>reply, <span class="nb">id </span>63337, <span class="nb">seq </span>3, length 1008
</code></pre></div></div>
<ul>
<li>83 上抓包</li>
</ul>
<p>无数据</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># tcpdump -i eth0 -s0 -nvvv '((icmp) and (dst host 10.174.20.81))' </span>
tcpdump: WARNING: eth0: no IPv4 address assigned
tcpdump: listening on eth0, link-type EN10MB <span class="o">(</span>Ethernet<span class="o">)</span>, capture size 65535 bytes
</code></pre></div></div>
<h2 id="解决">解决</h2>
<p>1,掩码改为 32 位</p>
<blockquote>
<p>sed -i ‘s#255.255.255.0#255.255.255.255#g’ /etc/sysconfig/network-scripts/ifcfg-l0:6</p>
</blockquote>
<p>重启网卡后再次 ping 10.174.20.83,发现直接不通了,后查看路由表,发现没有 10.174.20.0/24 段的路由,增加以下策略路由后恢复。</p>
<blockquote>
<p>ip route add 10.174.20.0/24 via 100.88.88.254 dev eth0 src 10.174.20.81</p>
</blockquote>
<p>同时需要将路由写到配置文件中,避免机器重启后丢失。</p>
<blockquote>
<p>echo ‘10.174.20.0/24 via 100.88.88.254 src 10.174.20.81’ » /etc/sysconfig/network-scripts/route-eth0</p>
</blockquote>
<h2 id="ref">REF</h2>
<p><a href="https://blog.csdn.net/SmallCatBaby/article/details/89876508">LVS中VIP配置在环回接口上,子网掩码为什么是4个255?</a></p>vow问题top 和 ps 中进程优先级值不同的原因2021-09-05T00:00:00+08:002021-09-05T00:00:00+08:00https://www.louxiaohui.com/2021/09/05/the-reason-the-value-of-PRI-is-different-between-ps-and-top<h2 id="现象">现象</h2>
<p>分别使用 <code class="language-plaintext highlighter-rouge">top</code> 及 <code class="language-plaintext highlighter-rouge">ps</code> 查看 <code class="language-plaintext highlighter-rouge">nginx</code>(PID 为19910)进程的优先级。</p>
<p>1,<code class="language-plaintext highlighter-rouge">top</code> 中值为 <code class="language-plaintext highlighter-rouge">20</code></p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>top <span class="nt">-b</span> <span class="nt">-n</span> 1 <span class="nt">-p</span> 19910
top - 09:57:13 up 435 days, 19 min, 2 <span class="nb">users</span>, load average: 0.00, 0.00, 0.00
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
%Cpu<span class="o">(</span>s<span class="o">)</span>: 0.1 us, 0.1 sy, 0.0 ni, 99.9 <span class="nb">id</span>, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 8175360 total, 4617480 used, 3557880 free, 167444 buffers
KiB Swap: 8384508 total, 10408 used, 8374100 free. 3955108 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19910 root 20 0 85948 2944 1784 S 0.0 0.0 0:00.00 nginx
</code></pre></div></div>
<p>2,<code class="language-plaintext highlighter-rouge">ps -eo pri</code> 中值为 <code class="language-plaintext highlighter-rouge">19</code></p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ps <span class="nt">-eo</span> pid,ppid,ni,pri,psr,pcpu,stat,cmd |head <span class="nt">-n1</span><span class="p">;</span>ps <span class="nt">-eo</span> pid,ppid,ni,pri,psr,pcpu,stat,cmd |grep nginx
PID PPID NI PRI PSR %CPU STAT CMD
19910 1 0 19 0 0.0 Ss nginx: master process /usr/sbin/nginx
19911 19910 0 19 2 0.0 S nginx: worker process
19912 19910 0 19 1 0.0 S nginx: worker process
19914 19910 0 19 1 0.0 S nginx: worker process
19915 19910 0 19 0 0.0 S nginx: worker process
</code></pre></div></div>
<p>3,<code class="language-plaintext highlighter-rouge">ps -elf</code> 中值为 <code class="language-plaintext highlighter-rouge">80</code></p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ps <span class="nt">-elf</span> |head <span class="nt">-n1</span><span class="p">;</span>ps <span class="nt">-elf</span> |grep <span class="o">[</span>n]ginx
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
5 S root 19910 1 0 80 0 - 34452 sigsus Jul29 ? 00:00:00 nginx: master process /usr/sbin/nginx
5 S www-data 25732 19910 0 80 0 - 34485 ep_pol Aug31 ? 00:00:22 nginx: worker process
5 S www-data 25733 19910 0 80 0 - 34555 ep_pol Aug31 ? 00:00:23 nginx: worker process
5 S www-data 25734 19910 0 80 0 - 34452 ep_pol Aug31 ? 00:00:24 nginx: worker process
5 S www-data 25735 19910 0 80 0 - 34452 ep_pol Aug31 ? 00:00:23 nginx: worker process
</code></pre></div></div>
<h2 id="ps-中的-priority">ps 中的 priority</h2>
<p>ps 中的 <code class="language-plaintext highlighter-rouge">priority</code> 共有如下几种,其中 priority 为原始值,其值为 <code class="language-plaintext highlighter-rouge">/proc/[pid]/stat</code> 中的第 18 列,其他值均由其计算而来。值越小,优先级越高。</p>
<table>
<thead>
<tr>
<th>option</th>
<th>calculation</th>
<th>rt range</th>
<th>nrt range</th>
<th>notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>pri</td>
<td>39-priority</td>
<td>41 to 139</td>
<td>0 to 39</td>
<td> </td>
</tr>
<tr>
<td>priority</td>
<td>priority</td>
<td>-100 to -2</td>
<td>-39 to 0</td>
<td>raw value</td>
</tr>
<tr>
<td>opri</td>
<td>60+priority</td>
<td>-40 to 58</td>
<td>60 to 99</td>
<td> </td>
</tr>
<tr>
<td>pri_api</td>
<td>-1 - priority</td>
<td>1 to 99</td>
<td>-40 to -1</td>
<td>correct for RT</td>
</tr>
<tr>
<td>pri_bar</td>
<td>priority + 1</td>
<td>-99 to -1</td>
<td>1 to 40</td>
<td> </td>
</tr>
<tr>
<td>pri_baz</td>
<td>priority + 100</td>
<td>1 to 99</td>
<td>100 to 139</td>
<td>internal kernel priority</td>
</tr>
<tr>
<td>pri_foo</td>
<td>priority-20</td>
<td>-120 to -21</td>
<td>-20 to 19</td>
<td>correct for NRT</td>
</tr>
</tbody>
</table>
<ul>
<li>
<p>Real-time Tasks</p>
<ul>
<li>实时任务是有时间限制的任务,有期限,必须在期限前完成。</li>
<li>无 nice 值。</li>
<li>任务的调度优先级永远高于所有的非实时任务。</li>
</ul>
</li>
<li>
<p>Non-Real-time Tasks</p>
<ul>
<li>非实时任务是无时间限制的任务,无期限,无需在期限前完成。</li>
<li>有 nice 值。</li>
<li>现代计算机系统中已不再使用。</li>
</ul>
</li>
</ul>
<p>使用以下 ps 命令可将 nginx 进程的各个 priority 的值打印出来。</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ps <span class="nt">-o</span> pid,nice,pri,priority,opri,pri_api,pri_bar,pri_baz,pri_foo 19910
PID NI PRI PRI PRI API BAR BAZ FOO
19910 0 19 20 80 <span class="nt">-21</span> 21 120 0
</code></pre></div></div>
<h2 id="分析">分析</h2>
<ul>
<li>top</li>
</ul>
<p>top 打印的值为原始值 <code class="language-plaintext highlighter-rouge">priority</code>,为 20,可使用以下命令获取到:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">awk</span> <span class="s1">'{print $18}'</span> /proc/19910/stat
20
</code></pre></div></div>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// legal as UNIX <span class="s2">"PRI"</span>
// <span class="s2">"priority"</span> <span class="o">(</span>was <span class="nt">-20</span>..20, now <span class="nt">-100</span>..39<span class="o">)</span>
static int pr_priority<span class="o">(</span>char <span class="k">*</span>restrict const outbuf, const proc_t <span class="k">*</span>restrict const pp<span class="o">){</span> /<span class="k">*</span> <span class="nt">-20</span>..20 <span class="k">*</span>/
<span class="k">return </span>snprintf<span class="o">(</span>outbuf, COLWID, <span class="s2">"%ld"</span>, pp->priority<span class="o">)</span><span class="p">;</span>
<span class="o">}</span>
</code></pre></div></div>
<ul>
<li>ps -eo pri</li>
</ul>
<p><code class="language-plaintext highlighter-rouge">pri</code> 值为 39 减去 <code class="language-plaintext highlighter-rouge">priority</code>。</p>
<blockquote>
<p>pri = 39 - priority = 39 - 20 = 19</p>
</blockquote>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// not legal as UNIX <span class="s2">"PRI"</span>
// <span class="s2">"pri"</span> <span class="o">(</span>was 20..60, now 0..139<span class="o">)</span>
static int pr_pri<span class="o">(</span>char <span class="k">*</span>restrict const outbuf, const proc_t <span class="k">*</span>restrict const pp<span class="o">){</span> /<span class="k">*</span> 20..60 <span class="k">*</span>/
<span class="k">return </span>snprintf<span class="o">(</span>outbuf, COLWID, <span class="s2">"%ld"</span>, 39 - pp->priority<span class="o">)</span><span class="p">;</span>
</code></pre></div></div>
<ul>
<li><code class="language-plaintext highlighter-rouge">ps -elf</code></li>
</ul>
<p><code class="language-plaintext highlighter-rouge">-l</code> 选项默认取的是 <code class="language-plaintext highlighter-rouge">opri</code> 的值,其值为 <code class="language-plaintext highlighter-rouge">priority</code> 的值加上 60。</p>
<blockquote>
<p>opri = priority + 60 = 20 + 60 = 80</p>
</blockquote>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// legal as UNIX <span class="s2">"PRI"</span>
// <span class="s2">"intpri"</span> and <span class="s2">"opri"</span> <span class="o">(</span>was 39..79, now <span class="nt">-40</span>..99<span class="o">)</span>
static int pr_opri<span class="o">(</span>char <span class="k">*</span>restrict const outbuf, const proc_t <span class="k">*</span>restrict const pp<span class="o">){</span> /<span class="k">*</span> 39..79 <span class="k">*</span>/
<span class="k">return </span>snprintf<span class="o">(</span>outbuf, COLWID, <span class="s2">"%ld"</span>, 60 + pp->priority<span class="o">)</span><span class="p">;</span>
<span class="o">}</span>
</code></pre></div></div>
<h2 id="ref">REF</h2>
<p><a href="https://unix.stackexchange.com/questions/613717/commands-top-and-ps-show-different-values-for-priority-why#comment1146281_613727">Commands ‘top’ and ‘ps’ show different values for priority - why?</a></p>
<p><a href="https://gitlab.com/procps-ng/procps/-/issues/111">ps priority(‘pri) different from value shown by top</a></p>
<p><a href="https://gitlab.com/procps-ng/procps/-/blob/master/ps/output.c">ps/output.c</a></p>
<p><a href="https://www.geeksforgeeks.org/difference-between-real-time-tasks-and-non-real-time-tasks/">difference-between-real-time-tasks-and-non-real-time-tasks</a></p>
<p><a href="https://www.zhihu.com/question/272086181">ps 命令的 PRI 值和 task_struct 的 prio 值的关系是怎么样?</a></p>vow现象记一次业务机器文件句柄占用过高问题2021-08-08T00:00:00+08:002021-08-08T00:00:00+08:00https://www.louxiaohui.com/2021/08/08/too-many-open-files-warning-on-machine<h3 id="现象">现象</h3>
<p>监控显示某业务机器系统文件句柄使用量大于 15000</p>
<p><img src="https://i.loli.net/2021/08/08/p6Un14HRtbEPCdz.png" alt="fd-used-too-high.png" /></p>
<h3 id="排查">排查</h3>
<p>登录业务机器,执行以下命令找出打开文件数最多的进程</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>lsof <span class="nt">-n</span> |awk <span class="s1">'{num[$2]++}END{for(i in num) print num[i],i}'</span> |sort <span class="nt">-nr</span> |head
</code></pre></div></div>
<p>确认进程号为 <code class="language-plaintext highlighter-rouge">11342</code>,执行 <code class="language-plaintext highlighter-rouge">lsof -p 11342</code> 查看句柄占用信息,发现有大量 <code class="language-plaintext highlighter-rouge">CLOSE_WAIT</code> 状态的 TCP 连接</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
appserver 11342 root 224u IPv4 2940079 0t0 TCP 2zqw0uq5rrsZ:9702->192.168.121.44:https <span class="o">(</span>CLOSE_WAIT<span class="o">)</span>
appserver 11342 root 225u IPv4 2934562 0t0 TCP 2zqw0uq5rrsZ:55478->192.168.120.181:https <span class="o">(</span>CLOSE_WAIT<span class="o">)</span>
appserver 11342 root 226u IPv4 2946272 0t0 TCP 2zqw0uq5rrsZ:9856->192.168.121.44:https <span class="o">(</span>CLOSE_WAIT<span class="o">)</span>
appserver 11342 root 227u IPv4 2943998 0t0 TCP 2zqw0uq5rrsZ:55622->192.168.120.181:https <span class="o">(</span>CLOSE_WAIT<span class="o">)</span>
appserver 11342 root 228u IPv4 2947267 0t0 TCP 2zqw0uq5rrsZ:9868->192.168.121.44:https <span class="o">(</span>CLOSE_WAIT<span class="o">)</span>
appserver 11342 root 229u IPv4 3016704 0t0 TCP 2zqw0uq5rrsZ:11326->192.168.121.44:https <span class="o">(</span>CLOSE_WAIT<span class="o">)</span>
appserver 11342 root 230u IPv4 3025554 0t0 TCP 2zqw0uq5rrsZ:57236->192.168.120.181:https <span class="o">(</span>CLOSE_WAIT<span class="o">)</span>
</code></pre></div></div>
<p>我们先看一下 TCP 四次挥手过程</p>
<p><img src="https://i.loli.net/2021/08/08/OwtzZYJhlom3n7y.png" alt="tcp-4-way-handshake.png" /></p>
<p>可以看到,<code class="language-plaintext highlighter-rouge">COLSE_WAIT</code> 为四次挥手中被动关闭方其中的一个状态,表示被动关闭方等待关闭。一般被动关闭方为服务器。</p>
<pre><code class="language-plain">当客户端 close() 一个 SOCKET 后发送 FIN 报文给服务器,服务器毫无疑问地将会回应一个 ACK 报文给对客户端,此时 TCP 连接则进入到
CLOSE_WAIT 状态。接下来呢,服务器需要检查自己是否还有数据要发送给客户端,如果没有的话,那服务器也就可以 close() 这个 SOCKET
并发送 FIN 报文给客户端,即关闭自己到客户端方向的连接。有数据的话则看应用程序的策略,继续发送或丢弃。
简单地说,当服务器处于 CLOSE_WAIT 状态下,需要完成的事情是等待应用程序去关闭连接。
</code></pre>
<p>而服务器有大量 CLOSE_WAIT 状态的 SOCKET ,说明应用程序没有及时关闭连接,也就是没有发送 FIN 包给客户端。</p>
<h3 id="解决">解决</h3>
<p>联系研发排查,确认为程序未关闭连接,导致句柄泄露,会在下个版本中修复。</p>
<h3 id="ref">REF</h3>
<p><a href="https://blog.csdn.net/qq_39382769/article/details/90703382">关于close_wait状态的理解</a></p>
<p><a href="https://blog.csdn.net/pmt123456/article/details/56677578">TCP三次握手/四次挥手 及 状态变迁图</a></p>
<p><a href="https://blog.csdn.net/hfhhgfv/article/details/84064230">TCP/IP详解–连接状态变迁图CLOSE WAIT</a></p>vow现象阿里云 ACK 容器化环境节点升配2021-07-15T00:00:00+08:002021-07-15T00:00:00+08:00https://www.louxiaohui.com/2021/07/15/increase-machine-cpu-and-memory-on-Aliyun-ACK<p>由于成本因素,近期将部分 AWS 客户切到了阿里云,但阿里云会议节点原来的机器配置较低,需要升配。原有实例规格为 ecs.c6.4xlarge,升级后为 ecs.c6.6xlarge</p>
<p>会议节点的机器比较特殊:</p>
<ul>
<li>需要绑定公网 IP</li>
<li>公网 IP 有白名单并关联的有公网域名</li>
<li>部署方式为 StatefulSet</li>
</ul>
<p>StatefulSet 部署的 pod 有编号(默认从0开始,依次递增),为保持 pod 编号与新开节点 IP 最后一位为递增顺序,即第一个 pod(编号为0)对应第一台节点,因此需要逐个替换。</p>
<p>经实际操作验证,ACK 节点一次扩容多台时都分布在一个可用区。为保证落在不同可用区,扩容数量选择 1 个,共计 6 台机器,所以总计需要执行 6 次扩容。</p>
<p>替换总共分为以下几步:</p>
<ul>
<li>新增 ecs.c6.6xlarge 规格的节点并置为不可调度</li>
<li>旧节点置为不可调度,保证 pod 删除后不会被控制器重新创建</li>
<li>删除旧 pod ,为后续移除节点做准备</li>
<li>旧节点解绑弹性公网 IP</li>
<li>新节点绑定原有弹性IP</li>
<li>新节点解除不可调度</li>
</ul>
<p>更换顺序:</p>
<table>
<thead>
<tr>
<th>更换顺序</th>
<th>pod</th>
<th>旧节点</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>meeting-0</td>
<td>cn-beijing.192.168.241.108</td>
</tr>
<tr>
<td>2</td>
<td>meeting-1</td>
<td>cn-beijing.192.168.240.42</td>
</tr>
<tr>
<td>3</td>
<td>meeting-2</td>
<td>cn-beijing.192.168.240.52</td>
</tr>
<tr>
<td>4</td>
<td>meeting-3</td>
<td>cn-beijing.192.168.241.118</td>
</tr>
<tr>
<td>5</td>
<td>meeting-4</td>
<td>cn-beijing.192.168.240.39</td>
</tr>
<tr>
<td>6</td>
<td>meeting-5</td>
<td>cn-beijing.192.168.241.100</td>
</tr>
</tbody>
</table>
<h2 id="新增节点">新增节点</h2>
<h3 id="节点扩容">节点扩容</h3>
<p>1,节点池中的实例规格改为 ecs.c6.6xlarge</p>
<p>点击 节点管理–>节点池–>meeting–>编辑
<img src="https://i.loli.net/2021/08/02/hIupwO9BbEXA3YH.png" alt="prod-ack-increase-node-edit.png" /></p>
<p>实例规格处搜 <code class="language-plaintext highlighter-rouge">ecs.c6.6x</code>,点击 <code class="language-plaintext highlighter-rouge">+</code> 号,添加到已选规格中。已选规格处点击 <code class="language-plaintext highlighter-rouge">-</code> 号,移除原有的 <code class="language-plaintext highlighter-rouge">ecs.c6.4x</code></p>
<p><img src="https://i.loli.net/2021/08/02/VxvENYd3kFUonJ9.png" alt="prod-ack-increase-node-add-6x.png" /></p>
<p>2,点击 节点管理–>节点池–>meeting–>扩容</p>
<p><img src="https://i.loli.net/2021/08/02/qa6bXGUhdP4JNLS.png" alt="prod-ack-increase-node-2.png" /></p>
<p>3,扩容的节点数量选择 1,点击提交</p>
<p><img src="https://i.loli.net/2021/08/02/rIKBAMPFueTlnQ6.png" alt="prod-ack-increase-node-1.png" /></p>
<p>可运行以下命令,观察新加入的节点是否为 ready</p>
<blockquote>
<p>kubectl get node -l k8s-meeting=true -w</p>
</blockquote>
<p>4,扩容节点打 ROLE 标签</p>
<blockquote>
<p>kubectl label node cn-beijing.192.168.240.88 node-role.kubernetes.io/meeting=node</p>
</blockquote>
<pre><code class="language-plain">cn-beijing.192.168.240.88 需要替换为新开节点的 IP
</code></pre>
<p>5,新开节点置为不可调度</p>
<blockquote>
<p>kubectl cordon cn-beijing.192.168.240.88</p>
</blockquote>
<pre><code class="language-plain">cn-beijing.192.168.240.88 需要替换为新开节点的 IP
</code></pre>
<h3 id="旧节点置为不可调度">旧节点置为不可调度</h3>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl cordon cn-beijing.192.168.241.108
</code></pre></div></div>
<pre><code class="language-plain">cn-beijing.192.168.241.108 为节点名称,实际执行时注意替换
</code></pre>
<h3 id="删除旧-pod">删除旧 pod</h3>
<blockquote>
<p>kubectl delete pod meeting-0</p>
</blockquote>
<pre><code class="language-plain">meeting-0 为 pod 名称,实际执行时注意替换
</code></pre>
<h3 id="旧节点解绑弹性公网-ip">旧节点解绑弹性公网 IP</h3>
<p>控制台切换到 <code class="language-plaintext highlighter-rouge">云服务器 ECS</code>,搜索节点 IP 解绑即可。</p>
<h3 id="新节点绑定原有弹性公网-ip">新节点绑定原有弹性公网 IP</h3>
<p>控制台切换到 <code class="language-plaintext highlighter-rouge">云服务器 ECS</code>,搜索新开节点 IP 绑定旧节点解绑的弹性 IP</p>
<pre><code class="language-plain">pod 中的 业务 docker 容器启动依赖公网 IP,因此在 pod 调度到新开节点前需要提前绑定弹性 IP
</code></pre>
<h3 id="新节点解除不可调度">新节点解除不可调度</h3>
<p>执行以下命令</p>
<blockquote>
<p>kubectl uncordon cn-beijing.192.168.240.88</p>
</blockquote>
<pre><code class="language-plain">cn-beijing.192.168.240.88 需要替换为新开节点的 IP
</code></pre>
<p>之后 Scheduler 会将 pod 调度到新开节点</p>
<h2 id="更改-sts-的资源限制">更改 sts 的资源限制</h2>
<p>待 6 个节点全部更换完成后,更改 StatefulSet 中的资源限制。</p>
<p>执行以下命令更改配置</p>
<blockquote>
<p>kubectl edit sts meeting</p>
</blockquote>
<ul>
<li>limits 处 cpu 值改为 “22”</li>
<li>limits 处 memory 值改为 46Gi</li>
</ul>
<p>更改后的配置</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">resources</span><span class="pi">:</span>
<span class="na">limits</span><span class="pi">:</span>
<span class="na">cpu</span><span class="pi">:</span> <span class="s2">"</span><span class="s">22"</span>
<span class="na">memory</span><span class="pi">:</span> <span class="s">46Gi</span>
</code></pre></div></div>
<h2 id="移除旧节点">移除旧节点</h2>
<h3 id="更改为按量付费">更改为按量付费</h3>
<p>旧节点之前为包月计费,需要改为按量计费后才会退换剩余的包月费用。</p>
<p>控制台切换到 <code class="language-plaintext highlighter-rouge">云服务器 ECS</code>,搜索旧节点,改为按量付费。</p>
<p><img src="https://i.loli.net/2021/08/02/UXzkj9LJ6tZfEoR.png" alt="prod-ack-increase-node-change-cost.png" /></p>
<h3 id="移除节点">移除节点</h3>
<p>节点管理–>节点池–>meeting–>详情</p>
<p><img src="https://i.loli.net/2021/08/02/rmBd6sSkpziTRJ5.png" alt="prod-ack-increase-node-detail.png" /></p>
<p>切换到<code class="language-plaintext highlighter-rouge">节点管理</code>选项,选中节点,点击<code class="language-plaintext highlighter-rouge">移除节点</code></p>
<p><img src="https://i.loli.net/2021/08/02/WyocGKZhbux8aS3.png" alt="prod-ack-increase-node-remove.png" /></p>
<p>勾选 <code class="language-plaintext highlighter-rouge">自动排空节点(drain)</code> 及<code class="language-plaintext highlighter-rouge">同时释放 ECS</code>,点击确定</p>
<p><img src="https://i.loli.net/2021/08/02/KfsAJYe4o2hBOw1.png" alt="prod-ack-increase-node-remove-confirm.png" /></p>vow由于成本因素,近期将部分 AWS 客户切到了阿里云,但阿里云会议节点原来的机器配置较低,需要升配。原有实例规格为 ecs.c6.4xlarge,升级后为 ecs.c6.6xlarge 会议节点的机器比较特殊: 需要绑定公网 IP 公网 IP 有白名单并关联的有公网域名 部署方式为 StatefulSetLVS+keepalived+nginx集群部署2021-05-22T00:00:00+08:002021-05-22T00:00:00+08:00https://www.louxiaohui.com/2021/05/22/deploy-lvs-and-keepalived-cluster<h3 id="主机信息">主机信息</h3>
<table>
<thead>
<tr>
<th>主机</th>
<th>IP-LAN</th>
<th>IP-WAN</th>
</tr>
</thead>
<tbody>
<tr>
<td>LVS01</td>
<td>172.17.85.54</td>
<td>119.253.282.16</td>
</tr>
<tr>
<td>LVS02</td>
<td>172.17.85.55</td>
<td>119.253.282.17</td>
</tr>
<tr>
<td>VIP</td>
<td>172.17.85.56</td>
<td>119.253.282.6</td>
</tr>
<tr>
<td>nginx01</td>
<td>172.17.85.57</td>
<td> </td>
</tr>
<tr>
<td>nginx02</td>
<td>172.17.85.58</td>
<td> </td>
</tr>
</tbody>
</table>
<h3 id="更新源">更新源</h3>
<p>四台机器均需要执行</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cat</span> <span class="o">></span> /etc/apt/sources.list <EOF
deb http://mirrors.163.com/ubuntu/ trusty main restricted universe multiverse
deb http://mirrors.163.com/ubuntu/ trusty-security main restricted universe multiverse
deb http://mirrors.163.com/ubuntu/ trusty-updates main restricted universe multiverse
deb http://mirrors.163.com/ubuntu/ trusty-proposed main restricted universe multiverse
deb http://mirrors.163.com/ubuntu/ trusty-backports main restricted universe multiverse
deb-src http://mirrors.163.com/ubuntu/ trusty main restricted universe multiverse
deb-src http://mirrors.163.com/ubuntu/ trusty-security main restricted universe multiverse
deb-src http://mirrors.163.com/ubuntu/ trusty-updates main restricted universe multiverse
deb-src http://mirrors.163.com/ubuntu/ trusty-proposed main restricted universe multiverse
deb-src http://mirrors.163.com/ubuntu/ trusty-backports main restricted universe multiverse
EOF
apt-get update
</code></pre></div></div>
<h3 id="安装-keepalived">安装 keepalived</h3>
<p>分别在两台 LVS 机器上执行</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apt-get <span class="nb">install</span> <span class="nt">-y</span> ipvsadm keepalived
</code></pre></div></div>
<h3 id="安装-nginx">安装 nginx</h3>
<p>分别在两台 nginx 机器上执行</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apt-get <span class="nb">install</span> <span class="nt">-y</span> nginx
</code></pre></div></div>
<h3 id="添加-lvs-配置脚本">添加 LVS 配置脚本</h3>
<p>分别在两台 nginx 机器上执行</p>
<blockquote>
<p>vim /usr/local/sbin/lvs_dr_rs.sh</p>
</blockquote>
<p>贴入以下内容,注意更改内外网的 VIP</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#!/bin/bash
#########################################################################
# File Name: realserver.sh
# Author: LookBack
# Email: admin#05hd.com
# Version:
# Created Time: 2015年05月29日 星期五 18时48分39秒
#########################################################################
# Script to start LVS DR real server.
# description: LVS DR real server
#
#. /etc/rc.d/init.d/functions
VIP=119.253.282.6
VIP1=172.17.85.56
host=`/bin/hostname`
case "$1" in
start)
# Start LVS-DR real server on this machine.
/sbin/ifconfig lo down
/sbin/ifconfig lo up
echo 1 > /proc/sys/net/ipv4/conf/lo/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/lo/arp_announce
echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
/sbin/ifconfig lo:0 $VIP broadcast $VIP netmask 255.255.255.255 up
/sbin/ifconfig lo:1 $VIP1 broadcast $VIP1 netmask 255.255.255.255 up
/sbin/route add -host $VIP dev lo:0
/sbin/route add -host $VIP1 dev lo:1
;;
stop)
# Stop LVS-DR real server loopback device(s).
/sbin/ifconfig lo:0 down
/sbin/ifconfig lo:1 down
echo 0 > /proc/sys/net/ipv4/conf/lo/arp_ignore
echo 0 > /proc/sys/net/ipv4/conf/lo/arp_announce
echo 0 > /proc/sys/net/ipv4/conf/all/arp_ignore
echo 0 > /proc/sys/net/ipv4/conf/all/arp_announce
;;
status)
# Status of LVS-DR real server.
islothere=`/sbin/ifconfig lo:0 | grep $VIP`
isrothere=`netstat -rn | grep "lo:0" | grep $VIP`
if [ ! "$islothere" -o ! "isrothere" ];then
# Either the route or the lo:0 device
# not found.
echo "LVS-DR real server Stopped."
else
echo "LVS-DR real server Running."
fi
;;
*)
# Invalid entry.
echo "$0: Usage: $0 {start|status|stop}"
exit 1
;;
esac
</code></pre></div></div>
<p>添加执行权限</p>
<blockquote>
<p>chmod a+x /usr/local/sbin/lvs_dr_rs.sh
bash /usr/local/sbin/lvs_dr_rs.sh start</p>
</blockquote>
<p>加入开机自启</p>
<blockquote>
<p>echo ‘bash /usr/local/sbin/lvs_dr_rs.sh start’ » /etc/rc.local</p>
</blockquote>
<h3 id="lvs01-机器配置">LVS01 机器配置</h3>
<p>virtual_router_id 注意事项:</p>
<ul>
<li>虚拟路由器的标识,对应备用节点的此值必须相同,以指明各个节点属于同一VRRP组</li>
<li>本机两个 vrrp_instance 组的此值不能相同,否则会会引发交换机广播风暴。</li>
<li>virtual_router_id 默认使用IP最后一位。</li>
</ul>
<blockquote>
<p>vim /etc/keepalived/keepalived.conf</p>
</blockquote>
<p>贴入一下内容,注意检查 <strong>virtual_router_id</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> global_defs {
router_id D-Medial-1
}
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 56 #本机两个vrrp_instance组的此值不能相同,但对应备用节点的此值必须相同,以指明各个节点属于同一VRRP组
nopreempt
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
172.17.85.56 dev eth0 label eth0:0
}
}
virtual_server 172.17.85.56 80 {
delay_loop 5
lb_algo wrr
lb_kind DR
# persistence_timeout 10
protocol TCP
real_server 172.17.85.54 80 {
weight 1
TCP_CHECK {
connect_timeout 5
nb_get_retry 3
delay_before_retry 3
connect_port 80
}
}
real_server 172.17.85.55 80 {
weight 1
TCP_CHECK {
connect_timeout 5
nb_get_retry 3
delay_before_retry 3
connect_port 80
}
}
}
vrrp_instance VI_2 {
state MASTER
interface eth1
virtual_router_id 6 #本机两个vrrp_instance组的此值不能相同,但对应备用节点的此值必须相同,以指明各个节点属于同一VRRP组
nopreempt
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
119.253.282.6 dev eth1 label eth1:0
}
}
virtual_server 119.253.282.6 80 {
delay_loop 5
lb_algo wrr
lb_kind DR
# persistence_timeout 10
protocol TCP
real_server 119.253.282.16 80 {
weight 1
TCP_CHECK {
connect_timeout 5
nb_get_retry 3
delay_before_retry 3
connect_port 80
}
}
real_server 119.253.282.17 80 {
weight 1
TCP_CHECK {
connect_timeout 5
nb_get_retry 3
delay_before_retry 3
connect_port 80
}
}
}
</code></pre></div></div>
<h3 id="lvs02-机器配置">LVS02 机器配置</h3>
<p>从主节点拷贝 keepalived.conf ,之后修改</p>
<ul>
<li>内外网 vrrp_instance 中的 state 由 MASTER 改为 BACKUP</li>
<li>内外网 vrrp_instance 中的 priority 由 100 改为 90</li>
<li>router_id D-Medial-1 改为 router_id D-Medial-2</li>
</ul>
<h3 id="两台-nginx-开启转发功能">两台 nginx 开启转发功能</h3>
<blockquote>
<p>echo 1 > /proc/sys/net/ipv4/ip_forward</p>
</blockquote>
<p>加入开机自启</p>
<blockquote>
<p>echo ‘echo 1 > /proc/sys/net/ipv4/ip_forward’ » /etc/rc.local</p>
</blockquote>
<h3 id="启动-keepalived">启动 keepalived</h3>
<p>先启动 LVS01,等 30 秒后启动 LVS02,避免同时占用 VIP</p>
<blockquote>
<p>service keepalived start</p>
</blockquote>
<h3 id="加入开机启动">加入开机启动</h3>
<blockquote>
<p>update-rc.d keepalived defaults 90</p>
</blockquote>
<h3 id="验证">验证</h3>
<p>关闭其中一台 nginx 后,分别绑定 172.17.85.56 及 119.253.282.6 测试到 80 端口的连通性</p>
<blockquote>
<p>for i in $(seq 1 1 20);do nc -zv -w 2 172.17.85.56 80;done
for i in $(seq 1 1 20);do nc -zv -w 2 119.253.282.6 80;done</p>
</blockquote>
<p>主要关注:</p>
<ul>
<li>负载均衡是否正常</li>
<li>后端 Real Server(nginx)出问题 LVS 是否自动剔除</li>
<li>LVS 机器其中一台宕机后,VIP 是否漂移到另一台LVS继续提供服务</li>
</ul>
<h3 id="问题记录">问题记录</h3>
<h4 id="更新源报错">更新源报错</h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>W: GPG error: http://mirrors.aliyun.com wheezy-updates InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 8B48AD6246925553 NO_PUBKEY 7638D0442B90D010
W: GPG error: http://mirrors.aliyun.com wheezy/updates InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 9D6D8F6BC857C906 NO_PUBKEY 8B48AD6246925553
W: GPG error: http://mirrors.aliyun.com wheezy Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 8B48AD6246925553 NO_PUBKEY 7638D0442B90D010 NO_PUBKEY 6FB2A1C265FFB764
</code></pre></div></div>
<p>解决:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt-key adv <span class="nt">--keyserver</span> keyserver.ubuntu.com <span class="nt">--recv-keys</span> 8B48AD6246925553
<span class="nb">sudo </span>apt-key adv <span class="nt">--keyserver</span> keyserver.ubuntu.com <span class="nt">--recv-keys</span> 7638D0442B90D010
<span class="nb">sudo </span>apt-key adv <span class="nt">--keyserver</span> keyserver.ubuntu.com <span class="nt">--recv-keys</span> 9D6D8F6BC857C906
<span class="nb">sudo </span>apt-key adv <span class="nt">--keyserver</span> keyserver.ubuntu.com <span class="nt">--recv-keys</span> 6FB2A1C265FFB764
</code></pre></div></div>
<h3 id="注意事项">注意事项</h3>
<p>keepalived 配置时需要注意以下几项:</p>
<ul>
<li>virtual_router_id 取值在0-255之间,用来区分多个 instance 的 VRRP 组播</li>
<li>配置时需要查看下本网段的 virtual_router_id ,不能与其他机器重复</li>
</ul>vow主机信息redis 集群机器宕机后主从关系恢复2021-04-11T00:00:00+08:002021-04-11T00:00:00+08:00https://www.louxiaohui.com/2021/04/11/restore-redis-cluster-status-after-machine-down<p>redis 集群总共有三台机器,一主两从,共计启动 9 个实例。
redis 集群机器宕机后,可能会出现两个 master 节点落到一台机器上的情况,若此机器宕机,在其他节点尚未选举出新的主节点之前,集群只有一个主节点,存在单点风险。</p>
<h2 id="redis-集群信息">redis 集群信息</h2>
<table>
<thead>
<tr>
<th>IP</th>
<th>端口</th>
</tr>
</thead>
<tbody>
<tr>
<td>192.168.100.71</td>
<td>7000(主) 7002 7003</td>
</tr>
<tr>
<td>192.168.100.72</td>
<td>7000(主) 7001 7003</td>
</tr>
<tr>
<td>192.168.100.73</td>
<td>7000(主) 7001 7002</td>
</tr>
</tbody>
</table>
<h2 id="初始化并重新创建集群">初始化并重新创建集群</h2>
<h3 id="禁止业务机器往集群写入">禁止业务机器往集群写入</h3>
<p>由于业务机器上的服务会连接 redis 写入数据,集群初始化会失败,需要禁止写入,可以使用 iptables,只允许集群内的机器访问 7000-70003 端口,拒绝其他机器连接。</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>iptables <span class="nt">-A</span> INPUT <span class="nt">-p</span> tcp <span class="nt">-s</span> 192.168.100.71 <span class="nt">-m</span> multiport <span class="nt">--dports</span> 7000:7003 <span class="nt">-j</span> ACCEPT
iptables <span class="nt">-A</span> INPUT <span class="nt">-p</span> tcp <span class="nt">-s</span> 192.168.100.72 <span class="nt">-m</span> multiport <span class="nt">--dports</span> 7000:7003 <span class="nt">-j</span> ACCEPT
iptables <span class="nt">-A</span> INPUT <span class="nt">-p</span> tcp <span class="nt">-s</span> 192.168.100.73 <span class="nt">-m</span> multiport <span class="nt">--dports</span> 7000:7003 <span class="nt">-j</span> ACCEPT
iptables <span class="nt">-A</span> INPUT <span class="nt">-p</span> tcp <span class="nt">-m</span> multiport <span class="nt">--dports</span> 7000:7003 <span class="nt">-j</span> DROP
</code></pre></div></div>
<h3 id="删除本地备份文件">删除本地备份文件</h3>
<blockquote>
<p>rm -f /usr/local/redis/cluster/nodes-7000.conf /usr/local/redis/cluster/nodes-7001.conf /usr/local/redis/cluster/nodes-7002.conf /usr/local/redis/cluster/nodes-7003.conf<br />
rm -f /usr/local/redis/cluster/dump.rdb</p>
</blockquote>
<h3 id="停止-redis">停止 redis</h3>
<blockquote>
<table>
<tbody>
<tr>
<td>ps -ef</td>
<td>grep -E ‘7000</td>
<td>7001</td>
<td>7002</td>
<td>7003’</td>
<td>grep -v grep</td>
<td>awk ‘{print $2}’</td>
<td>xargs kill -9</td>
</tr>
</tbody>
</table>
</blockquote>
<h3 id="启动-redis">启动 redis</h3>
<ul>
<li>192.168.100.71</li>
</ul>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/usr/local/redis-cluster/bin/redis-server /usr/local/redis-cluster/cluster/7000/redis7000.conf
/usr/local/redis-cluster/bin/redis-server /usr/local/redis-cluster/cluster/7002/redis7002.conf
/usr/local/redis-cluster/bin/redis-server /usr/local/redis-cluster/cluster/7003/redis7003.conf
</code></pre></div></div>
<ul>
<li>192.168.100.72</li>
</ul>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/usr/local/redis-cluster/bin/redis-server /usr/local/redis-cluster/cluster/7000/redis7000.conf
/usr/local/redis-cluster/bin/redis-server /usr/local/redis-cluster/cluster/7001/redis7001.conf
/usr/local/redis-cluster/bin/redis-server /usr/local/redis-cluster/cluster/7003/redis7003.conf
</code></pre></div></div>
<ul>
<li>192.168.100.73</li>
</ul>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/usr/local/redis-cluster/bin/redis-server /usr/local/redis-cluster/cluster/7000/redis7000.conf
/usr/local/redis-cluster/bin/redis-server /usr/local/redis-cluster/cluster/7001/redis7001.conf
/usr/local/redis-cluster/bin/redis-server /usr/local/redis-cluster/cluster/7002/redis7002.conf
</code></pre></div></div>
<h3 id="重置集群">重置集群</h3>
<p>执行 cluster reset 重置,清理旧的集群信息。</p>
<ul>
<li>192.168.100.71</li>
</ul>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>redis-cli <span class="nt">-h</span> 192.168.100.71 <span class="nt">-p</span> 7000 <span class="nt">-c</span> cluster reset
redis-cli <span class="nt">-h</span> 192.168.100.71 <span class="nt">-p</span> 7002 <span class="nt">-c</span> cluster reset
redis-cli <span class="nt">-h</span> 192.168.100.71 <span class="nt">-p</span> 7003 <span class="nt">-c</span> cluster reset
</code></pre></div></div>
<ul>
<li>192.168.100.72</li>
</ul>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>redis-cli <span class="nt">-h</span> 192.168.100.72 <span class="nt">-p</span> 7000 <span class="nt">-c</span> cluster reset
redis-cli <span class="nt">-h</span> 192.168.100.72 <span class="nt">-p</span> 7001 <span class="nt">-c</span> cluster reset
redis-cli <span class="nt">-h</span> 192.168.100.72 <span class="nt">-p</span> 7003 <span class="nt">-c</span> cluster reset
</code></pre></div></div>
<ul>
<li>192.168.100.73</li>
</ul>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>redis-cli <span class="nt">-h</span> 192.168.100.73 <span class="nt">-p</span> 7000 <span class="nt">-c</span> cluster reset
redis-cli <span class="nt">-h</span> 192.168.100.73 <span class="nt">-p</span> 7001 <span class="nt">-c</span> cluster reset
redis-cli <span class="nt">-h</span> 192.168.100.73 <span class="nt">-p</span> 7002 <span class="nt">-c</span> cluster reset
</code></pre></div></div>
<h3 id="清空数据">清空数据</h3>
<ul>
<li>192.168.100.71</li>
</ul>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>redis-cli <span class="nt">-h</span> 192.168.100.71 <span class="nt">-p</span> 7000 flushdb
redis-cli <span class="nt">-h</span> 192.168.100.71 <span class="nt">-p</span> 7002 flushdb
redis-cli <span class="nt">-h</span> 192.168.100.71 <span class="nt">-p</span> 7003 flushdb
</code></pre></div></div>
<ul>
<li>192.168.100.72</li>
</ul>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>redis-cli <span class="nt">-h</span> 192.168.100.72 <span class="nt">-p</span> 7000 flushdb
redis-cli <span class="nt">-h</span> 192.168.100.72 <span class="nt">-p</span> 7001 flushdb
redis-cli <span class="nt">-h</span> 192.168.100.72 <span class="nt">-p</span> 7003 flushdb
</code></pre></div></div>
<ul>
<li>192.168.100.73</li>
</ul>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>redis-cli <span class="nt">-h</span> 192.168.100.73 <span class="nt">-p</span> 7000 flushdb
redis-cli <span class="nt">-h</span> 192.168.100.73 <span class="nt">-p</span> 7001 flushdb
redis-cli <span class="nt">-h</span> 192.168.100.73 <span class="nt">-p</span> 7002 flushdb
</code></pre></div></div>
<h3 id="创建集群">创建集群</h3>
<p>登录第一台机器 192.168.100.71,执行以下命令创建即可。</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> /data/pkg/redis-3.0.4/src
./redis-trib.rb create 192.168.100.71:7000 192.168.100.72:7000 192.168.100.73:7000
<span class="o">>>></span> Creating cluster
/usr/local/lib/ruby/gems/2.5.0/gems/redis-3.3.0/lib/redis/client.rb:459: warning: constant ::Fixnum is deprecated
<span class="o">>>></span> Performing <span class="nb">hash </span>slots allocation on 3 nodes...
Using 3 masters:
192.168.100.71:7000
192.168.100.72:7000
192.168.100.73:7000
M: c966ac76a40be0c58a8295f0ce4fac800a89ffc0 192.168.100.71:7000
slots:0-5460 <span class="o">(</span>5461 slots<span class="o">)</span> master
M: 8a3b3c98d2d9feb75227b3054da00ed5abb6a 192.168.100.72:7000
slots:5461-10922 <span class="o">(</span>5462 slots<span class="o">)</span> master
M: a6396903ffc958711481836ceff121ddd2ff752d 192.168.100.73:7000
slots:10923-16383 <span class="o">(</span>5461 slots<span class="o">)</span> master
Can I <span class="nb">set </span>the above configuration? <span class="o">(</span><span class="nb">type</span> <span class="s1">'yes'</span> to accept<span class="o">)</span>: <span class="nb">yes</span>
<span class="o">>>></span> Nodes configuration updated
<span class="o">>>></span> Assign a different config epoch to each node
<span class="o">>>></span> Sending CLUSTER MEET messages to <span class="nb">join </span>the cluster
Waiting <span class="k">for </span>the cluster to join..
<span class="o">>>></span> Performing Cluster Check <span class="o">(</span>using node 192.168.100.71:7000<span class="o">)</span>
M: c966ac76a40be0c58a8295f0ce4fac800a89ffc0 192.168.100.71:7000
slots:0-5460 <span class="o">(</span>5461 slots<span class="o">)</span> master
0 additional replica<span class="o">(</span>s<span class="o">)</span>
M: a6396903ffc958711481836ceff121ddd2ff752d 192.168.100.73:7000
slots:10923-16383 <span class="o">(</span>5461 slots<span class="o">)</span> master
0 additional replica<span class="o">(</span>s<span class="o">)</span>
M: 8a3b3c98d2d9feb75227b3054da00ed5abb6a 192.168.100.72:7000
slots:5461-10922 <span class="o">(</span>5462 slots<span class="o">)</span> master
0 additional replica<span class="o">(</span>s<span class="o">)</span>
<span class="o">[</span>OK] All nodes agree about slots configuration.
<span class="o">>>></span> Check <span class="k">for </span>open slots...
<span class="o">>>></span> Check slots coverage...
<span class="o">[</span>OK] All 16384 slots covered.
</code></pre></div></div>
<h3 id="指定主节点">指定主节点</h3>
<p>在第一台机器上继续操作,手动指定三个主节点及其对应的从节点。</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./redis-trib.rb add-node <span class="nt">--slave</span> <span class="nt">--master-id</span> c966ac76a40be0c58a8295f0ce4fac800a89ffc0 192.168.100.72:7001 192.168.100.71:7000
./redis-trib.rb add-node <span class="nt">--slave</span> <span class="nt">--master-id</span> c966ac76a40be0c58a8295f0ce4fac800a89ffc0 192.168.100.73:7001 192.168.100.71:7000
./redis-trib.rb add-node <span class="nt">--slave</span> <span class="nt">--master-id</span> 8a3b3c98d2d9feb75227b3054da00ed5abb6a113 192.168.100.71:7002 192.168.100.72:7000
./redis-trib.rb add-node <span class="nt">--slave</span> <span class="nt">--master-id</span> 8a3b3c98d2d9feb75227b3054da00ed5abb6a113 192.168.100.73:7002 192.168.100.72:7000
./redis-trib.rb add-node <span class="nt">--slave</span> <span class="nt">--master-id</span> a6396903ffc958711481836ceff121ddd2ff752d 192.168.100.71:7003 192.168.100.73:7000
./redis-trib.rb add-node <span class="nt">--slave</span> <span class="nt">--master-id</span> a6396903ffc958711481836ceff121ddd2ff752d 192.168.100.72:7003 192.168.100.73:7000
</code></pre></div></div>
<h3 id="查看集群状态">查看集群状态</h3>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>redis-cli <span class="nt">-h</span> 192.168.100.71 <span class="nt">-p</span> 7000 cluster info
redis-cli <span class="nt">-h</span> 192.168.100.71 <span class="nt">-p</span> 7000 cluster nodes
</code></pre></div></div>
<h3 id="删除新增的防火墙规则">删除新增的防火墙规则</h3>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># 显示规则和相对应的编号</span>
iptables <span class="nt">-L</span> <span class="nt">-n</span> <span class="nt">--line-number</span>
<span class="c"># 找到新增规则对应的编号,删除</span>
iptables <span class="nt">-D</span> INPUT 规则编号
</code></pre></div></div>
<h2 id="ref">REF</h2>
<p><a href="https://blog.51cto.com/u_14661718/2468022">Redis集群节点主从关系调整 </a></p>vowredis 集群总共有三台机器,一主两从,共计启动 9 个实例。 redis 集群机器宕机后,可能会出现两个 master 节点落到一台机器上的情况,若此机器宕机,在其他节点尚未选举出新的主节点之前,集群只有一个主节点,存在单点风险。