Xiaozheng (Rio) Liu 刘晓征(Rio Liu)

Principal Software Quality Engineer资深软件质量工程师 at @ Red Hat

Architect of the OAR framework — a Python-based release orchestration platform spanning MCP servers, Slack bots, CI/CD pipelines, automated agents, and AI-driven workflows. Users can execute OAR commands via Slack or AI agents without any local deployment — the platform runs as a fully managed service with systemd supervision and component monitoring. OAR 框架架构师 — 基于 Python 的 OpenShift 版本发布编排平台, 涵盖 MCP 服务器、Slack 机器人、CI/CD 流水线、自动化多智能体系统和 AI 驱动工作流。 用户无需任何本地部署,通过 Slack 或 AI 智能体即可远程执行 OAR 命令—— 整套平台以 systemd 托管服务形式运行,配备组件级监控。

15+
Repositories仓库
600+
Commits提交
3
Core Languages核心语言
7
OCP VersionsOCP 版本
Python Go TypeScript React Kubernetes OpenShift MCP CI/CD RAG
Scroll to explore 向下滚动探索
Technical Skills技术能力
Core technologies and domains across the full stack全栈核心技术领域

Languages编程语言

Python Go TypeScript JavaScript Bash YAML

Frameworks & Tools框架与工具

Click (CLI) FastMCP React Vite pytest golangci-lint

Cloud & DevOps云原生与运维

Kubernetes OpenShift Docker GitHub Actions Jenkins GitLab CI Prow

Integrations外部集成

Jira API Slack SDK Errata Tool LDAP Google Sheets GitHub API GitLab API

AI & DataAI 与数据

RAG Systems Vector Stores LLM Integration Streamlit Pandas

Architecture系统架构

MCP Protocol State Machines Event-Driven Caching Concurrency REST APIs
🏗 System Architecture: OAR Framework 系统架构:OAR 框架
Designed and implemented the OAR platform — a 6-agent automation system for managing z-stream releases end-to-end 设计并实现了 OAR 平台——一套 6 个智能体协同工作的 z-stream 版本发布管理系统
Interface Layer — Multi-Channel Remote Execution 接口层——多通道远程执行
OAR CLI (Click)
oarctl Controller
jobctl Prow
MCP Server (HTTP)
Slack Bot (Socket Mode)
▼ Direct Click Invocation | MCP Protocol | Socket Mode | Gangway API
Automated Agents (6 Background Workers) 自动化智能体(6 个后台服务)
Release Detector
Job Controller
Test Result Aggregator
Jira Notificator
Slack Bot
Test Result Checker
▼ Event-Driven | File-Based Locking | Background Scheduler
Orchestration Layer 编排层
ReleaseShipmentOperator
BugOperator
ApprovalOperator
ReleaseOwnershipOperator
▼ Composite Pattern | Cross-Module Coordination
Core Services
ConfigStore
StateBox
AdvisoryManager
Notification
Shipment
▼ JWE Encrypted Config | GitHub-backed YAML | Errata Tool
External Integrations
Errata Tool
Jira
GitHub
GitLab
Jenkins
Slack
Prow/Gangway
▼ Kerberos Auth | REST APIs | Webhooks

StateBox: GitHub-backed State MachineStateBox:基于 GitHub 的状态机

  • YAML-based state persistence using GitHub as a backend基于 YAML 的状态持久化,以 GitHub 为后端
  • Automatic task status tracking via CLI output parsing通过 CLI 输出解析自动追踪任务状态
  • Blocker issue management with resolution tracking阻断问题管理及解决状态追踪
  • Transparent migration path from legacy Google Sheets从旧版 Google Sheets 的平滑迁移路径

ConfigStore CachingConfigStore 缓存

  • LRU + TTL (7 day) cache eliminates redundant JWE decryptionLRU + 7 天 TTL 缓存消除 JWE 解密冗余
  • Reduces latency from ~1000ms to <10ms for cached releases缓存命中延迟从 1000ms 降至 <10ms
  • Thread-safe with RLock for concurrent AI agent requestsRLock 保证并发 AI 请求的线程安全
  • Cache hit rate exceeds 80% in typical AI workflows典型 AI 工作流缓存命中率超 80%

MCP Server: AI Agent IntegrationMCP Server:AI 智能体集成

  • 28+ OAR commands wrapped as structured MCP tools28+ 个 OAR 命令封装为结构化 MCP 工具
  • Async concurrency via ThreadPoolExecutor (20 threads)基于 ThreadPoolExecutor 的异步并发(20 线程)
  • Direct Click invocation 70-90% faster than subprocess直接 Click 调用比子进程快 70-90%
  • Health endpoint, cache management, structured error reporting健康检查端点、缓存管理、结构化错误报告

Prow Job Controller & AggregatorProw Job Controller 与聚合器

  • Polling loop detects new accepted payloads, triggers Prow blocking tests via Gangway API轮询检测新 accepted payload,通过 Gangway API 自动触发 Prow 阻塞测试
  • Test job registry (YAML) defines what jobs to run with upgrade/optional/retry flagsYAML 测试注册表定义作业及 upgrade/optional/retry 策略
  • Aggregator evaluates results with retry policy, auto-labels payloads QE Accepted聚合器按 retry 策略评估结果,自动标记 QE Accepted
  • Duplicate detection prevents re-triggering for jobs already tracked in GitHub state重复检测防止对已追踪作业重复触发
💼 Key Projects & Contributions 核心项目与贡献
Significant open-source contributions spanning automation, infrastructure, and tooling 跨自动化、基础设施和工具链的重大开源贡献
Release Tests (OAR)
openshift/release-tests
388 commits
Core automation framework for OpenShift z-stream release lifecycle. Orchestrates advisory management, bug tracking, CI test verification, and release approval across Errata Tool, Jira, Jenkins, Slack, and GitLab. Features a modular manager/helper architecture, an MCP server for AI agent access, a Slack bot for remote command execution (no local deploy needed), and a GitHub-backed state system. OpenShift z-stream 版本发布生命周期核心自动化框架。跨 Errata Tool、Jira、Jenkins、Slack 和 GitLab 编排 Advisory 管理、Bug 追踪、CI 测试验证和发布审批。Modular Manager/Helper 架构,MCP 服务器供 AI 智能体访问,Slack 机器人实现远程命令执行(无需本地部署),基于 GitHub 的状态系统。
StateBox Design MCP Server Job Controller + Aggregator Slack Bot CI Failure Analysis Zero-Local-Deploy ConfigStore Caching Async Concurrency Jira Cloud Migration CLI Framework
OpenShift Tests Private
openshift/openshift-tests-private
128 commits
Test automation for Machine Config Operator (MCO) in OpenShift. Automated 20+ e2e test cases covering SSH key policies, file-based policies, unit-based policies, and container runtime configuration. Designed Go build overlays for custom test template development and implemented JSON test metadata extraction. OpenShift MCO(Machine Config Operator)测试自动化。编写 20+ 个端到端测试用例覆盖 SSH 密钥策略、文件策略、unit 策略及容器运行时配置。设计了 Go 构建 overlay 支持自定义测试模板开发,实现了 JSON 测试元数据提取。
MCO Test Automation Go Build Overlay Cypress Integration OCPERT Initiative ROSA Detection
Release CI/CD Configuration发布 CI/CD 配置
openshift/release
47 commits
Designed and maintained Prow-based CI/CD pipeline configurations for OpenShift QE auto-release jobs. Set up multi-architecture (amd64, arm64) job controllers, QE release gate testing, catalog source management, and daily/weekly presubmit job scheduling across 4.x releases. 为 OpenShift QE 自动发布任务设计并维护基于 Prow 的 CI/CD 流水线配置。建立多架构(amd64、arm64)Job Controller、QE 发布门控测试、Catalog Source 管理以及跨 4.x 版本的每日/每周 Presubmit 定时任务。
Prow Job Config Multi-arch Pipeline Auto Release Chain Stage/Gate Testing Slack Alerting
SHIP Status Dashboard
openshift-eng/ship-status-dash
9 commits
Go + React/TypeScript dashboard for visualizing OpenShift release shipment status across multiple components. Added systemd unit monitoring via D-Bus integration for the component monitor, including a SystemdProber with context-aware D-Bus calls and containerized integration tests. Go + React/TypeScript 看板,用于可视化 OpenShift 跨组件发布交付状态。为组件监控器添加了基于 D-Bus 的 systemd 单元监控,包括含 context 感知 D-Bus 调用的 SystemdProber 及容器化集成测试。
Systemd Monitor D-Bus Integration E2E Test Framework Errata Reliability Component
OpenShift Misc (Jenkins)
openshift-eng/openshift-misc
22 commits
Jenkins pipeline libraries for OpenShift release testing. CVE bug tracker checking, shipment image pullspec fetching via Pyxis API, and Konflux workflow integration for errata_optional_step and shipped_image_check. OpenShift 发布测试的 Jenkins 流水线库。CVE Bug 追踪检查,通过 Pyxis API 获取 Shipment Image Pullspec,Konflux 工作流的 errata_optional_step 和 shipped_image_check 集成。
CVE Bug Tracking Pyxis API Shipment Data Jenkins Shared Lib
RAG Chatbot
rioliu-rh/rag-chatbot
14 commits
Built a Retrieval-Augmented Generation chatbot from scratch using Python. Document ingestion, vector store (ChromaDB), LLM-based relevance checking, and a Streamlit chat interface. Two-stage retrieval: vector similarity + LLM relevance gate prevents hallucinations. 从零构建 RAG(检索增强生成)聊天机器人。文档摄入、向量库(ChromaDB)、基于 LLM 的相关性检查及 Streamlit 聊天界面。两阶段检索:向量相似度 + LLM 相关性门控,有效防止幻觉。
RAG Pipeline Vector Store LLM Integration Streamlit UI
OpenShift Origin
openshift/origin
7 commits
Contributed to core OpenShift origin repo: Kubernetes Pod Security Admission label migration, HyperShift guest kubeconfig support for debugging, unique namespace generation for UDN tests, and CLI argument handling improvements. 贡献核心 OpenShift origin 仓库:Kubernetes PSA 标签迁移、HyperShift Guest Kubeconfig 调试支持、UDN 测试的唯一 NS 生成及 CLI 参数改进。
PSA Migration HyperShift Support UDN Testing CLI Improvements
OpenShift Tests Extension
openshift-eng/openshift-tests-extension
9 commits
Go-based framework for extending OpenShift e2e testing with custom test suites. Added Cypress test integration — converting metadata to Go test specs, running headless tests, and parsing JUnit XML into the Go framework for unified reporting. 基于 Go 的 OpenShift E2E 测试扩展框架。添加 Cypress 测试集成——将元数据转换为 Go 测试 spec、运行 Headless 测试、解析 JUnit XML 实现统一上报。
Cypress Integration Go Test Extension JUnit Parsing
Cross-Repository Contributions跨仓库贡献
openshift/sippy · flexy-templates · release-controller · more
20+ commits
QE release views in Sippy (OpenShift CI analysis dashboard), stage catalog source configuration in flexy-templates, Jira payload verification in release-controller, and community fixes to verification-tests and cucushift. Sippy(OpenShift CI 分析看板)中的 QE Release Views,flexy-templates 中的 Stage Catalog Source 配置,release-controller 中的 Jira Payload 验证,以及 verification-tests 和 cucushift 的社区贡献。
QE Views (Sippy) Stage Catalog Sources Payload Bug Verification
💡 Technical Design Highlights 技术设计亮点
Key engineering decisions and their rationale 关键工程决策及其设计思路

MCP Server: Direct Click InvocationMCP Server:直接 Click 调用

Instead of spawning subprocesses for each OAR CLI command (300ms overhead), the MCP server wraps Click commands directly via Python invocation API. Reduced latency by 70-90% and enabled clean error propagation.放弃对每个 OAR CLI 命令的 subprocess 方式(300ms 开销),通过 Python 调用 API 直接封装 Click 命令。延迟降低 70-90%,并实现了清晰的错误传播。

PythonFastMCP

StateBox: YAML + GitHub = State MachineStateBox:YAML + GitHub = 状态机

Replaced Google Sheets as the release state backend. Every release has a YAML file in _releases/. CLI commands auto-update state via output-parsing callbacks. No API rate limits or spreadsheet dependencies.用 YAML 文件替代 Google Sheets 作为状态后端。每个版本在 _releases/ 下拥有独立 YAML。CLI 命令通过输出解析回调自动更新状态,无需担心 API 限流或 Google Sheets 依赖。

State MachineYAMLGitHub API

ConfigStore LRU+TTL CacheConfigStore LRU+TTL 缓存

ConfigStore init costs ~1000ms (JWE decrypt + GitHub HTTP + YAML parse). Since release data is immutable after announcement, a 50-entry LRU with 7-day TTL eliminates redundant work. Thread-safe RLock enables concurrent AI agent requests.ConfigStore 初始化耗时 ~1000ms(JWE 解密 + GitHub HTTP + YAML 解析)。版本宣布后数据不可变,故采用 50 条目的 LRU + 7 天 TTL 消除重复开销。RLock 保证并发 AI 请求的线程安全。

cachetoolsJWERLock

Go Build Overlay for Custom Test TemplatesGo Build Overlay 自定义测试模板

Implemented Go build overlays to allow custom test development without modifying the upstream origin dependency. Overlay patches replace the origin import path with a local fork for rapid iteration.通过 Go Build Overlay 实现自定义测试开发而无需修改上游 origin 依赖。Overlay Patch 将 origin Import Path 替换为本地 Fork,支持快速迭代。

Gogo.mod

Background Metadata Checker with Graceful ShutdownBackground Metadata Checker 优雅退出

Background process for periodic release metadata checks that survives container restarts. Switched from daemon threads to subprocess-based workers with proper cleanup hooks. Fixed systemd KillMode to prevent process group termination.周期检查版本元数据的后台进程,能承受容器重启。从 Daemon 线程迁移至 subprocess Worker,带清理钩子。修复 systemd KillMode 防止进程组被整体终止。

MultiprocessingsystemdSignal Handling

RAG Pipeline Two-Stage RetrievalRAG 两阶段检索

First stage: vector similarity search on embedded docs. Second stage: LLM-based relevance gate filters out non-applicable results. Prevents hallucinated answers from irrelevant context while maintaining high recall.第一阶段:向量相似度搜索。第二阶段:LLM 相关性门控过滤不适用结果。在保持高召回率的同时,防止无关上下文导致幻觉回答。

ChromaDBLLMStreamlit

Slack Bot: Remote OAR ExecutionSlack Bot:远程 OAR 命令执行

Socket Mode listens for @mentions, validates OAR commands with strict security (shell metacharacter blocking, command whitelisting), and executes server-side. Users type @ERTReleaseBot oar -r 4.19.1 update-bug-list from Slack — no local install, SSH, or VPN.Socket Mode 监听 @提及,严格安全验证 OAR 命令(Shell 特殊字符拦截、命令白名单),服务端执行。用户直接从 Slack 输入 @ERTReleaseBot oar -r 4.19.1 update-bug-list——无需本地安装、SSH 或 VPN。

Slack Socket ModeCommand Injection Preventionsystemd

Prow Job Controller + Test Result AggregatorProw Job Controller + 测试结果聚合器

When a release payload reaches Accepted, the Controller auto-triggers pre-configured Prow blocking tests via Gangway API. Results persisted on GitHub via YAML registry. Aggregator polls status, evaluates with retry policy (majority-wins), and auto-labels QE Accepted when all required jobs pass. Multi-arch: amd64/arm64/ppc64le/s390x.Release Payload 达到 Accepted 状态时,Controller 通过 Gangway API 自动触发 Prow 阻塞测试。结果通过 YAML 注册表持久化到 GitHub。Aggregator 轮询状态,按 Retry 策略(多数通过即通过)评估,全部 Required Job 通过后自动标记 QE Accepted。支持 amd64/arm64/ppc64le/s390x。

Gangway APIProwRetry PolicyMulti-arch

CI Failure Analysis: AI-Driven Root CauseCI 失败分析:AI 驱动根因分类

Slack /ci:analyze-failures triggers automated CI investigation — fetches Prow job logs, build diagnostics, infrastructure failure data; dispatches to specialized analyzers (Prow, Jenkins, installation). Classifies failures as infra flakiness (waive) vs. product regression (block). Supports must-gather extraction for deep pod/operator lifecycle analysis.Slack /ci:analyze-failures 命令触发自动 CI 调查——获取 Prow 作业日志、构建诊断及基础设施故障数据;分发至专用分析器(Prow、Jenkins、安装)。将失败分类为基础设施不稳定(可豁免)vs. 产品回归(必须阻断)。支持 must-gather 提取并深度分析 Pod/Operator 生命周期。

Prow APIMust-GatherDispatcher PatternAI Classification

Multi-Channel Remote Execution多通道远程执行架构

OAR commands execute through three independent remote channels — AI agents via MCP Server (HTTP), Slack Bot via Socket Mode (real-time), and Prow CI jobs (event-driven) — all sharing the same code path through direct Click invocation. Users never need to clone the repo, install packages, or manage Kerberos tickets locally.OAR 命令通过三个独立远程通道执行——AI 智能体(MCP Server HTTP)、Slack Bot Socket Mode(实时)、Prow CI 任务(事件驱动)——均共享同一条直接 Click 调用路径。用户无需 clone 仓库、安装 Python 包或在本地管理 Kerberos 票据。

MCPSlack BotProwZero-Local-Deploy
📅 Impact Timeline 影响时间线
Evolution of contributions across projects 项目贡献的演变过程
2024 - Present
OAR Framework Architecture & MCP ServerOAR 框架架构与 MCP 服务器
Architected the OAR framework: StateBox state management, MCP server with HTTP transport and direct Click invocation, ConfigStore LRU+TTL caching. Built Prow Job Controller with Gangway API integration and Test Result Aggregator with retry policy. Developed Slack Bot for remote execution with zero local deploy. Created AI-driven CI failure analysis with dispatcher pattern. Migrated Jira from Server to Cloud.架构设计 OAR 框架:StateBox 状态管理、MCP Server HTTP 传输与直接 Click 调用、ConfigStore LRU+TTL 缓存。构建 Prow Job Controller 与 Gangway API 集成、测试结果聚合器与 Retry 策略。开发 Slack Bot 实现零本地部署远程执行。创建 AI 驱动的 CI 失败分析与 Dispatcher 模式。完成 Jira Server 到 Cloud 的迁移。
2023 - 2024
MCO Test AutomationMCO 测试自动化
Automated 20+ MCO e2e test cases covering SSH key policies, file-based policies, and unit-based configuration. Designed Go build overlays for custom test templates. Established OCPERT initiative infrastructure.自动化 20+ MCO 端到端测试用例覆盖 SSH 密钥策略、文件策略及 unit 配置。设计 Go Build Overlay 支持自定义测试模板。建立 OCPERT 计划基础设施。
2022 - 2024
CI/CD Pipeline EngineeringCI/CD 流水线工程
Designed Prow auto-release job chains for OpenShift 4.12-4.22+. Set up multi-architecture job controllers, QE release gate testing, and Slack alerting for pipeline failures.为 OpenShift 4.12-4.22+ 设计 Prow 自动发布任务链。建立多架构 Job Controller、QE 发布门控测试及流水线失败 Slack 告警。
2022 - 2023
Release Automation Foundation发布自动化基础
Built the initial OAR CLI framework with Click, integrated Errata Tool and Jira APIs, implemented advisory lifecycle management and bug verification automation. Developed Jenkins pipeline libraries for CVE tracking and image testing.基于 Click 构建初始 OAR CLI 框架,集成 Errata Tool 和 Jira API,实现 Advisory 生命周期管理和 Bug 验证自动化。开发 Jenkins 流水线库用于 CVE 追踪和镜像测试。
2021 - 2022
OpenShift Origin & Test InfrastructureOpenShift Origin 与测试基础设施
Contributed to core OpenShift test infrastructure — Pod Security Admission migration, HyperShift debugging support, UDN namespace generation, and CLI tooling improvements.贡献 OpenShift 核心测试基础设施——PSA 迁移、HyperShift 调试支持、UDN 命名空间生成及 CLI 工具改进。