About
- Passionate, technology-driven, and self-motivated Full Stack Developer with expertise in cloud computing and DevOps culture & practice, and cloud-native solutions. My core mission is to transform systems into a highly resilient, self-healing entity through advanced architectural design, deep automation, and cutting-edge engineering practices.
- Building unbreakable systems and enjoy the thrill of solving complex challenges with technology
Experience
-
Summary:
- Passionate, technology-driven System Architect and Technical Director, play a critical role in making applications and infrastructure services Reliable, Visible, Resilience, Self-healing to stakeholders for daily operation, troubleshooting, performance analysis, capacity planning through advanced architectural design, deep automation, and cutting-edge engineering practices.
Responsibilities:
- Architecture Design & Optimization: Spearhead the design, implementation, and optimization of High Availability (HA) and Disaster Recovery (DR) solutions across on-premise and public cloud environments, encompassing multi-AZ (Availability Zone) and multi-region architectures to eliminate single points of failure and ensure business continuity.
- Infrastructure as Code (IaC) Leadership: Champion and drive Infrastructure as Code (IaC) practices, leveraging tools such as Terraform, ARM templates, and AWS CloudFormation to automate, version-control, and ensure repeatable, idempotent provisioning of all infrastructure components, achieving consistency and efficiency.
- Observability & Monitoring Enhancement: Enhance system observability by establishing comprehensive metrics collection, alerting mechanisms, and centralized logging using platforms like Prometheus, Grafana, Alertmanager, and OpenSearch (ELK stack), enabling real-time performance analysis, proactive issue detection, and rapid troubleshooting.
- Automation & Operational Efficiency: Develop robust automation scripts and tools using Ansible, Bash, and Python to eliminate manual, repetitive tasks, thereby streamlining operational workflows, accelerating incident response, and improving recovery times.
- Incident Management & Continuous Improvement: Participate in on-call rotations for critical incidents and lead blameless post-mortem processes to perform deep-dive root cause analysis, driving actionable insights and continuous improvement across systems and processes.
- Software Engineering Principles & CI/CD: Apply strong software development domain knowledge, including design patterns, code structure, and programming languages, with expertise in continuous integration and deployment (CI/CD) pipelines using Git, GitHub/GitLab, Jenkins, and ArgoCD
- Team Leadership & Mentorship: Lead and mentor a team of engineers, fostering a culture of collaboration, continuous learning, and professional growth, while ensuring alignment with organizational goals and technical excellence.
- Communication & ITSM Integration: Possess excellent written and verbal communication skills, with experience in ITOM/ITSM integration, specifically ServiceNow ITOM for event management and operational intelligence, alongside strong people management capabilities.
- DevOps & Modern Ops Practices: Maintain deep awareness and implementation expertise in DevOps, DevSecOps, GitOps, and AIOps strategies, fostering a culture of automation, collaboration, and continuous delivery.
Achievements:
- I successfully collaborated with geographically dispersed teams across different countries, adapting communication for time zone/cultural differences during critical incident resolutions.
- Build a dual-active architecture with Azure and Alibaba Cloud to support daily processing of over 100,000 orders for the PO system
- Build On Premise high availability k8s cluster, implement Kubernetes elastic scaling strategy, and ensure peak Pod startup latency is less than 500ms
- Build Proxmox bare metal cluster to support millisecond scale scaling for over 100 containerized applications
- Enhance system disaster recovery capability through Chaos Monkey drill, achieving zero business interruption throughout the year
- Develop a microservice monitoring system based on Spring Cloud, with an average response time of less than 200ms
- Implement OpenTelemetry, Prometheus+Grafana+AlertManager and ELK log analysis platform, improve fault localization efficiency by 80%
- Design Argo CD continuous delivery assembly line to achieve 1000 daily fault free deployments
- Standardized deployment through Helm Charts reduces environment consistency error rate by 90%
- Implement GitOps practice, reduce configuration drift rate from 15%/week to 0.5%/month
- Chaos Engineering
- SRE
- DevOps
- CI/CD
- GitOps
- Linux
- Docker
- Kubernetes
- Prometheus
- OpenTelemetry
- Grafana
- Azure Cloud
- Alibaba Cloud
- AWS
- EKS
- Jenkins
- ArgoCD
- HELM
- Terraform
- Ansible
- Bash
- Python
- Java
- Spring Boot
- Spirng Cloud
- ELK
- -
Summary:
- Collaborated with biz development team in China, co-worked with global teams , ensuring seamless alignment between business requirements and technical solutions, like server operation, storage capacity, application deployment and SRE etc.
Responsibilities:
- Application Operational Management: Collaborated with development teams throughout the application lifecycle to ensure seamless deployment of new systems, maintaining production-grade quality and zero customer impact.
- Change Management: Owned production environment governance under ITIL4 framework, enforcing compliance with Change Management, Incident Management, and Release Management policies.
- Incident Management: Conducted proactive monitoring, root cause analysis, and resolution of production incidents, escalating to cross-functional teams when critical business continuity risks emerged.
- Environment Patch Management: Executed systematic patching strategies to uphold security posture, ensuring 100% compliance with the latest vulnerability remediation protocols.
- Mentored two Management Trainee, accelerating their technical growth and productivity.
Achievements:
- Led the migration of a legacy system from traditional server system to virtulization system(VMware vSphere), Virtualization rate reaches 60%, boosting scalability and reliability.
- Passed the certification of ISO27001 information security management requirements
- Implement Hierarchical Storage Management, Significantly reduce overall storage cost (TCO), Include HDS SAN Storage System, IBM SAN Storage System, Dell iSCSI Storage System and Net App NAS Storage System
- ITIL4
- ISO 27001
- TCO
- Incident Management
- Change Management
- Security Patch Management
- VMVare vSphere
- SAN Storage
- NAS Storage
- iSCSI Storage
- Windows Server
- Linux Server
- -
Summary:
- Work with the Dev team on Sourcing Platform (the major business system) Deployment, Operation and Maintenance, HA Server, network, storage system etc.
Achievements:
- For the Souceing Platform, overseeing the full infrastructure lifecycle—including server and storage selection, backup, and high-availability solutions—from design through daily monitoring. Collaborated closely with development teams to implement agile practices, enabling rapid, reliable iteration and accelerated version releases.
- -
Summary:
- During this time, the company grew from 50 to more than 500 people, and I also grew to the head of the company's IT department.
Achievements:
- I built the company's IT system from scratch, including the network system, server system, storage system, security system, OA system, and business system. I also led a team to maintain the company's IT system and provide technical support to the company's employees.
-
Projects
On-premises Kubernetes clusters can be isolated from the public cloud, reducing the risk of attacks. This setup also helps in meeting stringent regulatory and data privacy requirements, as data is stored and managed within the physical premises.
- Deploying Kubernetes on-premises offers significant benefits in terms of security, compliance, and cost-effectiveness but also presents challenges in scalability, management complexity, and networking.
Cloud-native AI provides a set of essential features and services to help clients to build an AI platform, accelerate AI workloads and simplify MLOps.
- A Kubernetes-based service in the modular and extensible architecture, accelerating the construction of the AI platforms and improving resource utilization and delivery efficiency.
integrates practices like Continuous Integration (CI), Continuous Delivery (CD), and automation to streamline the software development lifecycle.
- By adopting DevSecOps, organizations can achieve faster, more secure software delivery while fostering a culture of shared responsibility for security. This transformation is essential in today's fast-paced and threat-prone software development landscape.
Great self-hosted option that brings teams and developers high-efficiency, but easy operations from planning to production.
- Gitea enables the creation and management of repositories based on Git. It also makes code review incredibly easy and convenient, enhancing code quality for users and businesses.
Argo CD is a declarative, GitOps continuous delivery tool for Kubernetes.
- Application definitions, configurations, and environments should be declarative and version controlled. Application deployment and lifecycle management should be automated, auditable, and easy to understand.
Keycloak is an open-source identity and access management solution that supports Single Sign-On (SSO) for web applications and RESTful services.
- Implement SSO with Keycloak, providing seamless authentication across your applications.
Skills
Education
-
Zhejiang University
Computer Science and Technology, B.S- Essential theories and necessary expertise in computer science and information technology related principles by connecting computer theories with applications, connecting computer software with hardware, and connecting engineering methodology with technology.
-
Guangdong University of Foreign Studies
English Junior College- Learn the foundations of composition, critical thinking, and research in this English degree program designed to immerse you in all areas of literature and language.
Certificates
- Alibaba Cloud-
- Alibaba Cloud-
- Alibaba Cloud-
- HKQAA-
- Microsoft-