Your key responsibilities are:
- Cluster Maintenance: Perform major/minor version upgrades and manage cluster scaling and capacity expansion.
Deep-Dive Debugging: Use Ceph logs and diagnostic tools to identify and resolve issues in production environments.
Linux & Hardware Tuning: Optimize the Linux OS (kernel, networking, and I/O) and server hardware (BIOS/firmware) for peak Ceph performance.
Operational Automation: Create scripts and automation to handle routine maintenance and CRUSH map cleanups.
Lifecycle Management: Execute major and minor Ceph version upgrades to ensure cluster stability and security.
Performance Engineering: Conduct deep-dive performance investigations and tuning if required.
Automation: Develop and maintain automation for operational tasks, including capacity expansion, CRUSH map optimizations, and routine maintenance.
Cluster Architecture: Build and configure new Ceph clusters from the ground up to support our rapidly scaling infrastructure.
Incident Response: Participate in a 24x7 on-call rotation, monitoring cluster health and resolving complex issues efficiently.
Upstream Contribution: Participate in the Ceph community through bug fixes, feature enhancements, and code reviews if required.
