Job Requirements Our purpose is to help a billion people find the right job! Phenom is an AI-Powered talent experience platform that is redefining the HR tech space. We have grown into a global organization with offices in 6 countries and over 1,700 employees. As an HR tech unicorn organization, innovation and creativity is within our DNA. Come help us make every talent moment Phenomenal!
As a Site Reliability Engineer (SRE) at Phenom, you’ll help ensure the reliability, performance, and operational excellence of our platform. Working within our IT Operations organization, you’ll collaborate closely with Engineering, Product, and Platform teams to support production deployments and manage change execution. Drive a culture of stability and continuous improvement.
This role requires strong hands-on experience in CloudOps and DevOps areas to execute production changes, contribute to root cause investigations, and automate processes that improve service health and platform resilience along with cloud optimisations. You’ll also take ownership for the availability, performance, and maintenance of cloud components and their scaling optimizations.
Key Responsibilities
- Reliability & Scalability: Ensure high availability and performance across Phenom’s cloud-native infrastructure, services, and database with Ownership or knowledge of SLIs/SLOs/SLAs, error budgets, and reporting.
- DevSecOps/Security: Explicit responsibilities for integrating security into CI/CD, scanning for vulnerabilities, and incident handling for security breaches.
- Infrastructure Scaling Patterns: Experience with auto-scaling, load balancing, and capacity planning methodologies.
- Modern GitOps Practices: Experience with GitOps workflows for automated infra/config management.
- Platform Engineering Integration: Collaboration with platform engineering for developer enablement and toolchain automation.
- Change Management: Implement and support infrastructure, application, and database changes following governance policies and ServiceNow-based Change workflows.
- Major Incident Handling: Serve as a key technical responder during Major Incidents, collaborating with cross-functional teams to rapidly restore service, communicate status, and drive post-incident actions.
- Problem Management Collaboration: Contribute to Root Cause Analysis (RCA) efforts and provide technical input for corrective and preventive actions (CAPAs).
- Release & Deployment Support: Actively support and execute production deployments, ensuring readiness, rollback planning, and validation during releases and patches, including database schema/version changes.
Skills & Experience
- 5+ years of experience in Cloud Ops/DevOps/SRE/Software engineering with hands-on responsibility for production systems.
- Proficient in one or more programming/scripting languages (e.g., Python, JavaScript/TypeScript, Java)
- Hands-on experience with:
- Cloud compute, network and storage expertise
- Tooling expertise in Kubernetes, ArgoCD, Helm, LinkerD/Istio/Nginx
- Public cloud platforms (AWS, GCP, or Azure)
- Kafka, Redis, MongoDB, and relational databases (e.g., PostgreSQL, MySQL, or Aurora)
- Strong understanding of production Change Management processes and use of ServiceNow for change execution and tracking.
- Proven experience supporting and executing production deployments in structured release environments, including database updates.
- Familiarity with observability tooling and best practices for monitoring and diagnostics of both applications and databases.
- Experience with CI/CD, container orchestration, and Infrastructure as Code.
- Solid Linux system administration and troubleshooting skills.
Experience with one or more of the following would be an asset(s):
- Familiarity with SaaS platforms.
- Prior experience with handling federal requirements such as FIPS, FedRAMP, and FISMA.
- Experience participating in release readiness reviews, Go/No-Go meetings, and Early Life Support (ELS).
- Exposure to structured RCA methodologies and configuration item tracking in a CMDB.
- Understanding of ITSM practices and service lifecycle principles.
- Prior experience as a Database Reliability Engineer (DBRE) or supporting mission-critical databases at scale.
Applicants selected will be subjected to a background security investigation and may need to meet eligibility requirements for access to classified information; US ‘Secret’ clearance will be required (Must be a US Citizen)
Salary
Expected salary range $110,000 - $130,000
Please note the Salary range is subject to change in the future in accordance with Phenom’s policies
Diversity, Equity, & Inclusion
Our commitment to diversity runs deep! Diversity is essential to building phenomenal teams, products, and customer experiences. Phenom is proud to be an equal opportunity employer taking collective action to build a more inclusive environment where every candidate and employee feels welcomed.
#LI-PL1