Site Reliability Engineer - OpenStack
In OVH Public Cloud team we are aiming to deliver the best-in- class service for wide range scale customers from one VM start-ups through DevOps development playgrounds, up to hundreds VMs hybrid-cloud clusters. In OVH Public Cloud OpenStack team you will be challenged with huge scale deployments and issue related to it, cooperation with upstream OpenStack developers from all over the world and delivering the latest, cutting edge technologies as a service.
Much of SRE software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. Practices such as limiting time spent on operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting and dynamic day-to-day work.
Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow.
- Engage in and improve the whole lifecycle of services from inception and design, through deployment, operation and refinement.
- Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
- Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
- Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
- Practice sustainable incident response and blameless postmortems.
- Experience with managing distributed, highly available, high traffic infrastructure based on Linux
- Experience with the use, maintenance and configuration of monitoring, metrics and logging infrastructure (Icinga/Nagios, Prometheus, Grafana, Graphite, Logstash/Kibana, etc.)before: You have extensive experience with performance analysis and tuning
- Comfortable with shell and scripting languages used in an SRE/Operations engineering context (Python, Go, Bash, Perl, etc.) before: You have experience developing tools
- Comfortable with Open Source configuration management and orchestration tools (Puppet, Ansible, Terraform, etc.)
- Experience with software and service deployment and package management, including packaging as well as container systems
- Aptitude for automation and streamlining of tasks, CI/CD
- Desire to dive in and understand/fix complex problems with large environments, assisting in the architectural design of new services and making them operate at scale
- Performing day-to-day operational/DevOps tasks, accountable for code, code review, test and documentation quality
Our Public Cloud team members are experts in matters of infrastructure and scalability. They are responsible for a product that is still… Find out more