Site Reliability Engineer - OpenStack

Ref
SREOS0118
CDI
Montréal
Canada
OVH offers a wide range of IT services to companies, and to individuals who are passionate about tech. Whether you're looking at our Private Cloud, Public Cloud or Hybrid Cloud services, web hosting plans, virtual datacentres, dedicated servers, storage solutions or even xDSL and VoIP connections, our services are constantly being improved with the very latest innovations, and are regularly developed with new features.

In OVH Public Cloud team we are aiming to deliver the best-in- class service for wide range scale customers from one VM start-ups through DevOps development playgrounds, up to hundreds VMs hybrid-cloud clusters. In OVH Public Cloud OpenStack team you will be challenged with huge scale deployments and issue  related to it, cooperation with upstream OpenStack developers from all over the world and delivering the latest, cutting edge technologies as a service.

Your Role?

The Site Reliability Engineer (SRE) is responsible for the availability, performance, monitoring, and incident response, among other things, of the platforms and services that Product Unit Public Cloud Instances runs and owns. Thanks to your software and system engineering competencies you are able to build and run large-scale, massively distributed, fault-tolerant system. SRE ensures that systems have reliability and uptime appropriate to users' needs and a fast rate of improvement while keeping an ever-watchful eye on capacity and performance.

Much of SRE software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. Practices such as limiting time spent on operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting and dynamic day-to-day work.
Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow.
  • Engage in and improve the whole lifecycle of services from inception and design, through deployment, operation and refinement.
  • Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
  • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
  • Practice sustainable incident response and blameless postmortems.
  
Your skills?

  • English
  • Experience with managing distributed, highly available, high traffic infrastructure based on Linux
  • Experience with the use, maintenance and configuration of monitoring, metrics and logging infrastructure (Icinga/Nagios, Prometheus, Grafana, Graphite, Logstash/Kibana, etc.)before: You have extensive experience with performance analysis and tuning
  • Comfortable with shell and scripting languages used in an SRE/Operations engineering context (Python, Go, Bash, Perl, etc.) before: You have experience developing tools
  • Comfortable with Open Source configuration management and orchestration tools (Puppet, Ansible, Terraform, etc.)

Your experience?
  • Experience with software and service deployment and package management, including packaging as well as container systems
  • Aptitude for automation and streamlining of tasks, CI/CD
  • Desire to dive in and understand/fix complex problems with large environments, assisting in the architectural design of new services and making them operate at scale
  • Performing day-to-day operational/DevOps tasks, accountable for code, code review, test and documentation quality

Your team

Public Cloud

Our Public Cloud team members are experts in matters of infrastructure and scalability. They are responsible for a product that is still… Find out more